Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
art
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

License Go Report Card Gitter Creeper

About

Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning: At present this project is still under early stage development, please do not use in the production environment.

Get Started

Installation

$ go get github.com/wspl/creeper

Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Then, create main.go

package main

import "github.com/wspl/creeper"

func main() {
	c := creeper.Open("./hacker_news.crs")
	c.Array("news").Each(func(c *creeper.Creeper) {
		println("title: ", c.String("title"))
		println("site: ", c.String("site"))
		println("link: ", c.String("link"))
		println("===")
	})
}

Build and run. Console will print something like:

title:  Samsung chief Lee arrested as S.Korean corruption probe deepens
site:  reuters.com
link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title:  ReactOS 0.4.4 Released
site:  reactos.org
link:  https://reactos.org/project-news/reactos-044-released
===
title:  FeFETs: How this new memory stacks up against existing non-volatile memory
site:  semiengineering.com
link:  http://semiengineering.com/what-are-fefets/

Script Spec

Town

Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

You might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something", represents parameter name has a default value "something".

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.

Node

Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing")
	title: $(".title a.storylink").text
	site: $(".title span.sitestr").text
	link: $(".title a.storylink").href

Like yaml, nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.

Page

Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs.

If a node owned page and fun at the same time, page should on the left of ->, fun should on the right of ->. Which is page -> fun

Fun

Fun represents the data processing process.

There are all supported funs:

Name Parameters Description
$ (selector: string) Relative CSS selector (select from parent node)
$root (selector: string) Absolute CSS selector (select from body)
html inner HTML
text inner text
outerHTML outer HTML
attr (attr: string) attribute value
style style attribute value
href href attribute value
src src attribute value
class class attribute value
id id attribute value
calc (prec: int) calculate arithmetic expression
match (regexp: string) match first sub-string via regular expression
expand (regexp: string, target: string) expand matched strings to target string

Author

Plutonist

impl.moe · Github @wspl

About

🐾 Creeper - The Next Generation Crawler Framework (Go)

Topics

Resources

License

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.