crawler

https://twisted.readthedocs.io/en/latest/core/howto/defer-intro.html#inline-callbacks-using-yield says "On Python 3, instead of writing returnValue(json.loads(responseBody)) you can instead write return json.loads(responseBody). This can be a significant readability advantage, but unfortunately if you need compatibility with Python 2, this isn’t an option.".

With the current installation and deployment manual, it is basically impossible to deploy successfully at one time.
It is also recommended to provide an official docker image based on python 3.

i want get the price (follow red frame)

	c.OnHTML("div[id=price]", func(e *colly.HTMLElement) {
		fmt.Printf("test----%+v\n",e)
		price,err := strconv.ParseFloat(e.Text,64)
		//price := e.Text
		fmt.Printf("********* price----%+v\n",price)

		if err != nil {

			fmt.Pri

If one opens the link to the docs provided in README the Readme opens on readthedocs.io. There is no navigation bar to find where one can browse to quick start page or advanced. You can only go there if one searches quick start and click on the page. Then there are navigation links for browsing through the docs.

Jus for the record:
I'm using Firefox (60.9.0 esr) on Windows 10 Pro.

Really gr

问题是这样的，我想爬取商品分页的信息于是我用了个for循环，执行document = Jsoup.connect(domain+reviews+String.format(b, p)).get()改变p的值来改变评论的页码。

但是当爬完第一页后再爬取第二页评论时（没准备爬取一页评论时都会执行这句document = Jsoup.connect(domain+reviews+String.format(b, p)).get();）出现了这样的错误：
java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.s

Reflect this kind of things

{
method : 'POST',
form : { key: 'value', key2: 'value'}
}

Bug 描述
按教程文档说明的，使用docker-compose up -d 安装启动后，直接执行task报错
不知道哪里有问题呢？
我的docker运行环境是win10

`2020-02-15 15:58:04 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: xueqiu)
22020-02-15 15:58:04 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Nov 7 2019, 10:44:02) - [GCC 8.3.0], pyOpenSSL 19

What is the current behavior?

Crawling a website that uses # (hashes) for url navigation does not crawl the pages that use #

The urls using # are not followed.

If the current behavior is a bug, please provide the steps to reproduce

Try crawling a website like mykita.com/en/

What is the motivation / use case for changing the behavior?

Though hashes are not ment to chan

In the quick start section, it seems forgot to mention the pipeline setting, and without the setting seems will cause yield item appear wrong result. Just like #137, please update the document, if need help, I can do the contribution as well.

我运行的是这4条代码，有可以获得IP，但用python客户端调用没办法取出来

启动scrapy worker，包括代理IP采集器和校验器

python crawler_booter.py --usage crawler
python crawler_booter.py --usage validator
启动调度器，包括代理IP定时调度和校验

python scheduler_booter.py --usage crawler
python scheduler_booter.py --usage validator

Describe the bug
When using the cdp driver, during closing of a browser page, this error sometimes appears.

{"level":"warn","time":"x","url":"x","error":"rpcc: the connection is closing","time":"x","message":"failed to close browser page"}
{"level":"error","time":"x","error":": rpcc: the connection is closing: session: detach timed out for session 5C391DF4E758E985AE3CBAA03774E562","t

There are several things not accurately documented/outdated:

-v2 is used the examples but does not work
# duckduckgo not supported although it is in the list of supported search engines
To get a list of all search engines --config is suggested but that just fails

I copied the examples/sciencenet_spider.py example and tried to run it using python 3.6 - but:

python sciencenet_spider.py
[2018:04:14 22:21:26] Spider started!
[2018:04:14 22:21:26] Using selector: KqueueSelector
[2018:04:14 22:21:26] Base url: http://blog.sciencenet.cn/
[2018:04:14 22:21:26] Item "Post": 0
[2018:04:14 22:21:26] Requests count: 0
[2018:04:14 22:21:26] Error coun

On this gif ( https://raw.githubusercontent.com/constverum/ProxyBroker/master/docs/source/_static/cli_serve_example.gif ) the server prints an info line when a client connects.

The current version doesn't do that, though it would be very useful. I tried the command that is on the GIF.

crawler

Here are 4,032 public repositories matching this topic...

scrapy / scrapy

binux / pyspider

gocolly / colly

iawia002 / annie

jhao104 / proxy_pool

codelucas / newspaper

code4craft / webmagic

shengqiangzhang / examples-of-web-crawlers

guyueyingmu / avbook

s0md3v / Photon

bda-research / node-crawler

crawlab-team / crawlab

injetlee / Python

yujiosaka / headless-chrome-crawler

rmax / scrapy-redis

chyroc / WechatSogou

SpiderClub / haipproxy

MontFerret / ferret

BruceDone / awesome-crawler

gaojiuli / toapi

symfony / dom-crawler

imWildCat / scylla

Arachni / arachni

dotnetcore / DotnetSpider

NikolaiT / GoogleScraper

jae-jae / QueryList

gaojiuli / gain

constverum / ProxyBroker

xtuhcy / gecco

PuerkitoBio / gocrawl

Improve this page

Add this topic to your repo