crawler
Here are 4,032 public repositories matching this topic...
With the current installation and deployment manual, it is basically impossible to deploy successfully at one time.
It is also recommended to provide an official docker image based on python 3.
don't know how to do
If one opens the link to the docs provided in README the Readme opens on readthedocs.io. There is no navigation bar to find where one can browse to quick start page or advanced. You can only go there if one searches quick start and click on the page. Then there are navigation links for browsing through the docs.
Jus for the record:
I'm using Firefox (60.9.0 esr) on Windows 10 Pro.
Really gr
问题是这样的,我想爬取商品分页的信息于是我用了个for循环,执行document = Jsoup.connect(domain+reviews+String.format(b, p)).get()改变p的值来改变评论的页码。
但是当爬完第一页后再爬取第二页评论时(没准备爬取一页评论时都会执行这句document = Jsoup.connect(domain+reviews+String.format(b, p)).get();)出现了这样的错误:
java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.s
-
Updated
Dec 13, 2019 - Python
-
Updated
Mar 14, 2020 - PHP
-
Updated
Mar 14, 2020 - Python
docker安装的任务执行有问题
Bug 描述
按教程文档说明的,使用docker-compose up -d 安装启动后,直接执行task报错
不知道哪里有问题呢?
我的docker运行环境是win10
`2020-02-15 15:58:04 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: xueqiu)
22020-02-15 15:58:04 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Nov 7 2019, 10:44:02) - [GCC 8.3.0], pyOpenSSL 19
scrapy爬虫去重bug
What is the current behavior?
Crawling a website that uses # (hashes) for url navigation does not crawl the pages that use #
The urls using # are not followed.
If the current behavior is a bug, please provide the steps to reproduce
Try crawling a website like mykita.com/en/
What is the motivation / use case for changing the behavior?
Though hashes are not ment to chan
In the quick start section, it seems forgot to mention the pipeline setting, and without the setting seems will cause yield item appear wrong result. Just like #137, please update the document, if need help, I can do the contribution as well.
python客户端调用为空
我运行的是这4条代码,有可以获得IP,但用python客户端调用没办法取出来
Describe the bug
When using the cdp driver, during closing of a browser page, this error sometimes appears.
{"level":"warn","time":"x","url":"x","error":"rpcc: the connection is closing","time":"x","message":"failed to close browser page"}
{"level":"error","time":"x","error":": rpcc: the connection is closing: session: detach timed out for session 5C391DF4E758E985AE3CBAA03774E562","t
-
Updated
Feb 18, 2020
-
Updated
Mar 30, 2020 - PHP
-
Updated
Mar 15, 2020 - Python
-
Updated
Jan 28, 2020 - Ruby
-
Updated
Apr 4, 2020 - C#
There are several things not accurately documented/outdated:
-v2is used the examples but does not work# duckduckgo not supportedalthough it is in the list of supported search engines- To get a list of all search engines
--configis suggested but that just fails
I copied the examples/sciencenet_spider.py example and tried to run it using python 3.6 - but:
python sciencenet_spider.py
[2018:04:14 22:21:26] Spider started!
[2018:04:14 22:21:26] Using selector: KqueueSelector
[2018:04:14 22:21:26] Base url: http://blog.sciencenet.cn/
[2018:04:14 22:21:26] Item "Post": 0
[2018:04:14 22:21:26] Requests count: 0
[2018:04:14 22:21:26] Error coun
On this gif ( https://raw.githubusercontent.com/constverum/ProxyBroker/master/docs/source/_static/cli_serve_example.gif ) the server prints an info line when a client connects.
The current version doesn't do that, though it would be very useful. I tried the command that is on the GIF.
Improve this page
Add a description, image, and links to the crawler topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the crawler topic, visit your repo's landing page and select "manage topics."


https://twisted.readthedocs.io/en/latest/core/howto/defer-intro.html#inline-callbacks-using-yield says "On Python 3, instead of writing returnValue(json.loads(responseBody)) you can instead write return json.loads(responseBody). This can be a significant readability advantage, but unfortunately if you need compatibility with Python 2, this isn’t an option.".