scrapy
Here are 2,051 public repositories matching this topic...
我最近也在学习下Python的网络爬虫,非常感谢你的分享。
我今天在搭建好环境后尝试Spider_Python项目时遇到一个问题,无法连接mongodb,提示的错误是pymongo不存在Connection模块,然后我在网上找了下pymongo的用法,做了如下修改后可以正常运行并存入mongodb。
` # 连接数据库,db和posts为数据库和集合的游标
def Connection(self):
#connect to mongo(localhost:27017)
mongoclient = pymongo.MongoClient()
mongodb = mongoclient[self.database]
posts = mongodb.posts
return posts
In the quick start section, it seems forgot to mention the pipeline setting, and without the setting seems will cause yield item appear wrong result. Just like #137, please update the document, if need help, I can do the contribution as well.
python客户端调用为空
我运行的是这4条代码,有可以获得IP,但用python客户端调用没办法取出来
It would be much better user experience to use custom widgets for spider args. For example if we could be able to select category from a list or enter URL in separate field it would be much easier to end user to work with.
Hi, according to the following links
https://doc.scrapy.org/en/latest/topics/spiders.html#spiderargs
https://scrapyd.readthedocs.io/en/stable/api.html#schedule-json
Params can be sent to Spider class during initialization, I can't see any place for me to input them.
It will be thankful if this feature added.
linux:HTTPConnectionPool(host='192.168.0.24', port=6801): Max retries exceeded with url: /listprojects.json (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f0a78b2d828>: Failed to establish a new connection: [Errno 111] Connection refused',))
windows:HTTPConnectionPool(host='localhost', port=6801): Max retries exceeded with url: /jobs (Caused by Ne
-
Updated
Apr 15, 2020 - Python
-
Updated
Apr 16, 2020 - Python
-
Updated
Apr 1, 2019 - Java
Documentation incorrectly states that any software accepting CONNECT method could be used as a proxy
Hello,
I was trying to build by own image with a 3rd party HTTP proxy.
Expected Behavior
According to the documentation:
you can use every software which accept the CONNECT method (Squid, Tinyproxy, etc.).
Actual Behavior
This is not the case because Scrapoxy expects to receive 200 response on http://xx.xx.
-
Updated
Apr 15, 2020 - Python
Bug 描述 (Describe the bug)
from scrapy.conf import settings
ModuleNotFoundError: No module named 'scrapy.conf'
如何重现 (To Reproduce)
基于win10 , 编译执行 scrapy crawl lianjia
桌面环境 Desktop (please complete the following information)
-
操作系统(OS): win10
-
Python: 3.7
-
Scrapy:1.7.3
-
Redis:
-
Elastic search:
-
Kibana:
附加信息 (Additional context)
添加有利
-
Updated
Feb 2, 2020 - JavaScript
-
Updated
Apr 5, 2020 - Python
It took me hours to figure this out so I want to help anyone else having trouble getting this running on Heroku.
Kimurai uses lsof, so an Aptfile with the single line lsof needs to be included in the root folder along with the heroku buildpack. Can you add this to the docs? Thanks!
Crawl file
The documentation file said that we can download video or other types of file. I googled and haven't found any example about this. Can you give me an example of crawl a file that is not image.
-
Updated
May 15, 2019 - Python
-
Updated
Oct 22, 2019 - Jupyter Notebook
-
Updated
Dec 28, 2019 - Python
-
Updated
Oct 12, 2019 - Python
-
Updated
Dec 30, 2019 - Python
Is there an option to crawl events out of Facebook?
If not, would it be easy to implement? I could assist if there is interest for that.
Improve this page
Add a description, image, and links to the scrapy topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the scrapy topic, visit your repo's landing page and select "manage topics."

Bug 描述
按教程文档说明的,使用docker-compose up -d 安装启动后,直接执行task报错
不知道哪里有问题呢?
我的docker运行环境是win10
`2020-02-15 15:58:04 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: xueqiu)
22020-02-15 15:58:04 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Nov 7 2019, 10:44:02) - [GCC 8.3.0], pyOpenSSL 19