HunterChao/Crawler
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
master
Could not load branches
Nothing to show
Could not load tags
Nothing to show
{{ refName }}
default
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
-
Clone
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more.
- Open with GitHub Desktop
- Download ZIP
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
本仓库下包括拉钩、豆瓣和链家三个爬虫
拉钩抓取全部公司介绍信息
运行文件为lagou.py,由于拉钩网对ip有限制,采用更换代理ip的形式进行反爬虫,0103.txt为可用的代理ip,运行时lagou.py随机使用代理ip
拉钩抓取数据部分截图
链家抓取二手房信息数据
采用scrapy框架抓取,运行文件为run.py,在控制台下直接运行即可,无需在cmd下启动
链家爬取数据部分截图
链家项目的详细介绍请见知乎专栏:https://zhuanlan.zhihu.com/p/25132058?refer=pythoncrawl
豆瓣电影信息抓取
按电影分类爬取豆瓣上全部电影信息,共87000余条数据。
包括读取电影分类信息GetPage.py,爬取各类别下电影详情介绍FullContents.py。
豆瓣电影信息的详细介绍请见知乎专栏:https://zhuanlan.zhihu.com/p/24771128?refer=pythoncrawl

