Skip to content
#

scraping

Here are 1,663 public repositories matching this topic...

oldani
oldani commented Feb 18, 2019

If you're using proxies with requests-html and rendering JS sites is all good. Once you render a website pyppeteer don't know about this proxies and will expose your IP. This is an undesired behavior when scraping with proxies.

The idea is that whenever someone passes in proxies to the session object or any method call, make pyppeteer also use these proxies. #265

nate-anderson
nate-anderson commented Nov 21, 2018

Scraping a Google search results page with HTML links containing attributes href and ping, such as:

<a href="https://en.wikipedia.org/wiki/Go_(programming_language)" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://en.wikipedia.org/wiki/Go_(programming_language)&amp;ved=2ahUKEwi-yY2t5eTeAhUzNX0KHXbrD7cQFjADegQIDRAB"><h3 class="LC20lb">Go (programming language) - Wikipedia</h3
1BOB
1BOB commented Nov 17, 2017

问题是这样的,我想爬取商品分页的信息于是我用了个for循环,执行document = Jsoup.connect(domain+reviews+String.format(b, p)).get()改变p的值来改变评论的页码。

但是当爬完第一页后再爬取第二页评论时(没准备爬取一页评论时都会执行这句document = Jsoup.connect(domain+reviews+String.format(b, p)).get();)出现了这样的错误:
java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.s

ferret
gyy52380
gyy52380 commented Oct 13, 2019

Describe the bug
When using the cdp driver, during closing of a browser page, this error sometimes appears.

{"level":"warn","time":"x","url":"x","error":"rpcc: the connection is closing","time":"x","message":"failed to close browser page"}
{"level":"error","time":"x","error":": rpcc: the connection is closing: session: detach timed out for session 5C391DF4E758E985AE3CBAA03774E562","t
brandonmp
brandonmp commented Nov 16, 2016

I'm trying to type some stuff into a page w/ artoo, and i think simulate() will do the trick.

i've never used simulate(), though, so i have no idea what the syntax is.

the github page for simulate linked in the artoo docs has no documentation, and the only docs i can find are for jquery-simulate-ext.

are there any examples I

nyov
nyov commented Feb 21, 2020

The line 279

https://github.com/scrapy/parsel/blob/332b7e87ba046c48f8b17ea3a4064015f1f58ffe/parsel/selector.py#L277-L279

appears to raise a warning in the scrapy build on Python 3.7 for the docs target:

https://travis-ci.org/scrapy/scrapy/jobs/653351006#L345-L350

Warning, treated as error:

/home/travis/build/scrapy/scrapy/.tox/docs/lib/python3.7/site-packages/parsel/selector.py:d
jekyll
mdlincoln
mdlincoln commented Jan 29, 2020

We need to put strings in to a lot of places:

  • YAML
  • markdown
  • html inside markdown
  • code blocks
  • arguments added to includes like with {% include figure.html caption="..." %} (which are a special case too b/c they actually get processed as markdown...)

It's not always intuitive how ' and " will interact in which domain, and there are often mutiple ways to do it which can be

Improve this page

Add a description, image, and links to the scraping topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the scraping topic, visit your repo's landing page and select "manage topics."

Learn more

You can’t perform that action at this time.