crawling
Here are 477 public repositories matching this topic...
don't know how to do
If one opens the link to the docs provided in README the Readme opens on readthedocs.io. There is no navigation bar to find where one can browse to quick start page or advanced. You can only go there if one searches quick start and click on the page. Then there are navigation links for browsing through the docs.
Jus for the record:
I'm using Firefox (60.9.0 esr) on Windows 10 Pro.
Really gr
What is the current behavior?
Crawling a website that uses # (hashes) for url navigation does not crawl the pages that use #
The urls using # are not followed.
If the current behavior is a bug, please provide the steps to reproduce
Try crawling a website like mykita.com/en/
What is the motivation / use case for changing the behavior?
Though hashes are not ment to chan
The developer of the website I intend to scrape information from is sloppy and has left a lot of broken links.
When I execute an otherwise effective Ferret script on a list of pages, it stops altogether at every 404.
Is there a DOCUMENT_EXISTS or anything that would help the script go on?
-
Updated
Jul 11, 2020
-
Updated
Oct 22, 2019 - Jupyter Notebook
-
Updated
Dec 16, 2018 - Python
-
Updated
Jun 12, 2020 - Go
SeleniumRequest should use meta to pass arguments
self.wait_time = wait_time
self.wait_until = wait_until
self.screenshot = screenshot
self.script = script
when use scrapy_redis.scheduler.Scheduler that won't be serialized
-
Updated
Apr 29, 2020 - JavaScript
-
Updated
Jul 10, 2020 - Elixir
-
Updated
Nov 24, 2019 - Go
Scrapy has a setting directive implemented for Sphinx documentation that allows linking to settings while formatting them as code in an easy manner.
Looking at #212, I think Spidermon could benefit from implementing such a directive as well.
Documentation Needed
CONTRIBUTING.md has some guidelines, but essentially there is simply a lot of stuff that needs filled out in the docs.
Also, if you would like to use another documentation format feel free. Listing everything is something I came up with in early development but it's prob
See the code and update the docs.
-
Updated
May 31, 2020 - Go
-
Updated
Oct 27, 2019 - Go
Are you submitting a bug report or a feature request?
Feature request/documentation enhancement
What is the current behavior?
The requirements for a user to get up and running are insufficient with regard to the requirements and dependencies. I encountered this experience when trying to resolve #31 on a fresh Win
-
Updated
Jul 14, 2020 - Python
-
Updated
Nov 13, 2019 - C#
-
Updated
Jul 9, 2020 - TypeScript
Datasets with identifiers containing upper case letters are being duplicated in the status.json file contained in the working_dir of the project. This is causing the desired flag in the DIG UI to be reset to zero. Hence, the data is not ingested into the system.
Example status.json:
{
"desired_docs": {
"imfCPI": 0,
"imfcpi": 1
},
"added_docs": {
"imf
-
Updated
Jun 21, 2020 - Crystal
Improve this page
Add a description, image, and links to the crawling topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the crawling topic, visit your repo's landing page and select "manage topics."

Description
When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the
twisted.web.error.SchemeNotSupported: Unsupported scheme: b''errorAs I see, most of the people have this issue the other way around.
Steps to Reproduce
**Expected