Skip to content
#

crawling

Here are 472 public repositories matching this topic...

teodoroanca
teodoroanca commented Apr 16, 2020

Description

When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the twisted.web.error.SchemeNotSupported: Unsupported scheme: b'' error

As I see, most of the people have this issue the other way around.

Steps to Reproduce

  1. Scrape a http link with proxy

**Expected

Pit-Storm
Pit-Storm commented Oct 21, 2019

If one opens the link to the docs provided in README the Readme opens on readthedocs.io. There is no navigation bar to find where one can browse to quick start page or advanced. You can only go there if one searches quick start and click on the page. Then there are navigation links for browsing through the docs.

Jus for the record:
I'm using Firefox (60.9.0 esr) on Windows 10 Pro.

Really gr

jlvdh
jlvdh commented Nov 27, 2018

What is the current behavior?

Crawling a website that uses # (hashes) for url navigation does not crawl the pages that use #

The urls using # are not followed.

If the current behavior is a bug, please provide the steps to reproduce

Try crawling a website like mykita.com/en/

What is the motivation / use case for changing the behavior?

Though hashes are not ment to chan

ferret
Ziinc
Ziinc commented Dec 11, 2019

Currently, there the Crawly.Engine apis are lacking for spider monitoring and management, especially for when there is no access to logs.

I think some critical areas are:

  • spider crawl stats (scraped item count, dropped request/item count, scrape speed)
  • stop_all_spiders to stop all running spiders

The stopping of spiders should be easy to implement.

For the spider stats, since so

elmaestro08
elmaestro08 commented Oct 20, 2018

Datasets with identifiers containing upper case letters are being duplicated in the status.json file contained in the working_dir of the project. This is causing the desired flag in the DIG UI to be reset to zero. Hence, the data is not ingested into the system.

Example status.json:

{
"desired_docs": {
"imfCPI": 0,
"imfcpi": 1
},
"added_docs": {
"imf

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

  • Updated Nov 13, 2019
  • C#

Improve this page

Add a description, image, and links to the crawling topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the crawling topic, visit your repo's landing page and select "manage topics."

Learn more

You can’t perform that action at this time.