crawling

Description

When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the twisted.web.error.SchemeNotSupported: Unsupported scheme: b'' error

As I see, most of the people have this issue the other way around.

Steps to Reproduce

Scrape a http link with proxy

**Expected

i want get the price (follow red frame)

	c.OnHTML("div[id=price]", func(e *colly.HTMLElement) {
		fmt.Printf("test----%+v\n",e)
		price,err := strconv.ParseFloat(e.Text,64)
		//price := e.Text
		fmt.Printf("********* price----%+v\n",price)

		if err != nil {

			fmt.Pri

If one opens the link to the docs provided in README the Readme opens on readthedocs.io. There is no navigation bar to find where one can browse to quick start page or advanced. You can only go there if one searches quick start and click on the page. Then there are navigation links for browsing through the docs.

Jus for the record:
I'm using Firefox (60.9.0 esr) on Windows 10 Pro.

Really gr

What is the current behavior?

Crawling a website that uses # (hashes) for url navigation does not crawl the pages that use #

The urls using # are not followed.

If the current behavior is a bug, please provide the steps to reproduce

Try crawling a website like mykita.com/en/

What is the motivation / use case for changing the behavior?

Though hashes are not ment to chan

The developer of the website I intend to scrape information from is sloppy and has left a lot of broken links.
When I execute an otherwise effective Ferret script on a list of pages, it stops altogether at every 404.
Is there a DOCUMENT_EXISTS or anything that would help the script go on?

SeleniumRequest should use meta to pass arguments

self.wait_time = wait_time
self.wait_until = wait_until
self.screenshot = screenshot
self.script = script

when use scrapy_redis.scheduler.Scheduler that won't be serialized

Scrapy has a setting directive implemented for Sphinx documentation that allows linking to settings while formatting them as code in an easy manner.

Looking at #212, I think Spidermon could benefit from implementing such a directive as well.

CONTRIBUTING.md has some guidelines, but essentially there is simply a lot of stuff that needs filled out in the docs.

Also, if you would like to use another documentation format feel free. Listing everything is something I came up with in early development but it's prob

Currently, there the Crawly.Engine apis are lacking for spider monitoring and management, especially for when there is no access to logs.

I think some critical areas are:

spider crawl stats (scraped item count, dropped request/item count, scrape speed)
stop_all_spiders to stop all running spiders

The stopping of spiders should be easy to implement.

For the spider stats, since so

Use https://github.com/stopstalk/stopstalk-deployment/blob/79911cdaf9e8ff0da963d044b286893144b189ec/modules/sites/codechef.py#L204 to replace
https://github.com/stopstalk/stopstalk-deployment/blob/79911cdaf9e8ff0da963d044b286893144b189ec/modules/sites/codechef.py#L139

See the code and update the docs.

wiki의 내용을 지금 버전 기반으로 업데이트 필요.

최소한 몇 버전때 동작하는 코드인지 명시하고, 코드 앞에 설치 코드를 제공.

혹은 사용자 들에게 자신들의 코드를 기여해주기를 부탁할 수도 있겠다.

Are you submitting a bug report or a feature request?

Feature request/documentation enhancement

What is the current behavior?

The requirements for a user to get up and running are insufficient with regard to the requirements and dependencies. I encountered this experience when trying to resolve #31 on a fresh Win

Datasets with identifiers containing upper case letters are being duplicated in the status.json file contained in the working_dir of the project. This is causing the desired flag in the DIG UI to be reset to zero. Hence, the data is not ingested into the system.

Example status.json:

{
"desired_docs": {
"imfCPI": 0,
"imfcpi": 1
},
"added_docs": {
"imf

crawling

Here are 472 public repositories matching this topic...

scrapy / scrapy

Description

Steps to Reproduce

gocolly / colly

codelucas / newspaper

yujiosaka / headless-chrome-crawler

MontFerret / ferret

transitive-bullshit / awesome-puppeteer

iawia002 / Lulu

MorvanZhou / easy-scraping-tutorial

essandess / isp-data-pollution

slotix / dataflowkit

clemfromspace / scrapy-selenium

zhuyingda / webster

DarkSand / Sasila

infinitbyte / gopa

scrapinghub / spidermon

rivermont / spidy

oltarasenko / crawly

stopstalk / stopstalk-deployment

alephdata / memorious

antchfx / antch

forkonlp / N2H4

trandoshan-io / crawler

dimkouv / massivedl

N0taN3rd / Squidwarc

Are you submitting a bug report or a feature request?

What is the current behavior?

google / corpuscrawler

usc-isi-i2 / dig-etl-engine

mehmetozkaya / DotnetCrawler

watzon / arachnid

jvandenaardweg / linkedin-profile-scraper

estin / pomp

Improve this page

Add this topic to your repo