GitHub - clarin-eric/linkchecker

Link Checker

The Link checker is a StormCrawler adaptation for URL checking. Instead of crawling, it checks the status of URLs and persists them in a database (currently MariaDB/MySQL). Important for understanding is the fact, that the linkchecker is not a stand alone application but storm topology which is running inside a cluster. For more information on storm topologies, have a look at the documentation of the apache storm project, please.

How to setup and run

In your IDE

Clone this repository into an IDE workspace
Create either a file crawler-test.flux or change the name in class at.oeaw.acdh.linkchecker.LinkcheckerTest, line 13 to point to a valid flux file
Adapt the settings in crawler.flux and crawler-conf.yaml (or whatever you call these files in your test-environment) as described in the cluster setup.
Execute class at.oeaw.acdh.linkchecker.LinkcheckerTest

In a local cluster

Before you can run linkchecker, you need to install Apache Storm: Download Apache Storm 2.2.0 (current supported version) from this link: https://archive.apache.org/dist/storm/apache-storm-2.2.0/apache-storm-2.2.0.tar.gz
Clone this repository.
Run mvn install in the working directory
Add your hikari connection pool propiertes to crawler-conf.yaml (and change any other parameters you wish, ex: http.agent):

HIKARI:
 driverClassName: com.mysql.cj.jdbc.Driver
 jdbcUrl: {your database url, ex: "jdbc:mysql://localhost:3307/stormychecker"}
 username: {your database username}
 password: {your database password}

Point to your crawler-conf.yaml file in crawler.flux:

includes:
  - resource: true
    file: "/crawler-default.yaml"
    override: false

  - resource: false
    file: {path to your crawler-conf.yaml file}
    override: true

Note: If you set it "crawler-conf.yaml", then you can directly use the crawler-conf.yaml in this repository.

To start the link checker on local mode, run apache-storm-2.2.0/bin/storm storm local path/to/this/repository/target/linkchecker-2.1.0.jar org.apache.storm.flux.Flux --local path/to/this/repository/crawler.flux --local-ttl 3600

For remote cluster setup, have a look at the documentation of the apache storm project, please.

Simple Explanation of Current Implementation

Our SQL database has 6 tables:

url: This is the table that linkchecker reads the URLs to check from. So this will be populated by another application(in our case curation-module).
status: This is the table that linkchecker saves the results into.
history: If a URL is checked more than once, the previous checking result is saved in the history table and the record in the status table is updated.
providerGroup
context: The table saves the context in which
url_context: joins url-table n-n to the context table, so that each URL might appear in different contexts. Moreover the table contains the last time when the link was ingested and and a boolean flag which indicates if the join is still active. Only URLs which have at least one active join are considered to be checked!

crawler.flux defines our topology. It defines all the spouts, bolts and streams.

at.ac.oeaw.acdh.linkchecker.spout.RASASpout uses the resource availability status API to fill up a buffer with URLs to check.
com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt partitions the URLs by a configured criteria
at.ac.oeaw.acdh.linkchecker.bolt.MetricsFetcherBolt fetches the urls. It sends redirects back to URLPartitionerBolt and sends the rest onwards down the stream to StatusUpdaterBolt. Modification of com.digitalpebble.stormcrawler.bolt.FetcherBolt
at.ac.oeaw.acdh.linkchecker.bolt.StatusUpdaterBolt persists the results in the status table of the database via resource availability status API.

README.md

Link Checker

How to setup and run

In your IDE

In a local cluster

Simple Explanation of Current Implementation

About

Releases 22

Packages

Contributors 3

Languages

License

clarin-eric/linkchecker

Latest commit

Git stats

Files

README.md

Link Checker

How to setup and run

In your IDE

In a local cluster

Simple Explanation of Current Implementation

About

Resources

License

Stars

Watchers

Forks

Releases 22

Packages 0

Contributors 3

Languages

Packages