Link Checker
The Link checker is a StormCrawler adaptation for URL checking. Instead of crawling, it checks the status of URLs and persists them in a database (currently MariaDB/MySQL). Important for understanding is the fact, that the linkchecker is not a stand alone application but storm topology which is running inside a cluster. For more information on storm topologies, have a look at the documentation of the apache storm project, please.
How to setup and run
In your IDE
- Clone this repository into an IDE workspace
- Create either a file crawler-test.flux or change the name in class at.oeaw.acdh.linkchecker.LinkcheckerTest, line 13 to point to a valid flux file
- Adapt the settings in crawler.flux and crawler-conf.yaml (or whatever you call these files in your test-environment) as described in the cluster setup.
- Execute class at.oeaw.acdh.linkchecker.LinkcheckerTest
In a local cluster
-
Before you can run linkchecker, you need to install Apache Storm: Download Apache Storm 2.2.0 (current supported version) from this link: https://archive.apache.org/dist/storm/apache-storm-2.2.0/apache-storm-2.2.0.tar.gz
-
Clone this repository.
-
Run
mvn installin the working directory -
Add your hikari connection pool propiertes to crawler-conf.yaml (and change any other parameters you wish, ex: http.agent):
HIKARI:
driverClassName: com.mysql.cj.jdbc.Driver
jdbcUrl: {your database url, ex: "jdbc:mysql://localhost:3307/stormychecker"}
username: {your database username}
password: {your database password}
- Point to your crawler-conf.yaml file in crawler.flux:
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: {path to your crawler-conf.yaml file}
override: true
Note: If you set it "crawler-conf.yaml", then you can directly use the crawler-conf.yaml in this repository.
- To start the link checker on local mode, run
apache-storm-2.2.0/bin/storm storm local path/to/this/repository/target/linkchecker-2.1.0.jar org.apache.storm.flux.Flux --local path/to/this/repository/crawler.flux --local-ttl 3600
For remote cluster setup, have a look at the documentation of the apache storm project, please.
Simple Explanation of Current Implementation
Our SQL database has 6 tables:
- url: This is the table that linkchecker reads the URLs to check from. So this will be populated by another application(in our case curation-module).
- status: This is the table that linkchecker saves the results into.
- history: If a URL is checked more than once, the previous checking result is saved in the history table and the record in the status table is updated.
- providerGroup
- context: The table saves the context in which
- url_context: joins url-table n-n to the context table, so that each URL might appear in different contexts. Moreover the table contains the last time when the link was ingested and and a boolean flag which indicates if the join is still active. Only URLs which have at least one active join are considered to be checked!
crawler.flux defines our topology. It defines all the spouts, bolts and streams.
at.ac.oeaw.acdh.linkchecker.spout.RASASpoutuses the resource availability status API to fill up a buffer with URLs to check.com.digitalpebble.stormcrawler.bolt.URLPartitionerBoltpartitions the URLs by a configured criteriaat.ac.oeaw.acdh.linkchecker.bolt.MetricsFetcherBoltfetches the urls. It sends redirects back to URLPartitionerBolt and sends the rest onwards down the stream to StatusUpdaterBolt. Modification ofcom.digitalpebble.stormcrawler.bolt.FetcherBoltat.ac.oeaw.acdh.linkchecker.bolt.StatusUpdaterBoltpersists the results in the status table of the database via resource availability status API.