Grow your team on GitHub
GitHub is home to over 40 million developers working together. Join them to grow your own development teams, manage permissions, and collaborate on projects.
Sign up
Pinned repositories
-
-
-
stormcrawlerfight
Crawl configurations for benchmarking / testing StormCrawler
-
storm2
Minimal resources for testing Storm 2 - requires the branch 2.x from SC
-
behemoth Archived
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
-
crawler-commons
Forked from crawler-commons/crawler-commonsA set of reusable Java components that implement functionality common to any web crawler
-
-
sc-warc
WARC resources for StormCrawler
-
azazello
Azazello is an open source platform for large scale document analysis based on Apache Spark
-
-
tescobank
Setup for crawling tescobank with SC
-
textclassification-examples
Use cases for DigitalPebble's TextClassification API
-
TextClassification
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
-
behemoth-commoncrawl Archived
Support for old (pre 2013) CommonCrawl dataset in Behemoth
-
tika-cc
resources for generating a corpus of docs from CC for Tika
-
elasticsearch-hadoop
Forked from elastic/elasticsearch-hadoopElasticsearch real-time search and analytics natively integrated with Hadoop
-
NutchFight
Resources for comparison between 1.8 and 2.x of Apache Nutch
-
behemoth-elasticsearch Archived
ElasticSearch module for Behemoth
-
behemoth-textclassification Archived
Module for classifying Behemoth documents with a model from our Text Classification API
-
TextClassificationPlugin
GATE Processing Resource wrapping DigitalPebble's TextClassification API
-
ngrams-api
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format