DigitalPebble Ltd

Grow your team on GitHub

GitHub is home to over 40 million developers working together. Join them to grow your own development teams, manage permissions, and collaborate on projects.

Pinned repositories

behemoth Archived

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

Java 285 61
storm-crawler

Scalable web crawler based on Apache Storm

Java 607 201

storm-crawler

Scalable web crawler based on Apache Storm

java web-crawler distributed apache-storm

Java Apache-2.0 201 607 34 (1 issue needs help) 1 Updated Jan 9, 2020
ansible-storm

Ansible playbook for deploying a Storm cluster

ansible storm playbook stormcrawler

0 1 0 0 Updated Jan 9, 2020
stormcrawlerfight

Crawl configurations for benchmarking / testing StormCrawler

Shell Apache-2.0 5 8 0 0 Updated Sep 19, 2019
storm2

Minimal resources for testing Storm 2 - requires the branch 2.x from SC

Java 0 1 0 0 Updated Oct 22, 2018
behemoth Archived

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

java nlp hadoop mapreduce

Java 61 285 12 1 Updated Apr 25, 2018
crawler-commons
Forked from crawler-commons/crawler-commons
A set of reusable Java components that implement functionality common to any web crawler

Java Apache-2.0 61 4 0 0 Updated Apr 4, 2017
storm
Forked from apache/storm
Mirror of Apache Storm

Java Apache-2.0 4,015 0 0 0 Updated Feb 27, 2017
sc-warc

WARC resources for StormCrawler

1 2 3 0 Updated Oct 20, 2016
azazello

Azazello is an open source platform for large scale document analysis based on Apache Spark

Java Apache-2.0 1 7 2 (1 issue needs help) 0 Updated Apr 20, 2016
nutch
Forked from apache/nutch
Mirror of Apache Nutch

Java Apache-2.0 1,159 0 0 0 Updated Nov 25, 2015
tescobank

Setup for crawling tescobank with SC

Java Apache-2.0 2 4 0 0 Updated Sep 23, 2015
textclassification-examples

Use cases for DigitalPebble's TextClassification API

Java Apache-2.0 3 10 0 0 Updated Sep 1, 2015
TextClassification

A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.

Java Apache-2.0 19 45 1 0 Updated Sep 1, 2015
behemoth-commoncrawl Archived

Support for old (pre 2013) CommonCrawl dataset in Behemoth

Java 0 4 0 0 Updated Apr 20, 2015
tika-cc

resources for generating a corpus of docs from CC for Tika

Shell 0 0 0 0 Updated Nov 28, 2014
elasticsearch-hadoop
Forked from elastic/elasticsearch-hadoop
Elasticsearch real-time search and analytics natively integrated with Hadoop

Java Apache-2.0 847 1 0 0 Updated Sep 29, 2014
NutchFight

Resources for comparison between 1.8 and 2.x of Apache Nutch

Java Apache-2.0 0 4 0 0 Updated Jun 4, 2014
behemoth-elasticsearch Archived

ElasticSearch module for Behemoth

Java 0 1 0 0 Updated Feb 12, 2014
behemoth-textclassification Archived

Module for classifying Behemoth documents with a model from our Text Classification API

Java 0 1 0 0 Updated Nov 22, 2012
TextClassificationPlugin

GATE Processing Resource wrapping DigitalPebble's TextClassification API

Java 3 5 1 1 Updated Jul 12, 2012
ngrams-api

Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format

Java 2 4 0 0 Updated Apr 27, 2012

Top languages

Loading…

Most used topics

Loading…

Grow your team on GitHub

Pinned repositories

behemoth Archived

behemoth-commoncrawl Archived

behemoth-elasticsearch Archived

behemoth-textclassification Archived

Top languages

Most used topics

People