common-crawl

Is your feature request related to a problem? Please describe.
Since the Oscar is limited by the fasttext language classifier which was trained on Wikipedia, the datasets contain also the sentences in other languages. For instance, Tajik (tg.txt) language contains large chunks of Uzbek sentences in Cyrillic script

Describe the solution you'd like
Train new models using other data othe

common-crawl

Here are 36 public repositories matching this topic...

commoncrawl / cc-pyspark

signedsecurity / sigurlfind3r

commoncrawl / news-crawl

michaelharms / comcrawl

oscar-corpus / goclassy

commoncrawl / cc-crawl-statistics

oscar-corpus / ungoliant

[Feature request] Train a classifier to better classify languages

IBM / cc-dbp

commoncrawl / cc-webgraph

bminixhofer / gerpt2

commoncrawl / cc-notebooks

oscar-corpus / oscar-website

hrbrmstr / cc

tokenmill / common-crawl-utils

code402 / warc-benchmark

HRN-Projects / common_crawl_with_scrapy

socket-var / nyt-twitter-cc-hadoop

Mgosi / Big-Data-Analysis-using-MapReduce-in-Hadoop

mwoss / mors

siddheswarc / EDA-using-MapReduce

toimik / CommonCrawl

seanbethard / corpuswork

ErikGartner / prometheus-cc-extractor

fizerkhan / CommonCrawlDocumentDownload

fizerkhan / KeywordAnalysis

fizerkhan / cdx-index-client

ggodreau / huhdewp

srmocher / fake-science

Dahouabdelhalim / Discourse-marksers-and-Web-crawling

hadrianw / abracabra

Improve this page

Add this topic to your repo