#

text-extraction

Here are 91 public repositories matching this topic...

miso-belica / sumy

Star

Module for automatic summarization of text documents and HTML pages.

python nlp pagerank-algorithm text-extraction reduction summarization html-page summary lsa sumy textteaser summarizer html-extraction html-extractor

Updated Sep 2, 2020
Python

unidoc / unipdf

Star

Open

[FEATURE] Early-termination while processing contentstream

gunnsth commented Oct 21, 2019

Is your feature request related to a problem? Please describe.
The problem is inefficiency when simply looking for a single operand and then stopping processing.
For example, if only looking for a single colored pixel in a page.

Describe the solution you'd like
It would make sense to be able to set a stop flag on the processor and return out of the handler, which would cause the proc

Read more

feature good first issue performance

chrismattmann / tika-python

Star

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Updated Jul 6, 2020
Python

whitelok / image-text-localization-recognition

Star

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

machine-learning awesome ocr deep-learning text-extraction text-recognition deep-learning-algorithms convolutional-neural-networks text-detection scene-texts

Updated Aug 29, 2020

unidoc / unidoc

Star

This repository has moved! https://github.com/unidoc/unipdf

golang pdf text-extraction pdf-files pdf-invoice unidoc pdf-library

Updated May 23, 2019
Go

miso-belica / jusText

Star

Heuristic based boilerplate removal tool

python text-extraction html-parser html-parsing

Updated Jul 1, 2020
Python

shixzie / nlp

Star

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp go golang matter artists natural-language-processing algorithm parse text songs text-extraction keyword registermodel

Updated Sep 18, 2017
Go

ropensci / pdftools

Star

Text Extraction, Rendering and Converting of PDF Documents

r text-extraction rstats pdf-files r-package poppler pdf-format poppler-library pdftools

Updated Oct 2, 2020
C++

datashare

ICIJ / datashare

Star

Better analyze information, in all its forms

docker elasticsearch extract text-extraction named-entity-recognition web-gui datashare investigative-journalism

Updated Oct 2, 2020
Java

cdown / srt

Star

A simple library for parsing, modifying, and composing SRT files.

python text-extraction subtitles public-domain subtitle srt subtitles-parsing

Updated Aug 31, 2020
Python

bookieio / breadability

Star

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

python text-mining text-extraction html-parsing html-extraction html-extractor

Updated Aug 2, 2019
HTML

skylander86 / lambda-text-extractor

Star

AWS Lambda functions to extract text from various binary formats.

microsoft pdf ocr aws-lambda lambda-functions tesseract text-extraction asyncio searchable-pdfs pdf-ocr-extraction

Updated Feb 7, 2018
Python

ocr

victorqribeiro / ocr

Star

Simple app to extract text from pictures using Tesseract

ocr tesseract text-extraction text-recognition image-recognition

Updated Dec 28, 2019
HTML

vaites / php-apache-tika

Star

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

ocr php-library tika apache text-extraction text-recognition

Updated Aug 29, 2020
PHP

adbar / trafilatura

Star

Web scraping library and command-line tool to download, extract (metadata, main text, comments), and convert the output

nlp crawler text-mining scraper news web-scraper text-extraction web-scraping tei-xml news-articles html2text article-extractor news-scraper text-cleaning text-preprocessing

Updated Oct 2, 2020
Python

JonathanRaiman / wikipedia_ner

Star

📖 Labeled examples from wiki dumps in Python

python wikipedia text-extraction dataset named-entity-recognition

Updated Aug 8, 2016
Jupyter Notebook

vsymbol / CUTIE

Star

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extrator)

computer-vision deep-learning text-extraction

Updated Sep 26, 2020
Python

sambitdash / PDFIO.jl

Star

PDF Reader Library for Native Julia.

language pdf stream julia iso text-extraction adobe pdf-files pdf-document cos pdf-specification pdf-library pdf-development

Updated Jul 29, 2020
Julia

lu4p / cat

Star

Extract text from plaintext, .docx, .odt, .pdf and .rtf files. Pure go.

cat go golang cross-platform text-extraction extract-text pdftotext docx2txt textextracting rtf-to-text pdf2txt odt2txt

Updated Aug 13, 2020
Go

pd3f

pd3f / pd3f

Star

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pdf machine-learning ocr pipeline text-extraction pdf-to-text language-model

Updated Sep 15, 2020
HTML

noyesno / awka

Star

Revive awka - Awk to C Compiler

c text-mining code-generator compiler awk text-extraction text-processing awk2c awka

Updated Oct 11, 2018
C

ckorzen / pdf-text-extraction-benchmark

Star

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

pdf tex benchmark evaluation extraction text-extraction arxiv

Updated Nov 27, 2018
TeX

rajesh-bhat / spark-ai-summit-2020-text-extraction

Star

keras cnn text-extraction lstm text-recognition text-detection summit ctc-loss spark-ai

Updated Aug 24, 2020
Jupyter Notebook

fourdigits / wagtail_textract

Star

Text extraction for Wagtail document search

search django wagtail tesseract text-extraction textract

Updated Oct 1, 2020
Python

jmriebold / BoilerPy3

Star

Python port of Boilerpipe library

text-extraction boilerpipe boilerpy html-text-extraction full-text-extraction

Updated Dec 22, 2019
Python

mknz / mirusan

Star

A PDF collection reader with built-in full-text search engine

electron python search-engine pdf elm whoosh text-extraction pdf-viewer full-text-search

Updated Jun 3, 2017
JavaScript

greed2411 / tokyo

Star

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

clojure extension filetype text-extraction ring mime-types text-parser extract-text apache-tika document-processing text-parsing

Updated Jun 13, 2020
Clojure

Arxa / video_text_detection

Star

Bachelor Thesis | Text extraction from complex video scenes

opencv video gradle javafx image-processing text-extraction junit testfx

Updated Mar 15, 2019
Java

bmoscon / ArticleParse

Star

Heuristic text extraction from news sites in Python3

python analysis text-analysis text-extraction heuristics boilerplate-removal

Updated Dec 31, 2017
Python

IDisposable / IFilterExtractor

Star

A simple component to extract just the text from any file that has an IFilter installed. Available as a C++ COM component and as a C# .NET library.

text-mining com text-extraction ifilter

Updated Mar 31, 2017
C++

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."

You can’t perform that action at this time.