text-extraction

Here are 178 public repositories matching this topic...

miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

python nlp pagerank-algorithm text-extraction reduction summarization html-page summary lsa sumy textteaser summarizer html-extraction html-extractor

Updated Aug 5, 2023
Python

unidoc / unipdf

Star

Golang PDF library for creating and processing PDF files (pure go)

golang pdf signing text-extraction pdf-generator pdf-generation pdf-reader pdf-manipulation pdf-library pdf-document-processor pdf-compression pdf-sign pdf-reports

Updated Aug 4, 2023
Go

adbar / trafilatura

Star

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Updated Aug 7, 2023
Python

chrismattmann / tika-python

Sponsor

Star

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Updated Jul 25, 2023
Python

whitelok / image-text-localization-recognition

Star

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

machine-learning awesome ocr deep-learning text-extraction text-recognition deep-learning-algorithms convolutional-neural-networks text-detection scene-texts

Updated Jul 24, 2022

unidoc / unidoc

Star

This repository has moved! https://github.com/unidoc/unipdf

golang pdf text-extraction pdf-files pdf-invoice unidoc pdf-library

Updated May 23, 2019
Go

miso-belica / jusText

Sponsor

Star

Heuristic based boilerplate removal tool

python text-extraction html-parser html-parsing

Updated Jan 24, 2023
Python

ICIJ / datashare

Star

A self-hosted search engine for documents.

docker elasticsearch extract text-extraction named-entity-recognition web-gui datashare investigative-journalism

Updated Aug 8, 2023
Java

ropensci / pdftools

Star

Text Extraction, Rendering and Converting of PDF Documents

r text-extraction rstats pdf-files r-package poppler pdf-format poppler-library pdftools

Updated May 26, 2023
C++

shixzie / nlp

Star

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp go golang natural-language-processing parse text text-extraction

Updated Sep 18, 2017
Go

cdown / srt

Star

A simple library and set of tools for parsing, modifying, and composing SRT files.

python library tools command-line text-extraction subtitles subtitle srt subtitles-parsing mit-license command-line-tool subtitle-parser subtitle-fixer

Updated Jul 1, 2023
Python

bookieio / breadability

Star

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

python text-mining text-extraction html-parsing html-extraction html-extractor

Updated Feb 21, 2023
HTML

pd3f / pd3f

Star

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

python pdf machine-learning ocr pipeline text-extraction pdf-to-text language-model extract-text parsr pd3f

Updated Mar 31, 2023
HTML

skylander86 / lambda-text-extractor

Star

AWS Lambda functions to extract text from various binary formats.

pdf ocr aws-lambda lambda-functions tesseract text-extraction searchable-pdfs pdf-ocr-extraction

Updated Feb 7, 2018
Python

vsymbol / CUTIE

Star

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

computer-vision deep-learning text-extraction

Updated Dec 8, 2022
Python

SapienzaNLP / extend

Star

Entity Disambiguation as text extraction (ACL 2022)

nlp natural-language-processing acl pytorch text-extraction entity-linking entity-disambiguation entity-disambiguation-models acl2022

Updated Apr 17, 2022
Python

archivesunleashed / aut

Star

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

scala big-data spark apache-spark hadoop analysis python3 text-extraction pyspark digital-humanities dataframe big-data-analytics webarchives network-graphing

Updated Jul 9, 2023
Scala

sambitdash / PDFIO.jl

Star

PDF Reader Library for Native Julia.

pdf julia text-extraction pdf-files pdf-document pdf-specification pdf-library pdf-development

Updated Mar 4, 2023
Julia

vaites / php-apache-tika

Star

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

ocr php-library tika apache text-extraction text-recognition

Updated Apr 14, 2023
PHP

victorqribeiro / ocr

Star

Simple app to extract text from pictures using Tesseract

ocr tesseract text-extraction text-recognition image-recognition

Updated Jul 19, 2021
HTML

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more