Module for automatic summarization of text documents and HTML pages.
-
Updated
Aug 5, 2023 - Python
Module for automatic summarization of text documents and HTML pages.
Golang PDF library for creating and processing PDF files (pure go)
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
This repository has moved! https://github.com/unidoc/unipdf
Heuristic based boilerplate removal tool
A self-hosted search engine for documents.
Text Extraction, Rendering and Converting of PDF Documents
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
A simple library and set of tools for parsing, modifying, and composing SRT files.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
AWS Lambda functions to extract text from various binary formats.
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Entity Disambiguation as text extraction (ACL 2022)
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
PDF Reader Library for Native Julia.
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Simple app to extract text from pictures using Tesseract
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."