Module for automatic summarization of text documents and HTML pages.
-
Updated
Oct 23, 2022 - Python
Module for automatic summarization of text documents and HTML pages.
Golang PDF library for creating and processing PDF files (pure go)
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
This repository has moved! https://github.com/unidoc/unipdf
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Heuristic based boilerplate removal tool
Better analyze information, in all its forms
Text Extraction, Rendering and Converting of PDF Documents
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
A simple library for parsing, modifying, and composing SRT files.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
AWS Lambda functions to extract text from various binary formats.
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Entity Disambiguation as text extraction (ACL 2022)
Simple app to extract text from pictures using Tesseract
PDF Reader Library for Native Julia.
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."