cseas / ocr-table

Extract tables from scanned image PDFs using Optical Character Recognition.

135 stars 46 forks

Star

Watch

cseas Fix pip install command

Fix pip install command

72379ed

15 commits

Failed to load latest commit information.

README.md

ocr-table

This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

Install Requirements

Tesseract OCR
```
sudo apt-get install tesseract-ocr
```
Imagemagick
```
sudo apt-get install imagemagick
```
PDF Utilities
```
sudo apt-get install poppler-utils
```
Python packages
```
sudo pip install -r requirements.txt
```

Usage

Clear the pdf/ folder and copy all your pdf files to be scanned in it.
Run the OCR:
```
python3 shellocr.py
```
The scanned text files shall be available in the txt/ folder once the process completes.

Alternate

If the above doesn't work for you, try the alternate method.
Save your file as input.pdf in the root directory.
Run
```
python3 pdf_miner.py 
```

About

Extract tables from scanned image PDFs using Optical Character Recognition.

shell python ocr tesseract extract-tables scanned-image-pdfs ocr-table optical-character-recognition pdfminer

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.