Skip to content
master
Go to file
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
pdf
Nov 24, 2018
Nov 24, 2018
txt
Nov 21, 2018
May 7, 2018
Nov 21, 2018
Nov 21, 2018
Nov 21, 2018
Nov 21, 2018

README.md

ocr-table

This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

Install Requirements

  1. Tesseract OCR

    sudo apt-get install tesseract-ocr
  2. Imagemagick

    sudo apt-get install imagemagick
  3. PDF Utilities

    sudo apt-get install poppler-utils
  4. Python packages

    sudo pip install -r requirements.txt

Usage

  1. Clear the pdf/ folder and copy all your pdf files to be scanned in it.

  2. Run the OCR:

    python3 shellocr.py
  3. The scanned text files shall be available in the txt/ folder once the process completes.

Alternate

  1. If the above doesn't work for you, try the alternate method.

  2. Save your file as input.pdf in the root directory.

  3. Run

    python3 pdf_miner.py 

About

Extract tables from scanned image PDFs using Optical Character Recognition.

Topics

Resources

License

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.