Linux Unix: Extract Text From PDFs And Images With gImageReader, A Tesseract OCR GUI

Senin, 03 Januari 2011

Extract Text From PDFs And Images With gImageReader, A Tesseract OCR GUI

gImageReader is a graphical GTK frontend to tesseract-ocr, a free software optical character recognition (OCR) engine.

Tesseract is a raw OCR engine, with no document layout analysis, no output formatting and no graphical user interface (GUI).

gImageReader processes an image or PDF file from which it creates text. It supports selecting columns and parts of the document, it can open multipage PDF files or images, supports all formats, can transmit a selected area to Tesseract for recognition and spell check the output.

Optional: Install Tesseract OCR 3.0 SVN in Ubuntu Lucid and MAverick

Tesseract OCR 3.0 is still in development but in my tests it worked much better then the current stable version. Further more, the PPA below comes with a lot of extra Tessaract language files so I suggest installing the latest Tesseract OCR 3.0 SVN. This is however is optional!

Warning: you must add the PPA, install the latest Tesseract and then disable the PPA as it contains a lot of bleeding edge packages!

Add the PPA and install Tesseract OCR 3.0 SVN:

sudo add-apt-repository ppa:alex-p/notesalexp
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng

You can install some extra languages from this PPA, such as Bulgarian, Catalan, Czech, Danish, German, Greek, Finnish, Indonesian, Hungarian, Italian, Dutch, Polish, Romanian, Spanish and so on. Simply search for "tesseract-ocr" in Synaptic and you should easily find all these packages - install the ones you'll need later on.

Now you must disable the PPA: press ALT + F2 and enter:

gksu software-properties-gtk

Then, on the "Other Software" tab look for the line(s) that says "http://ppa.launchpad.net/alex-p/notesalexp" and either disable it or delete it.

gImageReader

gImageReader is available for Linux and Windows and can be downloaded from HERE (.deb, .rpm and .exe files are available).

To use gImageReader, select the PDF or image you want to extract the text from and click "Recognize all" for the whole page or use your mouse to draw a selection and then click "Recognize selection" to extract only a part of the document.

If you've installed the Tesseract Ocr language for the PDF or image you're trying to open, gImageReader will automatically detect the language.

Thanks to LFFL for the gImageReader tip!

Linux Unix

Pengikut

Arsip Blog

Powered By

Senin, 03 Januari 2011

Extract Text From PDFs And Images With gImageReader, A Tesseract OCR GUI

Optional: Install Tesseract OCR 3.0 SVN in Ubuntu Lucid and MAverick

gImageReader

Tidak ada komentar:

Posting Komentar

Link

Viewer

Label