Skip to content
Paul Vorbach edited this page Oct 16, 2016 · 8 revisions

Overview

The main view is divided into three sections:

  1. Images
  2. Traineddata files
  3. Working pane, which itself consists of various steps:
    1. Pre-processing
    2. Box Editor
    3. Symbol Overview
    4. Recognition
    5. Evaluation

The working pane will show the image, which is selected in the images list applied to the selected traineddata file.

Getting started

Installing Tesseract language files

  1. Download the latest release of Tesseract language files from the tessdata releases page. ("Tessdata" is what Tesseract calls its definitions for recognizing a specific language.)
  2. Unzip/untar the archive to some directory on your computer (in this guide, I'll use C:\Program Files\Tesseract-OCR\tessdata as an example).
  3. Set an environment variable TESSDATA_PREFIX to the parent of the newly created directory. (C:\Program Files\Tesseract-OCR in our example.)

Creating a new project

File → New Project will ask you for a directory where your images are located. You can also check which kinds of images you want to look for. Once you create the new project, a directory called "tesseract-project" is created in the selected directory where all project-relevant data is stored.

Pre-processing

The pre-processing step currently supports binarization of non-black-and-white images. You can play with the parameters and update the image by clicking the "Preview" button. Once you are happy with the results, you can click "Apply to all images". Single images can then be optimized by finding other parameters that work better and clicking "Apply to current image".

Box Editor

tesseract4java also comes with a box editor for training Tesseract. The left side shows a table of all bounding boxes Tesseract found on the current image that you can filter with a simple substring search. The right shows the bounding boxes rendered on the image. When the list is filtered, only the matching boxes are visible on the image. Also, the currently selected box is shown in red.

If you find a box that does not match a symbol you can change the coordinates or characters by changing the values in the fields at the top. The magnifying icons let you change the zoom level of the image.

As an alternative to finding a bounding box in the table, you can also simply left-click on it on the image. Right-clicking on it lets you split a box in half or merge it with the previous or next box.

After you are happy with the bounding boxes on an image, you should save the box file with File → Save Box File. The box file is then saved to the "preprocessed" folder in the "tesseract-project" directory. Box files are compatible with other Tesseract Box Editors. When you have created several box files, you can run Tesseract for training, as described on the Tesseract wiki.

Symbol Overview

This view has a list of all characters on the current image ranked by their number of occurrences (in parentheses). When you select a symbol, you see a list of all variants of that symbol on the current image. By default, they are ranked by Tesseract's detection confidence, which can also be changed to size (width × height) or weight (number of black pixels). You can also review the characters features as seen by Tesseract when right-clicking on a symbol and selecting "Show features".

The purpose of this step is to give you a way to quickly identify and correct incorrect symbols in the box files of the previous step.

Recognition

This step displays the original scan and the recognized text side-by-side. You can show or hide not only different levels of bounding boxes like symbols, words, and paragraphs, but also detected base lines and line numbers (only visible when the zoom level is ≥ 50%).

When word boxes are enabled, right-clicking them gives you information about the word recognition confidence and detected font features. When symbol boxes are shown, right-clicking them gives you possible alternative characters.

Evaluation

TODO

Batch Export

TODO