-
Notifications
You must be signed in to change notification settings - Fork 90
Usage
The main view is divided into three sections:
- Images
- Traineddata files
- Working pane, which itself consists of various steps:
- Pre-processing
- Box Editor
- Symbol Overview
- Recognition
- Evaluation
The working pane will show the image, which is selected in the images list applied to the selected traineddata file.
- Download the latest release of Tesseract language files from the tessdata releases page. ("Tessdata" is what Tesseract calls its definitions for recognizing a specific language.)
- Unzip/untar the archive to some directory on your computer (in this guide, I'll use
C:\Program Files\Tesseract-OCR\tessdataas an example). - Set an environment variable
TESSDATA_PREFIXto the parent of the newly created directory. (C:\Program Files\Tesseract-OCRin our example.)
File → New Project will ask you for a directory where your images are located. You can also check which kinds of images you want to look for. Once you create the new project, a directory called "tesseract-project" is created in the selected directory where all project-relevant data is stored.
The pre-processing step currently supports binarization of non-black-and-white images. You can play with the parameters and update the image by clicking the "Preview" button. Once you are happy with the results, you can click "Apply to all images". Single images can then be optimized by finding other parameters that work better and clicking "Apply to current image".
tesseract4java also comes with a box editor for training Tesseract. The left side shows a table of all bounding boxes Tesseract found on the current image that you can filter with a simple substring search. The right shows the bounding boxes rendered on the image. When the list is filtered, only the matching boxes are visible on the image. Also, the currently selected box is shown in red.
If you find a box that does not match a symbol you can change the coordinates or characters by changing the values in the fields at the top. The magnifying icons let you change the zoom level of the image.
As an alternative to finding a bounding box in the table, you can also simply left-click on it on the image. Right-clicking on it lets you split a box in half or merge it with the previous or next box.
After you are happy with the bounding boxes on an image, you should save the box file with File → Save Box File. The box file is then saved to the "preprocessed" folder in the "tesseract-project" directory. Box files are compatible with other Tesseract Box Editors. When you have created several box files, you can run Tesseract for training, as described on the Tesseract wiki.
This view has a list of all characters on the current image ranked by their number of occurrences (in parentheses). When you select a symbol, you see a list of all variants of that symbol on the current image. By default, they are ranked by Tesseract's detection confidence, which can also be changed to size (width × height) or weight (number of black pixels). You can also review the characters features as seen by Tesseract when right-clicking on a symbol and selecting "Show features".
The purpose of this step is to give you a way to quickly identify and correct incorrect symbols in the box files of the previous step.
This step displays the original scan and the recognized text side-by-side. You can show or hide not only different levels of bounding boxes like symbols, words, and paragraphs, but also detected base lines and line numbers (only visible when the zoom level is ≥ 50%).
When word boxes are enabled, right-clicking them gives you information about the word recognition confidence and detected font features. When symbol boxes are shown, right-clicking them gives you possible alternative characters.
TODO
TODO