Upe Blog: Java Wrapper for Tesseract OCR Library

Tesseract is a very popular OCR library written in C++. It can be simply used to identify characters in a given image that contains text. In addition to that it can be used to get positions of each word/ character. Tesseract provides a command line tool and a C++ api to give services to users. However there is not a implementation for Java users that can directly use Tesseract for their applications.

As a part of my GSoC project in Apache PDFBox I implemented a Java wrapper for Tesseract C++ api that can be used by Java users to directly use Tesseract in their applications. Code repository can be found from here.

To use Java API simply import Tesseract-JNI-Wrapper-1.0.0.jar to your project. If you are using maven, add this to your pom

<dependency>
  <groupId>org.apache.pdfbox.ocr</groupId>
  <artifactId>Tesseract-JNI-Wrapper</artifactId>
  <name>Tesseract Jni Wrapper</name>
  <version>1.0.0</version>
</dependency>

Here is a sample code that can use Java API invoke Tesseract.

public String getOCRText(BufferedImage image){ //You need to send BufferedImage (RGB) of scanned image
  TessBaseAPI api = new TessBaseAPI();
  boolean init = api.init("src/main/resources/data", "eng"); // position of Training data files
  api.setBufferedImage(image);
  String text = api.getUTF8Text();
  System.out.println(text);
  api.end();
  return text;
}

Getting positions of each OCRed word

public void printOCRTextPositions(BufferedImage image){
  TessBaseAPI api = new TessBaseAPI();
  boolean init = api.init("src/main/resources/data", "eng");
  api.setBufferedImage(image);
  api.getResultIterator();
  if (api.isResultIteratorAvailable()) {
    do {
      System.out.println(api.getWord().trim());
      String result = api.getBoundingBox();
      System.out.println(result);
    } while (api.resultIteratorNext());
  }
  api.end();
}

P.S.
This wrapper currently is working in MacOS and Linux environments. It wasn't tested in Windows environments. If anyone is willing to develop or improve functionalities of this wrapper please let me know.

Upe Blog

Wednesday, October 8, 2014

Java Wrapper for Tesseract OCR Library

1 comment: