Wednesday, October 8, 2014

Java Wrapper for Tesseract OCR Library

Tesseract is a very popular OCR library written in C++. It can be simply used to identify characters in a given image that contains text. In addition to that it can be used to get positions of each word/ character. Tesseract provides a command line tool and a C++ api to give services to users. However there is not a implementation for Java users that can directly use Tesseract for their applications.

As a part of my GSoC project in Apache PDFBox  I implemented a Java wrapper for Tesseract C++ api that can be used by Java users to directly use Tesseract in their applications. Code repository can be found from here.

To use Java API simply import Tesseract-JNI-Wrapper-1.0.0.jar to your project. If you are using maven, add this to your pom

<dependency>
  <groupId>org.apache.pdfbox.ocr</groupId>
  <artifactId>Tesseract-JNI-Wrapper</artifactId>
  <name>Tesseract Jni Wrapper</name>
  <version>1.0.0</version>
</dependency>


Here is a sample code that can use Java API invoke Tesseract.

public String getOCRText(BufferedImage image){ //You need to send BufferedImage (RGB) of scanned image
  TessBaseAPI api = new TessBaseAPI();
  boolean init = api.init("src/main/resources/data", "eng"); // position of Training data files
  api.setBufferedImage(image);
  String text = api.getUTF8Text();
  System.out.println(text);
  api.end();
  return text;
}


Getting positions of each OCRed word

public void printOCRTextPositions(BufferedImage image){
  TessBaseAPI api = new TessBaseAPI();
  boolean init = api.init("src/main/resources/data", "eng");
  api.setBufferedImage(image);
  api.getResultIterator();
  if (api.isResultIteratorAvailable()) {
    do {
      System.out.println(api.getWord().trim());
      String result = api.getBoundingBox();
      System.out.println(result);
    } while (api.resultIteratorNext());
  }
  api.end();
}


P.S.
This wrapper currently is working in MacOS and Linux environments. It wasn't tested in Windows environments. If anyone is willing to develop or improve functionalities of this wrapper please let me know.

Tuesday, October 7, 2014

Continuous Integration for GitHub - Travis CI

Travis CI is a very impressive and cool CI tool that can directly fetch and automatically build your GitHub projects. Following few steps you can easily integrate your GitHub projects with Travis CI

1. Got to https://travis-ci.org/ and log in using your GitHub account

2. click + button and add your project to Travis CI




3. Add .travis.yml file to the root folder of the project and push it to GitHub
This is the file that contains configuration details to Travis CI about your project details like language and build instructions
If your project is a java maven project, you can simply add

language: java

install: mvn install -Dmaven.compiler.target=1.6 -Dmaven.compiler.source=1.6 -DskipTests=true

script: mvn test -Dmaven.compiler.target=1.6 -Dmaven.compiler.source=1.6

For more configuration details refer to the documentation of Travis CI

4. Do some change to your project and push it to GitHub. Commit will be reflected in Travis Console same time and it will start to build project automatically and send build details to your mail.