Wednesday, October 8, 2014

Java Wrapper for Tesseract OCR Library

Tesseract is a very popular OCR library written in C++. It can be simply used to identify characters in a given image that contains text. In addition to that it can be used to get positions of each word/ character. Tesseract provides a command line tool and a C++ api to give services to users. However there is not a implementation for Java users that can directly use Tesseract for their applications.

As a part of my GSoC project in Apache PDFBox  I implemented a Java wrapper for Tesseract C++ api that can be used by Java users to directly use Tesseract in their applications. Code repository can be found from here.

To use Java API simply import Tesseract-JNI-Wrapper-1.0.0.jar to your project. If you are using maven, add this to your pom

<dependency>
  <groupId>org.apache.pdfbox.ocr</groupId>
  <artifactId>Tesseract-JNI-Wrapper</artifactId>
  <name>Tesseract Jni Wrapper</name>
  <version>1.0.0</version>
</dependency>


Here is a sample code that can use Java API invoke Tesseract.

public String getOCRText(BufferedImage image){ //You need to send BufferedImage (RGB) of scanned image
  TessBaseAPI api = new TessBaseAPI();
  boolean init = api.init("src/main/resources/data", "eng"); // position of Training data files
  api.setBufferedImage(image);
  String text = api.getUTF8Text();
  System.out.println(text);
  api.end();
  return text;
}


Getting positions of each OCRed word

public void printOCRTextPositions(BufferedImage image){
  TessBaseAPI api = new TessBaseAPI();
  boolean init = api.init("src/main/resources/data", "eng");
  api.setBufferedImage(image);
  api.getResultIterator();
  if (api.isResultIteratorAvailable()) {
    do {
      System.out.println(api.getWord().trim());
      String result = api.getBoundingBox();
      System.out.println(result);
    } while (api.resultIteratorNext());
  }
  api.end();
}


P.S.
This wrapper currently is working in MacOS and Linux environments. It wasn't tested in Windows environments. If anyone is willing to develop or improve functionalities of this wrapper please let me know.

Tuesday, October 7, 2014

Continuous Integration for GitHub - Travis CI

Travis CI is a very impressive and cool CI tool that can directly fetch and automatically build your GitHub projects. Following few steps you can easily integrate your GitHub projects with Travis CI

1. Got to https://travis-ci.org/ and log in using your GitHub account

2. click + button and add your project to Travis CI




3. Add .travis.yml file to the root folder of the project and push it to GitHub
This is the file that contains configuration details to Travis CI about your project details like language and build instructions
If your project is a java maven project, you can simply add

language: java

install: mvn install -Dmaven.compiler.target=1.6 -Dmaven.compiler.source=1.6 -DskipTests=true

script: mvn test -Dmaven.compiler.target=1.6 -Dmaven.compiler.source=1.6

For more configuration details refer to the documentation of Travis CI

4. Do some change to your project and push it to GitHub. Commit will be reflected in Travis Console same time and it will start to build project automatically and send build details to your mail.


Thursday, January 30, 2014

LDAP connector for WSO2 ESB

As we were working on our intern project : "Infra portal", we had to connect to WSO2 user store LDAP (Lightweight Directory Access Protocol) directory in several ways like authenticating users, getting user list in a particular group, adding new entries (users in our case), editing entries and deleting them. Currently these operations can be done using javax.naming.* packages in java. However because most of our developments are done in jaggery, we had to write a java client for each property and integrate it with jaggery server. Then we could invoke its methods inside jaggery. But when time goes on, it was really painful to write methods to each operation that is required in the application in a single java class. It became too large and most of them were repetitions of previous methods with small changes, which is not suitable. 

So we decided to search for a better solution and finally we decided it is better to write a LDAP connector for WSO2 ESB. Main reasons for writing a LDAP connector were

  1. Connecting through an ESB connector makes most of the integrations with external LDAP directories become easy
  2. Loosely coupled interface between LDAP and client
  3. Others who are willing to use LDAP in their products can re-use this easily
  4. Language independent (Data exchange is done using REST or SOAP)


In our connector there are basic four functions implemented and one special function

Basic functions

  • Add new entry
  • Delete an entry
  • Update an entry
  • Search for an entry


Special Functions

  • Authenticate user


Before using any operation it is required to provide admin authenticate details to ESB. For that there is an Init operation.

<ldap.init xmlns="http://ws.apache.org/ns/synapse">
      <providerUrl>ldap://192.168.1.164:389/</providerUrl>
      <securityPrincipal>cn=admin,dc=wso2,dc=com</securityPrincipal>
      <securityCredentials>comadmin</securityCredentials>
   </ldap.init>
   
This signs in as the admin of LDAP directory which can perform any operation on LDAP Directory.
It is better to put this in a local entry and refer it in other operations with configKey

Add new entry

<ldap.addEntry configKey="LdapConfig">
    <objectClass>inetOrgPerson</objectClass>
    <dn>uid=dimuthuu2,ou=staff,dc=wso2,dc=com</dn>
    <attributes>cn=Dimuthu2Upeksha,mail=dimuthuu2wso2.com,userPassword=123,sn=Dimuthu2</attributes>
</ldap.addEntry

To add a new entry there are 3 parameters. 
1. Object class - This is a mandatory parameter. This defines the objectClass of the new entry
2. dn - Distinguished name of the new entry
3. attributes - Other attributes you need to add in to the entry


Delete an entry

<ldap.deleteEntry configKey="LdapConfig">
    <dn>uid=dimuthuu2,ou=staff,dc=wso2,dc=com</dn>
</ldap.deleteEntry>

Update an entry

<ldap.addEntry configKey="LdapConfig">
    <dn>uid=dimuthuu2,ou=staff,dc=wso2,dc=com</dn>
    <attributes>cn=Dimuthu2Upeksha,mail=dimuthuu2wso2.com,userPassword=123,sn=Dimuthu2</attributes>
</ldap.addEntry

1. dn - Distinguished name of the entry that is needed to update attributes
2. attributes- Key value pairs of attributes that are needed to be changed

Search for an entry

This searches a particular entry of a set of entries for given keywords.

<ldap.searchEntry configKey="LdapConfig">
            <objectClass>inetOrgPerson</objectClass>
            <filters>uid=dimuthuu</filters>
            <dn>ou=staff,dc=wso2,dc=com</dn>
            <attributes>uid,mail</attributes>
</ldap.searchEntry>

1. objectClass - type of entry that is needed to be searched
2. filters - keywords to search. Above case: search entries with uid with "dimuthuu"
3. dn - Distinguished name of the scope which searching should be applied.
4. attributes -  Attributes of the entry that should be included in the search result.


Authenticate

LDAP authentication is one of the major requirement in most LDAP based applications. To simplify this authentication mechanism, there is a special operation. For your given username and password it tells whether authentication succeeded or not.

<ldap.authenticate configKey="LdapConfig">
            <dn>uid=dimuthuu,ou=staff,dc=wso2,dc=com</dn>
            <password>1234</password>
</ldap.authenticate>

1. dn : Distinguished name of user
2. password: password of the user

-----------------------------------------
Special thanks should go to WSO2 ESB team including Dushan ayya and Isuru ayya for giving us a great help when we were in trouble.

If you think that this should be improved or I'm missing something here, please do comment below. Thanks