Showing posts with label Langage. Show all posts
Showing posts with label Langage. Show all posts

Tuesday, April 14, 2015

SinMin - Sinhala Corpus Project

We (Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardan, Lahiru Lasadun) started SinMin - Sinhala Corpus Project as final year undergraduates for our final year project under the supervision of Dr. Chinthana Wimalasuriya, Mr. N. H. N. D. de Silva and Prof Gihan Dias.

A rich language corpus enables a wide area of research topics for a language. Most of them include

1. Statistical analysis of the language usage pattens
2. Translators
3. Spell and Grammar Tools
4. Backend support to third party applications like OCR tools

Usually a corpus contains a collection of authentic texts of the language. However rather than storing them as raw text files, Sinmin further stores them in different databases with different schemas. This enables Sinmin to easily process language data in realtime.  In addition to that SinMin Corpus contains a REST API that enable querying and finding data through third party applications.




SinMin web interface provides ability to illustrate and find patterns that occur in Sinhala Language over different time periods and different categories.



Useful Links



SinMin Sinhala Corpus currently crawl data from following sources