Tuesday, April 14, 2015

SinMin - Sinhala Corpus Project

We (Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardan, Lahiru Lasadun) started SinMin - Sinhala Corpus Project as final year undergraduates for our final year project under the supervision of Dr. Chinthana Wimalasuriya, Mr. N. H. N. D. de Silva and Prof Gihan Dias.

A rich language corpus enables a wide area of research topics for a language. Most of them include

1. Statistical analysis of the language usage pattens
2. Translators
3. Spell and Grammar Tools
4. Backend support to third party applications like OCR tools

Usually a corpus contains a collection of authentic texts of the language. However rather than storing them as raw text files, Sinmin further stores them in different databases with different schemas. This enables Sinmin to easily process language data in realtime.  In addition to that SinMin Corpus contains a REST API that enable querying and finding data through third party applications.




SinMin web interface provides ability to illustrate and find patterns that occur in Sinhala Language over different time periods and different categories.



Useful Links



SinMin Sinhala Corpus currently crawl data from following sources

3 comments:

  1. I am really interested in your Project and already read the Research paper "Implementing a Corpus for Sinhala Language". But when I am going to access SinMin web(http://sinhala-corpus.projects.uom.lk/sinmin-web) It timed out. Is the site still operational or is there any other way to access it?

    ReplyDelete
  2. Thanks for your interest. Sorry. Site is down due to the lack of resources. However if you let me know what exactly you need, I may be able to help. Do you need to access the data?

    ReplyDelete
    Replies
    1. Yes please. My final year research project is "Bootstrapping Sinhala Linguistic Data for NLP Applications". So if I could access that data, It will be very helpful. Can you send me a way to access your data? E-mail :- kasunlakmal.klj@gmail.com

      Delete