We (Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardan, Lahiru Lasadun) started SinMin - Sinhala Corpus Project as final year undergraduates for our final year project under the supervision of Dr. Chinthana Wimalasuriya, Mr. N. H. N. D. de Silva and Prof Gihan Dias.
A rich language corpus enables a wide area of research topics for a language. Most of them include
1. Statistical analysis of the language usage pattens
2. Translators
3. Spell and Grammar Tools
4. Backend support to third party applications like OCR tools
Usually a corpus contains a collection of authentic texts of the language. However rather than storing them as raw text files, Sinmin further stores them in different databases with different schemas. This enables Sinmin to easily process language data in realtime. In addition to that SinMin Corpus contains a REST API that enable querying and finding data through third party applications.
SinMin web interface provides ability to illustrate and find patterns that occur in Sinhala Language over different time periods and different categories.
Useful Links
SinMin Sinhala Corpus currently crawl data from following sources
A rich language corpus enables a wide area of research topics for a language. Most of them include
1. Statistical analysis of the language usage pattens
2. Translators
3. Spell and Grammar Tools
4. Backend support to third party applications like OCR tools
Usually a corpus contains a collection of authentic texts of the language. However rather than storing them as raw text files, Sinmin further stores them in different databases with different schemas. This enables Sinmin to easily process language data in realtime. In addition to that SinMin Corpus contains a REST API that enable querying and finding data through third party applications.
SinMin web interface provides ability to illustrate and find patterns that occur in Sinhala Language over different time periods and different categories.
Useful Links
- SinMin Web : http://sinhala-corpus.projects.uom.lk/sinmin-web
- Crawled raw data files in XML format : http://sinhala-corpus.projects.uom.lk/sinmin-web/data
- Documentation : http://sinhala-corpus.projects.uom.lk/docs
- API documentation : http://sinhala-corpus.projects.uom.lk/docs/display/ds/REST+API
- Source Code : https://github.com/sinmin/core
- Additional Resources
- http://www.slideshare.net/cdwijayarathna/sinmin-literature-review-presentation
- http://www.slideshare.net/cdwijayarathna/implementing-a-corpus-for-sinhala-language
- http://www.slideshare.net/cdwijayarathna/sinmin-final-presentation
- Chamila's blog post : http://cdwijayarathna.blogspot.com/2015/04/sinmin-corpus-for-sinhala-language.html
SinMin Sinhala Corpus currently crawl data from following sources
- Sinhala Online Newspapers
- Lankadeepa - http://lankadeepa.lk/
- Divaina - http://www.divaina.com/
- Dinamina - http://www.dinamina.lk/2014/06/26/
- Lakbima - http://www.lakbima.lk/
- Mawbima - http://www.mawbima.lk/
- Rawaya - http://ravaya.lk/
- Silumina - http://www.silumina.lk/
- Sinhala News Sites
- Ada Derana - http://sinhala.adaderana.lk/
- Sinhala Religious and Educational Magazines
- Aloka Udapadi - http://www.lakehouse.lk/alokoudapadi/
- Budusarana - http://www.lakehouse.lk/budusarana/
- Namaskara - http://namaskara.lk/
- Sarasawiya - http://sarasaviya.lk/
- Vidusara - http://www.vidusara.com/
- Wijeya - http://www.wijeya.lk/
- Sri Lanka Gazette in Sinhala - http://documents.gov.lk/gazette/
- Online Mahawansaya - http://mahamegha.lk/mahawansa/
- Sinhala Movie Subtitles - http://www.baiscopelk.com/category/සිංහල-උපසිරැස/
- Sinhala Wikipedia - http://si.wikipedia.org/
- Sinhala Blogs