IndoLib: A Natural Language Processing Toolkit for Low-Resource South Asian Languages

Timalsina, Nitya

View/Open

TIMALSINA-DOCUMENT-2022.pdf (1.184Mb)

Author

Timalsina, Nitya

Metadata

Show full item record

Citation

Timalsina, Nitya. 2022. IndoLib: A Natural Language Processing Toolkit for Low-Resource South Asian Languages. Master's thesis, Harvard University Division of Continuing Education.

Abstract

Out of 7,151 living languages, 665 languages (9.299%) are spoken by nearly 2 billion people across Southern Asia. Of these, 37.74% (251 languages) are endangered, while the vast majority remain underrepresented in language systems. This thesis presents a new NLP toolkit called IndoLib designed to support natural language processing (NLP) research in South Asian languages, consisting of the Indo-Aryan, Dravidian, and Sino-Tibetan language families, in this case. IndoLib includes four primary components: (i) monolingual and multilingual datasets to expand language modeling and language detection for thirty-one Indic languages, (ii) fine-tuned multilingual models for named entity recognition (NER) and summarization, (iii) a bilingual dataset with Sanskrit-English and English-Sanskrit parallel sentences, and (iv) a fine-tuned machine translation model for two-way translations between Sanskrit and English. The fine-tuned multilingual NER and bilingual translation models outperform current benchmark models upon evaluation. This thesis is intended to aid researchers interested in applying transfer learning to develop or optimize transformer-based models for South Asian languages.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Citable link to this page

https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37373336

Collections

DCE Theses and Dissertations [1259]

Contact administrator regarding this item (to report mistakes or request changes)