Improving Bengali and Hindi Large Language Models

dc.contributor.advisorBarbosa, Denilson (Computing Science)
dc.contributor.authorShahriar, Arif
dc.date.accessioned2025-05-28T19:11:40Z
dc.date.available2025-05-28T19:11:40Z
dc.date.issued2024-06
dc.description.abstractBengali and Hind are two widely spoken yet low-resource languages. The state-of-the-art in modeling such languages uses BERT and the Wordpiece tokenizer. We observed that the Wordpiece tokenizer often breaks words into meaningless tokens, failing to separate roots from affixes. Moreover, Wordpiece does not take into account fine-grained character-level information. We hypothesize that modeling fine-grained character-level information or interactions between roots and affixes helps with modeling highly inflected and morphologically complex languages such as Bengali and Hindi. We used BERT with two different tokenizers - Bengali and Hindi Unigram tokenizer and a character-level tokenizer and observed better performance. Then, we pre-trained two language models accordingly and evaluated them for masked token detection, both in correct and erroneous settings, across many NLU tasks. We provide experimental evidence that Unigram and character-level tokenizers lead to better pre-trained models for Bengali and Hindi, outperforming the previous state-of-the-art and BERT with Wordpiece vocabulary. We conduct the first study investigating the efficacy of different tokenization methods in modeling Bengali and Hindi.
dc.identifier.doihttps://doi.org/10.7939/r3-thma-q304
dc.language.isoen
dc.rightsThis thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
dc.subjectLanguage model
dc.subjectNLU
dc.subjectLow-resource languages
dc.subjectSelf-supervised Pretraining
dc.subjectTokenizer
dc.subjectBERT
dc.titleImproving Bengali and Hindi Large Language Models
dc.typehttp://purl.org/coar/resource_type/c_46ec
thesis.degree.grantorhttp://id.loc.gov/authorities/names/n79058482
thesis.degree.levelMaster's
thesis.degree.nameMaster of Science
ual.date.graduationSpring 2024
ual.departmentDepartment of Computing Science
ual.jupiterAccesshttp://terms.library.ualberta.ca/public

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Shahriar_Arif_202401_MSc.pdf
Size:
654.02 KB
Format:
Adobe Portable Document Format