Identifying Cognate Sets Across Dictionaries of Related Languages

dc.contributor.advisorKondrak, Grzegorz (Computing Science)
dc.contributor.authorSt Arnaud, Adam, J.J.
dc.contributor.otherBeck, David (Linguistics)
dc.contributor.otherAmaral, J. Nelson (Computing Science)
dc.date.accessioned2025-05-28T21:56:16Z
dc.date.available2025-05-28T21:56:16Z
dc.date.issued2017-11
dc.description.abstractCognates are words in related languages that have originated from the same word in an ancestor language, such as the English/German word pair father/Vater. Cognate information is critical in the field of historical linguistics, where it is used to determine the relationships between languages and to construct the ancestor languages they originated from. Most recent work in cognate identification focuses on the task of clustering cognates within lists of words each having an identical definition. In that task, only orthographic or phonetic information about a word is utilized when making cognate judgments. We present a system for the more challenging task of identifying cognate sets across dictionaries of related languages. The likelihood of a cognate relationship is calculated on the basis of a rich set of features that capture both phonetic and semantic similarity, as well as the presence of regular sound correspondences. The pairwise similarity scores are combined with an average-score clustering algorithm to create sets of words from different languages that may originate from a common proto-word. When tested on the Algonquian language family, our system detects 63% of cognate sets while maintaining cluster purity of 70%.
dc.identifier.doihttps://doi.org/10.7939/R3NV99Q98
dc.language.isoen
dc.rightsThis thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
dc.subjectMachine learning
dc.subjectNatural language processing
dc.subjectComputational diachronic linguistics
dc.subjectCognates
dc.subjectComputational linguistics
dc.titleIdentifying Cognate Sets Across Dictionaries of Related Languages
dc.typehttp://purl.org/coar/resource_type/c_46ec
thesis.degree.grantorhttp://id.loc.gov/authorities/names/n79058482
thesis.degree.levelMaster's
thesis.degree.nameMaster of Science
ual.date.graduationFall 2017
ual.departmentDepartment of Computing Science
ual.jupiterAccesshttp://terms.library.ualberta.ca/public

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
StArnaud_Adam_JJ_201704_MSc.pdf
Size:
543.2 KB
Format:
Adobe Portable Document Format