Syntax and Sensibility: Using language models to detect and correct syntax errors

dc.contributor.authorSantos, E.A.
dc.contributor.authorCampbell, J.C.
dc.contributor.authorPatel, D.
dc.contributor.authorHindle, Abram
dc.contributor.authorAmaral, J.N.
dc.date.accessioned2025-05-01T02:06:33Z
dc.date.available2025-05-01T02:06:33Z
dc.date.issued2018
dc.descriptionSyntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of experience that help them quickly resolve these frustrating errors. Standard LR parsers are of little help, typically resolving syntax errors and their precise location poorly. We propose a methodology that locates where syntax errors occur, and suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by using language models trained on correct source code to find tokens that seem out of place. Fixes are synthesized by consulting the language models to determine what tokens are more likely at the estimated error location. We compare n-gram and LSTM (long short-term memory) language models for this task, each trained on a large corpus of Java code collected from GitHub. Unlike prior work, our methodology does not rely that the problem source code comes from the same domain as the training data. We evaluated against a repository of real student mistakes. Our tools are able to find a syntactically-valid fix within its top-2 suggestions, often producing the exact fix that the student used to resolve the error. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.
dc.identifier.doihttps://doi.org/10.7939/r3-6n5v-c611
dc.language.isoen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectlanguage models
dc.subjectestimated error location
dc.subjectincomprehensible syntax errors
dc.subjectn-gram models
dc.subjectLSTM model
dc.subjectJava code
dc.subjectlong short-term memory
dc.subjectGitHub
dc.titleSyntax and Sensibility: Using language models to detect and correct syntax errors
dc.typehttp://purl.org/coar/resource_type/R60J-J5BD
ual.jupiterAccesshttp://terms.library.ualberta.ca/public

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
santos2018SANER-syntax.pdf
Size:
384.46 KB
Format:
Adobe Portable Document Format