Are Large Language Models Good Essay Graders?
Date
Author
Institution
Degree Level
Degree
Department
Supervisor / Co-Supervisor and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading processes. Specifically, we investigate the applicability of LLMs such as GPT-3.5T and Llama-2 in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in education. Our study explores both zero-shot and few-shot learning approaches, employing various prompting techniques to enhance performance. Utilizing the ASAP dataset, a well-known dataset for the AES task, we compare the numeric grade provided by the LLMs to human rater-provided scores. Our research reveals that both approaches GPT-3.5T and Llama-2 generally assign lower scores compared to those provided by the human raters. Furthermore, neither LLM correlates well with the human scores. In particular, GPT-3.5T tends to be harsher and further misaligned with human evaluations compared to Llama-2. On the other hand, both LLMs not only can reliably detect spelling and grammar mistakes but also seem to take those mistakes into account when computing their score. Additionally, we extended our analysis to include the most recent release, Llama-3, which shows promising improvements in alignment with human scores. This suggests that newer generations of LLMs have the potential to be more effective in AES tasks. Overall, our results offer a cautiously optimistic view of using LLMs as tools to assist in the grading of written essays, highlighting both their current limitations and their future potential.
