Are Large Language Models Good Essay Graders?

Loading...
Thumbnail Image

Institution

http://id.loc.gov/authorities/names/n79058482

Degree Level

Master's

Degree

Master of Science

Department

Department of Computing Science

Supervisor / Co-Supervisor and Their Department(s)

Citation for Previous Publication

Link to Related Item

Abstract

We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading processes. Specifically, we investigate the applicability of LLMs such as GPT-3.5T and Llama-2 in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in education. Our study explores both zero-shot and few-shot learning approaches, employing various prompting techniques to enhance performance. Utilizing the ASAP dataset, a well-known dataset for the AES task, we compare the numeric grade provided by the LLMs to human rater-provided scores. Our research reveals that both approaches GPT-3.5T and Llama-2 generally assign lower scores compared to those provided by the human raters. Furthermore, neither LLM correlates well with the human scores. In particular, GPT-3.5T tends to be harsher and further misaligned with human evaluations compared to Llama-2. On the other hand, both LLMs not only can reliably detect spelling and grammar mistakes but also seem to take those mistakes into account when computing their score. Additionally, we extended our analysis to include the most recent release, Llama-3, which shows promising improvements in alignment with human scores. This suggests that newer generations of LLMs have the potential to be more effective in AES tasks. Overall, our results offer a cautiously optimistic view of using LLMs as tools to assist in the grading of written essays, highlighting both their current limitations and their future potential.

Item Type

http://purl.org/coar/resource_type/c_46ec

Alternative

License

Other License Text / Link

This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

en

Location

Time Period

Source