Exploring Methods for Generating and Evaluating Skill Targeted Reading Comprehension Questions
Date
Institution
Degree Level
Degree
Department
Supervisor / Co-Supervisor and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
It takes skilled teachers a significant amount of time and effort to create high quality reading comprehension questions, often making it impractical to target a particular reader’s weaknesses. Recently, language models have been proposed as a tool to help teachers fill this gap, allowing these teachers to generate questions targeting specific skill types. In this thesis, we propose SoftSkillQG, a new soft-prompt based language model for generating skill targeted reading comprehension questions that does not require any manual effort to target new skills. We compare SoftSkillQG against a variety of strong baselines and show that it outperforms existing techniques on four out of five question quality metrics for the SBRCS dataset and human evaluation of Context Specificity on the QuAIL dataset. However, on the QuAIL dataset, T5 WTA, a previously proposed method using manually created prompts, outperforms SoftSkillQG in terms of perplexity and these same five metrics. We investigate why SoftSkillQG performs poorly relative to T5 WTA, a method using manually created “hard” prompts, on the QuAIL dataset by examining both the data size and prompt initialization on SoftSkillQG’s performance. We show that dataset size may be affecting performance, but augmenting training with silver data from the SQuAD dataset did not improve performance. On the other hand, initializing the prompt of SoftSkillQG using the same prompt as T5 WTA yielded nearly the same perplexity on the QuAIL dataset. Finally, we perform a first of its kind analysis using the human annotations from our previous experiments to compare five different methods for evaluating sets of generated questions. We find that: MS-Jaccard4 best captures the diversity of a set of questions, Best Reference Evaluation aligns mostly closely with human judgement of Answerability; Cartesian Product evaluation aligns most closely with Context-Specificity; and Fr´echet BERT Distance aligns mostly closely with Fluency.
