Exploring Methods for Generating and Evaluating Skill Targeted Reading Comprehension Questions

Loading...
Thumbnail Image

Institution

http://id.loc.gov/authorities/names/n79058482

Degree Level

Master's

Degree

Master of Science

Department

Department of Computing Science

Supervisor / Co-Supervisor and Their Department(s)

Citation for Previous Publication

Link to Related Item

Abstract

It takes skilled teachers a significant amount of time and effort to create high quality reading comprehension questions, often making it impractical to target a particular reader’s weaknesses. Recently, language models have been proposed as a tool to help teachers fill this gap, allowing these teachers to generate questions targeting specific skill types. In this thesis, we propose SoftSkillQG, a new soft-prompt based language model for generating skill targeted reading comprehension questions that does not require any manual effort to target new skills. We compare SoftSkillQG against a variety of strong baselines and show that it outperforms existing techniques on four out of five question quality metrics for the SBRCS dataset and human evaluation of Context Specificity on the QuAIL dataset. However, on the QuAIL dataset, T5 WTA, a previously proposed method using manually created prompts, outperforms SoftSkillQG in terms of perplexity and these same five metrics. We investigate why SoftSkillQG performs poorly relative to T5 WTA, a method using manually created “hard” prompts, on the QuAIL dataset by examining both the data size and prompt initialization on SoftSkillQG’s performance. We show that dataset size may be affecting performance, but augmenting training with silver data from the SQuAD dataset did not improve performance. On the other hand, initializing the prompt of SoftSkillQG using the same prompt as T5 WTA yielded nearly the same perplexity on the QuAIL dataset. Finally, we perform a first of its kind analysis using the human annotations from our previous experiments to compare five different methods for evaluating sets of generated questions. We find that: MS-Jaccard4 best captures the diversity of a set of questions, Best Reference Evaluation aligns mostly closely with human judgement of Answerability; Cartesian Product evaluation aligns most closely with Context-Specificity; and Fr´echet BERT Distance aligns mostly closely with Fluency.

Item Type

http://purl.org/coar/resource_type/c_46ec

Alternative

License

Other License Text / Link

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

en

Location

Time Period

Source