Column Type Annotation Using Large Language Models
Date
Author
Institution
Degree Level
Degree
Department
Supervisor / Co-Supervisor and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
A vast amount of information on the web is stored in tabular format, making accurate table interpretation crucial for data analysis and knowledge extraction. Column Type Annotation (CTA), the process of assigning semantic types to table columns, is essential for effective table querying and understanding.
This thesis investigates the CTA task in two parts. First, we conduct a critical evaluation of established CTA benchmarks, identifying major issues that impact the performance of the models on these benchmarks. Our findings reveal that addressing these benchmark issues can lead to substantial performance reductions of up to 30\% compared to previously reported results.
Second, we harness the power of Large Language Models (LLMs) for the CTA task. By employing techniques such as Retrieval-Augmented Generation (RAG) and using models reasoning capabilities, we demonstrate how LLMs can achieve state-of-the-art performance on CTA tasks. Our approach leads to a 10\% improvement over simple prompting methods, making LLMs competitive with, and in some cases surpassing, current leading pre-trained models designed specifically for CTA.
