Extending Tables using a Web Table Corpus
Date
Author
Institution
Degree Level
Degree
Department
Supervisor / Co-Supervisor and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
The web contains a large volume of tables that provide structured information about entities and relationships. This data may be used as a source for exploratory searches and to gather information about desired entities. This thesis focuses on one particular exploratory search where given a query table and a corpus of web tables, the goal is to find a ranked list of additional columns (from the table corpus) that describe the entities of the query table. We refer to this task as “table extension.”
There are challenges in performing a table extension. A main challenge is that in the absence of schema information for web tables, it is not often clear which tables and/or columns may be relevant to the query. Also, multiple related columns may represent the same concept and this can lead to duplicate columns in the extended table. In this thesis, we propose a 5-step framework to address these challenges. Our framework establishes functional dependency relationships between columns and uses those dependencies in identifying more appropriate extensions. Duplicate columns are also detected and consolidated through some form of clustering. We evaluate our framework on a publicly available gold standard containing 233 web tables, using DBpedia as ground truth. Our evaluation reveals that the number of unique relevant columns extended by our proposed solution is on average 3 times more than that of two state-of-the-art baselines. Furthermore, the precision of extending a table using our method is higher than that of both baselines, meaning that fewer irrelevant columns are retrieved.
