Fall 2025 theses and dissertations (non-restricted) will be available in ERA on November 17, 2025.

Annotating Web Tables Using Surface Text Patterns

Loading...
Thumbnail Image

Institution

http://id.loc.gov/authorities/names/n79058482

Degree Level

Master's

Degree

Master of Science

Department

Department of Computing Science

Supervisor / Co-Supervisor and Their Department(s)

Examining Committee Member(s) and Their Department(s)

Citation for Previous Publication

Link to Related Item

Abstract

While the World Wide Web has always been treated as an immense source of data, most information it provides is usually deemed unstructured and sometimes ambiguous, which in turn makes it unreliable. But the web also contains a relatively large number of structured data in the form of tables, which are constructed elaborately by human. Unfortunately, each relational table on the Web carries its own "schema''. The semantics of the columns and the relationships between the columns are often ill-defined; this makes any machine interpretation of the schema difficult and even sometimes impossible. We study the problem of annotating Web tables where given a table and a set of relevant documents, each describing or mentioning the element(s) of a row, the goal is to find surface text patterns that best describe the contexts for each column or combinations of the columns. The problem is challenging because of the number of potential patterns, the amount of noise in texts and the numerous ways rows can be mentioned. We develop a 2-stage framework where candidate patterns are generated based on sliding windows over texts in the first stage, and in the second stage, patterns are generalized and the redundant patterns are removed. Experiments are conducted to evaluate the quality of the annotations in comparison to human annotations.

Item Type

http://purl.org/coar/resource_type/c_46ec

Alternative

License

Other License Text / Link

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Subject/Keywords

Language

en

Location

Time Period

Source