Analyzing And Extracting Lists On The Web
Date
Author
Institution
Degree Level
Degree
Department
Supervisor / Co-Supervisor and Their Department(s)
Examining Committee Member(s) and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
The amount of information available on the Web is rapidly growing, and the need for extracting more useful and relevant data from this tremendously large source has become an interesting research challenge. Among various types of useful information that can be extracted, lists in particular are highly valuable as they provide groupings of related items. Such groupings are often interpretable and may present data in a more structured and condensed format that can be fed to other applications. In this thesis we explore some of the properties of lists embedded in web pages. Based on these properties, we propose a technique for classifying web pages into two categories: those containing lists, and the rest. Our results show that unlike some previous work, not all list-specific html tags are useful for identifying list-containing web pages. We also study the related problem of locating lists in a page. We cast the problem of detecting the boundaries of a list as a classification task and build a classifier using relevant page features. As the classifier produces a sequence of labels for each page, we examine some of the properties of this sequence and show how the accuracy of the detection can be further improved by rejecting some of the sequences that are less likely to indicate a list.
