A New Method For Semi-Supervised Density-Based Projected Clustering

Loading...
Thumbnail Image

Institution

http://id.loc.gov/authorities/names/n79058482

Degree Level

Master's

Degree

Master of Science

Department

Department of Computing Science

Supervisor / Co-Supervisor and Their Department(s)

Examining Committee Member(s) and Their Department(s)

Citation for Previous Publication

Link to Related Item

Abstract

Density-based clustering methods extract high density clusters which are separated by regions of lower density. HDBSCAN* is an existing algorithm for producing a density-based cluster hierarchy. To obtain clusters from this hierarchy it includes an instance of FOSC(Framework for Optimal Selection of Clusters) to extract significant clusters, based on a measure known as cluster stability. We introduce CASAR (Compact And Separation Adjusted Ratio), a new algorithm for extracting significant clusters from an HDBSCAN* hierarchy. CASAR issimilar to FOSC, but defines local cluster quality differently and also uses a different aggregation method for comparing the quality of descendant clusters to ancestors in the hierarchy. The local cluster quality that CASAR uses is based on the validation index DBCV (Density-Based Cluster Validation). CASAR is designed to extract individual density-based clusters from subspaces, and is not meant to be a general purpose replacement for cluster stability. We also introduce a new semi-supervised density-based method for finding relevant subspaces. Given a set of should-link objects that belong to an undiscovered cluster, our method finds an appropriate set of attributes for extracting the cluster. Our method makes use of well-established qualities of density-based clusters, and as such, it can be used as a pre-processing step for a wide variety of different density-based clustering algorithms. We combine this method with HDBSCAN* and CASAR to produce a semi-supervised density-based projected clustering algorithm. In a series of experiments, we compare CASAR and cluster stability on both synthetic data and on real data sets. We also compare our semi-supervised density-based projected clustering algorithm to an existing semi-supervised projected clustering algorithm and to a well-known unsupervised projected clustering algorithm. We conclude this thesis with a summary of the strengths and weaknesses of our method, a summary of experimental findings, and a discussion about possible directions for future work.

Item Type

http://purl.org/coar/resource_type/c_46ec

Alternative

License

Other License Text / Link

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

en

Location

Time Period

Source