Clustering Survival Data using Random Forest and Persistent Homology

Loading...
Thumbnail Image

Institution

http://id.loc.gov/authorities/names/n79058482

Degree Level

Master's

Degree

Master of Science

Department

Department of Mathematical and Statistical Sciences

Specialization

Biostatistics

Supervisor / Co-Supervisor and Their Department(s)

Examining Committee Member(s) and Their Department(s)

Citation for Previous Publication

Link to Related Item

Abstract

Survival data is mostly analyzed using Cox proportional hazards model to identify factors associated with survival time of patients. However recently random survival forest (RSF), a non-parametric method for ensemble estimation constructed by bagging of classification trees for survival data, is used as an alternative method for better survival prediction and ranking the importance of covariates associated with it. In addition to identification of variable importance for survival prediction, exploring clusters in survival data using the variables identified as important in RSF analysis were applied. Clustering survival data (patients) to assess their survival experience was investigated using random forest clustering based on partitioning around the medoids and persistent homology (PH), a topological data analysis (TDA) technique for cluster identification in lower dimension (dimension zero). In both methods, we were able to identify different groups of patients possessing different survival experience accounting for those covariates most important in determining survival experience. The clusters formed were assessed for significant difference in their survival experience (log-rank test) and were found to have difference in survival experience between them. Further investigation was applied using PH to explore more detailed characteristic features of patients at higher dimension (dimension one). Both clustering methods result in a promising exploration of groups within patients that will give insight into to patient handling and give valuable information in providing quality service to patients who need more attention. All analysis procedures in this thesis were done using two datasets: the kidney and liver dataset.

Item Type

http://purl.org/coar/resource_type/c_46ec

Alternative

License

Other License Text / Link

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Subject/Keywords

Language

en

Location

Time Period

Source