Time Series Contextual Anomaly Detection for Detecting Stock Market Manipulation
Date
Institution
Degree Level
Degree
Department
Supervisor / Co-Supervisor and Their Department(s)
Examining Committee Member(s) and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
Anomaly detection in time series is one of the fundamental issues in data mining. It addresses various problems in different domains such as intrusion detection in computer networks, anomaly detection in healthcare sensory data, and fraud detection in securities. Though there has been extensive work on anomaly detection, most techniques look for individual objects that are different from normal objects but do not take the temporal aspect of data into consideration. We are particularly interested in contextual anomaly detection methods for time series that are applicable to fraud detection in securities. This has significant impacts on national and international securities markets. In this thesis, we propose a prediction-based Contextual Anomaly Detection (CAD) method for complex time series that are not described through deterministic models. First, a subset of time series is selected based on the window size parameter, Second, a centroid is calculated representing the expected behaviour of time series of the group. Then, the centroid values are used along with correlation of each time series with the centroid to predict the values of the time series. The proposed method improves recall from 7% to 33% compared to kNN and random walk without compromising precision. We propose a formalized method to improve performance of CAD using big data techniques by eliminating false positives. The method aims to capture expected behaviour of stocks through sentiment analysis of tweets about stocks. We present a case study and explore developing sentiment analysis models to improve anomaly detection in the stock market. The experimental results confirm the proposed method is effective in improving CAD through removing irrelevant anomalies by correctly identifying 28% of false positives.
