Tackling the Big Data 4 Vs for Anomaly Detection

José Camacho; Gabriel Maciá-Fernández; Jesús Esteban Díaz Verdejo; Pedro García-Teodoro

Abstract:

In this paper, a framework for anomaly detection and forensics in Big Data is introduced. The framework tackles the Big Data 4 Vs: Variety, Veracity, Volume and Velocity. The varied nature of the data sources is treated by transforming the typically unstructured data into a highly dimensional and structured data set. To overcome both the uncertainty (low veracity) and high dimension introduced, a latent variable method, in particular Principal Component Analysis (PCA), is applied. PCA is well known to present outstanding capabilities to extract information from highly dimensional data sets. However, PCA is limited to low size, thought highly multivariate, data sets. To handle this limitation, a kernel computation of PCA is employed. This avoids computational problems due to the size (number of observations) in the data sets and allows parallelism. Also, hierarchical models are proposed if dimensionality is extreme. Finally, to handle high velocity in analyzing time series data flows, the Exponentially Weighted Moving Average (EWMA) approach is employed. All these steps are discussed in the paper, and the VAST 2012 mini challenge 2 is used for illustration.

Research areas:

Ethical hacking and digital forensics

Year:

2014

Type of Publication:

In Proceedings

Keywords:

Big data, Anomaly detection, PCA, MEDA

Book title:

INFOCOM'2014 Workshop on Security and Privacy in Big Data

Back