Tackling the Big Data 4 Vs for Anomaly Detection
-
José Camacho; Gabriel Maciá-Fernández; Jesús Esteban Díaz Verdejo; Pedro García-Teodoro
- Abstract:
- In this paper, a framework for anomaly detection
and forensics in Big Data is introduced. The framework tackles the Big Data 4 Vs: Variety, Veracity, Volume and Velocity. The varied nature of the data sources is treated by transforming
the typically unstructured data into a highly dimensional and structured data set. To overcome both the uncertainty (low veracity) and high dimension introduced, a latent variable method,
in particular Principal Component Analysis (PCA), is applied. PCA is well known to present outstanding capabilities to extract information from highly dimensional data sets. However, PCA
is limited to low size, thought highly multivariate, data sets. To handle this limitation, a kernel computation of PCA is employed. This avoids computational problems due to the size (number of observations) in the data sets and allows parallelism. Also, hierarchical models are proposed if dimensionality is extreme.
Finally, to handle high velocity in analyzing time series data flows, the Exponentially Weighted Moving Average (EWMA) approach is employed. All these steps are discussed in the paper, and the
VAST 2012 mini challenge 2 is used for illustration.
- Research areas:
- Year:
- 2014
- Type of Publication:
- In Proceedings
- Keywords:
- Big data, Anomaly detection, PCA, MEDA
- Book title:
- INFOCOM'2014 Workshop on Security and Privacy in Big Data
Hits: 4259