Towards Scalable and Accurate Online Feature Selection for Big Data
Feature selection is important in many big data applications. There are at least two critical challenges. Firstly, in many applications, the dimensionality is extremely high, in millions, and keeps growing. Secondly, feature selection has to be highly scalable, preferably in an online manner such that each feature can be processed in a sequential scan. In this paper, we develop SAOLA, a Scalable and Accurate On Line Approach for feature selection. With a theoretical analysis on a low bound on the pair wise correlations between features in the currently selected feature subset, SAOLA employs novel online pair wise comparison techniques to address the two challenges and maintain a parsimonious model over time in an online manner. An empirical study using a series of benchmark real data sets shows that SAOLA is scalable on data sets of extremely high dimensionality, and has superior performance over the state-of-the-art feature selection methods.