Classifying noisy and incomplete medical data by a differential latent semantic indexing approach
It is well-recognized that medical datasets are often noisy and incomplete due to the difficulties in data collection and integration. Noise and incompleteness in medical data post substantial challenges for accurate classification. A differential latent semantic indexing (DLSI) approach which is an improvement of the standard LSI method has been proposed for information retrieval and demonstrated improved performance over standard LSI approach. The key idea is that DLSI adapts to the unique characteristics of individual record/document. By experimental results on real datasets, we show that DLSI outperforms the standard LSI method on noisy and incomplete medical datasets. The results strongly indicate that the DLSI approach is also capable of medical numerical data analysis.