Scholars@Duke publication: Efficient Use of Data for Prediction and Validation

Efficient Use of Data for Prediction and Validation

Publication , Journal Article

Liu, L; Jung, SH

Published in: Advances in Artificial Intelligence and Machine Learning

January 1, 2024

Prediction model building is one of the most important tasks in analysis of high-dimensional data. A fitted prediction model should be validated for future use. So, when conducting such an analysis, we have to use the whole data for both training and validation. When using a hold-out method, the fitted prediction model will be more efficient if the training set is bigger, but the validation power will be lower with a smaller validation set. In order to balance the efficiency of fitted prediction model and its validation, 50-50 allocation of the whole data set is popularly used as a hold-out method. In prediction and validation procedure, we have to use the information embedded in the whole data set as efficiently as possible. As a such effort, cross-validation methods (CV) have been very popular these days. In a CV method, a large portion of the data set is used to train models and the remaining small portion of the data is used for validation, and this procedure is repeated until the whole data set is used for validation. In a CV method, each data point is used for both training and validation, so that as the portion of training set is increased, the efficiency of training will be increased, while the validation power will be decreased due to the increased over-fitting, i.e. more frequent use of each data point for training. As another effort of efficient use of the whole data, we propose to use the whole data set for both training and validation, called 1-fold CV method. By using the whole data to fit a prediction model, training efficiency will be highest, but, by reusing the whole data set for validation, its validation power is expected to be very low. The validation power of CV methods will be estimated by permutation methods. Through extensive simulation and real data studies, we conclude that the newly proposed 1-fold CV method uses the available data set very efficiently.

Duke Scholars

Author Sin-Ho Jung Biostatistics & Bioinformatics, Division of Integrative Geno ...

Published In

Advances in Artificial Intelligence and Machine Learning

DOI

10.54364/AAIML.2024.41106

EISSN

2582-9793

Publication Date

January 1, 2024

Volume

Issue

Start / End Page

1834 / 1846

Citation

APA

Chicago

ICMJE

MLA

NLM

Liu, L., & Jung, S. H. (2024). Efficient Use of Data for Prediction and Validation. Advances in Artificial Intelligence and Machine Learning, 4(1), 1834–1846. https://doi.org/10.54364/AAIML.2024.41106

Liu, L., and S. H. Jung. “Efficient Use of Data for Prediction and Validation.” Advances in Artificial Intelligence and Machine Learning 4, no. 1 (January 1, 2024): 1834–46. https://doi.org/10.54364/AAIML.2024.41106.

Liu L, Jung SH. Efficient Use of Data for Prediction and Validation. Advances in Artificial Intelligence and Machine Learning. 2024 Jan 1;4(1):1834–46.

Liu, L., and S. H. Jung. “Efficient Use of Data for Prediction and Validation.” Advances in Artificial Intelligence and Machine Learning, vol. 4, no. 1, Jan. 2024, pp. 1834–46. Scopus, doi:10.54364/AAIML.2024.41106.

Liu L, Jung SH. Efficient Use of Data for Prediction and Validation. Advances in Artificial Intelligence and Machine Learning. 2024 Jan 1;4(1):1834–1846.

Published In

Advances in Artificial Intelligence and Machine Learning

DOI

10.54364/AAIML.2024.41106

EISSN

2582-9793

Publication Date

January 1, 2024

Volume

Issue

Start / End Page

1834 / 1846