Rate-distortion bounds on Bayes risk in supervised learning
An information-theoretic framework is presented for estimating the number of labeled samples needed to train a classifier in a parametric Bayesian setting. Ideas from rate-distortion theory are used to derive bounds for the average L1 or L∞ distance between the learned classifier and the true maximum a posteriori classifier in terms of familiar information-theoretic quantities and the number of training samples available. The maximum a posteriori classifier is viewed as a random source, labeled training data are viewed as a finite-rate encoding of the source, and the L1 or L∞ Bayes risk is viewed as the average distortion. The result is a framework dual to the well-known probably approximately correct (PAC) framework. PAC bounds characterize worst-case learning performance of a family of classifiers whose complexity is captured by the Vapnik-Chervonenkis (VC) dimension. The rate-distortion framework, on the other hand, characterizes the average-case performance of a family of data distributions in terms of a quantity called the interpolation dimension, which represents the complexity of the family of data distributions. The resulting bounds do not suffer from the pessimism typical of the PAC framework, particularly when the training set is small.