Scholars@Duke publication: Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective

Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective

Publication , Conference

Chen, C; Duan, J; Chen, Y; Zhang, J; Tran, SD; Xu, Y; Chen, L; Zeng, B; Chilimbi, T

Published in: Advances in Neural Information Processing Systems

January 1, 2022

Contrastive learning (CL) has been the de facto technique for self-supervised representation learning (SSL), with impressive empirical success such as multi-modal representation learning. However, traditional CL loss only considers negative samples from a minibatch, which could cause biased gradients due to the non-decomposibility of the loss. For the first time, we consider optimizing a more generalized contrastive loss, where each data sample is associated with an infinite number of negative samples. We show that directly using minibatch stochastic optimization could lead to gradient bias. To remedy this, we propose an efficient Bayesian data augmentation technique to augment the contrastive loss into a decomposable one, where standard stochastic optimization can be directly applied without gradient bias. Specifically, our augmented loss defines a joint distribution over the model parameters and the augmented parameters, which can be conveniently optimized by a proposed stochastic expectation-maximization algorithm. Our framework is more general and is related to several popular SSL algorithms. We verify our framework on both small scale models and several large foundation models, including SSL of ImageNet and SSL for vision-language representation learning. Experiment results indicate the existence of gradient bias in all cases, and demonstrate the effectiveness of the proposed method on improving previous state of the arts. Remarkably, our method can outperform the strong MoCo-v3 under the same hyper-parameter setting with only around half of the minibatch size; and also obtains strong results in the recent public benchmark ELEVATER for few-shot image classification.

Duke Scholars

Author Yiran Chen Electrical and Computer Engineering

Published In

Advances in Neural Information Processing Systems

ISSN

1049-5258

Publication Date

January 1, 2022

Volume

Related Subject Headings

4611 Machine learning
1702 Cognitive Sciences
1701 Psychology

Citation

APA

Chicago

ICMJE

MLA

NLM

Chen, C., Duan, J., Chen, Y., Zhang, J., Tran, S. D., Xu, Y., … Chilimbi, T. (2022). Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective. In Advances in Neural Information Processing Systems (Vol. 35).

Chen, C., J. Duan, Y. Chen, J. Zhang, S. D. Tran, Y. Xu, L. Chen, B. Zeng, and T. Chilimbi. “Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective.” In Advances in Neural Information Processing Systems, Vol. 35, 2022.

Chen C, Duan J, Chen Y, Zhang J, Tran SD, Xu Y, et al. Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective. In: Advances in Neural Information Processing Systems. 2022.

Chen, C., et al. “Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective.” Advances in Neural Information Processing Systems, vol. 35, 2022.

Chen C, Duan J, Chen Y, Zhang J, Tran SD, Xu Y, Chen L, Zeng B, Chilimbi T. Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective. Advances in Neural Information Processing Systems. 2022.

Published In

Advances in Neural Information Processing Systems

ISSN

1049-5258

Publication Date

January 1, 2022

Volume

Related Subject Headings

4611 Machine learning
1702 Cognitive Sciences
1701 Psychology