Conference12th International Conference on Learning Representations, ICLR 2024 · January 1, 2024
Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to be overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form o ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2023
Mixup is a data augmentation technique that relies on training using random convex combinations of data points and their labels. In recent years, Mixup has become a standard primitive used in the training of state-of-the-art image classification models due ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2023
Sparse coding, which refers to modeling a signal as sparse linear combinations of the elements of a learned dictionary, has proven to be a successful (and interpretable) approach in applications such as signal processing, computer vision, and medical imagi ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2023
In deep learning, often the training process finds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mecha ...
Cite
ConferenceEMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings · January 1, 2023
Pre-trained language models have been shown to encode linguistic structures like parse trees in their embeddings while being trained unsupervised. Some doubts have been raised whether the models are doing parsing or only some computation weakly correlated ...
Cite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2023
We focus on the task of learning a single index model σ(w* · x) with respect to the isotropic Gaussian distribution in d dimensions. Prior work has shown that the sample complexity of learning w* is governed by the information exponent k* of the link funct ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2023
Compositionality is a common property in many modalities including text and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic ...
Cite
Conference11th International Conference on Learning Representations, ICLR 2023 · January 1, 2023
Monotonic linear interpolation (MLI) - on the line connecting a random initialization with the minimizer it converges to, the loss and accuracy are monotonic - is a phenomenon that is commonly observed in the training of neural networks.Such a phenomenon m ...
Cite
Conference11th International Conference on Learning Representations, ICLR 2023 · January 1, 2023
Self-supervised learning has significantly improved the performance of many NLP tasks. However, how can self-supervised learning discover useful representations, and why is it better than traditional approaches such as probabilistic models are still largel ...
Cite
Conference11th International Conference on Learning Representations, ICLR 2023 · January 1, 2023
Depth separation-why a deeper network is more powerful than a shallower one-has been a major problem in deep learning theory. Previous results often focus on representation power. For example, Safran et al. (2019) constructed a function that is easy to app ...
Cite
Conference11th International Conference on Learning Representations, ICLR 2023 · January 1, 2023
Recently, researchers observed that gradient descent for deep neural networks operates in an “edge-of-stability” (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold 2/η (where η is the step size). Despit ...
Cite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2023
Finding an approximate second-order stationary point (SOSP) is a well-studied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, l ...
Cite
Journal ArticleMathematical Programming · June 1, 2022
Tucker decomposition is a popular technique for many data analysis and machine learning applications. Finding a Tucker decomposition is a nonconvex optimization problem. As the scale of the problems increases, local search algorithms such as stochastic gra ...
Full textCite
Journal ArticleMathematical Programming · June 1, 2022
Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape ...
Full textCite
ConferenceProceedings of Machine Learning Research · January 1, 2022
This paper studies online algorithms augmented with multiple machine-learned predictions. We give a generic algorithmic framework for online covering problems with multiple predictions that obtains an online solution that is competitive against the perform ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2022
Recently, many reinforcement learning techniques have been shown to have provable guarantees in the simple case of linear dynamics, especially in problems like linear quadratic regulators. However, in practice many tasks require learning a policy from rich ...
Cite
ConferenceICLR 2022 - 10th International Conference on Learning Representations · January 1, 2022
In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk ...
Cite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2022
We explore the connection between outlier-robust high-dimensional statistics and non-convex optimization in the presence of sparsity constraints, with a focus on the fundamental tasks of robust sparse mean estimation and robust sparse PCA. We develop novel ...
Cite
OtherACM Transactions on Algorithms · August 1, 2021
In this article, we introduce the online service with delay problem. In this problem, there are n points in a metric space that issue service requests over time, and there is a server that serves these requests. The goal is to minimize the sum of distance ...
Full textCite
Journal ArticleJournal of the ACM · March 1, 2021
Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in mach ...
Full textCite
ConferenceProceedings of Machine Learning Research · January 1, 2021
Choosing the right parameters for optimization algorithms is often the key to their success in practice. Solving this problem using a learning-to-learn approach-using meta-gradient descent on a meta-objective based on the trajectory that the optimizer gene ...
Cite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2021
In this paper we study the training dynamics for gradient flow on over-parametrized tensor decomposition problems. Empirically, such training process often first fits larger components and then discovers smaller components, which is similar to a tensor def ...
Cite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2021
The emerging field of learning-augmented online algorithms uses ML techniques to predict future input parameters and thereby improve the performance of online algorithms. Since these parameters are, in general, real-valued functions, a natural approach is ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2021
We give a algorithm for exact sampling from the Bingham distribution p(x) ∝ exp(x⊺Ax) on the sphere Sd-1 with expected runtime of poly(d, λmax(A) - λmin(A)). The algorithm is based on rejection sampling, where the proposal distribution is a polynomial appr ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2021
While over-parameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on over-parameterization do not fully explain the reason—they either work in the Neural Tangent Kernel regime where ...
Cite
Chapter · January 1, 2021
In this chapter, we introduce nonnegative matrix factorization and topic modeling. We will see that there is a natural structural assumption called separability that allows us to circumvent the worst-caseNP-hardness results for nonnegative matrix factoriza ...
Full textCite
Journal ArticleProceedings of the Annual ACM Symposium on Theory of Computing · June 8, 2020
Estimating the normalizing constant of an unnormalized probability distribution has important applications in computer science, statistical physics, machine learning, and statistics. In this work, we consider the problem of estimating the normalizing const ...
Full textCite
Journal ArticleAdvances in Neural Information Processing Systems · January 1, 2020
Over-parametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor decom ...
Cite
Conference37th International Conference on Machine Learning, ICML 2020 · January 1, 2020
We study the problem of high-dimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomial-time algorithms for this problem with dimension-independent error gu ...
Cite
Conference37th International Conference on Machine Learning, ICML 2020 · January 1, 2020
A popular line of recent research incorporates ML advice in the design of online algorithms to improve their performance in typical instances. These papers treat the ML algorithm as a blackbox, and redesign online algorithms to take advantage of ML predict ...
Cite
ConferenceProceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms · January 1, 2019
We study the fundamental problem of high-dimensional mean estimation in a robust model where a constant fraction of the samples are adversarially corrupted. Recent work gave the first polynomial time algorithms for this problem with dimension-independent e ...
Full textCite
Journal ArticleAdvances in Neural Information Processing Systems · January 1, 2019
Mode connectivity (Garipov et al., 2018; Draxler et al., 2018) is a surprising phenomenon in the loss landscape of deep nets. Optima'at least those discovered by gradient-based optimization'turn out to be connected by simple paths on which the loss functio ...
Cite
Conference7th International Conference on Learning Representations, ICLR 2019 · January 1, 2019
We give a new algorithm for learning a two-layer neural network under a general class of input distributions. Assuming there is a ground-truth two-layer network y = Aσ(Wx) + ξ, where A, W are weight matrices, ξ represents noise, and the number of neurons i ...
Cite
Conference7th International Conference on Learning Representations, ICLR 2019 · January 1, 2019
Word embedding is a powerful tool in natural language processing. In this paper we consider the problem of word embedding composition - given vector representations of two words, compute a vector for the entire phrase. We give a generative model that can c ...
Cite
Journal ArticleFoundations and Trends in Machine Learning · January 1, 2019
Spectral methods have been the mainstay in several domains such as machine learning, applied mathematics and scientific computing. They involve finding a certain kind of spectral decomposition to obtain basis functions that can capture important structures ...
Full textCite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2019
Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In con ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2019
We study the problem of estimating the covariance matrix of a high-dimensional distribution when a small constant fraction of the samples can be arbitrarily corrupted. Recent work gave the first polynomial time algorithms for this problem with near-optimal ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2019
Variance reduction techniques like SVRG (Johnson and Zhang, 2013) provide simple and fast algorithms for optimizing a convex finite-sum objective. For nonconvex objectives, these techniques can also find a first-order stationary point (with small gradient) ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2019
Folklore results in the theory of Stochastic Approximation indicates the (minimax) optimality of Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951) with polynomially decaying step sizes and iterate averaging (Ruppert, 1988; Polyak and Juditsky, 19 ...
Cite
Conference35th International Conference on Machine Learning, ICML 2018 · January 1, 2018
Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive paramete ...
Cite
Conference35th International Conference on Machine Learning, ICML 2018 · January 1, 2018
Direct policy gradient methods for reinforcement learning and continuous control problems arc a popular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge of the underlying model, 2) they are an "end- to-end" approa ...
Cite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2018
A key task in Bayesian machine learning is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). One prevalent example of this is sampling posteriors in parametric distributions, such as latent- ...
Cite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2018
Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex nonsmooth losses (such as modern deep networks), the population risk is generally s ...
Cite
Conference6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings · January 1, 2018
We consider the problem of learning a one-hidden-layer neural network: we assume the input x ∈ Rd is from Gaussian distribution and the label y = a>σ(Bx) + ξ, where a is a nonnegative vector in Rm with m ≤ d, B ∈ Rm×d is a full-rank weight matrix, and ξ is ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2018
Matrix completion is a well-studied problem with many machine learning applications. In practice, the problem is often solved by non-convex optimization algorithms. However, the current theoretical analysis for non-convex algorithms relies crucially on the ...
Cite
ConferenceProceedings of the Annual ACM Symposium on Theory of Computing · June 19, 2017
In this paper, we introduce the online service with delay problem. In this problem, there are n points in a metric space that issue service requests over time, and a server that serves these requests. The goal is to minimize the sum of distance traveled by ...
Full textCite
ConferenceProceedings of the Annual ACM Symposium on Theory of Computing · June 19, 2017
Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the max ...
Full textCite
Journal ArticleJournal of Machine Learning Research · April 1, 2017
We present a novel analysis of the dynamics of tensor power iterations in the overcomplete regime where the tensor CP rank is larger than the input dimension. Finding the CP decomposition of an overcomplete tensor is NP-hard in general. We consider the cas ...
Cite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2017
Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape ...
Cite
Conference34th International Conference on Machine Learning, ICML 2017 · January 1, 2017
Generalization is defined training of generative adversarial network (GAN), and it's shown that generalization is not guaranteed for the popular distances between distributions such as Jensen-Shannon or Wasserstein. In particular, training may appear to be ...
Cite
Conference34th International Conference on Machine Learning, ICML 2017 · January 1, 2017
In this paper we develop a new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA. In particular, we show for all above problems (including asymm ...
Cite
Conference34th International Conference on Machine Learning, ICML 2017 · January 1, 2017
This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedur ...
Cite
ConferenceJournal of Machine Learning Research · June 6, 2016
Local search heuristics for non-convex optimizations are popular in applied machine learning. However, in general it is hard to guarantee that such algorithms even converge to a local minimum, due to the existence of complicated saddle point structures in ...
Cite
Journal ArticleIEEE Transactions on Signal Processing · April 1, 2016
This paper addresses two fundamental problems in the context of hidden Markov models (HMMs). The first problem is concerned with the characterization and computation of a minimal order HMM that realizes the exact joint densities of an output process based ...
Full textCite
Conference33rd International Conference on Machine Learning, ICML 2016 · January 1, 2016
This paper considers the problem of canonical- correlation analysis, and more broadly, the generalized eigenvector problem for a pair of symmetric matrices. We consider the setting of finding top-A- canonical/eigen subspace, and solve these problems throug ...
Cite
Conference33rd International Conference on Machine Learning, ICML 2016 · January 1, 2016
In many settings, we have multiple data sets (also called views) that capture different and overlapping aspects of the same phenomenon. We are often interested in finding patterns that are unique to one or to a subset of the views. For example, we might ha ...
Cite
Conference33rd International Conference on Machine Learning, ICML 2016 · January 1, 2016
Recently, there has been considerable progress on designing algorithms with provable guarantees - typically using linear algebraic methods - for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be ...
Cite
ConferenceSIAM Journal on Computing · January 1, 2016
In the nonnegative matrix factorization (NMF) problem we are given an n × m nonnegative matrix M and an integer r > 0. Our goal is to express M as AW, where A and W are nonnegative matrices of size n×r and r×m, respectively. In some applications, it makes ...
Full textCite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2016
Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems. Simple non-convex optimization algorithms are popular and effective in practice. Despite recent progress in pro ...
Cite
ConferenceLeibniz International Proceedings in Informatics, LIPIcs · August 1, 2015
Tensor rank and low-rank tensor decompositions have many applications in learning and complexity theory. Most known algorithms use unfoldings of tensors and can only handle rank up to n|p/2| for a p-th order tensor in Rnp. Previously no efficient algorithm ...
Full textCite
ConferenceProceedings of the Annual ACM Symposium on Theory of Computing · June 14, 2015
Efficiently learning mixture of Gaussians is a fundamental problem in statistics and learning theory. Given samples coming from a random one out of k Gaussian distributions in Rn, the learning problem asks to estimate the means and the covariance matrices ...
Full textCite
Journal ArticleAlgorithmica · May 1, 2015
We present a new algorithm for independent component analysis which has provable performance guarantees. In particular, suppose we are given samples of the form y=Ax+η where A is an unknown but non-singular n×n matrix, x is a random variable whose coordina ...
Full textCite
ConferenceJournal of Machine Learning Research · January 1, 2015
Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non- ...
Cite
ConferenceJournal of Machine Learning Research · January 1, 2015
In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of comput ...
Cite
ConferenceJournal of Machine Learning Research · January 1, 2015
We provide guarantees for learning latent variable models emphasizing on the overcomplete regime, where the dimensionality of the latent space exceeds the observed dimensionality. In particular, we consider multiview mixtures, ICA, and sparse coding models ...
Cite
ConferenceJournal of Machine Learning Research · January 1, 2015
We analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non-convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient updates are trapped in saddle points. In this paper we ...
Cite
Conference32nd International Conference on Machine Learning, ICML 2015 · January 1, 2015
We develop a family of accelerated stochastic algorithms that optimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide ra ...
Cite
Conference32nd International Conference on Machine Learning, ICML 2015 · January 1, 2015
Non-negative matrix factorization (NMF) is a natural model of admixture and is widely used in science and engineering. A plethora of algorithms have been developed to tackle NMF, but due to the non-convex nature of the problem, there is little guarantee on ...
Cite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2015
This note is a short version of that in [1]. It is intended as a survey for the 2015 Algorithmic Learning Theory (ALT) conference. This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent vari ...
Full textCite
Journal ArticleJournal of Machine Learning Research · August 1, 2014
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models-including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation-which exploits a certain tenso ...
Cite
Journal Article2014 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton 2014 · January 30, 2014
In this paper, we study the problem of finding a minimal order (quasi-) Hidden Markov Model for a random process, which is the output process of an unknown stationary HMM of finite order. In the main theorem, we show that excluding a measure zero set in th ...
Full textCite
Journal ArticleJournal of Machine Learning Research · January 1, 2014
Community detection is the task of detecting hidden communities from observed interactions. Guaranteed community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remov ...
Cite
Journal ArticleJournal of Machine Learning Research · January 1, 2014
In sparse recovery we are given a matrix A ∈ Rnxm ("the dictionary") and a vector of the form AX where X is sparse, and the goal is to recover X. This is a central notion in signal processing, statistics and machine learning. But in applications ...
Cite
Journal Article31st International Conference on Machine Learning, ICML 2014 · January 1, 2014
2014 We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an n node multilayer network that has degree at most nγ for some γ < 1 and each edge has ...
Cite
Journal ArticleProceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS · December 1, 2013
We give a new (1 + ε)-approximation for SPARSEST CUT problem on graphs where small sets expand significantly more than the sparsest cut (expansion of sets of size n/r exceeds that of the sparsest cut by a factor √ log n log r, for some small r; this condit ...
Full textCite
Journal Article30th International Conference on Machine Learning, ICML 2013 · January 1, 2013
Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model learning have been based on a maximum likelihood objective. Efficient algorithms exist that attempt to app ...
Cite
Journal ArticleJournal of Machine Learning Research · January 1, 2013
Detecting hidden communities from observed interactions is a classical problem. Theoretical analysis of community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we prov ...
Cite
Journal ArticleProceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS · December 1, 2012
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in m ...
Full textCite
Journal ArticleAdvances in Neural Information Processing Systems · December 1, 2012
We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax +η where A is an unknown n × n matrix and x is a random variable whose components ...
Cite
Journal ArticleProceedings of the ACM Conference on Electronic Commerce · July 10, 2012
A community in a social network is usually understood to be a group of nodes more densely connected with each other than with the rest of the network. This is an important concept in most domains where networks arise: social, technological, biological, etc ...
Full textCite
Journal ArticleProceedings of the Annual ACM Symposium on Theory of Computing · June 26, 2012
The Nonnegative Matrix Factorization (NMF) problem has a rich history spanning quantum mechanics, probability theory, data analysis, polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormo ...
Full textCite
Journal ArticleAlgorithmica (New York) · December 1, 2011
We study the problem of solving simple stochastic games, and give both an interesting new algorithm and a hardness result. We show a reduction from fine approximation of simple stochastic games to coarse approximation of a polynomial sized game, which can ...
Full textCite
Journal ArticleLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · September 8, 2011
How to color 3 colorable graphs with few colors is a problem of longstanding interest. The best polynomial-time algorithm uses n 0.2072 colors. There are no indications that coloring using say O(log n) colors is hard. It has been suggested that SDP hierarc ...
Full textCite
Journal ArticleLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · July 11, 2011
We give new algorithms for a variety of randomly-generated instances of computational problems using a linearization technique that reduces to solving a system of linear equations. These algorithms are derived in the context of learning with structured noi ...
Full textCite
Journal ArticleCommunications of the ACM · May 1, 2011
Computational complexity studies intractable problems like NP-complete problems, which are conjectured to require more computational resources than can be provided by the fastest computers. The intractability of this problem leads to a concrete realization ...
Full textCite
Journal ArticleLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · December 1, 2009
We study the problem of solving simple stochastic games, and give both an interesting new algorithm and a hardness result. We show a reduction from fine approximation of simple stochastic games to coarse approximation of a polynomial sized game, which can ...
Full textCite