Publications

2021
P. Nakkiran, P. Venkat, S. Kakade, and T. Ma, “Optimal Regularization Can Mitigate Double Descent,” in International Conference on Learning Representations, 2021. arXiv VersionAbstract
Recent empirical and theoretical studies have shown that many learning algorithms -- from linear regression to neural networks -- can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as "double descent", has raised questions of if we need to re-think our current understanding of generalization. In this work, we study whether the double-descent phenomenon can be avoided by using optimal regularization. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned ℓ2 regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned ℓ2 regularization can mitigate double descent for more general models, including neural networks. Our results suggest that it may also be informative to study the test risk scalings of various algorithms in the context of appropriately tuned regularization.
S. S. Du, W. Hu, S. M. Kakade, J. D. Lee, and Q. Lei, “Few-Shot Learning via Learning the Representation, Provably,” in International Conference on Learning Representations, 2021. arXiv VersionAbstract
This paper studies few-shot learning via representation learning, where one uses T source tasks with n1 data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only n2(≪n1) data. Specifically, we focus on the setting where there exists a good \emph{common representation} between source and target, and our goal is to understand how much of a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a fast rate of O((Φ)n1T+kn2); here, Φ is the representation function class, (Φ) is its complexity measure, and k is the dimension of the representation. When specialized to linear representation functions, this rate becomes O(dkn1T+kn2) where d(≫k) is the ambient input dimension, which is a substantial improvement over the rate without using representation learning, i.e. over the rate of O(dn2). This result bypasses the Ω(1T) barrier under the i.i.d. task assumption, and can capture the desired property that all n1T samples from source tasks can be \emph{pooled} together for representation learning. Next, we consider the setting where the common representation may be high-dimensional but is capacity-constrained (say in norm); here, we again demonstrate the advantage of representation learning in both high-dimensional linear regression and neural network learning. Our results demonstrate representation learning can fully utilize all n1T samples from source tasks.
Y. Wang, R. Wang, and S. M. Kakade, “An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap,” in NeurIPS, 2021. arXiv VersionAbstract
A fundamental question in the theory of reinforcement learning is: suppose the optimal Q-function lies in the linear span of a given d dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential (in d) sample size lower bound, which holds even if the agent has access to a generative model of the environment. One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even more favorable assumption: there exists a \emph{constant suboptimality gap} between the optimal Q-value of the best action and that of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easier identification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has access to a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption. This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal Q-function. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting. Complementing our negative hardness result, we give two positive results showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption (both implicitly place stronger conditions on the underlying dynamics model).
X. Liu, W. Kong, S. Kakade, and S. Oh, “Robust and Differentially Private Mean Estimation,” in NeurIPS, 2021. arXiv VersionAbstract
Differential privacy has emerged as a standard requirement in a variety of applications ranging from the U.S. Census to data collected in commercial devices, initiating an extensive line of research in accurately and privately releasing statistics of a database. An increasing number of such databases consist of data from multiple sources, not all of which can be trusted. This leaves existing private analyses vulnerable to attacks by an adversary who injects corrupted data. Despite the significance of designing algorithms that guarantee privacy and robustness (to a fraction of data being corrupted) simultaneously, even the simplest questions remain open. For the canonical problem of estimating the mean from i.i.d. samples, we introduce the first efficient algorithm that achieves both privacy and robustness for a wide range of distributions. This achieves optimal accuracy matching the known lower bounds for robustness, but the sample complexity has a factor of d1/2 gap from known lower bounds. We further show that this gap is due to the computational efficiency; we introduce the first family of algorithms that close this gap but takes exponential time. The innovation is in exploiting resilience (a key property in robust estimation) to adaptively bound the sensitivity and improve privacy.
Y. Bai, et al., “How Important is the Train-Validation Split in Meta-Learning,” in International Conference on Machine Learning, Virtual, 2021. arxiv VersionAbstract
Meta-learning aims to perform fast adaptation on a new task through learning a "prior" from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split (\emph{train-val method}) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct \emph{train-train method}, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful in the linear centroid meta-learning problem. In the agnostic case, we show that the expected loss of the train-val method is minimized at the optimal prior for meta testing, and this is not the case for the train-train method in general without structural assumptions on the data. In contrast, in the realizable case where the data are generated from linear models, we show that both the train-val and train-train losses are minimized at the optimal prior in expectation. Further, perhaps surprisingly, our main result shows that the train-train method achieves a \emph{strictly better} excess loss in this realizable case, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that sample splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.
R. Wang, D. P. Foster, and S. M. Kakade, “What are the Statistical Limits of Offline RL with Linear Function Approximation,” in International Conference on Learning Representations, 2021. arXiv VersionAbstract
Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan., “On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift,” Journal of Machine Learning Research, vol. 22, 2021. arXiv VersionAbstract
Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case -- which avoid explicit worst-case dependencies on the size of state space -- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).
C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points,” ACM, 2021. arXiv VersionAbstract
Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a gap has arisen between theory and practice. Indeed, traditional analyses of GD and SGD show that both algorithms converge to stationary points efficiently. But these analyses do not take into account the possibility of converging to saddle points. More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial. For modern machine learning, where the dimension can be in the millions, such dependence would be catastrophic. We analyze perturbed versions of GD and SGD and show that they are truly efficient---their dimension dependence is only polylogarithmic. Indeed, these algorithms converge to second-order stationary points in essentially the same time as they take to converge to classical first-order stationary points.
2020
N. Agarwal, S. Kakade, R. Kidambi, Y. T. Lee, P. Netrapalli, and A. Sidford, Leverage Score Sampling for Faster Accelerated Regression and Empirical Risk Minimization. ALT: ArXiv Report, 2020. Publisher's VersionAbstract
Given a matrix A∈ℝn×d and a vector b∈ℝd, we show how to compute an ϵ-approximate solution to the regression problem minx∈ℝd12‖Ax−b‖22 in time Õ ((n+d⋅κsum‾‾‾‾‾‾‾√)⋅s⋅logϵ−1) where κsum=tr(A⊤A)/λmin(ATA) and s is the maximum number of non-zero entries in a row of A. Our algorithm improves upon the previous best running time of Õ ((n+n⋅κsum‾‾‾‾‾‾‾√)⋅s⋅logϵ−1). We achieve our result through a careful combination of leverage score sampling techniques, proximal point methods, and accelerated coordinate descent. Our method not only matches the performance of previous methods, but further improves whenever leverage scores of rows are small (up to polylogarithmic factors). We also provide a non-linear generalization of these results that improves the running time for solving a broader class of ERM problems.
Leverage Score Sampling for Faster Accelerated Regression and Empirical Risk Minimization
C. Ilin, S. E. Annan-Phan, X. H. Tai, S. Mehra, S. M. Hsiang, and J. E. Blumenstock, Public Mobility Data Enables COVID-19 Forecasting and Management at Local and Global Scales, vol. 28120. NBER: ArXiv Report, 2020. Publisher's VersionAbstract
Policymakers everywhere are working to determine the set of restrictions that will effectively contain the spread of COVID-19 without excessively stifling economic activity. We show that publicly available data on human mobility — collected by Google, Facebook, and other providers — can be used to evaluate the effectiveness of non-pharmaceutical interventions and forecast the spread of COVID-19. This approach relies on simple and transparent statistical models, and involves minimal assumptions about disease dynamics. We demonstrate the effectiveness of this approach using local and regional data from China, France, Italy, South Korea, and the United States, as well as national data from 80 countries around the world.
Public Mobility Data Enables COVID-19 Forecasting and Management at Local and Global Scales
C. Jin, S. M. Kakade, A. Krishnamurthy, and Q. Liu, Sample-Efficient Reinforcement Learning of Undercomplete POMDPs. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Partial observability is a common challenge in many reinforcement learning applications, which requires an agent to maintain memory, infer latent states, and integrate this past information into exploration. This challenge leads to a number of computational and statistical hardness results for learning general Partially Observable Markov Decision Processes (POMDPs). This work shows that these hardness barriers do not preclude efficient reinforcement learning for rich and interesting subclasses of POMDPs. In particular, we present a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works. OOM-UCB achieves an optimal sample complexity of ̃ (1/ε2) for finding an ε-optimal policy, along with being polynomial in all other relevant quantities. As an interesting special case, we also provide a computationally and statistically efficient algorithm for POMDPs with deterministic state transitions.
Sample-Efficient Reinforcement Learning of Undercomplete POMDPs
C. Wei, S. Kakade, and T. Ma., The Implicit and Explicit Regularization Effects of Dropout. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Dropout is a widely-used regularization technique, often required to obtain state-of-the-art for a number of architectures. This work demonstrates that dropout introduces two distinct but entangled regularization effects: an explicit effect (also studied in prior work) which occurs since dropout modifies the expected training objective, and, perhaps surprisingly, an additional implicit effect from the stochasticity in the dropout training update. This implicit regularization effect is analogous to the effect of stochasticity in small mini-batch stochastic gradient descent. We disentangle these two effects through controlled experiments. We then derive analytic simplifications which characterize each effect in terms of the derivatives of the model and the loss, for deep neural networks. We demonstrate these simplified, analytic regularizers accurately capture the important aspects of dropout, showing they faithfully replace dropout in practice.
The Implicit and Explicit Regularization Effects of Dropout
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes. COLT: ArXiv Report, 2020. Publisher's VersionAbstract
Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case -- which avoid explicit worst-case dependencies on the size of state space -- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).
Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes
K. Zhang, S. M. Kakade, T. Başar, and L. F. Yang, Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexit. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of Õ (|S||A||B|(1−γ)−3ϵ−2) for finding the Nash equilibrium (NE) value up to some ϵ error, and the ϵ-NE policies with a smooth planning oracle, where γ is the discount factor, and S,A,B denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a Ω̃ (|S|(|A|+|B|)(1−γ)−3ϵ−2) lower bound, where this model-based approach is near-optimal with only a gap on the |A|,|B| dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in |A|,|B|), particularly arises in the multi-agent context.
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexit
N. Kohli, E. Aiken, and J. E. Blumenstock, Privacy Guarantees for Personal Mobility Data in Humanitarian Response, vol. Volume, no. issueNumber. KDD 2020 Workshop on Humanitarian Mapping, San Diego, CA: , 2020, pp. pageNBumber. Publisher's Version
A. Agarwal, M. Henaff, S. Kakade, and W. Sun, PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learnin. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case ℓ∞ assumptions; this includes approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learnin
A. Agarwal, S. Kakade, A. Krishnamurthy, and W. Sun, FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
In order to deal with the curse of dimensionality in reinforcement learning (RL), it is common practice to make parametric assumptions where values or policies are functions of some low dimensional feature space. This work focuses on the representation learning question: how can we learn such features? Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem. Structurally, we make precise connections between these low rank MDPs and latent variable models, showing how they significantly generalize prior formulations for representation learning in RL. Algorithmically, we develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models.
FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs
R. Wang, S. S. Du, L. F. Yang, and S. M. Kakade, Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is ε near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -- a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an ε-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.
Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning
S. Kakade, P. Liang, V. Sharan, and G. Valiant, Prediction with a Short Memory. STOC: ArXiv Report, 2020. Publisher's VersionAbstract
We consider the problem of predicting the next observation given a sequence of past observations, and consider the extent to which accurate prediction requires complex algorithms that explicitly leverage long-range dependencies. Perhaps surprisingly, our positive results show that for a broad class of sequences, there is an algorithm that predicts well on average, and bases its predictions only on the most recent few observation together with a set of simple summary statistics of the past observations. Specifically, we show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by I, then a simple Markov model over the most recent I/ϵ observations obtains expected KL error ϵ---and hence ℓ1 error ϵ√---with respect to the optimal predictor that has access to the entire past and knows the data generating distribution. For a Hidden Markov Model with n hidden states, I is bounded by logn, a quantity that does not depend on the mixing time, and we show that the trivial prediction algorithm based on the empirical frequencies of length O(logn/ϵ) windows of observations achieves this error, provided the length of the sequence is dΩ(logn/ϵ), where d is the size of the observation alphabet. We also establish that this result cannot be improved upon, even for the class of HMMs, in the following two senses: First, for HMMs with n hidden states, a window length of logn/ϵ is information-theoretically necessary to achieve expected ℓ1 error ϵ√. Second, the dΘ(logn/ϵ) samples required to estimate the Markov model for an observation alphabet of size d is necessary for any computationally tractable learning algorithm, assuming the hardness of strongly refuting a certain class of CSPs.
Prediction with a Short Memory
S. Kakade, A. Krishnamurthy, K. Lowrey, M. Ohnishi, and W. Sun, Information Theoretic Regret Bounds for Online Nonlinear Control. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract

This work studies the problem of sequential control in an unknown, nonlinear dynamical system, where we model the underlying system dynamics as an unknown function in a known Reproducing Kernel Hilbert Space. This framework yields a general setting that permits discrete and continuous control inputs as well as non-smooth, non-differentiable dynamics. Our main result, the Lower Confidence-based Continuous Control (LC3) algorithm, enjoys a near-optimal O(T‾‾√) regret bound against the optimal controller in episodic settings, where T is the number of episodes. The bound has no explicit dependence on dimension of the system dynamics, which could be infinite, but instead only depends on information theoretic quantities. We empirically show its application to a number of nonlinear control tasks and demonstrate the benefit of exploration for learning model dynamics.

Project website.

Information Theoretic Regret Bounds for Online Nonlinear Control

Pages