S. M. Kakade, “On the Sample Complexity of Reinforcement Learning,” 2003.Abstract

This thesis is a detailed investigation into the following question: how much data must an agent collect in order to perform "reinforcement learning" successfully? This question is analogous to the classical issue of the sample complexity in supervised learning, but is harder because of the increased realism of the reinforcement learning setting. This thesis summarizes recent sample complexity results in the reinforcement learning literature and builds on these results to provide novel algorithms with strong performance guarantees.

We focus on a variety of reasonable performance criteria and sampling models by which agents may access the environment. For instance, in a policy search setting, we consider the problem of how much simulated experience is required to reliably choose a "good" policy among a restricted class of policies \Pi (as in Kearns, Mansour, and Ng [2000]). In a more online setting, we consider the case in which an agent is placed in an environment and must follow one unbroken chain of experience with no access to "offline" simulation (as in Kearns and Singh [1998]).

We build on the sample based algorithms suggested by Kearns, Mansour, and Ng [2000]. Their sample complexity bounds have no dependence on the size of the state space, an exponential dependence on the planning horizon time, and linear dependence on the complexity of \Pi . We suggest novel algorithms with more restricted guarantees whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class \Pi , but have only a polynomial dependence on the horizon time. We pay particular attention to the tradeoffs made by such algorithms.


T. Xie, D. J. Foster, Y. Bai, N. Jiang, and S. M. Kakade, “The Role of Coverage in Online Reinforcement Learning,” 2022. arXiv VersionAbstract
Coverage conditions -- which assert that the data logging distribution adequately covers the state space -- play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing -- somewhat surprisingly -- that the mere existence of a data distribution with good coverage can enable sample-efficient online RL. Concretely, we show that coverability -- that is, existence of a data distribution that satisfies a ubiquitous coverage condition called concentrability -- can be viewed as a structural property of the underlying MDP, and can be exploited by standard algorithms for sample-efficient exploration, even when the agent does not know said distribution. We complement this result by proving that several weaker notions of coverage, despite being sufficient for offline RL, are insufficient for online RL. We also show that existing complexity measures for online RL, including Bellman rank and Bellman-Eluder dimension, fail to optimally capture coverability, and propose a new complexity measure, the sequential extrapolation coefficient, to provide a unification.
D. Madeka, K. Torkkola, C. Eisenach, A. Luo, D. P. Foster, and S. M. Kakade, “Deep Inventory Management,” 2022. arXiv VersionAbstract
This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this dynamic program has historically been considered intractable, our results show that several policy learning approaches are competitive with or outperform classical methods. In order to train these algorithms, we develop novel techniques to convert historical data into a simulator. On the theoretical side, we present learnability results on a subclass of inventory control problems, where we provide a provable reduction of the reinforcement learning problem to that of supervised learning. On the algorithmic side, we present a model-based reinforcement learning procedure (Direct Backprop) to solve the periodic review inventory control problem by constructing a differentiable simulator. Under a variety of metrics Direct Backprop outperforms model-free RL and newsvendor baselines, in both simulations and real-world deployments.
J. C. Perdomo, A. Krishnamurthy, P. Bartlett, and S. Kakade, “A Sharp Characterization of Linear Estimators for Offline Policy Evaluation,” ArXiv Report, 2022. Publisher's VersionAbstract
Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature.
In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation (OPE), unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of OPE.
D. J. Foster, S. M. Kakade, J. Qian, and A. Rakhlin, The Statistical Complexity of Interactive Decision Making. ArXiv Report, 2021. Publisher's VersionAbstract
A fundamental challenge in interactive learning and decision making, ranging from bandit problems to reinforcement learning, is to provide sample-efficient, adaptive learning algorithms that achieve near-optimal regret. This question is analogous to the classical problem of optimal (supervised) statistical learning, where there are well-known complexity measures (e.g., VC dimension and Rademacher complexity) that govern the statistical complexity of learning. However, characterizing the statistical complexity of interactive learning is substantially more challenging due to the adaptive nature of the problem. The main result of this work provides a complexity measure, the Decision-Estimation Coefficient, that is proven to be both necessary and sufficient for sample-efficient interactive learning. In particular, we provide:
1. a lower bound on the optimal regret for any interactive decision making problem, establishing the Decision-Estimation Coefficient as a fundamental limit.
2. a unified algorithm design principle, Estimation-to-Decisions (E2D), which transforms any algorithm for supervised estimation into an online algorithm for decision making. E2D attains a regret bound matching our lower bound, thereby achieving optimal sample-efficient learning as characterized by the Decision-Estimation Coefficient.
Taken together, these results constitute a theory of learnability for interactive decision making. When applied to reinforcement learning settings, the Decision-Estimation Coefficient recovers essentially all existing hardness results and lower bounds. More broadly, the approach can be viewed as a decision-theoretic analogue of the classical Le Cam theory of statistical estimation; it also unifies a number of existing approaches -- both Bayesian and frequentist.

COVID19 White Papers

D. Siddarth, et al., Mitigate/Suppress/Maintain: Local Targets for Victory Over COVID. Edmond J. Safra Center for Ethics: Rapid Response Initiative, 2020. Publisher's VersionAbstract
There is growing consensus around a strategy centered on testing, tracing, and supported isolation (TTSI) to suppress COVID, save lives, and safely reopen the economy. Given the high prevalence the disease has reached in many OECD countries, this requires the expansion of TTSI inputs to scales and with a speed that have little precedent (Siddarth and Weyl, 2020). As scarcity of testing supplies is expected to remain during the build-up to a surge, authorities will need to allocate these tests within and across localities to minimize loss of life. This paper documents a preliminary framework that aims to provide such guidance to multiscale geographies, in conjunction with our previous recommendations. Given unfortunate limits in current testing capacity, we describe a testing and tracing regime in which orders of magnitude fewer resources can be used to suppress the disease. Such suppression should be rapidly scaled in low-prevalence areas (within and across regions) to the extent that those areas can be insulated from other areas. In particular, if travel restrictions are used to allow asynchronous sup-pression, and if logistics permit the use of mobile resources, a smaller number of tests per day may be required. At the same time, vulnerable communities and essential workforces must be protected across regions, as prescribed in Phase I of the Roadmap to Pandemic Resilience (Allen et al., 2020).
D. S. Allen, A. Bassuk, S. Block, G. J. Busenberg, and M. Charpignon, Pandemic Resilience: Getting it Done. Edmond J. Safra Center for Ethics: Rapid Response Initiative, 2020. Publisher's VersionAbstract
On April 27, the CDC [Centers for Disease Control and Prevention] changed its guidance to support broader use of testing not only for therapeutic purposes, but also for disease control. In the most recent guidance, released May 3, first priority goes to hospitalized patients, first responders with symptoms, and residents in congregate living contexts with symptoms. But there is now also a second priority category that includes asymptomatic individuals from groups experiencing disparate impacts of the disease and 'persons without symptoms who are prioritized by health departments or clinicians, for any reason, including but not limited to: public health monitoring, sentinel surveillance, or 'screening of other asymptomatic individuals according to state and local plans' (bold in original, italics added). The last phrase supports broad testing of contacts of COVID [coronavirus disease]-positive individuals and of essential workers, even when they have mild symptoms or none at all. This Supplement to our Roadmap to Pandemic Resilience offers guidance to help state and local governments develop TTSI (testing, tracing, and supported isolation) programs in support of such testing for purposes of disease control and suppression.
V. Hart, et al., Outpacing the Virus: Digital Response to Containing the Spread of COVID-19 while Mitigating Privacy Risks. Edmond J. Safra Center for Ethics: Rapid Response Initiative, 2020. Publisher's VersionAbstract
There is a growing consensus that we must use a combined strategy of medical and technological tools to provide us with response at a scale that can outpace the speed and proliferation of the SARS-CoV-2 virus. A process of identifying exposed individuals who have come into contact with diagnosed individuals, called “contact tracing,” has been shown to effectively enable suppression of new cases of SARS-CoV-2 (COVID-19). Important concerns around protecting patient’s confidentiality and civil liberties, and lack of familiarity with available privacy-protecting technologies, have both led to suboptimal privacy implementations and hindered adoption. This paper reviews the trade-offs of these methods, their techniques, the necessary rate of adoption, and critical security and privacy controls and concerns for an information system that can accelerate medical response. Proactive use of intentionally designed technology can support voluntary participation from the public toward the goals of smart testing, effective resource allocation, and relaxing some of physical distancing measures, but only when it guarantees and assures an individual’s complete control over disclosure, and use of data in the way that protects individual rights.
J. Chan, et al., “PACT: Privacy Sensitive Protocols and Mechanisms for Mobile Contact Tracing,” IEEE Bulletin on Data Engineering, pp. 15-35, 2020. Publisher's VersionAbstract
The global health threat from COVID-19 has been controlled in a number of instances by large-scale testing and contact tracing efforts. We created this document to suggest three functionalities on how we might best harness computing technologies to supporting the goals of public health organizations in minimizing morbidity and mortality associated with the spread of COVID-19, while protecting the civil liberties of individuals. In particular, this work advocates for a third-party free approach to assisted mobile contact tracing, because such an approach mitigates the security and privacy risks of requiring a trusted third party. We also explicitly consider the inferential risks involved in any contract tracing system, where any alert to a user could itself give rise to de-anonymizing information. More generally, we hope to participate in bringing together colleagues in industry, academia, and civil society to discuss and converge on ideas around a critical issue rising with attempts to mitigate the COVID-19 pandemic.

Technical Notes

K. Huang, S. M. Kakade, J. D. Lee, and Q. Lei, A Short Note on the Relationship of Information Gain and Eluder Dimension. ArXiv Report, 2021. arXiv VersionAbstract
Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning. Eluder dimension was originally proposed as a general complexity measure of function classes, but the common examples of where it is known to be small are function spaces (vector spaces). In these cases, the primary tool to upper bound the eluder dimension is the elliptic potential lemma. Interestingly, the elliptic potential lemma also features prominently in the analysis of linear bandits/reinforcement learning and their nonparametric generalization, the information gain. We show that this is not a coincidence -- eluder dimension and information gain are equivalent in a precise sense for reproducing kernel Hilbert spaces.
E. Hazan and S. Kakade, Revisiting the Polyak step size. ArXiv Report, 2019. Publisher's VersionAbstract
This paper revisits the Polyak step size schedule for convex optimization problems, proving that a simple variant of it simultaneously attains near optimal convergence rates for the gradient descent algorithm, for all ranges of strong convexity, smoothness, and Lipschitz parameters, without a-priory knowledge of these parameters.
C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm. ArXiv Report, 2019. Publisher's VersionAbstract
In this note, we derive concentration inequalities for random vectors with subGaussian norm (a generalization of both subGaussian random vectors and norm bounded random vectors), which are tight up to logarithmic factors.
D. P. Foster, S. M. Kakade, and T. Zhang, Multi-View Dimensionality Reduction via Canonical Correlation Analysis, 20084th ed. vol. TTI-TR-2008-4. Toyota Technical Institute-Chicago: , 2017. Publisher's VersionAbstract
We analyze the multi-view regression problem where we have two views X = (X(1),X(2)) of the input data and a target variable Y of interest. We provide sufficient conditions under which we can reduce the dimensionality of X (via a projection) without loosing predictive power of Y . Crucially, this projection can be computed via a Canonical Correlation Analysis only on the unlabeled data. The algorithmic template is as follows: with unlabeled data, perform CCA and construct a certain projection; with the labeled data, do least squares regression in this lower dimensional space. We show how, under certain natural assumptions, the number of labeled samples could be significantly reduced (in comparison to the single view setting) — in particular, we show how this dimensionality reduction does not loose predictive power of Y (thus it only introduces little bias but could drastically reduce the variance). We explore two separate assumptions under which this is possible and show how, under either assumption alone, dimensionality reduction could reduce the labeled sample complexity. The two assumptions we consider are a conditional independence assumption and a redundancy assumption. The typical conditional independence assumption is that conditioned on Y the views X(1) and X(2) are independent — we relax this assumption to be conditioned on some hidden state H the views X(1) and X(2) are independent. Under the redundancy assumption, we have that the best predictor from each view is roughly as good as the best predictor using both views.


J. Wu, D. Zou, V. Braverman, Q. Gu, and S. M. Kakade, “The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift,” NeurIPS, 2022. arXiv VersionAbstract
We study linear regression under covariate shift, where the marginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across the two domains. We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data (both conducted by online SGD) for this problem. We establish sharp instance-dependent excess risk upper and lower bounds for this approach. Our bounds suggest that for a large class of linear regression instances, transfer learning with O(N2) source data (and scarce or no target data) is as effective as supervised learning with N target data. In addition, we show that finetuning, even with only a small amount of target data, could drastically reduce the amount of source data required by pretraining. Our theory sheds light on the effectiveness and limitation of pretraining as well as the benefits of finetuning for tackling covariate shift problems.
B. Barak, B. L. Edelman, S. Goel, S. Kakade, E. Malach, and C. Zhang, “Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit,” NeurIPS, 2022. arXiv VersionAbstract
There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning k-sparse parities of n bits, a canonical family of problems which pose theoretical computational barriers. In this setting, we find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time. In particular, we demonstrate empirically that with standard training, a variety of architectures learn sparse parities with nO(k) examples, with loss (and error) curves abruptly dropping after nO(k) iterations. These positive results nearly match known SQ lower bounds, even without an explicit sparsity-promoting prior. We elucidate the mechanisms of these phenomena with a theoretical analysis: we find that the phase transition in performance is not due to SGD "stumbling in the dark" until it finds the hidden set of features (a natural algorithm which also runs in nO(k) time); instead, we show that SGD gradually amplifies a Fourier gap in the population gradient.
N. Saunshi, et al., “Understanding Contrastive Learning Requires Incorporating Inductive Biases,” ICML, 2022. Publisher's VersionAbstract
Contrastive learning is a popular form of self-supervised learning that encourages augmentations (views) of the same input to have more similar representations compared to augmentations of different inputs. Recent attempts to theoretically explain the success of contrastive learning on downstream classification tasks prove guarantees depending on properties of {\em augmentations} and the value of {\em contrastive loss} of representations. We demonstrate that such analyses, that ignore {\em inductive biases} of the function class and training algorithm, cannot adequately explain the success of contrastive learning, even {\em provably} leading to vacuous guarantees in some settings. Extensive experiments on image and text domains highlight the ubiquity of this problem -- different function classes and algorithms behave very differently on downstream tasks, despite having the same augmentations and contrastive losses. Theoretical analysis is presented for the class of linear representations, where incorporating inductive biases of the function class allows contrastive learning to work with less stringent conditions compared to prior analyses.
Multi-Stage Episodic Control for Strategic Exploration in Text Games,” ICLR, 2022. Publisher's VersionAbstract
Text adventure games present unique challenges to reinforcement learning methods due to their combinatorially large action spaces and sparse rewards. The interplay of these two factors is particularly demanding because large action spaces require extensive exploration, while sparse rewards provide limited feedback. This work proposes to tackle the explore-vs-exploit dilemma using a multi-stage approach that explicitly disentangles these two strategies within each episode. Our algorithm, called eXploit-Then-eXplore (XTX), begins each episode using an exploitation policy that imitates a set of promising trajectories from the past, and then switches over to an exploration policy aimed at discovering novel actions that lead to unseen state spaces. This policy decomposition allows us to combine global decisions about which parts of the game space to return to with curiosity-based local exploration in that space, motivated by how a human may approach these games. Our method significantly outperforms prior approaches by 27% and 11% average normalized score over 12 games from the Jericho benchmark (Hausknecht et al., 2020) in both deterministic and stochastic settings, respectively. On the game of Zork1, in particular, XTX obtains a score of 103, more than a 2x improvement over prior methods, and pushes past several known bottlenecks in the game that have plagued previous state-of-the-art methods.
D. Zou, J. Wu, V. Braverman, Q. Gu, and S. M. Kakade, “Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime,” NeurIPS, 2022. arXiv VersionAbstract
Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.
A. Kusupati, et al., “Matryoshka Representations for Adaptive Deployment,” NeurIPS, 2022. arXiv VersionAbstract
Learned representations are a central component in modern ML systems, serving a multitude of downstream tasks. When training such representations, it is often the case that computational and statistical constraints for each downstream task are unknown. In this context rigid, fixed capacity representations can be either over or under-accommodating to the task at hand. This leads us to ask: can we design a flexible representation that can adapt to multiple downstream tasks with varying computational resources? Our main contribution is Matryoshka Representation Learning (MRL) which encodes information at different granularities and allows a single embedding to adapt to the computational constraints of downstream tasks. MRL minimally modifies existing representation learning pipelines and imposes no additional cost during inference and deployment. MRL learns coarse-to-fine representations that are at least as accurate and rich as independently trained low-dimensional representations. The flexibility within the learned Matryoshka Representations offer: (a) up to 14x smaller embedding size for ImageNet-1K classification at the same level of accuracy; (b) up to 14x real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations. Finally, we show that MRL extends seamlessly to web-scale datasets (ImageNet, JFT) across various modalities -- vision (ViT, ResNet), vision + language (ALIGN) and language (BERT). MRL code and pretrained models are open-sourced at this https URL.
J. T. Ash, C. Zhang, S. Goel, A. Krishnamurthy, and S. Kakade, Anti-Concentrated Confidence Bonuses for Scalable Exploration. ICLR, 2021. Publisher's VersionAbstract
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement learning. The LinUCB algorithm, a centerpiece of the stochastic linear bandits literature, prescribes an elliptical bonus which addresses the challenge of leveraging shared information in large action spaces. This bonus scheme cannot be directly transferred to high-dimensional exploration problems, however, due to the computational cost of maintaining the inverse covariance matrix of action features. We introduce \emph{anti-concentrated confidence bounds} for efficiently approximating the elliptical bonus, using an ensemble of regressors trained to predict random noise from policy network-derived features. Using this approximation, we obtain stochastic linear bandit algorithms which obtain Õ (dT‾‾√) regret bounds for poly(d) fixed actions. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks.
B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, Inductive Biases and Variable Creation in Self-Attention Mechanisms. ICML, 2021. Publisher's VersionAbstract
Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules, where our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer layers create sparse variables: they can represent sparse functions of the input sequence, with sample complexity scaling only logarithmically with the context length. Furthermore, we propose new experimental protocols to support this analysis and to guide the practice of training Transformers, built around the large body of work on provably learning sparse Boolean functions.
Y. Efroni, S. Kakade, A. Krishnamurthy, and C. Zhang, Sparsity in Partially Controllable Linear Systems. ICML, 2021. Publisher's VersionAbstract
A fundamental concept in control theory is that of controllability, where any system state can be reached through an appropriate choice of control inputs. Indeed, a large body of classical and modern approaches are designed for controllable linear dynamical systems. However, in practice, we often encounter systems in which a large set of state variables evolve exogenously and independently of the control inputs; such systems are only \emph{partially controllable}. The focus of this work is on a large class of partially controllable linear dynamical systems, specified by an underlying sparsity pattern. Our main results establish structural conditions and finite-sample guarantees for learning to control such systems. In particular, our structural results characterize those state variables which are irrelevant for optimal control, an analysis which departs from classical control techniques. Our algorithmic results adapt techniques from high-dimensional statistics -- specifically soft-thresholding and semiparametric least-squares -- to exploit the underlying sparsity pattern in order to obtain finite-sample guarantees that significantly improve over those based on certainty-equivalence. We also corroborate these theoretical improvements over certainty-equivalent control through a simulation study.
J. Wu, D. Zou, V. Braverman, Q. Gu, and S. M. Kakade, Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression. ICML, 2021. Publisher's VersionAbstract
Stochastic gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications. In practice, one often runs SGD with a geometrically decaying stepsize, i.e., a constant initial stepsize followed by multiple geometric stepsize decay, and uses the last iterate as the output. This kind of SGD is known to be nearly minimax optimal for classical finite-dimensional linear regression problems (Ge et al., 2019), and provably outperforms SGD with polynomially decaying stepsize in terms of the statistical minimax rates. However, a sharp analysis for the last iterate of SGD with decaying step size in the overparameterized setting is still open. In this paper, we provide problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems. In particular, for SGD with geometrically decaying stepsize (or tail geometrically decaying stepsize), we prove nearly matching upper and lower bounds on the excess risk. Our results demonstrate the generalization ability of SGD for a wide class of overparameterized problems, and can recover the minimax optimal results up to logarithmic factors in the classical regime. Moreover, we provide an excess risk lower bound for SGD with polynomially decaying stepsize and illustrate the advantage of geometrically decaying stepsize in an instance-wise manner, which complements the minimax rate comparison made in previous work.
K. Pillutla, S. M. Kakade, and Z. Harchaoui, “Robust Aggregation for Federated Learning,” IEEE Transactions on Signal Processing, 2022. arXiv VersionAbstract

We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The proposed approach relies on a robust secure aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of calls to a regular non-robust secure average oracle. The robust aggregation oracle is privacy-preserving, similar to the secure average oracle it builds upon. We provide experimental results of the proposed approach with linear models and deep networks for two tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption.



D. Zou, J. Wu, V. Braverman, Q. Gu, D. P. Foster, and S. M. Kakade, The Benefits of Implicit Regularization from SGD in Least Squares Problems. NeurIPS, 2021, pp. 39. Publisher's VersionAbstract
Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.
B. Huang, et al., Going Beyond Linear RL: Sample Efficient Neural Function Approximation. NeurIPS, 2021. Publisher's VersionAbstract
Deep Reinforcement Learning (RL) powered by neural net approximation of the Q function has had enormous empirical success. While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions. This is the focus of this work, where we study function approximation with two-layer neural networks (considering both ReLU and polynomial activation functions). Our first result is a computationally and statistically efficient algorithm in the generative model setting under completeness for two-layer neural networks. Our second result considers this setting but under only realizability of the neural net function class. Here, assuming deterministic dynamics, the sample complexity scales linearly in the algebraic dimension. In all cases, our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
S. S. Du, et al., Bilinear Classes: A Structural Framework for Provable Generalization in RL. ICML: ArXiv Report, 2021. Publisher's VersionAbstract
This work introduces Bilinear Classes, a new structural framework, which permit generalization in reinforcement learning in a wide variety of settings through the use of function approximation. The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear Q∗/V∗ model in which both the optimal Q-function and the optimal V-function are linear in some known feature space. Our main result provides an RL algorithm which has polynomial sample complexity for Bilinear Classes; notably, this sample complexity is stated in terms of a reduction to the generalization error of an underlying supervised learning sub-problem. These bounds nearly match the best known sample complexity bounds for existing models. Furthermore, this framework also extends to the infinite dimensional (RKHS) setting: for the the Linear Q∗/V∗ model, linear MDPs, and linear mixture MDPs, we provide sample complexities that have no explicit dependence on the explicit feature dimension (which could be infinite), but instead depends only on information theoretic quantities.
B. Huang, et al., Optimal Gradient-based Algorithms for Non-concave Bandit Optimization. NeurIPS, 2021. Publisher's VersionAbstract
Bandit problems with linear or concave reward have been extensively studied, but relatively few works have studied bandits with non-concave reward. This work considers a large family of bandit problems where the unknown underlying reward function is non-concave, including the low-rank generalized linear bandit problems and two-layer neural network with polynomial activation bandit problem. For the low-rank generalized linear bandit problem, we provide a minimax-optimal algorithm in the dimension, refuting both conjectures in [LMT21, JWWN19]. Our algorithms are based on a unified zeroth-order optimization paradigm that applies in great generality and attains optimal rates in several structured polynomial settings (in the dimension). We further demonstrate the applicability of our algorithms in RL in the generative model setting, resulting in improved sample complexity over prior approaches. Finally, we show that the standard optimistic algorithms (e.g., UCB) are sub-optimal by dimension factors. In the neural net setting (with polynomial activation functions) with noiseless reward, we provide a bandit algorithm with sample complexity equal to the intrinsic algebraic dimension. Again, we show that optimistic approaches have worse sample complexity, polynomial in the extrinsic dimension (which could be exponentially worse in the polynomial degree).
J. T. Ash, S. Goel, A. Krishnamurthy, and S. Kakade, Gone Fishing: Neural Active Learning with Fisher Embeddings. NeurIPS, 2021. Publisher's VersionAbstract
There is an increasing need for effective active learning algorithms that are compatible with deep neural networks. While there are many classic, well-studied sample selection methods, the non-convexity and varying internal representation of neural models make it unclear how to extend these approaches. This article introduces BAIT, a practical, tractable, and high-performing active learning algorithm for neural networks that addresses these concerns. BAIT draws inspiration from the theoretical analysis of maximum likelihood estimators (MLE) for parametric models. It selects batches of samples by optimizing a bound on the MLE error in terms of the Fisher information, which we show can be implemented efficiently at scale by exploiting linear-algebraic structure especially amenable to execution on modern hardware. Our experiments show that BAIT outperforms the previous state of the art on both classification and regression problems, and is flexible enough to be used with a variety of model architectures.
D. Zou, J. Wu, V. Braverman, Q. Gu, and S. M. Kakade., Benign Overfitting of Constant-Stepsize SGD for Linear Regression. COLT: ArXiv Report, 2021. Publisher's VersionAbstract
There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging) for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension (specific for SGD) and (ii) the bias bound provides a sharp geometric characterization in terms of the location of the initial iterate (and how it aligns with the data covariance matrix). We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares (minimum-norm interpolation) and ridge regression.
A. Kusupati, et al., LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes. NeurIPS, 2021. Publisher's VersionAbstract
Learning binary representations of instances and classes is a classical problem with several high potential applications. In modern settings, the compression of high-dimensional neural representations to low-dimensional binary codes is a challenging task and often require large bit-codes to be accurate. In this work, we propose a novel method for Learning Low-dimensional binary Codes (LLC) for instances as well as classes. Our method does not require any side-information, like annotated attributes or label meta-data, and learns extremely low-dimensional binary codes (~20 bits for ImageNet-1K). The learnt codes are super-efficient while still ensuring nearly optimal classification accuracy for ResNet50 on ImageNet-1K. We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes. We further quantitatively measure the quality of our codes by applying it to the efficient image retrieval as well as out-of-distribution (OOD) detection problems. For ImageNet-100 retrieval problem, our learnt binary codes outperform 16 bit HashNet using only 10 bits and also are as accurate as 10 dimensional real representations. Finally, our learnt binary codes can perform OOD detection, out-of-the-box, as accurately as a baseline that needs ~3000 samples to tune its threshold, while we require none. Code and pre-trained models are available at this https URL.
R. Wang, Y. Wu, R. Salakhutdinov, and S. M. Kakade, Instabilities of Offline RL with Pre-Trained Neural Representation. ICML: ArXiv Report, 2021. Publisher's VersionAbstract
In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated. Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold, else there are lower bounds exhibiting exponential error amplification (in the problem horizon) unless the data collection distribution has only a mild distribution shift relative to the target policy. This work studies these issues from an empirical perspective to gauge how stable offline RL methods are. In particular, our methodology explores these ideas when using features from pre-trained neural networks, in the hope that these representations are powerful enough to permit sample efficient offline RL. Through extensive experiments on a range of tasks, we see that substantial error amplification does occur even when using such pre-trained representations (trained on the same task itself); we find offline RL is stable only under extremely mild distribution shift. The implications of these results, both from a theoretical and an empirical perspective, are that successful offline RL (where we seek to go beyond the low distribution shift regime) requires substantially stronger conditions beyond those which suffice for successful supervised learning.
P. Nakkiran, P. Venkat, S. Kakade, and T. Ma, Optimal Regularization Can Mitigate Double Descent. ICLR: ArXiv Report, 2021. Publisher's VersionAbstract
Recent empirical and theoretical studies have shown that many learning algorithms -- from linear regression to neural networks -- can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as "double descent", has raised questions of if we need to re-think our current understanding of generalization. In this work, we study whether the double-descent phenomenon can be avoided by using optimal regularization. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned ℓ2 regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned ℓ2 regularization can mitigate double descent for more general models, including neural networks. Our results suggest that it may also be informative to study the test risk scalings of various algorithms in the context of appropriately tuned regularization.
S. S. Du, W. Hu, S. M. Kakade, J. D. Lee, and Q. Lei, Few-Shot Learning via Learning the Representation, Provably. ICLR: ArXiv Report, 2021. Publisher's VersionAbstract
This paper studies few-shot learning via representation learning, where one uses T source tasks with n1 data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only n2(≪n1) data. Specifically, we focus on the setting where there exists a good \emph{common representation} between source and target, and our goal is to understand how much of a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a fast rate of O((Φ)n1T+kn2); here, Φ is the representation function class, (Φ) is its complexity measure, and k is the dimension of the representation. When specialized to linear representation functions, this rate becomes O(dkn1T+kn2) where d(≫k) is the ambient input dimension, which is a substantial improvement over the rate without using representation learning, i.e. over the rate of O(dn2). This result bypasses the Ω(1T) barrier under the i.i.d. task assumption, and can capture the desired property that all n1T samples from source tasks can be \emph{pooled} together for representation learning. Next, we consider the setting where the common representation may be high-dimensional but is capacity-constrained (say in norm); here, we again demonstrate the advantage of representation learning in both high-dimensional linear regression and neural network learning. Our results demonstrate representation learning can fully utilize all n1T samples from source tasks.
Y. Wang, R. Wang, and S. M. Kakade, An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap. NeurIPS, 2021. Publisher's VersionAbstract
A fundamental question in the theory of reinforcement learning is: suppose the optimal Q-function lies in the linear span of a given d dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential (in d) sample size lower bound, which holds even if the agent has access to a generative model of the environment. One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even more favorable assumption: there exists a \emph{constant suboptimality gap} between the optimal Q-value of the best action and that of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easier identification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has access to a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption. This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal Q-function. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting. Complementing our negative hardness result, we give two positive results showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption (both implicitly place stronger conditions on the underlying dynamics model).
X. Liu, W. Kong, S. Kakade, and S. Oh, Robust and Differentially Private Mean Estimation. NeurIPS, 2021. Publisher's VersionAbstract
Differential privacy has emerged as a standard requirement in a variety of applications ranging from the U.S. Census to data collected in commercial devices, initiating an extensive line of research in accurately and privately releasing statistics of a database. An increasing number of such databases consist of data from multiple sources, not all of which can be trusted. This leaves existing private analyses vulnerable to attacks by an adversary who injects corrupted data. Despite the significance of designing algorithms that guarantee privacy and robustness (to a fraction of data being corrupted) simultaneously, even the simplest questions remain open. For the canonical problem of estimating the mean from i.i.d. samples, we introduce the first efficient algorithm that achieves both privacy and robustness for a wide range of distributions. This achieves optimal accuracy matching the known lower bounds for robustness, but the sample complexity has a factor of d1/2 gap from known lower bounds. We further show that this gap is due to the computational efficiency; we introduce the first family of algorithms that close this gap but takes exponential time. The innovation is in exploiting resilience (a key property in robust estimation) to adaptively bound the sensitivity and improve privacy.
Y. Bai, et al., How Important is the Train-Validation Split in Meta-Learning. ICML: ArXiv Report, 2021. Publisher's VersionAbstract
Meta-learning aims to perform fast adaptation on a new task through learning a "prior" from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split (\emph{train-val method}) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct \emph{train-train method}, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful in the linear centroid meta-learning problem. In the agnostic case, we show that the expected loss of the train-val method is minimized at the optimal prior for meta testing, and this is not the case for the train-train method in general without structural assumptions on the data. In contrast, in the realizable case where the data are generated from linear models, we show that both the train-val and train-train losses are minimized at the optimal prior in expectation. Further, perhaps surprisingly, our main result shows that the train-train method achieves a \emph{strictly better} excess loss in this realizable case, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that sample splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.
R. Wang, D. P. Foster, and S. M. Kakade, What are the Statistical Limits of Offline RL with Linear Function Approximation. ICLR: ArXiv Report, 2021. Publisher's VersionAbstract
Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan., On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. JMLR: ArXiv Report, 2021. Publisher's VersionAbstract
Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case -- which avoid explicit worst-case dependencies on the size of state space -- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).
C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points,” ACM, 2021. Publisher's VersionAbstract
Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a gap has arisen between theory and practice. Indeed, traditional analyses of GD and SGD show that both algorithms converge to stationary points efficiently. But these analyses do not take into account the possibility of converging to saddle points. More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial. For modern machine learning, where the dimension can be in the millions, such dependence would be catastrophic. We analyze perturbed versions of GD and SGD and show that they are truly efficient---their dimension dependence is only polylogarithmic. Indeed, these algorithms converge to second-order stationary points in essentially the same time as they take to converge to classical first-order stationary points.


N. Agarwal, S. Kakade, R. Kidambi, Y. T. Lee, P. Netrapalli, and A. Sidford, Leverage Score Sampling for Faster Accelerated Regression and Empirical Risk Minimization. ALT: ArXiv Report, 2020. Publisher's VersionAbstract
Given a matrix A∈ℝn×d and a vector b∈ℝd, we show how to compute an ϵ-approximate solution to the regression problem minx∈ℝd12‖Ax−b‖22 in time Õ ((n+d⋅κsum‾‾‾‾‾‾‾√)⋅s⋅logϵ−1) where κsum=tr(A⊤A)/λmin(ATA) and s is the maximum number of non-zero entries in a row of A. Our algorithm improves upon the previous best running time of Õ ((n+n⋅κsum‾‾‾‾‾‾‾√)⋅s⋅logϵ−1). We achieve our result through a careful combination of leverage score sampling techniques, proximal point methods, and accelerated coordinate descent. Our method not only matches the performance of previous methods, but further improves whenever leverage scores of rows are small (up to polylogarithmic factors). We also provide a non-linear generalization of these results that improves the running time for solving a broader class of ERM problems.
C. Ilin, S. E. Annan-Phan, X. H. Tai, S. Mehra, S. M. Hsiang, and J. E. Blumenstock, Public Mobility Data Enables COVID-19 Forecasting and Management at Local and Global Scales, vol. 28120. NBER: ArXiv Report, 2020. Publisher's VersionAbstract
Policymakers everywhere are working to determine the set of restrictions that will effectively contain the spread of COVID-19 without excessively stifling economic activity. We show that publicly available data on human mobility — collected by Google, Facebook, and other providers — can be used to evaluate the effectiveness of non-pharmaceutical interventions and forecast the spread of COVID-19. This approach relies on simple and transparent statistical models, and involves minimal assumptions about disease dynamics. We demonstrate the effectiveness of this approach using local and regional data from China, France, Italy, South Korea, and the United States, as well as national data from 80 countries around the world.
C. Jin, S. M. Kakade, A. Krishnamurthy, and Q. Liu, Sample-Efficient Reinforcement Learning of Undercomplete POMDPs. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Partial observability is a common challenge in many reinforcement learning applications, which requires an agent to maintain memory, infer latent states, and integrate this past information into exploration. This challenge leads to a number of computational and statistical hardness results for learning general Partially Observable Markov Decision Processes (POMDPs). This work shows that these hardness barriers do not preclude efficient reinforcement learning for rich and interesting subclasses of POMDPs. In particular, we present a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works. OOM-UCB achieves an optimal sample complexity of ̃ (1/ε2) for finding an ε-optimal policy, along with being polynomial in all other relevant quantities. As an interesting special case, we also provide a computationally and statistically efficient algorithm for POMDPs with deterministic state transitions.
C. Wei, S. Kakade, and T. Ma., The Implicit and Explicit Regularization Effects of Dropout. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Dropout is a widely-used regularization technique, often required to obtain state-of-the-art for a number of architectures. This work demonstrates that dropout introduces two distinct but entangled regularization effects: an explicit effect (also studied in prior work) which occurs since dropout modifies the expected training objective, and, perhaps surprisingly, an additional implicit effect from the stochasticity in the dropout training update. This implicit regularization effect is analogous to the effect of stochasticity in small mini-batch stochastic gradient descent. We disentangle these two effects through controlled experiments. We then derive analytic simplifications which characterize each effect in terms of the derivatives of the model and the loss, for deep neural networks. We demonstrate these simplified, analytic regularizers accurately capture the important aspects of dropout, showing they faithfully replace dropout in practice.
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes. COLT: ArXiv Report, 2020. Publisher's VersionAbstract
Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case -- which avoid explicit worst-case dependencies on the size of state space -- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).
K. Zhang, S. M. Kakade, T. Başar, and L. F. Yang, Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexit. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of Õ (|S||A||B|(1−γ)−3ϵ−2) for finding the Nash equilibrium (NE) value up to some ϵ error, and the ϵ-NE policies with a smooth planning oracle, where γ is the discount factor, and S,A,B denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a Ω̃ (|S|(|A|+|B|)(1−γ)−3ϵ−2) lower bound, where this model-based approach is near-optimal with only a gap on the |A|,|B| dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in |A|,|B|), particularly arises in the multi-agent context.
A. Agarwal, M. Henaff, S. Kakade, and W. Sun, PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learnin. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case ℓ∞ assumptions; this includes approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
A. Agarwal, S. Kakade, A. Krishnamurthy, and W. Sun, FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
In order to deal with the curse of dimensionality in reinforcement learning (RL), it is common practice to make parametric assumptions where values or policies are functions of some low dimensional feature space. This work focuses on the representation learning question: how can we learn such features? Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem. Structurally, we make precise connections between these low rank MDPs and latent variable models, showing how they significantly generalize prior formulations for representation learning in RL. Algorithmically, we develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models.
R. Wang, S. S. Du, L. F. Yang, and S. M. Kakade, Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is ε near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -- a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an ε-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.
S. Kakade, A. Krishnamurthy, K. Lowrey, M. Ohnishi, and W. Sun, Information Theoretic Regret Bounds for Online Nonlinear Control. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract

This work studies the problem of sequential control in an unknown, nonlinear dynamical system, where we model the underlying system dynamics as an unknown function in a known Reproducing Kernel Hilbert Space. This framework yields a general setting that permits discrete and continuous control inputs as well as non-smooth, non-differentiable dynamics. Our main result, the Lower Confidence-based Continuous Control (LC3) algorithm, enjoys a near-optimal O(T‾‾√) regret bound against the optimal controller in episodic settings, where T is the number of episodes. The bound has no explicit dependence on dimension of the system dynamics, which could be infinite, but instead only depends on information theoretic quantities. We empirically show its application to a number of nonlinear control tasks and demonstrate the benefit of exploration for learning model dynamics.

Project website.

V. Ramanujan and R. Somani, Soft Threshold Weight Reparameterization for Learnable Sparsity. ICML: ArXiv Report, 2020. Publisher's VersionAbstract
Sparsity in Deep Neural Networks (DNNs) is studied extensively with the focus of maximizing prediction accuracy given an overall parameter budget. Existing methods rely on uniform or heuristic non-uniform sparsity budgets which have sub-optimal layer-wise parameter allocation resulting in a) lower prediction accuracy or b) higher inference cost (FLOPs). This work proposes Soft Threshold Reparameterization (STR), a novel use of the soft-threshold operator on DNN weights. STR smoothly induces sparsity while learning pruning thresholds thereby obtaining a non-uniform sparsity budget. Our method achieves state-of-the-art accuracy for unstructured sparsity in CNNs (ResNet50 and MobileNetV1 on ImageNet-1K), and, additionally, learns non-uniform budgets that empirically reduce the FLOPs by up to 50%. Notably, STR boosts the accuracy over existing results by up to 10% in the ultra sparse (99%) regime and can also be used to induce low-rank (structured sparsity) in RNNs. In short, STR is a simple mechanism which learns effective sparsity budgets that contrast with popular heuristics. Code, pretrained models and sparsity budgets are at this https URL.
W. Kong, R. Somani, Z. Song, S. Kakade, and S. Oh, Robust Meta-learning for Mixed Linear Regression with Small Batches. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
A common challenge faced in practical supervised learning, such as medical image processing and robotic interactions, is that there are plenty of tasks but each task cannot afford to collect enough labeled examples to be learned in isolation. However, by exploiting the similarities across those tasks, one can hope to overcome such data scarcity. Under a canonical scenario where each task is drawn from a mixture of k linear regressions, we study a fundamental question: can abundant small-data tasks compensate for the lack of big-data tasks? Existing second moment based approaches show that such a trade-off is efficiently achievable, with the help of medium-sized tasks with Ω(k1/2) examples each. However, this algorithm is brittle in two important scenarios. The predictions can be arbitrarily bad (i) even with only a few outliers in the dataset; or (ii) even if the medium-sized tasks are slightly smaller with o(k1/2) examples each. We introduce a spectral approach that is simultaneously robust under both scenarios. To this end, we first design a novel outlier-robust principal component analysis algorithm that achieves an optimal accuracy. This is followed by a sum-of-squares algorithm to exploit the information from higher order moments. Together, this approach is robust against outliers and achieves a graceful statistical trade-off; the lack of Ω(k1/2)-size tasks can be compensated for with smaller tasks, which can now be as small as O(logk).
M. Braverman, X. Chen, S. M. Kakade, K. Narasimhan, C. Zhang, and Y. Zhang, Calibration, Entropy Rates, and Memory in Language Models. ICML: ArXiv Report, 2020. Publisher's VersionAbstract
Building accurate language models that capture meaningful long-term dependencies is a core challenge in natural language processing. Towards this end, we present a calibration-based approach to measure long-term discrepancies between a generative sequence model and the true distribution, and use these discrepancies to improve the model. Empirically, we show that state-of-the-art language models, including LSTMs and Transformers, are \emph{miscalibrated}: the entropy rates of their generations drift dramatically upward over time. We then provide provable methods to mitigate this phenomenon. Furthermore, we show how this calibration-based approach can also be used to measure the amount of memory that language models use for prediction.
A. Agarwal, S. Kakade, and L. F. Yang, On the Optimality of Sparse Model-Based Planning for Markov Decision Processes. COLT: ArXiv Report, 2020. Publisher's VersionAbstract
This work considers the sample and computational complexity of obtaining an ϵ-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this work, we study the effectiveness of the most natural plug-in approach to model-based planning: we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP. We ask arguably the most basic and unresolved question in model based planning: is the naive "plug-in" approach, non-asymptotically, minimax optimal in the quality of the policy it finds, given a fixed sample size? Here, the non-asymptotic regime refers to when the sample size is sublinear in the model size. With access to a generative model, we resolve this question in the strongest possible sense: our main result shows that \emph{any} high accuracy solution in the plug-in model constructed with N samples, provides an ϵ-optimal policy in the true underlying MDP (where ϵ is the minimax accuracy with N samples at every state, action pair). In comparison, all prior (non-asymptotically) minimax optimal results use model free approaches, such as the Variance Reduced Q-value iteration algorithm (Sidford et al 2018), while the best known model-based results (e.g. Azar et al 2013) require larger sample sizes in their dependence on the planning horizon or the state space. Notably, we show that the model-based approach allows the use of \emph{any} efficient planning algorithm in the empirical MDP, which simplifies algorithm design as this approach does not tie the algorithm to the sampling procedure. The core of our analysis is avnovel "absorbing MDP" construction to address the statistical dependency issues that arise in the analysis of model-based planning approaches, a construction which may be helpful more generally.
S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang, Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning. ICLR: ArXiv Report, 2020. Publisher's VersionAbstract
Modern deep learning methods provide effective means to learn good representations. However, is a good representation itself sufficient for sample efficient reinforcement learning? This question has largely been studied only with respect to (worst-case) approximation error, in the more classical approximate dynamic programming literature. With regards to the statistical viewpoint, this question is largely unexplored, and the extant body of literature mainly focuses on conditions which permit sample efficient reinforcement learning with little understanding of what are necessary conditions for efficient reinforcement learning. This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning.
S. Arora, S. S. Du, S. Kakade, Y. Luo, and N. Saunshi, Provable Representation Learning for Imitation Learning via Bi-level Optimization. ICML: ArXiv Report, 2020. Publisher's VersionAbstract
A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a.k.a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts' trajectories are available. We formulate representation learning as a bi-level optimization problem where the "outer" optimization tries to learn the joint representation and the "inner" optimization encodes the imitation learning setup and tries to learn task-specific parameters. We instantiate this framework for the imitation learning settings of behavior cloning and observation-alone. Theoretically, we show using our framework that representation learning can provide sample complexity benefits for imitation learning in both settings. We also provide proof-of-concept experiments to verify our theory.
W. Kong, R. Somani, Z. Song, S. Kakade, and S. Oh, Meta-learning for mixed linear regression. ICML: ArXiv Report, 2020. Publisher's VersionAbstract
In modern supervised learning, there are a large number of tasks, but many of them are associated with only a small amount of labeled data. These include data from medical image processing and robotic interaction. Even though each individual task cannot be meaningfully trained in isolation, one seeks to meta-learn across the tasks from past experiences by exploiting some similarities. We study a fundamental question of interest: When can abundant tasks with small data compensate for lack of tasks with big data? We focus on a canonical scenario where each task is drawn from a mixture of k linear regressions, and identify sufficient conditions for such a graceful exchange to hold; The total number of examples necessary with only small data tasks scales similarly as when big data tasks are available. To this end, we introduce a novel spectral approach and show that we can efficiently utilize small data tasks with the help of Ω̃ (k3/2) medium data tasks each with Ω̃ (k1/2) examples.
E. Hazan, S. M. Kakade, and K. Singh, The Nonstochastic Control Problem. ALT: ArXiv Report, 2020. Publisher's VersionAbstract
We consider the problem of controlling an unknown linear dynamical system in the presence of (nonstochastic) adversarial perturbations and adversarial convex loss functions. In contrast to classical control, the a priori determination of an optimal controller here is hindered by the latter's dependence on the yet unknown perturbations and costs. Instead, we measure regret against an optimal linear policy in hindsight, and give the first efficient algorithm that guarantees a sublinear regret bound, scaling as T^{2/3}, in this setting.
D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee, Stochastic subgradient method converges on tame functions. Foundations of Computational Mathematics (FOCM), 2020: ArXiv Report, 2020. Publisher's VersionAbstract
This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures.


J. Thickstun, Z. Harchaoui, D. P. Foster, and S. M. Kakade, Coupled Recurrent Models for Polyphonic Music Composition. ISMIR: ArXiv Report, 2019. Publisher's VersionAbstract
This paper introduces a novel recurrent model for music composition that is tailored to the structure of polyphonic music. We propose an efficient new conditional probabilistic factorization of musical scores, viewing a score as a collection of concurrent, coupled sequences: i.e. voices. To model the conditional distributions, we borrow ideas from both convolutional and recurrent neural models; we argue that these ideas are natural for capturing music's pitch invariances, temporal structure, and polyphony. We train models for single-voice and multi-voice composition on 2,300 scores from the KernScores dataset.
R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli, The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure. NeurIPS: ArXiv Report, 2019. Publisher's VersionAbstract
Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations SGD is run for) is known in advance, SGD's final iterate behavior with any polynomially decaying learning rate scheme is highly sub-optimal compared to the minimax rate (by a condition number factor in the strongly convex case and a factor of T‾‾√ in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offers significant improvements over any polynomially decaying step sizes. In particular, the final iterate behavior with a step decay schedule is off the minimax rate by only log factors (in the condition number for strongly convex case, and in T for the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the stepsizes employed. These results demonstrate the subtlety in establishing optimal learning rate schemes (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.
A. Rajeswaran, C. Finn, S. Kakade, and S. Levine, Meta-Learning with Implicit Gradients. NeurIPS: ArXiv Report, 2019. Publisher's VersionAbstract
A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.
C. Finn, 7 3, A. Rajeswaran, S. Kakade, and S. Levine, Online Meta-Learning. ICML: ArXiv Report, 2019. Publisher's VersionAbstract
A central capability of intelligent systems is the ability to continuously build upon previous experiences to speed up and enhance learning of new tasks. Two distinct research paradigms have studied this question. Meta-learning views this problem as learning a prior over model parameters that is amenable for fast adaptation on a new task, but typically assumes the set of tasks are available together as a batch. In contrast, online (regret based) learning considers a sequential setting in which problems are revealed one after the other, but conventionally train only a single model without any task-specific adaptation. This work introduces an online meta-learning setting, which merges ideas from both the aforementioned paradigms to better capture the spirit and practice of continual lifelong learning. We propose the follow the meta leader algorithm which extends the MAML algorithm to this setting. Theoretically, this work provides an (logT) regret guarantee with only one additional higher order smoothness assumption in comparison to the standard online setting. Our experimental evaluation on three different large-scale tasks suggest that the proposed algorithm significantly outperforms alternatives based on traditional online learning approaches.
G. Cadamuro, R. K. Vinayak, J. Blumenstock, S. Kakade, and J. N. Shapiro, “The Illusion of Change: Correcting for Bias when Inferring Changes in Sparse, Societal-Scale Data,” The Web Conference 2019, pp. 2608–2615, 2019. Publisher's VersionAbstract
Societal-scale data is playing an increasingly prominent role in social science research; examples from research on geopolitical events include questions on how emergency events impact the diffusion of information or how new policies change patterns of social interaction. Such research often draws critical inferences from observing how an exogenous event changes meaningful metrics like network degree or network entropy. However, as we show in this work, standard estimation methodologies make systematically incorrect inferences when the event also changes the sparsity of the data. To address this issue, we provide a general framework for inferring changes in social metrics when dealing with non-stationary sparsity. We propose a plug-in correction that can be applied to any estimator, including several recently proposed procedures. Using both simulated and real data, we demonstrate that the correction significantly improves the accuracy of the estimated change under a variety of plausible data generating processes. In particular, using a large dataset of calls from Afghanistan, we show that whereas traditional methods substantially overestimate the impact of a violent event on social diversity, the plug-in correction reveals the true response to be much more modest.
N. Agarwal, B. Bullins, E. Hazan, S. M. Kakade, and K. Singh, Online Control with Adversarial Disturbances. ICML: ArXiv Report, 2019. Publisher's VersionAbstract
We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise). The objective we consider is one of regret: we desire an online control procedure that can do nearly as well as that of a procedure that has full knowledge of the disturbances in hindsight. Our main result is an efficient algorithm that provides nearly tight regret bounds for this problem. From a technical standpoint, this work generalizes upon previous work in two main aspects: our model allows for adversarial noise in the dynamics, and allows for general convex costs.
R. K. Vinayak, W. Kong, G. Valiant, and S. M. Kakade, Maximum Likelihood Estimation for Learning Populations of Parameters. ArXiv Report, 2019. Publisher's VersionAbstract
Consider a setting with N independent individuals, each with an unknown parameter, pi∈[0,1] drawn from some unknown distribution P⋆. After observing the outcomes of t independent Bernoulli trials, i.e., Xi∼Binomial(t,pi) per individual, our objective is to accurately estimate P⋆. This problem arises in numerous domains, including the social sciences, psychology, health-care, and biology, where the size of the population under study is usually large while the number of observations per individual is often limited. Our main result shows that, in the regime where t≪N, the maximum likelihood estimator (MLE) is both statistically minimax optimal and efficiently computable. Precisely, for sufficiently large N, the MLE achieves the information theoretic optimal error bound of (1t) for t
K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch, Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. ICLR: ArXiv Report, 2019. Publisher's VersionAbstract
We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. This exploration is critical for fast and stable learning of the value function. Combining these components enable solutions to complex simulated control tasks, like humanoid locomotion and dexterous in-hand manipulation, in the equivalent of a few minutes of experience in the real world.
E. Hazan, S. M. Kakade, K. Singh, and A. V. Soest, Provably Efficient Maximum Entropy Exploration. ICML: ArXiv Report, 2019. Publisher's VersionAbstract
Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves. For example, one natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. We provide an efficient algorithm to optimize such such intrinsically defined objectives, when given access to a black box planning oracle (which is robust to function approximation). Furthermore, when restricted to the tabular setting where we have sample based access to the MDP, our proposed algorithm is provably efficient, both in terms of its sample and computational complexities. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.


S. Kakade, P. Liang, V. Sharan, and G. Valiant, Prediction with a Short Memory. STOC: ArXiv Report, 2020. Publisher's VersionAbstract
We consider the problem of predicting the next observation given a sequence of past observations, and consider the extent to which accurate prediction requires complex algorithms that explicitly leverage long-range dependencies. Perhaps surprisingly, our positive results show that for a broad class of sequences, there is an algorithm that predicts well on average, and bases its predictions only on the most recent few observation together with a set of simple summary statistics of the past observations. Specifically, we show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by I, then a simple Markov model over the most recent I/ϵ observations obtains expected KL error ϵ---and hence ℓ1 error ϵ√---with respect to the optimal predictor that has access to the entire past and knows the data generating distribution. For a Hidden Markov Model with n hidden states, I is bounded by logn, a quantity that does not depend on the mixing time, and we show that the trivial prediction algorithm based on the empirical frequencies of length O(logn/ϵ) windows of observations achieves this error, provided the length of the sequence is dΩ(logn/ϵ), where d is the size of the observation alphabet. We also establish that this result cannot be improved upon, even for the class of HMMs, in the following two senses: First, for HMMs with n hidden states, a window length of logn/ϵ is information-theoretically necessary to achieve expected ℓ1 error ϵ√. Second, the dΘ(logn/ϵ) samples required to estimate the Markov model for an observation alphabet of size d is necessary for any computationally tractable learning algorithm, assuming the hardness of strongly refuting a certain class of CSPs.
K. Pillutla, V. Roulet, S. Kakade, and Z. Harchaoui, A smoother way to train structured prediction models. NeurIPS: ArXiv Report, 2018. Publisher's VersionAbstract
We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optimization algorithm for the structural support vector machine. The proposed algorithm blends an extrapolation scheme for acceleration and an adaptive smoothing scheme and builds upon the stochastic variance-reduced gradient algorithm. We establish its worst-case global complexity bound and study several practical variants, including extensions to deep structured prediction. We present experimental results on two real-world problems, namely named entity recognition and visual object localization. The experimental results show that the proposed framework allows us to build upon efficient inference algorithms to develop large-scale optimization algorithms for structured prediction which can achieve competitive performance on the two real-world problems.
S. Kakade and J. D. Lee, Provably Correct Automatic Subdifferentiation for Qualified Programs. NeurIPS: ArXiv Report, 2018. Publisher's VersionAbstract
The Cheap Gradient Principle (Griewank 2008) --- the computational cost of computing the gradient of a scalar-valued function is nearly the same (often within a factor of 5) as that of simply computing the function itself --- is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The current state of affairs is markedly different with regards to computing subderivatives: widely used ML libraries, including TensorFlow and PyTorch, do not correctly compute (generalized) subderivatives even on simple examples. This work considers the question: is there a Cheap Subgradient Principle? Our main result shows that, under certain restrictions on our library of nonsmooth functions (standard in nonlinear programming), provably correct generalized subderivatives can be computed at a computational cost that is within a (dimension-free) factor of 6 of the cost of computing the scalar function itself.
P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford, Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification. JMLR: ArXiv Report, 2018. Publisher's VersionAbstract
This work characterizes the benefits of averaging schemes widely used in conjunction with stochastic gradient descent (SGD). In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD to decrease the variance in SGD's final iterate. This work presents non-asymptotic excess risk bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini-batch SGD yields provable near-linear parallelization speedups over SGD with batch size one. This allows for understanding learning rate versus batch size tradeoffs for the final iterate of an SGD method. These results are then utilized in providing a highly parallelizable SGD method that obtains the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD methods. A non-asymptotic analysis of communication efficient parallelization schemes such as model-averaging/parameter mixing methods is then provided. Finally, this work sheds light on some fundamental differences in SGD's behavior when dealing with agnostic noise in the (non-realizable) least squares regression problem. In particular, the work shows that the stepsizes that ensure minimax risk for the agnostic case must be a function of the noise properties. This paper builds on the operator view of analyzing SGD methods, introduced by Defossez and Bach (2015), followed by developing a novel analysis in bounding these operators to characterize the excess risk. These techniques are of broader interest in analyzing computational aspects of stochastic approximation.
R. Kidambi, P. Netrapalli, P. Jain, and S. M. Kakade, On the insufficiency of existing momentum schemes for Stochastic Optimization. ICLR: ArXiv Report, 2018. Publisher's VersionAbstract
Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov's accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Rigorously speaking, "fast gradient" methods have provable improvements over gradient descent only for the deterministic case, where the gradients are exact. In the stochastic case, the popular explanations for their wide applicability is that when these fast gradient methods are applied in the stochastic case, they partially mimic their exact gradient counterparts, resulting in some practical gain. This work provides a counterpoint to this belief by proving that there exist simple problem instances where these methods cannot outperform SGD despite the best setting of its parameters. These negative problem instances are, in an informal sense, generic; they do not look like carefully constructed pathological instances. These results suggest (along with empirical evidence) that HB or NAG's practical performance gains are a by-product of mini-batching. Furthermore, this work provides a viable (and provable) alternative, which, on the same set of problem instances, significantly improves over HB, NAG, and SGD's performance. This algorithm, referred to as Accelerated Stochastic Gradient Descent (ASGD), is a simple to implement stochastic algorithm, based on a relatively less popular variant of Nesterov's Acceleration. Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD.
P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford, Accelerating Stochastic Gradient Descent. COLT: ArXiv Report, 2018. Publisher's VersionAbstract
There is widespread sentiment that it is not possible to effectively utilize fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) for the purposes of stochastic optimization due to their instability and error accumulation, a notion made precise in d'Aspremont 2008 and Devolder, Glineur, and Nesterov 2014. This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes the conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and non-convex optimization problems.
J. Thickstun, Z. Harchaoui, D. Foster, and S. M. Kakade, Invariances and Data Augmentation for Supervised Music Transcription. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018: ArXiv Report, 2018. Publisher's VersionAbstract
This paper explores a variety of models for frame-based music transcription, with an emphasis on the methods needed to reach state-of-the-art on human recordings. The translation-invariant network discussed in this paper, which combines a traditional filterbank with a convolutional neural network, was the top-performing model in the 2017 MIREX Multiple Fundamental Frequency Estimation evaluation. This class of models shares parameters in the log-frequency domain, which exploits the frequency invariance of music to reduce the number of model parameters and avoid overfitting to the training data. All models in this paper were trained with supervision by labeled data from the MusicNet dataset, augmented by random label-preserving pitch-shift transformations.
C. Wu, et al., Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines. ICLR: ArXiv Report, 2018. Publisher's VersionAbstract
Policy gradient methods have enjoyed great success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high-dimensional action spaces. To mitigate this issue, we derive a bias-free action-dependent baseline for variance reduction which fully exploits the structural form of the stochastic policy itself and does not make any additional assumptions about the MDP. We demonstrate and quantify the benefit of the action-dependent baseline through both theoretical analysis as well as numerical results, including an analysis of the suboptimality of the optimal state-dependent baseline. The result is a computationally efficient policy gradient algorithm, which scales to high-dimensional control problems, as demonstrated by a synthetic 2000-dimensional target matching task. Our experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks. Finally, we show that the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks.
Q. Huang, S. Kakade, W. Kong, and G. Valian, Recovering Structured Probability Matrices. ITCS: ArXiv Report, 2018. Publisher's VersionAbstract
We consider the problem of accurately recovering a matrix B of size M by M , which represents a probability distribution over M2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient algorithm that accurately recovers the underlying M by M matrix using Theta(M) samples. This result easily translates to Theta(M) sample algorithms for learning topic models and learning hidden Markov Models. These linear sample complexities are optimal, up to constant factors, in an extremely strong sense: even testing basic properties of the underlying matrix (such as whether it has rank 1 or 2) requires Omega(M) samples. We provide an even stronger lower bound where distinguishing whether a sequence of observations were drawn from the uniform distribution over M observations versus being generated by an HMM with two hidden states requires Omega(M) observations. This precludes sublinear-sample hypothesis tests for basic properties, such as identity or uniformity, as well as sublinear sample estimators for quantities such as the entropy rate of HMMs.
M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi, Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator. ICML: ArXiv Report, 2018. Publisher's VersionAbstract
Direct policy gradient methods for reinforcement learning and continuous control problems are a popular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge of the underlying model 2) they are an "end-to-end" approach, directly optimizing the performance metric of interest 3) they inherently allow for richly parameterized policies. A notable drawback is that even in the most basic continuous control problem (that of linear quadratic regulators), these methods must solve a non-convex optimization problem, where little is understood about their efficiency from both computational and statistical perspectives. In contrast, system identification and model based planning in optimal control theory have a much more solid theoretical footing, where much is known with regards to their computational and statistical properties. This work bridges this gap showing that (model free) policy gradient methods globally converge to the optimal solution and are efficient (polynomially so in relevant problem dependent quantities) with regards to their sample and computational complexities.


V. Sharan, S. Kakade, P. Liang, and G. Valiant., Learning Overcomplete HMMs. NIPS: ArXiv Report, 2017. Publisher's VersionAbstract
We study the problem of learning overcomplete HMMs---those that have many hidden states but a small output alphabet. Despite having significant practical importance, such HMMs are poorly understood with no known positive or negative results for efficient learning. In this paper, we present several new results---both positive and negative---which help define the boundaries between the tractable and intractable settings. Specifically, we show positive results for a large subclass of HMMs whose transition matrices are sparse, well-conditioned, and have small probability mass on short cycles. On the other hand, we show that learning is impossible given only a polynomial number of samples for HMMs with a small output alphabet and whose transition matrices are random regular graphs with large degree. We also discuss these results in the context of learning HMMs which can capture long-term dependencies.
P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, V. K. Pillutla, and A. Sidford, A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares). FSTTCS: ArXiv Report, 2017. Publisher's VersionAbstract
This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addresses model mis-specification.
J. Thickstun, Z. Harchaoui, and S. Kakade, Learning Features of Music from Scratch. ICLR: ArXiv Report, 2017. Publisher's VersionAbstract
This paper introduces a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments, together with instrument/note annotations resulting in over 1 million temporal labels on 34 hours of chamber music performances under various studio and microphone conditions. The paper defines a multi-label classification task to predict notes in musical recordings, along with an evaluation protocol, and benchmarks several machine learning architectures for this task: i) learning from spectrogram features; ii) end-to-end learning with a neural net; iii) end-to-end learning with a convolutional neural net. These experiments show that end-to-end models trained for note prediction learn frequency selective filters as a low-level representation of audio.
P. Jain, C. Jin, S. M. Kakade, and P. Netrapalli, Computing Matrix Squareroot via Non Convex Local Search. AISTATS: ArXiv Report, 2017. Publisher's VersionAbstract
While there has been a significant amount of work studying gradient descent techniques for non-convex optimization problems over the last few years, all existing results establish either local convergence with good rates or global convergence with highly suboptimal rates, for many problems of interest. In this paper, we take the first step in getting the best of both worlds -- establishing global convergence and obtaining a good rate of convergence for the problem of computing squareroot of a positive definite (PD) matrix, which is a widely studied problem in numerical linear algebra with applications in machine learning and statistics among others. Given a PD matrix M and a PD starting point U0, we show that gradient descent with appropriately chosen step-size finds an ϵ-accurate squareroot of M in O(αlog(‖M−U20‖F/ϵ)) iterations, where α=(max{‖U0‖22,‖M‖2}/min{σ2min(U0),σmin(M)})3/2. Our result is the first to establish global convergence for this problem and that it is robust to errors in each iteration. A key contribution of our work is the general proof technique which we believe should further excite research in understanding deterministic and stochastic variants of simple non-convex gradient descent algorithms with good global convergence rates for other problems in machine learning and numerical linear algebra.
A. Rajeswaran, K. Lowrey, E. Todorov, and S. Kakade, Towards Generalization and Simplicity in Continuous Control. NIPS: ArXiv Report, 2017. Publisher's VersionAbstract
This work shows that policies with simple linear and RBF parameterizations can be trained to solve a variety of continuous control tasks, including the OpenAI gym benchmarks. The performance of these trained policies are competitive with state of the art results, obtained with more elaborate parameterizations such as fully connected neural networks. Furthermore, existing training and testing scenarios are shown to be very limited and prone to over-fitting, thus giving rise to only trajectory-centric policies. Training with a diverse initial state distribution is shown to produce more global policies with better generalization. This allows for interactive control scenarios where the system recovers from large on-line perturbations; as shown in the supplementary video.
C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, How to Escape Saddle Points Efficiently. ICML: ArXiv Report, 2017. Publisher's VersionAbstract
This work shows that policies with simple linear and RBF parameterizations can be trained to solve a variety of continuous control tasks, including the OpenAI gym benchmarks. The performance of these trained policies are competitive with state of the art results, obtained with more elaborate parameterizations such as fully connected neural networks. Furthermore, existing training and testing scenarios are shown to be very limited and prone to over-fitting, thus giving rise to only trajectory-centric policies. Training with a diverse initial state distribution is shown to produce more global policies with better generalization. This allows for interactive control scenarios where the system recovers from large on-line perturbations; as shown in the supplementary video.


Q. Huang, R. Ge, S. M. Kakade, and M. Dahleh, Minimal Realization Problems for Hidden Markov Models. IEEE Transactions on Signal Processing: ArXiv Report, 2016. Publisher's VersionAbstract
Consider a stationary discrete random process with alphabet size d, which is assumed to be the output process of an unknown stationary Hidden Markov Model (HMM). Given the joint probabilities of finite length strings of the process, we are interested in finding a finite state generative model to describe the entire process. In particular, we focus on two classes of models: HMMs and quasi-HMMs, which is a strictly larger class of models containing HMMs. In the main theorem, we show that if the random process is generated by an HMM of order less or equal than k, and whose transition and observation probability matrix are in general position, namely almost everywhere on the parameter space, both the minimal quasi-HMM realization and the minimal HMM realization can be efficiently computed based on the joint probabilities of all the length N strings, for N > 4 lceil log_d(k) rceil +1. In this paper, we also aim to compare and connect the two lines of literature: realization theory of HMMs, and the recent development in learning latent variable models with tensor decomposition techniques.
P. Jain, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford, Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm. COLT: ArXiv Report, 2016. Publisher's VersionAbstract
This work provides improved guarantees for streaming principle component analysis (PCA). Given $A_1, \ldots, A_n\in \mathbb{R}^{d\times d}$ sampled independently from distributions satisfying $\mathbb{E}[A_i] = \Sigma$ for $\Sigma \succeq \mathbf{0}$, this work provides an $O(d)$-space linear-time single-pass streaming algorithm for estimating the top eigenvector of $\Sigma$. The algorithm nearly matches (and in certain cases improves upon) the accuracy obtained by the standard batch method that computes top eigenvector of the empirical covariance $\frac{1}{n} \sum_{i \in [n]} A_i$ as analyzed by the matrix Bernstein inequality. Moreover, to achieve constant accuracy, our algorithm improves upon the best previous known sample complexities of streaming algorithms by either a multiplicative factor of $O(d)$ or $1/\mathrm{gap}$ where $\mathrm{gap}$ is the relative distance between the top two eigenvalues of $\Sigma$. These results are achieved through a novel analysis of the classic Oja's algorithm, one of the oldest and most popular algorithms for streaming PCA. In particular, this work shows that simply picking a random initial point $w_0$ and applying the update rule $w_{i + 1} = w_i + \eta_i A_i w_i$ suffices to accurately estimate the top eigenvector, with a suitable choice of $\eta_i$. We believe our result sheds light on how to efficiently perform streaming PCA both in theory and in practice and we hope that our analysis may serve as the basis for analyzing many variants and extensions of streaming PCA.
R. Ge, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford, Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis. ICML: ArXiv Report, 2016. Publisher's VersionAbstract
This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009). We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved.We obtain our results by reducing CCA to the top-k generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a running time of O(zkκ√ρlog(1/ϵ)log(kκ/ρ)) where z is the total number of nonzero entries, κ is the condition number and ρ is the relative eigenvalue gap of the appropriate matrices. Our algorithm is linear in the input size and the number of components k up to a log(k) factor. This is essential for handling large-scale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets.
C. Jin, S. M. Kakade, and P. Netrapalli, Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent. NIPS: ArXiv Report, 2016. Publisher's VersionAbstract
Matrix completion, where we wish to recover a low rank matrix by observing a few entries from it, is a widely studied problem in both theory and practice with wide applications. Most of the provable algorithms so far on this problem have been restricted to the offline setting where they provide an estimate of the unknown matrix using all observations simultaneously. However, in many applications, the online version, where we observe one entry at a time and dynamically update our estimate, is more appealing. While existing algorithms are efficient for the offline setting, they could be highly inefficient for the online setting. In this paper, we propose the first provable, efficient online algorithm for matrix completion. Our algorithm starts from an initial estimate of the matrix and then performs non-convex stochastic gradient descent (SGD). After every observation, it performs a fast update involving only one row of two tall matrices, giving near linear total runtime. Our algorithm can be naturally used in the offline setting as well, where it gives competitive sample complexity and runtime to state of the art algorithms. Our proofs introduce a general framework to show that SGD updates tend to stay away from saddle surfaces and could be of broader interests for other non-convex problems to prove tight rates.
D. Garber, et al., Faster Eigenvector Computation via Shift-and-Invert Preconditioning. ICML: ArXiv Report, 2016. Publisher's VersionAbstract
We give faster algorithms and improved sample complexities for estimating the top eigenvector of a matrix Σ -- i.e. computing a unit vector x such that xTΣx≥(1−ϵ)λ1(Σ): Offline Eigenvector Estimation: Given an explicit A∈ℝn×d with Σ=ATA, we show how to compute an ϵ approximate top eigenvector in time Õ ([nnz(A)+d∗sr(A)gap2]∗log1/ϵ) and Õ ([nnz(A)3/4(d∗sr(A))1/4gap√]∗log1/ϵ). Here nnz(A) is the number of nonzeros in A, sr(A) is the stable rank, gap is the relative eigengap. By separating the gap dependence from the nnz(A) term, our first runtime improves upon the classical power and Lanczos methods. It also improves prior work using fast subspace embeddings [AC09, CW13] and stochastic optimization [Sha15c], giving significantly better dependencies on sr(A) and ϵ. Our second running time improves these further when nnz(A)≤d∗sr(A)gap2. Online Eigenvector Estimation: Given a distribution D with covariance matrix Σ and a vector x0 which is an O(gap) approximate top eigenvector for Σ, we show how to refine to an ϵ approximation using O(var(D)gap∗ϵ) samples from D. Here var(D) is a natural notion of variance. Combining our algorithm with previous work to initialize x0, we obtain improved sample complexity and runtime results under a variety of assumptions on D. We achieve our results using a general framework that we believe is of independent interest. We give a robust analysis of the classic method of shift-and-invert preconditioning to reduce eigenvector computation to approximately solving a sequence of linear systems. We then apply fast stochastic variance reduced gradient (SVRG) based system solvers to achieve our claims.


Q. Huang and S. M. Kakade, Super-Resolution Off the Grid. NIPS: ArXiv Report, 2015. Publisher's VersionAbstract
Super-resolution is the problem of recovering a superposition of point sources using bandlimited measurements, which may be corrupted with noise. This signal processing problem arises in numerous imaging problems, ranging from astronomy to biology to spectroscopy, where it is common to take (coarse) Fourier measurements of an object. Of particular interest is in obtaining estimation procedures which are robust to noise, with the following desirable statistical and computational properties: we seek to use coarse Fourier measurements (bounded by some cutoff frequency); we hope to take a (quantifiably) small number of measurements; we desire our algorithm to run quickly. Suppose we have k point sources in d dimensions, where the points are separated by at least \Delta from each other (in Euclidean distance). This work provides an algorithm with the following favorable guarantees: - The algorithm uses Fourier measurements, whose frequencies are bounded by O(1/\Delta) (up to log factors). Previous algorithms require a cutoff frequency which may be as large as {\Omega}( d/\Delta). - The number of measurements taken by and the computational complexity of our algorithm are bounded by a polynomial in both the number of points k and the dimension d, with no dependence on the separation \Delta. In contrast, previous algorithms depended inverse polynomially on the minimal separation and exponentially on the dimension for both of these quantities. Our estimation procedure itself is simple: we take random bandlimited measurements (as opposed to taking an exponential number of measurements on the hyper-grid). Furthermore, our analysis and algorithm are elementary (based on concentration bounds for sampling and the singular value decomposition).
R. Frostig, R. Ge, S. M. Kakade, and A. Sidford, Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. ICML: ArXiv Report, 2015. Publisher's VersionAbstract
We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several algorithms that reduce the minimization of a strongly convex function to approximate minimizations of regularizations of the function. Using these results, we accelerate recent fast stochastic algorithms in a black-box fashion. Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that are advantageous in practice. Both in theory and in practice, the provided algorithms reap the computational benefits of adding a large strongly convex regularization term, without incurring a corresponding bias to the original problem.
K. Chaudhuri, S. Kakade, P. Netrapalli, and S. Sanghavi, Convergence Rates of Active Learning for Maximum Likelihood Estimation. NIPS: ArXiv Report, 2015. Publisher's VersionAbstract
An active learner is given a class of models, a large set of unlabeled examples, and the ability to interactively query labels of a subset of these examples; the goal of the learner is to learn a model in the class that fits the data well. Previous theoretical work has rigorously characterized label complexity of active learning, but most of this work has focused on the PAC or the agnostic PAC model. In this paper, we shift our attention to a more general setting -- maximum likelihood estimation. Provided certain conditions hold on the model class, we provide a two-stage active learning algorithm for this problem. The conditions we require are fairly general, and cover the widely popular class of Generalized Linear Models, which in turn, include models for binary and multi-class classification, regression, and conditional random fields. We provide an upper bound on the label requirement of our algorithm, and a lower bound that matches it up to lower order terms. Our analysis shows that unlike binary classification in the realizable case, just a single extra round of interaction is sufficient to achieve near-optimal performance in maximum likelihood estimation. On the empirical side, the recent work in ~\cite{Zhang12} and~\cite{Zhang14} (on active linear and logistic regression) shows the promise of this approach.
D. Belanger and S. M. Kakade, A Linear Dynamical System Model for Text. ICML: ArXiv Report, 2015. Publisher's VersionAbstract
Low dimensional representations of words allow accurate NLP models to be trained on limited annotated data. While most representations ignore words' local context, a natural way to induce context-dependent representations is to perform inference in a probabilistic latent-variable sequence model. Given the recent success of continuous vector space word representations, we provide such an inference procedure for continuous states, where words' representations are given by the posterior mean of a linear dynamical system. Here, efficient inference can be performed using Kalman filtering. Our learning algorithm is extremely scalable, operating on simple cooccurrence counts for both parameter initialization using the method of moments and subsequent iterations of EM. In our experiments, we employ our inferred word embeddings as features in standard tagging tasks, obtaining significant accuracy improvements. Finally, the Kalman filter updates can be seen as a linear recurrent neural network. We demonstrate that using the parameters of our model to initialize a non-linear recurrent neural network language model reduces its training time by a day and yields lower perplexity.
Q. Huang, R. Ge, and S. M. Kakade, Learning Mixtures of Gaussians in High Dimensions. STOC: ArXiv Report, 2015. Publisher's VersionAbstract
Efficiently learning mixture of Gaussians is a fundamental problem in statistics and learning theory. Given samples coming from a random one out of k Gaussian distributions in Rn, the learning problem asks to estimate the means and the covariance matrices of these Gaussians. This learning problem arises in many areas ranging from the natural sciences to the social sciences, and has also found many machine learning applications. Unfortunately, learning mixture of Gaussians is an information theoretically hard problem: in order to learn the parameters up to a reasonable accuracy, the number of samples required is exponential in the number of Gaussian components in the worst case. In this work, we show that provided we are in high enough dimensions, the class of Gaussian mixtures is learnable in its most general form under a smoothed analysis framework, where the parameters are randomly perturbed from an adversarial starting point. In particular, given samples from a mixture of Gaussians with randomly perturbed parameters, when n > {\Omega}(k^2), we give an algorithm that learns the parameters with polynomial running time and using polynomial number of samples. The central algorithmic ideas consist of new ways to decompose the moment tensor of the Gaussian mixture by exploiting its structural properties. The symmetries of this tensor are derived from the combinatorial structure of higher order moments of Gaussian distributions (sometimes referred to as Isserlis' theorem or Wick's theorem). We also develop new tools for bounding smallest singular values of structured random matrices, which could be useful in other smoothed analysis settings.
R. Frostig, R. Ge, S. M. Kakade, and A. Sidford, Competing with the Empirical Risk Minimizer in a Single Pass. COLT: ArXiv Report, 2015. Publisher's VersionAbstract
In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or the M-estimator -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal in this work is to perform as well as the ERM, on every problem, while minimizing the use of computational resources such as running time and space usage. We provide a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties: * The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample. * The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem, even considering constant factors. * The algorithm's performance depends on the initial error at a rate that decreases super-polynomially. * The algorithm is easily parallelizable. Moreover, we quantify the (finite-sample) rate at which the algorithm becomes competitive with the ERM.


A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky, Tensor decompositions for learning latent variable models. JMLR: ArXiv Report, 2014. Publisher's VersionAbstract
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
A. Agarwal, S. M. Kakade, N. Karampatziakis, L. Song, and G. Valiant, Least Squares Revisited: Scalable Approaches for Multi-class Prediction. ICML: ArXiv Report, 2014. Publisher's VersionAbstract
This work provides simple algorithms for multi-class (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large. These robust and parameter free algorithms are essentially iterative least-squares updates and very versatile both in theory and in practice. On the theoretical front, we present several variants with convergence guarantees. Owing to their effective use of second-order structure, these algorithms are substantially better than first-order methods in many practical scenarios. On the empirical side, we present a scalable stagewise variant of our approach, which achieves dramatic computational speedups over popular optimization packages such as Liblinear and Vowpal Wabbit on standard datasets (MNIST and CIFAR-10), while attaining state-of-the-art accuracies.
D. Hsu, S. M. Kakade, and T. Zhang, An Analysis of Random Design Linear Regression. Foundations of Computational Mathematics: ArXiv Report, 2014. Publisher's VersionAbstract
This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the ``out-of-sample'' prediction error, as opposed to the ``in-sample'' (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices.


A. Anandkumar, R. Ge, D. Hsu, and S. M. Kakade, A Tensor Spectral Approach to Learning Mixed Membership Community Models. COLT: ArXiv Report, 2013. Publisher's VersionAbstract
Community detection is the task of detecting hidden communities from observed interactions. Guaranteed community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and provide guaranteed community detection for a family of probabilistic network models with overlapping communities, termed as the mixed membership Dirichlet model, first introduced by Airoldi et al. This model allows for nodes to have fractional memberships in multiple communities and assumes that the community memberships are drawn from a Dirichlet distribution. Moreover, it contains the stochastic block model as a special case. We propose a unified approach to learning these models via a tensor spectral decomposition method. Our estimator is based on low-order moment tensor of the observed network, consisting of 3-star counts. Our learning method is fast and is based on simple linear algebraic operations, e.g. singular value decomposition and tensor power iterations. We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method. As an important special case, our results match the best known scaling requirements for the (homogeneous) stochastic block model.
A. Anandkumar, D. Hsu, M. Janzamin, and S. M. Kakade, When are overcomplete topic models identifiable? Uniqueness of tensor Tucker decompositions with structured sparsity. NIPS: ArXiv Report, 2013. Publisher's VersionAbstract
Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence. Our sufficient conditions for identifiability involve a novel set of "higher order" expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allows for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition.
A. Anandkumar, D. Hsu, A. Javanmard, and S. M. Kakade, Learning Linear Bayesian Networks with Latent Variables. Algorithmica special issue on Machine Learning (2014), ICML: ArXiv Report, 2013. Publisher's VersionAbstract
Unsupervised estimation of latent variable models is a fundamental problem central to numerous applications of machine learning and statistics. This work presents a principled approach for estimating broad classes of such models, including probabilistic topic models and latent linear Bayesian networks, using only second-order observed moments. The sufficient conditions for identifiability of these models are primarily based on weak expansion constraints on the topic-word matrix, for topic models, and on the directed acyclic graph, for Bayesian networks. Because no assumptions are made on the distribution among the latent variables, the approach can handle arbitrary correlations among the topics or latent factors. In addition, a tractable learning method via ℓ1 optimization is proposed and studied in numerical experiments.
P. Dhillon, D. P. Foster, S. M. Kakade, and L. Ungar, A Risk Comparison of Ordinary Least Squares vs Ridge Regression. JMLR: ArXiv Report, 2013. Publisher's VersionAbstract
We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a Principal Component Analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method is within a constant factor (namely 4) of the risk of ridge regression.
S. M. Kakade, I. Lobel, and H. Nazerzadeh, Optimal Dynamic Mechanism Design and the Virtual Pivot Mechanism. Operations Research: SSRN Report, 2013. Publisher's VersionAbstract
We consider the problem of designing optimal mechanisms for settings where agents have dynamic private information. We present the Virtual-Pivot Mechanism, that is optimal in a large class of environments that satisfy a separability condition. The mechanism satisfies a rather strong equilibrium notion (it is periodic ex-post incentive compatible and individually rational). We provide both necessary and sufficient conditions for immediate incentive compatibility for mechanisms that satisfy periodic ex-post incentive compatibility in future periods. The result also yields a strikingly simple mechanism for selling a sequence of items to a single buyer. We also show the allocation rule of the Virtual-Pivot Mechanism has a very simple structure (a Virtual Index) in multi-armed bandit settings. Finally, we show through examples that the relaxation technique we use does not produce optimal dynamic mechanisms in general non-separable environments.
D. Hsu and S. M. Kakade, Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions. 4th Innovations in Theoretical Computer Science (ITCS): ArXiv Report, 2013. Publisher's VersionAbstract
This work provides a computationally efficient and statistically consistent moment-based estimator for mixtures of spherical Gaussians. Under the condition that component means are in general position, a simple spectral decomposition technique yields consistent parameter estimates from low-order observable moments, without additional minimum separation assumptions needed by previous computationally efficient estimation procedures. Thus computational and information-theoretic barriers to efficient estimation in mixture models are precluded when the mixture components have means in general position and spherical covariances. Some connections are made to estimation problems related to independent component analysis.
A. Agarwal, D. P. Foster, D. Hsu, S. M. Kakade, and A. Rakhlin, Stochastic convex optimization with bandit feedback. SIAM Journal on Optimization (SIOPT), 2013 and NIPS 2011: ArXiv Report, 2013. Publisher's VersionAbstract
This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set $\xset$ under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value f(x) at any query point $x \in \xset$. The quantity of interest is the regret of the algorithm, which is the sum of the function values at algorithm's query points minus the optimal function value. We demonstrate a generalization of the ellipsoid algorithm that incurs $\otil(\poly(d)\sqrt{T})$ regret. Since any algorithm has regret at least Ω(T‾‾√) on this problem, our algorithm is optimal in terms of the scaling with T.


S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, “Regularization Techniques for Learning with Matrice,” Journal of Machine Learning Research, vol. 13, no. 59. pp. 1865-1890, 2012. Publisher's VersionAbstract
There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning.
D. Hsu, S. M. Kakade, and T. Zhang, Errata: Analysis of a randomized approximation scheme for matrix multiplication. ArXiv Report, 2012. Publisher's VersionAbstract

This note gives a simple analysis of a randomized approximation scheme for matrix multiplication proposed by Sarlos (2006) based on a random rotation followed by uniform column sampling. The result follows from a matrix version of Bernstein's inequality and a tail inequality for quadratic forms in subgaussian random vectors.



The analysis in Example 4.3 is flawed due to an incorrect bound on

  \|E[X_j^2]\|_2 .

It is possible to fix by introducing a factor of

  max_i { \|a_i\|_2 / \|b_i\|_2 , \|b_i\|_2 / \|a_i\|_2 }

in the bound, which ultimately leads to a significantly worse result.

(Note: There is no problem if A = B.)

Thanks to Joel Tropp for pointing out this bug.

A slightly different algorithm avoids this issue and leads to the same running time bound; see

for details.

DH 2012-11-25


A slightly different sampling scheme also fixes the problem; see

for details.

DH 2014-10-17


A. Anandkumar, D. Hsu, and S. M. Kakade, A Method of Moments for Mixture Models and Hidden Markov Models. Conference on Learning Theory (COLT): ArXiv Report, 2012. Publisher's VersionAbstract
Mixture models are a fundamental tool in applied statistics and machine learning for treating data taken from multiple subpopulations. The current practice for estimating the parameters of such models relies on local search heuristics (e.g., the EM algorithm) which are prone to failure, and existing consistent methods are unfavorable due to their high computational and sample complexity which typically scale exponentially with the number of mixture components. This work develops an efficient method of moments approach to parameter estimation for a broad class of high-dimensional mixture models with many components, including multi-view mixtures of Gaussians (such as mixtures of axis-aligned Gaussians) and hidden Markov models. The new method leads to rigorous unsupervised learning results for mixture models that were not achieved by previous works; and, because of its simplicity, it offers a viable alternative to EM for practical deployment.
A. Anandkumar, F. Huang, D. Hsu, and S. M. Kakade, Learning High-Dimensional Mixtures of Graphical Models. Neural Information Processing Systems (NIPS): ArXiv Report, 2012. Publisher's VersionAbstract
We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable corresponding to the mixture components is hidden and each mixture component over the observed variables can have a potentially different Markov graph structure and parameters. We propose a novel approach for estimating the mixture components, and our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. Our method is efficient when the union graph, which is the union of the Markov graphs of the mixture components, has sparse vertex separators between any pair of observed variables. This includes tree mixtures and mixtures of bounded degree graphs. For such models, we prove that our method correctly recovers the union graph structure and the tree structures corresponding to maximum-likelihood tree approximations of the mixture components. The sample and computational complexities of our method scale as $\poly(p, r)$, for an r-component mixture of p-variate graphical models. We further extend our results to the case when the union graph has sparse local separators between any pair of observed variables, such as mixtures of locally tree-like graphs, and the mixture components are in the regime of correlation decay.
D. Hsu, S. M. Kakade, and P. Liang, Identifiability and Unmixing of Latent Parse Trees. Neural Information Processing Systems (NIPS): ArXiv Report, 2012. Publisher's VersionAbstract
This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models.
D. P. Foster, S. M. Kakade, and R. Salakhutdinov, Domain Adaptation: Overfitting and Small Sample Statistics. AISTAT: ArXiv Report, 2012. Publisher's VersionAbstract
We study the prevalent problem when a test distribution differs from the training distribution. We consider a setting where our training set consists of a small number of sample domains, but where we have many samples in each domain. Our goal is to generalize to a new domain. For example, we may want to learn a similarity function using only certain classes of objects, but we desire that this similarity function be applicable to object classes not present in our training sample (e.g. we might seek to learn that "dogs are similar to dogs" even though images of dogs were absent from our training set). Our theoretical analysis shows that we can select many more features than domains while avoiding overfitting by utilizing data-dependent variance properties. We present a greedy feature selection algorithm based on using T-statistics. Our experiments validate this theory showing that our T-statistic based greedy feature selection is more robust at avoiding overfitting than the classical greedy procedure.
A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y. - K. Liu, A Spectral Algorithm for Latent Dirichlet Allocation. Neural Information Processing Systems (NIPS): ArXiv Report, 2012. Publisher's VersionAbstract
The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on k×k matrices, where k is the number of latent factors (e.g. the number of topics), rather than in the d-dimensional observed space (typically d≫k).
D. Hsu, S. M. Kakade, and T. Zhang, An Analysis of Random Design Linear Regression. Conference on Learning Theory (COLT): ArXiv Report, 2012. Publisher's VersionAbstract
This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the ``out-of-sample'' prediction error, as opposed to the ``in-sample'' (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices.
S. Bubeck, N. Cesa-Bianchi, and S. M. Kakade, Towards minimax policies for online linear optimization with bandit feedback. Conference on Learning Theory (COLT): ArXiv Report, 2012. Publisher's VersionAbstract
We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order dnlogN‾‾‾‾‾‾‾‾√ for any finite action set with N actions, under the assumption that the instantaneous loss is bounded by 1. This shaves off an extraneous d‾‾√ factor compared to previous works, and gives a regret bound of order dnlogn‾‾‾‾‾‾√ for any compact set of actions. Without further assumptions on the action set, this last bound is minimax optimal up to a logarithmic factor. Interestingly, our result also shows that the minimax regret for bandit linear optimization with expert advice in d dimension is the same as for the basic d-armed bandit with expert advice. Our second contribution is to show how to use the Mirror Descent algorithm to obtain computationally efficient strategies with minimax optimal regret bounds in specific examples. More precisely we study two canonical action sets: the hypercube and the Euclidean ball. In the former case, we obtain the first computationally efficient algorithm with a dn‾√ regret, thus improving by a factor dlogn‾‾‾‾‾‾√ over the best known result for a computationally efficient algorithm. In the latter case, our approach gives the first algorithm with a dnlogn‾‾‾‾‾‾‾√ regret, again shaving off an extraneous d‾‾√ compared to previous works.
E. Hazan and S. M. Kakade, (weak) Calibration is Computationally Hard. Conference on Learning Theory (COLT): ArXiv Report, 2012. Publisher's VersionAbstract
We show that the existence of a computationally efficient calibration algorithm, with a low weak calibration rate, would imply the existence of an efficient algorithm for computing approximate Nash equilibria - thus implying the unlikely conclusion that every problem in PPAD is solvable in polynomial time.
D. Hsu, S. M. Kakade, and T. Zhang, Tail inequalities for sums of random matrices that depend on the intrinsic dimension. Electronic Communications in Probability: ArXiv Report, 2012. Publisher's VersionAbstract
We derive exponential tail inequalities for sums of random matrices with no dependence on the explicit matrix dimensions. These are similar to the matrix versions of the Chernoff bound and Bernstein inequality except with the explicit matrix dimensions replaced by a trace quantity that can be small even when the dimension is large or infinite. Some applications to principal component analysis and approximate matrix multiplication are given to illustrate the utility of the new bounds.


A. Anandkumar, K. Chaudhuri, D. Hsu, S. M. Kakade, L. Song, and T. Zhang, Spectral Methods for Learning Multivariate Latent Tree Structure. NIPS 2011: ArXiv Report, 2011. Publisher's VersionAbstract
This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables). We propose the Spectral Recursive Grouping algorithm, an efficient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. Our finite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many high-dimensional settings. At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics.
J. Blitzer, D. Foster, and S. and Kakade., Domain Adaptation with Coupled Subspaces. AISTAT 2011: , 2011. Publisher's VersionAbstract
Domain adaptation algorithms address a key issue in applied machine learning: How can we train a system under a source distribution but achieve high performance under a different target distribution? We tackle this question for divergent distributions where crucial predictive target features may not even have support under the source distribution. In this setting, the key intuition is that that if we can link target-specific features to source features, we can learn effectively using only source labeled data. We formalize this intuition, as well as the assumptions under which such coupled learning is possible. This allows us to give finite sample target error bounds (using only source training data) and an algorithm which performs at the state-of-the-art on two natural language processing adaptation tasks which are characterized by novel target features.
S. M. Kakade, A. T. Kalai, V. Kanade, and O. Shamir, Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression. NIPS 2011: ArXiv Report, 2011. Publisher's VersionAbstract
Generalized Linear Models (GLMs) and Single Index Models (SIMs) provide powerful generalizations of linear regression, where the target variable is assumed to be a (possibly unknown) 1-dimensional function of a linear predictor. In general, these problems entail non-convex estimation procedures, and, in practice, iterative local search heuristics are often used. Kalai and Sastry (2009) recently provided the first provably efficient method for learning SIMs and GLMs, under the assumptions that the data are in fact generated under a GLM and under certain monotonicity and Lipschitz constraints. However, to obtain provable performance, the method requires a fresh sample every iteration. In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efficient. We also provide an empirical study, demonstrating their feasibility in practice.
D. Hsu, S. M. Kakade, and T. Zhang, Robust matrix decomposition with outliers. IEEE Transactions on Information Theory: ArXiv Report, 2011. Publisher's VersionAbstract
Suppose a given observation matrix can be decomposed as the sum of a low-rank matrix and a sparse matrix (outliers), and the goal is to recover these individual components from the observed sum. Such additive decompositions have applications in a variety of numerical problems including system identification, latent variable graphical modeling, and principal components analysis. We study conditions under which recovering such a decomposition is possible via a combination of ℓ1 norm and trace norm minimization. We are specifically interested in the question of how many outliers are allowed so that convex programming can still achieve accurate recovery, and we obtain stronger recovery guarantees than previous studies. Moreover, we do not assume that the spatial pattern of outliers is random, which stands in contrast to related analyses under such assumptions via matrix completion.


N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, IEEE Transactions on Information Theory: ArXiv Report, 2010. Publisher's VersionAbstract
Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multi-armed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GP-UCB, an intuitive upper-confidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covariance functions. In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, GP-UCB compares favorably with other heuristical GP optimization approaches.
A. Strehl, J. Langford, L. Li, and S. Kakade, Learning from Logged Implicit Exploration Data. NIPS: , 2010. Publisher's VersionAbstract
We provide a sound and consistent foundation for the use of nonrandom exploration data in "contextual bandit" or "partially labeled" settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which "offline" data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
D. Hsu, S. M. Kakade, J. Langford, and T. Zhang, Multi-Label Prediction via Compressed Sensing. NIPS 2010. (2009 Conference): , 2010. Publisher's VersionAbstract
We consider multi-label prediction problems with large output spaces under the assumption of output sparsity -- that the target (label) vectors have small support. We develop a general theory for a variant of the popular error correcting output code scheme, using ideas from compressed sensing for exploiting this sparsity. The method can be regarded as a simple reduction from multi-label regression problems to binary regression problems. We show that the number of subproblems need only be logarithmic in the total number of possible labels, making this approach radically more efficient than others. We also state and prove robustness guarantees for this method in the form of regret transform bounds (in general), and also provide a more detailed analysis for the linear prediction setting.
S. M. Kakade, 1 1, O. Shamir, K. Sridharan, and A. Tewari, Learning exponential families in high-dimensions: Strong convexity and sparsity. AISTAT: ArXiv Report, 2010. Publisher's VersionAbstract
The versatility of exponential families, along with their attendant convexity properties, make them a popular and effective statistical model. A central issue is learning these models in high-dimensions, such as when there is some sparsity pattern of the optimal parameter. This work characterizes a certain strong convexity property of general exponential families, which allow their generalization ability to be quantified. In particular, we show how this property can be used to analyze generic exponential families under L_1 regularization.


D. Hsu, S. M. Kakade, and T. Zhang, A Spectral Algorithm for Learning Hidden Markov Models. COLT: ArXiv Tech Report, 2009. Publisher's VersionAbstract
Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs. The sample complexity of the algorithm does not explicitly depend on the number of distinct (discrete) observations---it implicitly depends on this quantity through spectral properties of the underlying HMM. This makes the algorithm particularly applicable to settings with a large number of observations, such as those in natural language processing where the space of observation is sometimes the words in a language. The algorithm is also simple, employing only a singular value decomposition and matrix multiplications.
E. Even-Dar, S. M. Kakade, and Y. Mansour, Online Markov Decision Processes. Mathematics of Operations Research: , 2009. Publisher's VersionAbstract
We consider a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well an agent can do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions.
N. Devanur and S. M. Kakade, The Price of Truthfulness for Pay-Per-Click Auctions. ACM Conference on Electronic Commerce: , 2009. Publisher's VersionAbstract
We analyze the problem of designing a truthful pay-per-click auction where the click-through-rates (CTR) of the bidders are unknown to the auction. Such an auction faces the classic explore/exploit dilemma: while gathering information about the click through rates of advertisers, the mechanism may loose revenue; however, this gleaned information may prove valuable in the future for a more profitable allocation. In this sense, such mechanisms are prime candidates to be designed using multi-armed bandit techniques. However, a naive application of multi-armed bandit algorithms would not take into account the strategic considerations of the players -- players might manipulate their bids (which determine the auction's revenue) in a way as to maximize their own utility. Hence, we consider the natural restriction that the auction be truthful. The revenue that we could hope to achieve is the expected revenue of a Vickrey auction that knows the true CTRs, and we define the truthful regret to be the difference between the expected revenue of the auction and this Vickrey revenue. This work sharply characterizes what regret is achievable, under a truthful restriction. We show that this truthful restriction imposes statistical limits on the achievable regret -- the achievable regret is Θ(T2/3), while for traditional bandit algorithms (without the truthful restriction) the achievable regret is Θ(T1/2) (where T is the number of rounds). We term the extra T1/6 factor, the `price of truthfulness'.
S. M. Kakade and S. Shalev-Shwartz, Mind the Duality Gap: Logarithmic regret algorithms for online optimization. NeurIPS Proceedings: , 2009. Publisher's VersionAbstract
We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in HazanKaKaAg06. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions.
S. M. Kakade and A. Tewari, On the Generalization Ability of Online Strongly Convex Programming Algorithms. Proceedings of NIPS: ArXiv Report, 2009. Publisher's VersionAbstract
This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. The bound also solves an open problem regarding the convergence rate of {\pegasos}, a recently proposed method for solving the SVM optimization problem.
S. M. Kakade, K. Sridharan, and A. Tewari, On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization. Proceedings of NIPS: , 2009. Publisher's VersionAbstract
We provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes. These bounds make short work of providing a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either or constraints), margin bounds (including both and margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and covering numbers (with norm constraints and relative entropy constraints). In addition to providing a unified analysis, the results herein provide some of the sharpest risk and margin bounds (improving upon a number of previous results). Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction (up to a constant factor of 2).
K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan, Multi-View Clustering via Canonical Correlation Analysis. ICML: TTI-C Tech Report TTI-TR-2008-5, 2009. Publisher's VersionAbstract
Clustering data in high dimensions is believed to be a hard problem in general. A number of efficient clustering algorithms developed in recent years address this problem by projecting the data into a lower-dimensional subspace, e.g. via Principal Components Analysis (PCA) or random projections, before clustering. Here, we consider constructing such projections using multiple views of the data, via Canonical Correlation Analysis (CCA). Under the assumption that the views are un-correlated given the cluster label, we show that the separation conditions required for the algorithm to be successful are significantly weaker than prior results in the literature. We provide results for mixtures of Gaussians and mixtures of log concave distributions. We also provide empirical support from audio-visual speaker clustering (where we desire the clusters to correspond to speaker ID) and from hierarchical Wikipedia document clustering (where one view is the words in the document and the other is the link structure).


V. Dani, 7 9, T. Hayes, and S. M. Kakade, “Stochastic Linear Optimization under Bandit Feedback,” 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, pp. 355-366, 2008. Publisher's VersionAbstract
In the classical stochastic k-armed bandit problem, ineachofasequenceofT rounds, adecisionmaker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the algo- rithm and the optimal cost. In the linear optimization version of this problem (firstconsideredbyAuer(2002)), weviewthearms as vectors in Rn, and require that the costs be lin- ear functions of the chosen vector. As before, it is assumed that the cost functions are sampled in- dependently from an unknown distribution. In this setting, the goal is to find algorithms whose run- ning time and regret behave well as functions of the number of rounds T and the dimensionality n (rather than the number of arms, k, which may be exponential in n or even infinite). We give a nearly complete characterization of this problem in terms of both upper and lower bounds for the regret. In certain special cases (such as when the decision region is a polytope), the regret is polylog(T). In general though, the optimal re- gret is ( p T) — our lower bounds rule out the possibility of obtaining polylog(T) rates in gen- eral. We present two variants of an algorithm based on the idea of "upper confidence bounds." The first, due to Auer (2002), but not fully analyzed, obtains regret whose dependence on n and T are both es- sentially optimal, but which may be computation- ally intractable when the decision set is a polytope. The second version can be efficiently implemented when the decision set is a polytope (given as an in- tersection of half-spaces), but gives up a factor of p n in the regret bound. Our results also extend to the setting where the set of allowed decisions may change over time.
K. Sridharan and S. M. Kakade, “An Information Theoretic Framework for Multi-view Learning,” 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, pp. 403-414, 2008. Publisher's VersionAbstract
In the multi-view learning paradigm, the input variable is partitioned into two different views X1 and X2 and there is a target variable Y of interest. The underlying assumption is that either view alone is sufficient to predict the target Y accurately. This provides a natural semi-supervised learning setting in which unlabeled data can be used to eliminate hypothesis from either view, whose predictions tend to disagree with predictions based on the other view. This work explicitly formalizes an information theoretic, multi-view assumption and studies the multi-view paradigm in the PAC style semi-supervised framework of Balcan and Blum [2006]. Underlying the PAC style framework is that an incompatibility function is assumed to be known — roughly speaking, this incompatibility function is a means to score how good a function is based on the unlabeled data alone. Here, we show how to derive incompatibility functions for certain loss functions of interest, so that minimizing this incompatibility over unlabeled data helps reduce expected loss on future test cases. In particular, we show how the class of empirically successful coregularization algorithms fall into our framework and provide performance bounds (using the results in Rosenberg and Bartlett [2007], Farquhar et al. [2005]). We also provide a normative justification for canonical correlation analysis (CCA) as a dimensionality reduction technique. In particular, we show (for strictly convex loss functions of the formℓ(W.x.y) that we can first use CCA as dimensionality reduction technique and (if the multi-view assumption is satisfied) this projection does not throw away much predictive information about the target Y —the benefit being that subsequent learning with a labeled set need only work in this lower dimensional space.
P. Bartlett, V. Dani, T. Hayes, S. Kakade, A. Rakhlin, and A. Tewari, “High-Probability Regret Bounds for Bandit Online Linear Optimization,” 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, pp. 335-342, 2008. Publisher's VersionAbstract
We present a modification of the algorithm of Dani et al. (9) for the online linear optimization prob- lem in the bandit setting, which with high proba- bility has regret at mostO ( p T ) against an adap- tive adversary. This improves on the previous al- gorithm (9) whose regret is bounded in expecta- tion against an oblivious adversary. We obtain the same dependence on the dimension (n3=2) as that exhibited by Dani et al. The results of this paper rest firmly on those of (9) and the remark- able technique of Auer et al. (2) for obtaining high- probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.
S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, “Efficient Bandit Algorithms for Online Multiclass Prediction,” ICML '08: Proceedings of the 25th international conference on Machine learning, vol. Volume, no. issueNumber, pp. 440–447, 2008. Publisher's VersionAbstract
This paper introduces the Banditron, a variant of the Perceptron [Rosenblatt, 1958], for the multiclass bandit setting. The multiclass bandit setting models a wide range of practical supervised learning applications where the learner only receives partial feedback (referred to as "bandit" feedback, in the spirit of multi-armed bandit models) with respect to the true label (e.g. in many web applications users often only provide positive "click" feedback which does not necessarily fully disclose a true label). The Banditron has the ability to learn in a multiclass classification setting with the "bandit" feedback which only reveals whether or not the prediction made by the algorithm was correct or not (but does not necessarily reveal the true label). We provide (relative) mistake bounds which show how the Banditron enjoys favorable performance, and our experiments demonstrate the practicality of the algorithm. Furthermore, this paper pays close attention to the important special case when the data is linearly separable --- a problem which has been exhaustively studied in the full information setting yet is novel in the bandit setting.
M. Seeger, S. M. Kakade, and D. P. Foster, “Information Consistency of Nonparametric Gaussian Process Methods,” IEEE Transactions on Information Theory, vol. 54, no. 5, pp. 2376 - 2382, 2008. Publisher's VersionAbstract
Bayesian nonparametric models are widely and successfully used for statistical prediction. While posterior consistency properties are well studied in quite general settings, results have been proved using abstract concepts such as metric entropy, and they come with subtle conditions which are hard to validate and not intuitive when applied to concrete models. Furthermore, convergence rates are difficult to obtain. By focussing on the concept of information consistency for Bayesian Gaussian process (GP)models, consistency results and convergence rates are obtained via a regret bound on cumulative log loss. These results depend strongly on the covariance function of the prior process, thereby giving a novel interpretation to penalization with reproducing kernel Hilbert space norms and to commonly used covariance function classes and their parameters. The proof of the main result employs elementary convexity arguments only. A theorem of Widom is used in order to obtain precise convergence rates for several covariance functions widely used in practice.
S. M. Kakade and D. P. Foster, “Deterministic Calibration and Nash Equilibrium,” J.C.S.S Learning Theory Special Issue, vol. 74, no. 1, pp. 115-130, 2008. Publisher's VersionAbstract
We provide a natural learning process in which the joint frequency of (time-averaged) empirical play converges into the set of convex combinations of Nash equilibria. Furthermore, the actual distribution of players' actions is close to some (approximate) Nash equilibria on most rounds (on all but a vanishing fraction of the rounds). In this process, all players rationally choose their actions using a public prediction made by a deterministic, weakly calibrated algorithm. For this to be possible, we show that such a deterministic (weakly) calibrated learning algorithm exists.
V. Dani, T. Hayes, and S. M. Kakade, “The Price of Bandit Information for Online Optimization,” Proceedings of NIPS, pp. 345-352, 2008. Publisher's VersionAbstract
In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and chang- ing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret √ in the bandit setting to that in the full-information setting. For the full informa- tion case, the upper bound on the regret is O∗( nT ), where n is the ambient √ dimension and T is the time horizon. For the bandit case, we present an algorithm which achieves O∗(n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark √ contrast to the K-arm bandit setting, where the gap in the dependence on K is T log K). We also present lower bounds showing that exponential ( this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems.


D. Ramanan, S. Baker, and S. M. Kakade, Leveraging Archival Video for Building Face Datasets. 2007 IEEE 11th International Conference on Computer Vision: , 2007. Publisher's VersionAbstract
We introduce a semi-supervised method for building large, labeled datasets effaces by leveraging archival video. Specifically, we have implemented a system for labeling 11 years worth of archival footage from a television show. We have compiled a dataset of 611,770 faces, orders of magnitude larger than existing collections. It includes variation in appearance due to age, weight gain, changes in hairstyles, and other factors difficult to observe in smaller-scale collections. Face recognition in an uncontrolled setting can be difficult. We argue (and demonstrate) that there is much structure at varying timescales in the video data that make recognition much easier. At local time scales, one can use motion and tracking to group face images together - we may not know the identity, but we know a single label applies to all faces in a track. At medium time scales (say, within a scene), one can use appearance features such as hair and clothing to group tracks across shot boundaries. However, at longer timescales (say, across episodes), one can no longer use clothing as a cue. This suggests that one needs to carefully encode representations of appearance, depending on the timescale at which one intends to match. We assemble our final dataset by classifying groups of tracks in a nearest-neighbors framework. We use a face library obtained by labeling track clusters in a reference episode. We show that this classification is significantly easier when exploiting the hierarchical structure naturally present in the video sequences. From a data-collection point of view, tracking is vital because it adds non-frontal poses to our face collection. This is important because we know of no other method for collecting images of non-frontal faces "in the wild".
S. M. Kakade, A. T. Kalai, and K. Ligett, Playing Games with Approximation Algorithms. STOC '07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing: , 2007. Publisher's VersionAbstract
In an online linear optimization problem, on each period t, an online algorithm chooses st ∈ S from a fixed (possibly infinite) set S of feasible decisions. Nature (who may be adversarial) chooses a weight vector wt ∈ R, and the algorithm incurs cost c(st,wt), where c is a fixed cost function that is linear in the weight vector. In the full-information setting, the vector wt is then revealed to the algorithm, and in the bandit setting, only the cost experienced, c(st,wt), is revealed. The goal of the online algorithm is to perform nearly as well as the best fixed s ∈ S in hindsight. Many repeated decision-making problems with weights fit naturally into this framework, such as online shortest-path, online TSP, online clustering, and online weighted set cover. Previously, it was shown how to convert any efficient exact offline optimization algorithm for such a problem into an efficient online bandit algorithm in both the full-information and the bandit settings, with average cost nearly as good as that of the best fixed s ∈ S in hindsight. However, in the case where the offline algorithm is an approximation algorithm with ratio α > 1, the previous approach only worked for special types of approximation algorithms. We show how to convert any offline approximation algorithm for a linear optimization problem into a corresponding online approximation algorithm, with a polynomial blowup in runtime. If the offline algorithm has an α-approximation guarantee, then the expected cost of the online algorithm on any sequence is not much larger than α times that of the best s ∈ S, where the best is chosen with the benefit of hindsight. Our main innovation is combining Zinkevich's algorithm for convex optimization with a geometric transformation that can be applied to any approximation algorithm. Standard techniques generalize the above result to the bandit setting, except that a "Barycentric Spanner" for the problem is also (provably) necessary as input.Our algorithm can also be viewed as a method for playing largerepeated games, where one can only compute approximate best-responses, rather than best-responses.
S. M. Kakade and D. P. Foster, “Multi-View Regression via Canonical Correlation Analysis,” International Conference on Computational Learning Theory, vol. 4539, pp. 82-96, 2007. Publisher's VersionAbstract
In the multi-view regression problem, we have a regression problem where the input variable (which is a real vector) can be partitioned into two different views, where it is assumed that either view of the input is sufficient to make accurate predictions — this is essentially (a significantly weaker version of) the co-training assumption for the regression problem. We provide a semi-supervised algorithm which first uses unlabeled data to learn a norm (or, equivalently, a kernel) and then uses labeled data in a ridge regression algorithm (with this induced norm) to provide the predictor. The unlabeled data is used via canonical correlation analysis (CCA, which is a closely related to PCA for two random variables) to derive an appropriate norm over functions. We are able to characterize the intrinsic dimensionality of the subsequent ridge regression problem (which uses this norm) by the correlation coefficients provided by CCA in a rather simple expression. Interestingly, the norm used by the ridge regression algorithm is derived from CCA, unlike in standard kernel methods where a special apriori norm is assumed (i.e. a Banach space is assumed). We discuss how this result shows that unlabeled data can decrease the sample complexity.
E. Even-Dar, S. M. Kakade, and Y. Mansour, “The value of observation for monitoring dynamic systems,” IJCAI'07: Proceedings of the 20th international joint conference on Artifical intelligence, pp. 2474–2479, 2007. Publisher's VersionAbstract
We consider the fundamental problem of monitoring (i.e. tracking) the belief state in a dynamic system, when the model is only approximately correct and when the initial belief state might be unknown. In this general setting where the model is (perhaps only slightly) mis-specified, monitoring (and consequently planning) may be impossible as errors might accumulate over time. We provide a new characterization, the value of observation, which allows us to bound the error accumulation. The value of observation is a parameter that governs how much information the observation provides. For instance, in Partially Observable MDPs when it is 1 the POMDP is an MDP while for an unobservable Markov Decision Process the parameter is 0. Thus, the new parameter characterizes a spectrum from MDPs to unobservable MDPs depending on the amount of information conveyed in the observations.
L. Ortiz, R. Schapire, and S. M. Kakade, Maximum Entropy Correlated Equilibria. AISTAT: , 2007. Publisher's VersionAbstract
We study maximum entropy correlated equilibria in (multi-player)games and provide two gradient-based algorithms that are guaranteedto converge to such equilibria. Although we do not provideconvergence rates for these algorithms, they do have strong connectionsto other algorithms (such as iterative scaling) which are effectiveheuristics for tasks such as statistical estimation.


A. Beygelzimer, S. M. Kakade, and J. Langford, Cover Trees for Nearest Neighbor. ICML '06: Proceedings of the 23rd international conference on Machine learning: , 2006. Publisher's VersionAbstract
We present a tree data structure for fast nearest neighbor operations in general n-point metric spaces (where the data set consists of n points). The data structure requires O(n) space regardless of the metric's structure yet maintains all performance properties of a navigating net (Krauthgamer & Lee, 2004b). If the point set has a bounded expansion constant c, which is a measure of the intrinsic dimensionality, as defined in (Karger & Ruhl, 2002), the cover tree data structure can be constructed in O (c6n log n) time. Furthermore, nearest neighbor queries require time only logarithmic in n, in particular O (c12 log n) time. Our experimental results show speedups over the brute force search varying between one and several orders of magnitude on natural machine learning datasets.
E. Even-Dar, S. M. Kakade, M. Kearns, and Y. Mansour, (In)Stability Properties of Limit Order Dynamics. ACM Conference on Electronic Commerce: , 2006. Publisher's VersionAbstract
We study the stability properties of the dynamics of the standard continuous limit-order mechanism that is used in modern equity markets. We ask whether such mechanisms are susceptible to “butterfly effects” — the infliction of large changes on common measures of market activity by only small perturbations of the order sequence. We show that the answer depends strongly on whether the market consists of “absolute” traders (who determine their prices independent of the current order book state) or “relative” traders (who determine their prices relative to the current bid and ask). We prove that while the absolute model enjoys provably strong stability properties, the relative model is vulnerable to great instability. Our theoretical results are supported by large-scale experiments using limit order data from INET, a large electronic exchange for NASDAQ stocks.
S. M. Kakade and D. P. Foster, Calibration via Regression. 2006 IEEE Information Theory Workshop - ITW '06 Punta del Este: , 2006. Publisher's VersionAbstract
In the online prediction setting, the concept of calibration entails having the empirical (conditional) frequencies match the claimed predicted probabilities. This contrasts with more traditional online prediction goals of getting a low cumulative loss. The differences between these goals have typically made them hard to compare with each other. This paper shows how to get an approximate form of calibration out of a traditional online loss minimization algorithm, namely online regression. As a corollary, we show how to construct calibrated forecasts on a collection of subsequences.
S. M. Kakade, M. W. Seeger, and D. P. Foster, Worst-Case Bounds for Gaussian Process Models. Advances in Neural Information Processing Systems 18 (NIPS 2005): , 2006. Publisher's VersionAbstract
We present a competitive analysis of some non-parametric Bayesian al- gorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. We consider models which use a Gaussian process prior (over the space of all func- tions) and provide bounds on the regret (under the log loss) for com- monly used non-parametric Bayesian algorithms — including Gaussian regression and logistic regression — which show how these algorithms can perform favorably under rather general conditions. These bounds ex- plicitly handle the infinite dimensionality of these non-parametric classes in a natural way. We also make formal connections to the minimax and minimum description length (MDL) framework. Here, we show precisely how Bayesian Gaussian regression is a minimax strategy.
S. M. Kakade and A. Kalai, From Batch to Transductive Online Learning. Advances in Neural Information Processing Systems 18 (NIPS 2005): , 2006. Publisher's VersionAbstract
It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite di- rection. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.


E. Even-Dar, S. M. Kakade, and Y. Mansour, Reinforcement Learning in POMDPs Without Resets. IJCAI'05: Proceedings of the 19th international joint conference on Artificial intelligence: , 2005. Publisher's VersionAbstract
We consider the most realistic reinforcement learning setting in which an agent starts in an unknown environment (the POMDP) and must follow one continuous and uninterrupted chain of experience with no access to "resets" or "offline" simulation. We provide algorithms for general connected POMDPs that obtain near optimal average reward. One algorithm we present has a convergence rate which depends exponentially on a certain horizon time of an optimal policy, but has no dependence on the number of (unobservable) states. The main building block of our algorithms is an implementation of an approximate reset strategy, which we show always exists in every POMDP. An interesting aspect of our algorithms is how they use this strategy when balancing exploration and exploitation.
E. Even-Dar, S. M. Kakade, and Y. Mansour, Planning in POMDPs Using Multiplicity Automata. Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005): , 2005. Publisher's VersionAbstract
Planning and learning in Partially Observable MDPs (POMDPs) are among the most challenging tasks in both the AI and Operation Research communities. Although solutions to these problems are intractable in general, there might be special cases, such as structured POMDPs, which can be solved efficiently. A natural and possibly efficient way to represent a POMDP is through the predictive state representation (PSR) - a representation which recently has been receiving increasing attention. In this work, we relate POMDPs to multiplicity automata- showing that POMDPs can be represented by multiplicity automata with no increase in the representation size. Furthermore, we show that the size of the multiplicity automaton is equal to the rank of the predictive state representation. Therefore, we relate both the predictive state representation and POMDPs to the well-founded multiplicity automata literature. Based on the multiplicity automata representation, we provide a planning algorithm which is exponential only in the multiplicity automata rank rather than the number of states of the POMDP. As a result, whenever the predictive state representation is logarithmic in the standard POMDP representation, our planning algorithm is efficient.
S. M. Kakade, M. Kearns, L. Ortiz, R. Pemantle, and S. Suri, The Economic Properties of Social Networks. Advances in Neural Information Processing Systems 17 (NIPS 2004): , 2005. Publisher's VersionAbstract
We examine the marriage of recent probabilistic generative models for social networks with classical frameworks from mathematical eco- nomics. We are particularly interested in how the statistical structure of such networks influences global economic quantities such as price vari- ation. Our findings are a mixture of formal analysis, simulation, and experiments on an international trade data set from the United Nations.
S. M. Kakade and M. Kearns, Trading in Markovian Price Models. 18th Annual Conference on Learning Theory, COLT 2005: , 2005. Publisher's VersionAbstract
We examine a Markovian model for the price evolution of a stock, in which the probability of local upward or downward movement is arbitrarily dependent on the current price itself (and perhaps some auxiliary state information). This model directly and considerably generalizes many of the most well-studied price evolution models in classical finance, including a variety of random walk, drift and diffusion models. Our main result is a “universally profitable” trading strategy — a single fixed strategy whose profitability competes with the optimal strategy (which knows all of the underlying parameters of the infinite and possibly nonstationary Markov process).
S. M. Kakade and A. Y. Ng, Online Bounds for Bayesian Algorithms. NIPS'04: Proceedings of the 17th International Conference on Neural Information Processing Systems: , 2005. Publisher's VersionAbstract
We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct, ” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression.
E. Even-Dar, S. M. Kakade, and Y. Mansour, Experts in a Markov Decision Process. Advances in Neural Information Processing Systems 17 (NIPS 2004): , 2005. Publisher's VersionAbstract
We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard.


S. Kakade, M. Kearns, and L. Ortiz, Graphical Economics. International Conference on Computational Learning Theory: , 2004. Publisher's VersionAbstract
We introduce a graph-theoretic generalization of classical Arrow-Debreu economics, in which an undirected graph specifies which consumers or economies are permitted to engage in direct trade, and the graph topology may give rise to local variations in the prices of commodities. Our main technical contributions are: (1) a general existence theorem for graphical equilibria, which require local markets to clear; (2) an improved algorithm for computing approximate equilibria in standard (non-graphical) economies, which generalizes the algorithm of Deng et al. [2002] to non-linear utility functions; (3) an algorithm for computing equilibria in the graphical setting, which runs in time polynomial in the number of consumers in the special but important case in which the graph is a tree (again permitting non-linear utility functions). We also highlight many interesting learning problems that arise in our model, and relate them to learning in standard game theory and economics, graphical games, and graphical models for probabilistic inference.
S. Kakade, M. Kearns, Y. Mansour, and L. Ortiz, Competitive Algorithms for VWAP and Limit Order Trading. Proceedings of the ACM Electronic Commerce Conference: , 2004. Publisher's VersionAbstract
We introduce new online models for two important aspectsof modern financial markets: Volume Weighted Average Pricetrading and limit order books. We provide an extensivestudy of competitive algorithms in these models and relatethem to earlier online algorithms for stock trading.
S. M. Kakade and D. P. Foster, Deterministic Calibration and Nash Equilibrium. International Conference on Computational Learning Theory: , 2004. Publisher's VersionAbstract
We provide a natural learning process in which the joint frequency of (time-averaged) empirical play converges into the set of convex combinations of Nash equilibria. Furthermore, the actual distribution of players' actions is close to some (approximate) Nash equilibria on most rounds (on all but a vanishing fraction of the rounds). In this process, all players rationally choose their actions using a public prediction made by a deterministic, weakly calibrated algorithm. For this to be possible, we show that such a deterministic (weakly) calibrated learning algorithm exists.


S. Kakade, M. Kearns, J. Langford, and L. Ortiz, Correlated Equilibria in Graphical Games. EC '03: Proceedings of the 4th ACM conference on Electronic commerce: , 2003. Publisher's VersionAbstract
We examine correlated equilibria in the recently introduced formalism of graphical games, a succinct representation for multiplayer games. We establish a natural and powerful relationship between the graphical structure of a multiplayer game and a certain Markov network representing distributions over joint actions. Our first main result establishes that this Markov network succinctly represents all correlated equilibria of the graphical game up to expected payoff equivalence. Our second main result provides a general algorithm for computing correlated equilibria in a graphical game based on its associated Markov network. For a special class of graphical games that includes trees, this algorithm runs in time polynomial in the graphical game representation (which is polynomial in the number of players and exponential in the graph degree).
D. Bagnell, S. Kakade, A. Ng, and G. Schneider, Policy Search by Dynamic Programming. Advances in Neural Information Processing Systems 16 (NIPS 2003): , 2003. Publisher's VersionAbstract
We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem.
S. Kakade, M. Kearns, and J. Langford, Exploration in Metric State Spaces. Proceedings of the 20th International Conference on Machine Learning: , 2003. Publisher's VersionAbstract
We present metric- E3a provably near-optimal algorithm for reinforcement learning in Markov decisionprocesses in which there is a naturalmetricon the state space that allows the construction of accurate localmodels. The algorithm is a generalization of the E3algorithm of Kearns and Singh, and assumes a black boxfor approximate planning. Unlike the original E3, metric-E3finds a near optimal policy in an amount of timethat does not directly depend on the size of the state space, but instead depends on the covering number of thestate space. Informally, the covering number is the number of neighborhoods required for accurate localmodeling.


S. Kakade, Y. W. Teh, and S. Roweis, An Alternative Objective Function for Markovian Fields. Proceedings of the Nineteenth International Conference on Machine Learning: , 2002.Abstract
In labelling or prediction tasks, a trained model's test performance is often based on the quality of its single-time marginal distributions over labels rather than its joint distribution over label sequences. We propose using a new cost function for discriminative learning that more accurately reflects such test time conditions. We present an efficient method to compute the gradient of this cost for Maximum Entropy Markov Models, Conditional Random Fields, and for an extension of these models...
S. Kakade and P. Dayan, Acquisition and Extinction in Autoshaping. Psychological Review: , 2002. Publisher's VersionAbstract
C. R. Gallistel and J. Gibbon (2000) presented quantitative data on the speed with which animals acquire behavioral responses during autoshaping, together with a statistical model of learning intended to account for them. Although this model captures the form of the dependencies among critical variables, its detailed predictions are substantially at variance with the data. In the present article, further key data on the speed of acquisition are used to motivate an alternative model of learning, in which animals can be interpreted as paying different amounts of attention to stimuli according to estimates of their differential reliabilities as predictors.
S. Kakade and P. Dayan, Dopamine: Generalization and Bonuses. Neural Networks: , 2002. Publisher's VersionAbstract
In the temporal difference model of primate dopamine neurons, their phasic activity reports a prediction error for future reward. This model is supported by a wealth of experimental data. However, in certain circumstances, the activity of the dopamine cells seems anomalous under the model, as they respond in particular ways to stimuli that are not obviously related to predictions of reward. In this paper, we address two important sets of anomalies, those having to do with generalization and novelty. Generalization responses are treated as the natural consequence of partial information; novelty responses are treated by the suggestion that dopamine cells multiplex information about reward bonuses, including exploration bonuses and shaping bonuses. We interpret this additional role for dopamine in terms of the mechanistic attentional and psychomotor effects of dopamine, having the computational role of guiding exploration.
S. Kakade, A Natural Policy Gradient. Advances in Neural Information Processing Systems 14 (NIPS 2001): , 2002.Abstract
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the param(cid:173) eter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradi(cid:173) ent is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sut(cid:173) ton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
N. D. Daw, S. Kakade, and P. Dayan, Opponent Interactions Between Serotonin and Dopamine. Neural Networks: , 2002.Abstract
Discusses a neural network model concerning the apparent opponent partnership of serotonin and dopamine. Anatomical and pharmacological evidence suggests that the dorsal raphe serotonin system and the ventral tegmental and substantia nigra dopamine system may act as mutual opponents. In the light of the temporal difference model of the involvement of the dopamine system in reward learning, the proposed model incorporates 3 aspects of motivational opponency involving dopamine and serotonin: (1) a tonic serotonergic signal reports the long-run average reward rate as part of an average-case reinforcement learning model; (2) a tonic dopaminergic signal reports the long-run average punishment rate in a similar context; and (3) a phasic serotonin signal might report an ongoing prediction error for future punishment. (PsycINFO Database Record (c) 2016 APA, all rights reserved)
J. Langford, M. Zinkevich, and S. Kakade, Competitive Analysis of the Explore/Exploit Tradeoff. Proceedings of the Nineteenth International Conference on Machine Learning: , 2002. Publisher's VersionAbstract
We investigate the explore/exploit trade-off in reinforcement learning using competitive analysis applied to an abstract model. We state and prove lower and upper bounds on the competitive ratio. The essential conclusion of our analysis is that optimizing the explore/exploit trade-off is much easier with a few pieces of extra knowledge such as the stopping time or upper and lower bounds on the value of the optimal exploitation policy.


S. Kakade, Optimizing Average Reward Using Discounted Rewards. COLT '01/EuroCOLT '01: Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory: , 2001. Publisher's VersionAbstract
In many reinforcement learningproblems, it is appropriate to optimize the average reward. In practice, this is often done by solving the Bellman equations usinga discount factor close to 1. In this paper, we provide a bound on the average reward of the policy obtained by solving the Bellman equations which depends on the relationship between the discount factor and the mixingtime of the Markov chain. We extend this result to the direct policy gradient of Baxter and Bartlett, in which a discount parameter is used to find a biased estimate of the gradient of the average reward with respect to the parameters of a policy. We show that this biased gradient is an exact gradient of a related discounted problem and provide a bound on the optima found by following these biased gradients of the average reward. Further, we show that the exact Hessian in this related discounted problem is an approximate Hessian of the average reward, with equality in the limit the discount factor tends to 1. We then provide an algorithm to estimate the Hessian from a sample path of the underlyingMark ov chain, which converges with probability 1.
S. Kakade and P. Dayan, Dopamine Bonuses. Advances in Neural Information Processing Systems 13: , 2001. Publisher's VersionAbstract
Substantial data support a temporal difference (TD) model of dopamine (DA) neuron activity in which the cells provide a global error signal for reinforcement learning. However, in certain circumstances, DA activity seems anomalous under the TD model, responding to non-rewarding stimuli. We address these anomalies by suggesting that DA cells multiplex information about reward bonuses, including Sutton's exploration bonuses and Ng et al's non-distorting shaping bonuses. We interpret this additional role for DA in terms of the unconditional attentional and psychomotor effects of dopamine, having the computational role of guiding exploration.
P. Dayan and S. Kakade, Explaining Away in Weight Space. Advances in Neural Information Processing Systems 13 (NIPS 2000): , 2001. Publisher's VersionAbstract
Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA,'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; pro(cid:173) vide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.


S. Kakade and P. Dayan, Acquisition Autoshaping. Advances in Neural Information Processing Systems 12: , 2000. Publisher's VersionAbstract
Quantitative data on the speed with which animals acquire behavioral responses during classical conditioning experiments should provide strong constraints on models of learning. However, most models have simply ignored these data; the few that have attempted to address them have failed by at least an order of magnitude. We discuss key data on the speed of acquisition, and show how to account for them using a statistically sound model of learning, in which differential reliabilities of stimuli play a crucial role.
P. Dayan, S. Kakade, and P. R. Montague, “Learning and Selective Attention,” Nature Neuroscience, vol. 3. pp. 1218-1223, 2000. Publisher's VersionAbstract
Selective attention involves the differential processing of different stimuli, and has widespread psychological and neural consequences. Although computational modeling should offer a powerful way of linking observable phenomena at different levels, most work has focused on the relatively narrow issue of constraints on processing resources. By contrast, we consider statistical and informational aspects of selective attention, divorced from resource constraints, which are evident in animal conditioning experiments involving uncertain predictions and unreliable stimuli. Neuromodulatory systems and limbic structures are known to underlie attentional effects in such tasks.