Publications by Year: 2019

2019
J. Thickstun, Z. Harchaoui, D. P. Foster, and S. M. Kakade, Coupled Recurrent Models for Polyphonic Music Composition. ISMIR: ArXiv Report, 2019. Publisher's VersionAbstract
This paper introduces a novel recurrent model for music composition that is tailored to the structure of polyphonic music. We propose an efficient new conditional probabilistic factorization of musical scores, viewing a score as a collection of concurrent, coupled sequences: i.e. voices. To model the conditional distributions, we borrow ideas from both convolutional and recurrent neural models; we argue that these ideas are natural for capturing music's pitch invariances, temporal structure, and polyphony. We train models for single-voice and multi-voice composition on 2,300 scores from the KernScores dataset.
Coupled Recurrent Models for Polyphonic Music Composition
R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli, The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure. NeurIPS: ArXiv Report, 2019. Publisher's VersionAbstract
Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations SGD is run for) is known in advance, SGD's final iterate behavior with any polynomially decaying learning rate scheme is highly sub-optimal compared to the minimax rate (by a condition number factor in the strongly convex case and a factor of T‾‾√ in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offers significant improvements over any polynomially decaying step sizes. In particular, the final iterate behavior with a step decay schedule is off the minimax rate by only log factors (in the condition number for strongly convex case, and in T for the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the stepsizes employed. These results demonstrate the subtlety in establishing optimal learning rate schemes (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.
The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure
A. Rajeswaran, C. Finn, S. Kakade, and S. Levine, Meta-Learning with Implicit Gradients. NeurIPS: ArXiv Report, 2019. Publisher's VersionAbstract
A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.
Meta-Learning with Implicit Gradients
C. Finn, 7 3, A. Rajeswaran, S. Kakade, and S. Levine, Online Meta-Learning. ICML: ArXiv Report, 2019. Publisher's VersionAbstract
A central capability of intelligent systems is the ability to continuously build upon previous experiences to speed up and enhance learning of new tasks. Two distinct research paradigms have studied this question. Meta-learning views this problem as learning a prior over model parameters that is amenable for fast adaptation on a new task, but typically assumes the set of tasks are available together as a batch. In contrast, online (regret based) learning considers a sequential setting in which problems are revealed one after the other, but conventionally train only a single model without any task-specific adaptation. This work introduces an online meta-learning setting, which merges ideas from both the aforementioned paradigms to better capture the spirit and practice of continual lifelong learning. We propose the follow the meta leader algorithm which extends the MAML algorithm to this setting. Theoretically, this work provides an (logT) regret guarantee with only one additional higher order smoothness assumption in comparison to the standard online setting. Our experimental evaluation on three different large-scale tasks suggest that the proposed algorithm significantly outperforms alternatives based on traditional online learning approaches.
Online Meta-Learning
G. Cadamuro, R. K. Vinayak, J. Blumenstock, S. Kakade, and J. N. Shapiro, “The Illusion of Change: Correcting for Bias when Inferring Changes in Sparse, Societal-Scale Data,” The Web Conference 2019, pp. 2608–2615, 2019. Publisher's VersionAbstract
Societal-scale data is playing an increasingly prominent role in social science research; examples from research on geopolitical events include questions on how emergency events impact the diffusion of information or how new policies change patterns of social interaction. Such research often draws critical inferences from observing how an exogenous event changes meaningful metrics like network degree or network entropy. However, as we show in this work, standard estimation methodologies make systematically incorrect inferences when the event also changes the sparsity of the data. To address this issue, we provide a general framework for inferring changes in social metrics when dealing with non-stationary sparsity. We propose a plug-in correction that can be applied to any estimator, including several recently proposed procedures. Using both simulated and real data, we demonstrate that the correction significantly improves the accuracy of the estimated change under a variety of plausible data generating processes. In particular, using a large dataset of calls from Afghanistan, we show that whereas traditional methods substantially overestimate the impact of a violent event on social diversity, the plug-in correction reveals the true response to be much more modest.
The Illusion of Change: Correcting for Bias when Inferring Changes in Sparse, Societal-Scale Data
E. Hazan and S. Kakade, Revisiting the Polyak step size. ArXiv Report, 2019. Publisher's VersionAbstract
This paper revisits the Polyak step size schedule for convex optimization problems, proving that a simple variant of it simultaneously attains near optimal convergence rates for the gradient descent algorithm, for all ranges of strong convexity, smoothness, and Lipschitz parameters, without a-priory knowledge of these parameters.
Revisiting the Polyak step size
N. Agarwal, B. Bullins, E. Hazan, S. M. Kakade, and K. Singh, Online Control with Adversarial Disturbances. ICML: ArXiv Report, 2019. Publisher's VersionAbstract
We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise). The objective we consider is one of regret: we desire an online control procedure that can do nearly as well as that of a procedure that has full knowledge of the disturbances in hindsight. Our main result is an efficient algorithm that provides nearly tight regret bounds for this problem. From a technical standpoint, this work generalizes upon previous work in two main aspects: our model allows for adversarial noise in the dynamics, and allows for general convex costs.
Online Control with Adversarial Disturbances
R. K. Vinayak, W. Kong, G. Valiant, and S. M. Kakade, Maximum Likelihood Estimation for Learning Populations of Parameters. ArXiv Report, 2019. Publisher's VersionAbstract
Consider a setting with N independent individuals, each with an unknown parameter, pi∈[0,1] drawn from some unknown distribution P⋆. After observing the outcomes of t independent Bernoulli trials, i.e., Xi∼Binomial(t,pi) per individual, our objective is to accurately estimate P⋆. This problem arises in numerous domains, including the social sciences, psychology, health-care, and biology, where the size of the population under study is usually large while the number of observations per individual is often limited. Our main result shows that, in the regime where t≪N, the maximum likelihood estimator (MLE) is both statistically minimax optimal and efficiently computable. Precisely, for sufficiently large N, the MLE achieves the information theoretic optimal error bound of (1t) for t
Maximum Likelihood Estimation for Learning Populations of Parameters
C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm. ArXiv Report, 2019. Publisher's VersionAbstract
In this note, we derive concentration inequalities for random vectors with subGaussian norm (a generalization of both subGaussian random vectors and norm bounded random vectors), which are tight up to logarithmic factors.
A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm
K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch, Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. ICLR: ArXiv Report, 2019. Publisher's VersionAbstract
We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. This exploration is critical for fast and stable learning of the value function. Combining these components enable solutions to complex simulated control tasks, like humanoid locomotion and dexterous in-hand manipulation, in the equivalent of a few minutes of experience in the real world.
Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control
E. Hazan, S. M. Kakade, K. Singh, and A. V. Soest, Provably Efficient Maximum Entropy Exploration. ICML: ArXiv Report, 2019. Publisher's VersionAbstract
Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves. For example, one natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. We provide an efficient algorithm to optimize such such intrinsically defined objectives, when given access to a black box planning oracle (which is robust to function approximation). Furthermore, when restricted to the tabular setting where we have sample based access to the MDP, our proposed algorithm is provably efficient, both in terms of its sample and computational complexities. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.
Provably Efficient Maximum Entropy Exploration