Publications by Year: 2017

2017
D. P. Foster, S. M. Kakade, and T. Zhang, Multi-View Dimensionality Reduction via Canonical Correlation Analysis, 20084th ed. vol. TTI-TR-2008-4. Toyota Technical Institute-Chicago: , 2017. Publisher's VersionAbstract
We analyze the multi-view regression problem where we have two views X = (X(1),X(2)) of the input data and a target variable Y of interest. We provide sufficient conditions under which we can reduce the dimensionality of X (via a projection) without loosing predictive power of Y . Crucially, this projection can be computed via a Canonical Correlation Analysis only on the unlabeled data. The algorithmic template is as follows: with unlabeled data, perform CCA and construct a certain projection; with the labeled data, do least squares regression in this lower dimensional space. We show how, under certain natural assumptions, the number of labeled samples could be significantly reduced (in comparison to the single view setting) — in particular, we show how this dimensionality reduction does not loose predictive power of Y (thus it only introduces little bias but could drastically reduce the variance). We explore two separate assumptions under which this is possible and show how, under either assumption alone, dimensionality reduction could reduce the labeled sample complexity. The two assumptions we consider are a conditional independence assumption and a redundancy assumption. The typical conditional independence assumption is that conditioned on Y the views X(1) and X(2) are independent — we relax this assumption to be conditioned on some hidden state H the views X(1) and X(2) are independent. Under the redundancy assumption, we have that the best predictor from each view is roughly as good as the best predictor using both views.
Multi-View Dimensionality Reduction via Canonical Correlation Analysis
M. R. Khan and J. Blumenstock, Determinants of Mobile Money Adoption in Pakistan. 2017. Publisher's VersionAbstract
In this work, we analyze the problem of adoption of mobile money in Pakistan by using the call detail records of a major telecom company as our input. Our results highlight the fact that different sections of the society have different patterns of adoption of digital financial services but user mobility related features are the most important one when it comes to adopting and using mobile money services.
Determinants of Mobile Money Adoption in Pakistan
V. Sharan, S. Kakade, P. Liang, and G. Valiant., Learning Overcomplete HMMs. NIPS: ArXiv Report, 2017. Publisher's VersionAbstract
We study the problem of learning overcomplete HMMs---those that have many hidden states but a small output alphabet. Despite having significant practical importance, such HMMs are poorly understood with no known positive or negative results for efficient learning. In this paper, we present several new results---both positive and negative---which help define the boundaries between the tractable and intractable settings. Specifically, we show positive results for a large subclass of HMMs whose transition matrices are sparse, well-conditioned, and have small probability mass on short cycles. On the other hand, we show that learning is impossible given only a polynomial number of samples for HMMs with a small output alphabet and whose transition matrices are random regular graphs with large degree. We also discuss these results in the context of learning HMMs which can capture long-term dependencies.
Learning Overcomplete HMMs
P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, V. K. Pillutla, and A. Sidford, A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares). FSTTCS: ArXiv Report, 2017. Publisher's VersionAbstract
This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addresses model mis-specification.
A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)
J. Thickstun, Z. Harchaoui, and S. Kakade, Learning Features of Music from Scratch. ICLR: ArXiv Report, 2017. Publisher's VersionAbstract
This paper introduces a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments, together with instrument/note annotations resulting in over 1 million temporal labels on 34 hours of chamber music performances under various studio and microphone conditions. The paper defines a multi-label classification task to predict notes in musical recordings, along with an evaluation protocol, and benchmarks several machine learning architectures for this task: i) learning from spectrogram features; ii) end-to-end learning with a neural net; iii) end-to-end learning with a convolutional neural net. These experiments show that end-to-end models trained for note prediction learn frequency selective filters as a low-level representation of audio.
Learning Features of Music from Scratch
P. Jain, C. Jin, S. M. Kakade, and P. Netrapalli, Computing Matrix Squareroot via Non Convex Local Search. AISTATS: ArXiv Report, 2017. Publisher's VersionAbstract
While there has been a significant amount of work studying gradient descent techniques for non-convex optimization problems over the last few years, all existing results establish either local convergence with good rates or global convergence with highly suboptimal rates, for many problems of interest. In this paper, we take the first step in getting the best of both worlds -- establishing global convergence and obtaining a good rate of convergence for the problem of computing squareroot of a positive definite (PD) matrix, which is a widely studied problem in numerical linear algebra with applications in machine learning and statistics among others. Given a PD matrix M and a PD starting point U0, we show that gradient descent with appropriately chosen step-size finds an ϵ-accurate squareroot of M in O(αlog(‖M−U20‖F/ϵ)) iterations, where α=(max{‖U0‖22,‖M‖2}/min{σ2min(U0),σmin(M)})3/2. Our result is the first to establish global convergence for this problem and that it is robust to errors in each iteration. A key contribution of our work is the general proof technique which we believe should further excite research in understanding deterministic and stochastic variants of simple non-convex gradient descent algorithms with good global convergence rates for other problems in machine learning and numerical linear algebra.
Computing Matrix Squareroot via Non Convex Local Search
A. Rajeswaran, K. Lowrey, E. Todorov, and S. Kakade, Towards Generalization and Simplicity in Continuous Control. NIPS: ArXiv Report, 2017. Publisher's VersionAbstract
This work shows that policies with simple linear and RBF parameterizations can be trained to solve a variety of continuous control tasks, including the OpenAI gym benchmarks. The performance of these trained policies are competitive with state of the art results, obtained with more elaborate parameterizations such as fully connected neural networks. Furthermore, existing training and testing scenarios are shown to be very limited and prone to over-fitting, thus giving rise to only trajectory-centric policies. Training with a diverse initial state distribution is shown to produce more global policies with better generalization. This allows for interactive control scenarios where the system recovers from large on-line perturbations; as shown in the supplementary video.
Towards Generalization and Simplicity in Continuous Control
C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, How to Escape Saddle Points Efficiently. ICML: ArXiv Report, 2017. Publisher's VersionAbstract
This work shows that policies with simple linear and RBF parameterizations can be trained to solve a variety of continuous control tasks, including the OpenAI gym benchmarks. The performance of these trained policies are competitive with state of the art results, obtained with more elaborate parameterizations such as fully connected neural networks. Furthermore, existing training and testing scenarios are shown to be very limited and prone to over-fitting, thus giving rise to only trajectory-centric policies. Training with a diverse initial state distribution is shown to produce more global policies with better generalization. This allows for interactive control scenarios where the system recovers from large on-line perturbations; as shown in the supplementary video.
How to Escape Saddle Points Efficiently