2012

2012
S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, “Regularization Techniques for Learning with Matrice,” Journal of Machine Learning Research, vol. 13, no. 59. pp. 1865-1890, 2012. Publisher's VersionAbstract
There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning.
Regularization Techniques for Learning with Matrice
D. Hsu, S. M. Kakade, and T. Zhang, Errata: Analysis of a randomized approximation scheme for matrix multiplication. ArXiv Report, 2012. Publisher's VersionAbstract

This note gives a simple analysis of a randomized approximation scheme for matrix multiplication proposed by Sarlos (2006) based on a random rotation followed by uniform column sampling. The result follows from a matrix version of Bernstein's inequality and a tail inequality for quadratic forms in subgaussian random vectors.

Errata

 

The analysis in Example 4.3 is flawed due to an incorrect bound on

  \|E[X_j^2]\|_2 .

It is possible to fix by introducing a factor of

  max_i { \|a_i\|_2 / \|b_i\|_2 , \|b_i\|_2 / \|a_i\|_2 }

in the bound, which ultimately leads to a significantly worse result.

(Note: There is no problem if A = B.)

Thanks to Joel Tropp for pointing out this bug.

A slightly different algorithm avoids this issue and leads to the same running time bound; see

  http://arxiv.org/abs/1211.5414

for details.

DH 2012-11-25

---

A slightly different sampling scheme also fixes the problem; see

  http://arxiv.org/abs/1410.4429

for details.

DH 2014-10-17

 

Errata: Analysis of a randomized approximation scheme for matrix multiplication
A. Anandkumar, D. Hsu, and S. M. Kakade, A Method of Moments for Mixture Models and Hidden Markov Models. Conference on Learning Theory (COLT): ArXiv Report, 2012. Publisher's VersionAbstract
Mixture models are a fundamental tool in applied statistics and machine learning for treating data taken from multiple subpopulations. The current practice for estimating the parameters of such models relies on local search heuristics (e.g., the EM algorithm) which are prone to failure, and existing consistent methods are unfavorable due to their high computational and sample complexity which typically scale exponentially with the number of mixture components. This work develops an efficient method of moments approach to parameter estimation for a broad class of high-dimensional mixture models with many components, including multi-view mixtures of Gaussians (such as mixtures of axis-aligned Gaussians) and hidden Markov models. The new method leads to rigorous unsupervised learning results for mixture models that were not achieved by previous works; and, because of its simplicity, it offers a viable alternative to EM for practical deployment.
A Method of Moments for Mixture Models and Hidden Markov Models
A. Anandkumar, F. Huang, D. Hsu, and S. M. Kakade, Learning High-Dimensional Mixtures of Graphical Models. Neural Information Processing Systems (NIPS): ArXiv Report, 2012. Publisher's VersionAbstract
We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable corresponding to the mixture components is hidden and each mixture component over the observed variables can have a potentially different Markov graph structure and parameters. We propose a novel approach for estimating the mixture components, and our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. Our method is efficient when the union graph, which is the union of the Markov graphs of the mixture components, has sparse vertex separators between any pair of observed variables. This includes tree mixtures and mixtures of bounded degree graphs. For such models, we prove that our method correctly recovers the union graph structure and the tree structures corresponding to maximum-likelihood tree approximations of the mixture components. The sample and computational complexities of our method scale as $\poly(p, r)$, for an r-component mixture of p-variate graphical models. We further extend our results to the case when the union graph has sparse local separators between any pair of observed variables, such as mixtures of locally tree-like graphs, and the mixture components are in the regime of correlation decay.
Learning High-Dimensional Mixtures of Graphical Models
D. Hsu, S. M. Kakade, and P. Liang, Identifiability and Unmixing of Latent Parse Trees. Neural Information Processing Systems (NIPS): ArXiv Report, 2012. Publisher's VersionAbstract
This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models.
Identifiability and Unmixing of Latent Parse Trees
D. P. Foster, S. M. Kakade, and R. Salakhutdinov, Domain Adaptation: Overfitting and Small Sample Statistics. AISTAT: ArXiv Report, 2012. Publisher's VersionAbstract
We study the prevalent problem when a test distribution differs from the training distribution. We consider a setting where our training set consists of a small number of sample domains, but where we have many samples in each domain. Our goal is to generalize to a new domain. For example, we may want to learn a similarity function using only certain classes of objects, but we desire that this similarity function be applicable to object classes not present in our training sample (e.g. we might seek to learn that "dogs are similar to dogs" even though images of dogs were absent from our training set). Our theoretical analysis shows that we can select many more features than domains while avoiding overfitting by utilizing data-dependent variance properties. We present a greedy feature selection algorithm based on using T-statistics. Our experiments validate this theory showing that our T-statistic based greedy feature selection is more robust at avoiding overfitting than the classical greedy procedure.
Domain Adaptation: Overfitting and Small Sample Statistics
A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y. - K. Liu, A Spectral Algorithm for Latent Dirichlet Allocation. Neural Information Processing Systems (NIPS): ArXiv Report, 2012. Publisher's VersionAbstract
The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on k×k matrices, where k is the number of latent factors (e.g. the number of topics), rather than in the d-dimensional observed space (typically d≫k).
A Spectral Algorithm for Latent Dirichlet Allocation
D. Hsu, S. M. Kakade, and T. Zhang, An Analysis of Random Design Linear Regression. Conference on Learning Theory (COLT): ArXiv Report, 2012. Publisher's VersionAbstract
This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the ``out-of-sample'' prediction error, as opposed to the ``in-sample'' (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices.
An Analysis of Random Design Linear Regression
S. Bubeck, N. Cesa-Bianchi, and S. M. Kakade, Towards minimax policies for online linear optimization with bandit feedback. Conference on Learning Theory (COLT): ArXiv Report, 2012. Publisher's VersionAbstract
We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order dnlogN‾‾‾‾‾‾‾‾√ for any finite action set with N actions, under the assumption that the instantaneous loss is bounded by 1. This shaves off an extraneous d‾‾√ factor compared to previous works, and gives a regret bound of order dnlogn‾‾‾‾‾‾√ for any compact set of actions. Without further assumptions on the action set, this last bound is minimax optimal up to a logarithmic factor. Interestingly, our result also shows that the minimax regret for bandit linear optimization with expert advice in d dimension is the same as for the basic d-armed bandit with expert advice. Our second contribution is to show how to use the Mirror Descent algorithm to obtain computationally efficient strategies with minimax optimal regret bounds in specific examples. More precisely we study two canonical action sets: the hypercube and the Euclidean ball. In the former case, we obtain the first computationally efficient algorithm with a dn‾√ regret, thus improving by a factor dlogn‾‾‾‾‾‾√ over the best known result for a computationally efficient algorithm. In the latter case, our approach gives the first algorithm with a dnlogn‾‾‾‾‾‾‾√ regret, again shaving off an extraneous d‾‾√ compared to previous works.
Towards minimax policies for online linear optimization with bandit feedback
E. Hazan and S. M. Kakade, (weak) Calibration is Computationally Hard. Conference on Learning Theory (COLT): ArXiv Report, 2012. Publisher's VersionAbstract
We show that the existence of a computationally efficient calibration algorithm, with a low weak calibration rate, would imply the existence of an efficient algorithm for computing approximate Nash equilibria - thus implying the unlikely conclusion that every problem in PPAD is solvable in polynomial time.
(weak) Calibration is Computationally Hard
D. Hsu, S. M. Kakade, and T. Zhang, Tail inequalities for sums of random matrices that depend on the intrinsic dimension. Electronic Communications in Probability: ArXiv Report, 2012. Publisher's VersionAbstract
We derive exponential tail inequalities for sums of random matrices with no dependence on the explicit matrix dimensions. These are similar to the matrix versions of the Chernoff bound and Bernstein inequality except with the explicit matrix dimensions replaced by a trace quantity that can be small even when the dimension is large or infinite. Some applications to principal component analysis and approximate matrix multiplication are given to illustrate the utility of the new bounds.
Tail inequalities for sums of random matrices that depend on the intrinsic dimension
D. Hsu, S. M. Kakade, and T. Zhang, A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability: ArXiv Report, 2012. Publisher's VersionAbstract
We prove an exponential probability tail inequality for positive semidefinite quadratic forms in a subgaussian random vector. The bound is analogous to one that holds when the vector has independent Gaussian entries.
A tail inequality for quadratic forms of subgaussian random vectors