2008

2008
V. Dani, 7 9, T. Hayes, and S. M. Kakade, “Stochastic Linear Optimization under Bandit Feedback,” 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, pp. 355-366, 2008. Publisher's VersionAbstract
In the classical stochastic k-armed bandit problem, ineachofasequenceofT rounds, adecisionmaker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the algo- rithm and the optimal cost. In the linear optimization version of this problem (firstconsideredbyAuer(2002)), weviewthearms as vectors in Rn, and require that the costs be lin- ear functions of the chosen vector. As before, it is assumed that the cost functions are sampled in- dependently from an unknown distribution. In this setting, the goal is to find algorithms whose run- ning time and regret behave well as functions of the number of rounds T and the dimensionality n (rather than the number of arms, k, which may be exponential in n or even infinite). We give a nearly complete characterization of this problem in terms of both upper and lower bounds for the regret. In certain special cases (such as when the decision region is a polytope), the regret is polylog(T). In general though, the optimal re- gret is ( p T) — our lower bounds rule out the possibility of obtaining polylog(T) rates in gen- eral. We present two variants of an algorithm based on the idea of "upper confidence bounds." The first, due to Auer (2002), but not fully analyzed, obtains regret whose dependence on n and T are both es- sentially optimal, but which may be computation- ally intractable when the decision set is a polytope. The second version can be efficiently implemented when the decision set is a polytope (given as an in- tersection of half-spaces), but gives up a factor of p n in the regret bound. Our results also extend to the setting where the set of allowed decisions may change over time.
P. Bartlett, V. Dani, T. Hayes, S. Kakade, A. Rakhlin, and A. Tewari, “High-Probability Regret Bounds for Bandit Online Linear Optimization,” 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, pp. 335-342, 2008. Publisher's VersionAbstract
We present a modification of the algorithm of Dani et al. (9) for the online linear optimization prob- lem in the bandit setting, which with high proba- bility has regret at mostO ( p T ) against an adap- tive adversary. This improves on the previous al- gorithm (9) whose regret is bounded in expecta- tion against an oblivious adversary. We obtain the same dependence on the dimension (n3=2) as that exhibited by Dani et al. The results of this paper rest firmly on those of (9) and the remark- able technique of Auer et al. (2) for obtaining high- probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.
High-Probability Regret Bounds for Bandit Online Linear Optimization
K. Sridharan and S. M. Kakade, “An Information Theoretic Framework for Multi-view Learning,” 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, pp. 403-414, 2008. Publisher's VersionAbstract
In the multi-view learning paradigm, the input variable is partitioned into two different views X1 and X2 and there is a target variable Y of interest. The underlying assumption is that either view alone is sufficient to predict the target Y accurately. This provides a natural semi-supervised learning setting in which unlabeled data can be used to eliminate hypothesis from either view, whose predictions tend to disagree with predictions based on the other view. This work explicitly formalizes an information theoretic, multi-view assumption and studies the multi-view paradigm in the PAC style semi-supervised framework of Balcan and Blum [2006]. Underlying the PAC style framework is that an incompatibility function is assumed to be known — roughly speaking, this incompatibility function is a means to score how good a function is based on the unlabeled data alone. Here, we show how to derive incompatibility functions for certain loss functions of interest, so that minimizing this incompatibility over unlabeled data helps reduce expected loss on future test cases. In particular, we show how the class of empirically successful coregularization algorithms fall into our framework and provide performance bounds (using the results in Rosenberg and Bartlett [2007], Farquhar et al. [2005]). We also provide a normative justification for canonical correlation analysis (CCA) as a dimensionality reduction technique. In particular, we show (for strictly convex loss functions of the formℓ(W.x.y) that we can first use CCA as dimensionality reduction technique and (if the multi-view assumption is satisfied) this projection does not throw away much predictive information about the target Y —the benefit being that subsequent learning with a labeled set need only work in this lower dimensional space.
An Information Theoretic Framework for Multi-view Learning
S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, “Efficient Bandit Algorithms for Online Multiclass Prediction,” ICML '08: Proceedings of the 25th international conference on Machine learning, vol. Volume, no. issueNumber, pp. 440–447, 2008. Publisher's VersionAbstract
This paper introduces the Banditron, a variant of the Perceptron [Rosenblatt, 1958], for the multiclass bandit setting. The multiclass bandit setting models a wide range of practical supervised learning applications where the learner only receives partial feedback (referred to as "bandit" feedback, in the spirit of multi-armed bandit models) with respect to the true label (e.g. in many web applications users often only provide positive "click" feedback which does not necessarily fully disclose a true label). The Banditron has the ability to learn in a multiclass classification setting with the "bandit" feedback which only reveals whether or not the prediction made by the algorithm was correct or not (but does not necessarily reveal the true label). We provide (relative) mistake bounds which show how the Banditron enjoys favorable performance, and our experiments demonstrate the practicality of the algorithm. Furthermore, this paper pays close attention to the important special case when the data is linearly separable --- a problem which has been exhaustively studied in the full information setting yet is novel in the bandit setting.
Efficient Bandit Algorithms for Online Multiclass Prediction
M. Seeger, S. M. Kakade, and D. P. Foster, “Information Consistency of Nonparametric Gaussian Process Methods,” IEEE Transactions on Information Theory, vol. 54, no. 5, pp. 2376 - 2382, 2008. Publisher's VersionAbstract
Bayesian nonparametric models are widely and successfully used for statistical prediction. While posterior consistency properties are well studied in quite general settings, results have been proved using abstract concepts such as metric entropy, and they come with subtle conditions which are hard to validate and not intuitive when applied to concrete models. Furthermore, convergence rates are difficult to obtain. By focussing on the concept of information consistency for Bayesian Gaussian process (GP)models, consistency results and convergence rates are obtained via a regret bound on cumulative log loss. These results depend strongly on the covariance function of the prior process, thereby giving a novel interpretation to penalization with reproducing kernel Hilbert space norms and to commonly used covariance function classes and their parameters. The proof of the main result employs elementary convexity arguments only. A theorem of Widom is used in order to obtain precise convergence rates for several covariance functions widely used in practice.
Information Consistency of Nonparametric Gaussian Process Methods
S. M. Kakade and D. P. Foster, “Deterministic Calibration and Nash Equilibrium,” J.C.S.S Learning Theory Special Issue, vol. 74, no. 1, pp. 115-130, 2008. Publisher's VersionAbstract
We provide a natural learning process in which the joint frequency of (time-averaged) empirical play converges into the set of convex combinations of Nash equilibria. Furthermore, the actual distribution of players' actions is close to some (approximate) Nash equilibria on most rounds (on all but a vanishing fraction of the rounds). In this process, all players rationally choose their actions using a public prediction made by a deterministic, weakly calibrated algorithm. For this to be possible, we show that such a deterministic (weakly) calibrated learning algorithm exists.
Deterministic Calibration and Nash Equilibrium
V. Dani, T. Hayes, and S. M. Kakade, “The Price of Bandit Information for Online Optimization,” Proceedings of NIPS, pp. 345-352, 2008. Publisher's VersionAbstract
In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and chang- ing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret √ in the bandit setting to that in the full-information setting. For the full informa- tion case, the upper bound on the regret is O∗( nT ), where n is the ambient √ dimension and T is the time horizon. For the bandit case, we present an algorithm which achieves O∗(n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark √ contrast to the K-arm bandit setting, where the gap in the dependence on K is T log K). We also present lower bounds showing that exponential ( this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems.
The Price of Bandit Information for Online Optimization