2002

2002
S. Kakade, Y. W. Teh, and S. Roweis, An Alternative Objective Function for Markovian Fields. Proceedings of the Nineteenth International Conference on Machine Learning: , 2002.Abstract
In labelling or prediction tasks, a trained model's test performance is often based on the quality of its single-time marginal distributions over labels rather than its joint distribution over label sequences. We propose using a new cost function for discriminative learning that more accurately reflects such test time conditions. We present an efficient method to compute the gradient of this cost for Maximum Entropy Markov Models, Conditional Random Fields, and for an extension of these models...
An Alternative Objective Function for Markovian Fields
S. Kakade and P. Dayan, Acquisition and Extinction in Autoshaping. Psychological Review: , 2002. Publisher's VersionAbstract
C. R. Gallistel and J. Gibbon (2000) presented quantitative data on the speed with which animals acquire behavioral responses during autoshaping, together with a statistical model of learning intended to account for them. Although this model captures the form of the dependencies among critical variables, its detailed predictions are substantially at variance with the data. In the present article, further key data on the speed of acquisition are used to motivate an alternative model of learning, in which animals can be interpreted as paying different amounts of attention to stimuli according to estimates of their differential reliabilities as predictors.
Acquisition and Extinction in Autoshaping
S. Kakade and P. Dayan, Dopamine: Generalization and Bonuses. Neural Networks: , 2002. Publisher's VersionAbstract
In the temporal difference model of primate dopamine neurons, their phasic activity reports a prediction error for future reward. This model is supported by a wealth of experimental data. However, in certain circumstances, the activity of the dopamine cells seems anomalous under the model, as they respond in particular ways to stimuli that are not obviously related to predictions of reward. In this paper, we address two important sets of anomalies, those having to do with generalization and novelty. Generalization responses are treated as the natural consequence of partial information; novelty responses are treated by the suggestion that dopamine cells multiplex information about reward bonuses, including exploration bonuses and shaping bonuses. We interpret this additional role for dopamine in terms of the mechanistic attentional and psychomotor effects of dopamine, having the computational role of guiding exploration.
Dopamine: Generalization and Bonuses
S. Kakade, A Natural Policy Gradient. Advances in Neural Information Processing Systems 14 (NIPS 2001): , 2002.Abstract
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the param(cid:173) eter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradi(cid:173) ent is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sut(cid:173) ton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
A Natural Policy Gradient
S. Kakade and J. Langford, Approximately Optimal Approximate Reinforcement Learning. Proceedings of the Nineteenth International Conference on Machine Learning: , 2002.Abstract
In order to solve realistic reinforcement learning problems, it is critical that approximate algorithms be used. In this paper,
Approximately Optimal Approximate Reinforcement Learning
J. Langford, M. Zinkevich, and S. Kakade, Competitive Analysis of the Explore/Exploit Tradeoff. Proceedings of the Nineteenth International Conference on Machine Learning: , 2002. Publisher's VersionAbstract
We investigate the explore/exploit trade-off in reinforcement learning using competitive analysis applied to an abstract model. We state and prove lower and upper bounds on the competitive ratio. The essential conclusion of our analysis is that optimizing the explore/exploit trade-off is much easier with a few pieces of extra knowledge such as the stopping time or upper and lower bounds on the value of the optimal exploitation policy.
Competitive Analysis of the Explore/Exploit Tradeoff
N. D. Daw, S. Kakade, and P. Dayan, Opponent Interactions Between Serotonin and Dopamine. Neural Networks: , 2002.Abstract
Discusses a neural network model concerning the apparent opponent partnership of serotonin and dopamine. Anatomical and pharmacological evidence suggests that the dorsal raphe serotonin system and the ventral tegmental and substantia nigra dopamine system may act as mutual opponents. In the light of the temporal difference model of the involvement of the dopamine system in reward learning, the proposed model incorporates 3 aspects of motivational opponency involving dopamine and serotonin: (1) a tonic serotonergic signal reports the long-run average reward rate as part of an average-case reinforcement learning model; (2) a tonic dopaminergic signal reports the long-run average punishment rate in a similar context; and (3) a phasic serotonin signal might report an ongoing prediction error for future punishment. (PsycINFO Database Record (c) 2016 APA, all rights reserved)
Opponent Interactions Between Serotonin and Dopamine