2001

2001
S. Kakade, Optimizing Average Reward Using Discounted Rewards. COLT '01/EuroCOLT '01: Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory: , 2001. Publisher's VersionAbstract
In many reinforcement learningproblems, it is appropriate to optimize the average reward. In practice, this is often done by solving the Bellman equations usinga discount factor close to 1. In this paper, we provide a bound on the average reward of the policy obtained by solving the Bellman equations which depends on the relationship between the discount factor and the mixingtime of the Markov chain. We extend this result to the direct policy gradient of Baxter and Bartlett, in which a discount parameter is used to find a biased estimate of the gradient of the average reward with respect to the parameters of a policy. We show that this biased gradient is an exact gradient of a related discounted problem and provide a bound on the optima found by following these biased gradients of the average reward. Further, we show that the exact Hessian in this related discounted problem is an approximate Hessian of the average reward, with equality in the limit the discount factor tends to 1. We then provide an algorithm to estimate the Hessian from a sample path of the underlyingMark ov chain, which converges with probability 1.
Optimizing Average Reward Using Discounted Rewards
S. Kakade and P. Dayan, Dopamine Bonuses. Advances in Neural Information Processing Systems 13: , 2001. Publisher's VersionAbstract
Substantial data support a temporal difference (TD) model of dopamine (DA) neuron activity in which the cells provide a global error signal for reinforcement learning. However, in certain circumstances, DA activity seems anomalous under the TD model, responding to non-rewarding stimuli. We address these anomalies by suggesting that DA cells multiplex information about reward bonuses, including Sutton's exploration bonuses and Ng et al's non-distorting shaping bonuses. We interpret this additional role for DA in terms of the unconditional attentional and psychomotor effects of dopamine, having the computational role of guiding exploration.
Dopamine Bonuses
P. Dayan and S. Kakade, Explaining Away in Weight Space. Advances in Neural Information Processing Systems 13 (NIPS 2000): , 2001. Publisher's VersionAbstract
Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA,'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; pro(cid:173) vide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.
Explaining Away in Weight Space