textbook: Reinforcement Learning: An Introduction - Richard Sutton and Andrew Barto
******** Fundamentals of Reinforcement Learning
**** week1
multi-arm bandit
(state added)
=> MDP (Markov Decision Process): state-value fun., action-value fun., policy
reward hypothesis (Michael Littman)
episode, discount rate
state-value Bellman eq., action-value Bellman eq.
state-value optimality eq., action-value optimality eq.
iterative policy evaluation by state-value Bellman eq.
policy improvement by greedification
policy iteration = policy evaluation + policy improvement
(these iterations are dynamic programming)
general policy iteration
(brute-force policy evaluation is practically impossible in cases)
=> policy evaluation with dynamic programming,
which requires a model of environment (?)
Warren Power’s sample case
******** Sampling-based Learning
**** week1
general policy iteration using Monte Carlo methods: sampling of episodes and then averaging
epsilon-soft policy
off-policy learning: behavior policy vs. target policy
=> needs importance sampling
batch RL (Emma Brunskill)
temporal difference (TD) learning: update by each (s, a, r, s’), not by episodes
that is, it requires just experiences (not the model of environment like dynamic programming)
Richard Sutton: prediction learning is natural way we learn
and TD learning is a proper formalizatiion of prediction learning
“Comparing TD and Monte Carlo” of Week 3 of Sample-based Learning Methodes
=> also shows effect of step size
**** week4
Sarsa: TD algorithm from action-value Bellman eq., policy iteration, on-policy
Q-learning: TD algorithm from action-value Bellman optimal eq., value iteration, off-policy without importance sampling
Q-learning is faster than Sarsa
Sarsa is more reliable than Q-learning
Expected Sarsa: lower variance than Sarsa, needs more computation, off-policy without importance sampling
greedy Expected Sarsa = Expected sarsa with greedy target policy = Q-learning
**** week5
Dyna-Q: model based learning
is like off-policy learning with experience replay
Dyna-Q+: with inaccurate model
model-based learning (Drew Bagnell) with quadratic value function approximation for continuous state-actions
******** Prediction and Control with Function Approximation
**** week2
course-coding, tile coding
function approximation with NN
3 functions in RL: policy, values(state, action), model
**** week3
average reward
differential returns and differential value functions
reward design: intrinsic reward (Satinder Singh)
**** week4
softmax policy (compared to epsilon-greedy)
policy gradient learning: learn policy directly (but it still needs action-value function)
approximating action-value with state-value, which is Critic
=> Actor-Critic
Critic learns using semi-gradient TD
Actor learning using TD error from Critic
Gaussian policy for continuous actions
******** A Complete Reinforcement Learning System (Capstone)
**** week2
The Hodonistic Neuron (by Harry Klopf)
Eligibility Trace: contingent (Actor) or non-contingent (Critic)
Agnostic System Identification for Model-Based Reinforcement Learning 2012 (Drew Bagnell)
mobile health (Susan Murphy)
Adam optimizer = momentum + vector step-size
data efficiency needed => experience replay with replay buffer
reproduciibility crisis (Joelle Pineau)