- Use past experience of incompletely known system to predict the future
- Assign credit by doing successive predictions and calculating differences
- Good for pattern-recognition problems
- Iterative, low-cost alternative to supervised learning, proven equally effective or better
- In supervised learning (historically more popular)
- Learner learns to associate pairs (input -> expected output)
- Single: all information about correctness is revealed at once (e.g: weather)
- Multi: correctness is revealed several steps after having to predict (e.g: chess moves)
- Real world solvable problems tend to be multi-step
- Each x_t is a vector <observation, outcome>
- Experience comes in x_1, x_2, ... x_t, z vectors
- x_t is a real scalar at time t for each position
- z is the outcome of the sequence
- The learner produces P_0, P_1, P_t predictions trying to estimate z
- Predictions use weights 'w'
- w is updated as part of the learning process, iteratively
- Usually the update procedure for delta_w in supervised learning is:
- All delta_wt depend on 'z' which is only known at the end of the sequence.
- All Pt must be saved in memory
-
TD allows to compute delta_wt without z:
-
If P_t is a linear function of x_t and w, the equation is equal to the Widrow-Hoff rule: