Skip to content

Latest commit

 

History

History
53 lines (43 loc) · 2.48 KB

Learning to Predict by Temporal Differences.md

File metadata and controls

53 lines (43 loc) · 2.48 KB

Key ideas

  • Use past experience of incompletely known system to predict the future
  • Assign credit by doing successive predictions and calculating differences
  • Good for pattern-recognition problems
  • Iterative, low-cost alternative to supervised learning, proven equally effective or better

Temporal-difference and supervised learning approaches

  • In supervised learning (historically more popular)
    • Learner learns to associate pairs (input -> expected output)

Single vs multi step prediction

  • Single: all information about correctness is revealed at once (e.g: weather)
  • Multi: correctness is revealed several steps after having to predict (e.g: chess moves)
  • Real world solvable problems tend to be multi-step

Computational improvements

  • Each x_t is a vector <observation, outcome>
  • Experience comes in x_1, x_2, ... x_t, z vectors
    • x_t is a real scalar at time t for each position
    • z is the outcome of the sequence
  • The learner produces P_0, P_1, P_t predictions trying to estimate z
    • Predictions use weights 'w'
  • w is updated as part of the learning process, iteratively update w with deltas
  • Usually the update procedure for delta_w in supervised learning is: update equation

The problem

  • All delta_wt depend on 'z' which is only known at the end of the sequence.
  • All Pt must be saved in memory

TD lambda

  • TD allows to compute delta_wt without z:

    • z - Pt (prediction error) is represented as a sum
    • td_lambda
    • This can be computed at every step, no need to remember P_1..Pt, and this distributes time complexity over 0..t
  • If P_t is a linear function of x_t and w, the equation is equal to the Widrow-Hoff rule:

    • td_widrow
    • z-wt xt represents the error between prediction and reality (z)
    • xt multiplies the error, to determine weight changes
      • If error is positive, mutiplying by x_t will increase wx_t and therefore reduce the error

Propagation

  • P_t computed by network when it's not a linear function of x_tand w.
    • td_non_linear
  • Lambda is then used as a way to weigh "recency". E.g: TD(1) will do a one step look ahead, disregarding the rest, TD(0) will look infinitely far to compute the weights.
  • td_non_linear
  • td_last