QuantEcon · mmcky · Dec 17, 2022 · Dec 16, 2022 · Dec 17, 2022
diff --git a/lectures/mccall_q.md b/lectures/mccall_q.md
@@ -19,7 +19,6 @@ This lecture illustrates a powerful machine learning technique called Q-learning
 
 {cite}`Sutton_2018` presents Q-learning and a variety of other statistical learning procedures.
 
-
 The Q-learning algorithm combines ideas from
 
 * dynamic programming
@@ -30,8 +29,8 @@ This lecture applies a Q-learning algorithm to the situation faced by  a   McCal
 
 Relative to the dynamic programming formulation of the McCall worker model that we studied in  {doc}`quantecon lecture <mccall_model>`, a Q-learning algorithm gives the worker less knowledge about 
 
-  * the random process that generates a sequence of wages 
-  * the reward function that tells  consequences of accepting or rejecting a job
+* the random process that generates a sequence of wages 
+* the reward function that tells  consequences of accepting or rejecting a job
 
 The Q-learning algorithm  invokes a statistical learning model to learn about these things.
 
@@ -266,10 +265,11 @@ This definition of $Q(w,a)$ presumes that in subsequent periods the worker  take
 An optimal   Q-function for our McCall worker satisfies
 
 $$
-\begin{align}
+\begin{aligned}
 Q\left(w,\text{accept}\right) & =\frac{w}{1-\beta} \\
 Q\left(w,\text{reject}\right) & =c+\beta\int\max_{\text{accept, reject}}\left\{ \frac{w'}{1-\beta},Q\left(w',\text{reject}\right)\right\} dF\left(w'\right) 
-\end{align} $$ (eq:impliedq)
+\end{aligned} 
+$$ (eq:impliedq)
 
 
 Note that the first equation of system {eq}`eq:impliedq` presumes that after  the agent has  accepted an offer, he will not have the objection to reject that same offer in the future.   
@@ -352,7 +352,7 @@ $$
 \begin{aligned}
          w  & + \beta \max_{\textrm{accept, reject}} \left\{ Q (w, \textrm{accept}), Q(w, \textrm{reject}) \right\} - Q (w, \textrm{accept})   = 0  \cr
          c  & +\beta\int\max_{\text{accept, reject}}\left\{ Q(w', \textrm{accept}),Q\left(w',\text{reject}\right)\right\} dF\left(w'\right) - Q\left(w,\text{reject}\right)  = 0  \cr 
-       \end{aligned} 
+\end{aligned} 
 $$ (eq:probtosample1)
 
 Notice the integral over $F(w')$ on the second line.
@@ -366,7 +366,7 @@ $$
 \begin{aligned}
          w  & + \beta \max_{\textrm{accept, reject}} \left\{ Q (w, \textrm{accept}), Q(w, \textrm{reject}) \right\} - Q (w, \textrm{accept})   = 0  \cr
          c  & +\beta \max_{\text{accept, reject}}\left\{ Q(w', \textrm{accept}),Q\left(w',\text{reject}\right)\right\}  - Q\left(w,\text{reject}\right)  \approx 0  \cr 
-       \end{aligned} 
+\end{aligned} 
 $$(eq:probtosample2)
 
 
@@ -387,7 +387,7 @@ $$
 \begin{aligned}
          w  & + \beta \max_{\textrm{accept, reject}} \left\{ \hat Q_t (w_t, \textrm{accept}), \hat Q_t(w_t, \textrm{reject}) \right\} - \hat Q_t(w_t, \textrm{accept})   = \textrm{diff}_{\textrm{accept},t}  \cr
          c  & +\beta\int\max_{\text{accept, reject}}\left\{ \hat Q_t(w_{t+1}, \textrm{accept}),\hat Q_t\left(w_{t+1},\text{reject}\right)\right\}  - \hat Q_t\left(w_t,\text{reject}\right)  = \textrm{diff}_{\textrm{reject},t}  \cr 
-       \end{aligned}
+\end{aligned}
 $$ (eq:old105)
 
 The adaptive learning scheme would then be some version of
@@ -401,9 +401,6 @@ to  objects in equation system {eq}`eq:old105`.
 
 This informal argument takes us to the threshold of Q-learning.
 
-
-+++
-
 ## Q-Learning 
 
 Let's first describe  a $Q$-learning algorithm precisely.
@@ -436,10 +433,10 @@ $$ (eq:old3)
 where 
 
 $$
-\begin{align*}
+\begin{aligned}
 \widetilde{TD}\left(w,\text{accept}\right) & = \left[ w+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w,a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{accept}\right) \\
 \widetilde{TD}\left(w,\text{reject}\right) & = \left[ c+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w',a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{reject}\right),\;w'\sim F
-\end{align*}
+\end{aligned}
 $$ (eq:old4)
 
 The terms  $\widetilde{TD}(w,a) $ for $a = \left\{\textrm{accept,reject} \right\}$  are the **temporal difference errors** that drive the updates.
@@ -516,9 +513,6 @@ By using the $\epsilon$-greedy method and also by increasing the number of episo
 
 **Remark:** Notice that    $\widetilde{TD}$ associated with  an optimal Q-table defined in equation (2) automatically above satisfies  $\widetilde{TD}=0$ for all state action pairs.  Whether a limit of our Q-learning algorithm converges to an optimal Q-table depends on whether the algorithm visits all state, action pairs often enough.
 
-
-
-
 We implement this pseudo code  in a Python class. 
 
 For simplicity and convenience, we let `s` represent the state index between $0$ and $n=50$ and $w_s=w[s]$. 
@@ -736,19 +730,16 @@ plot_epochs(ns_to_plot=[100, 1000, 10000, 100000, 200000])
 
 The above graphs indicates that
 
-  * the Q-learning algorithm has trouble  learning  the Q-table well for wages that are rarely drawn
-
-  * the quality of approximation to the "true" value function computed by value function iteration improves for longer epochs
+* the Q-learning algorithm has trouble  learning  the Q-table well for wages that are rarely drawn
 
-+++
+* the quality of approximation to the "true" value function computed by value function iteration improves for longer epochs
 
 ## Employed Worker Can't Quit
 
 
 The preceding version of temporal difference Q-learning described in  equation system  (4) lets an an employed  worker quit, i.e., reject her wage as an incumbent and instead accept receive unemployment compensation this period
 and draw a new offer next period.
 
-
 This is an option that the McCall worker described in {doc}`this quantecon lecture <mccall_model>` would not take.  
 
 See {cite}`Ljungqvist2012`, chapter 7 on search, for a proof.
@@ -757,20 +748,18 @@ But in the context of Q-learning, giving the worker the option to quit and get u
 unemployed turns out to accelerate the learning process by promoting experimentation vis a vis premature
 exploitation only.
 
-
 To illustrate this, we'll amend our formulas for temporal differences to forbid an employed worker from quitting a job she had accepted earlier. 
 
 With this understanding about available choices, we obtain the following temporal difference values:
 
 $$
-\begin{align*}
+\begin{aligned}
 \widetilde{TD}\left(w,\text{accept}\right) & = \left[ w+\beta\widetilde{Q}^{old}\left(w,\text{accept}\right) \right]-\widetilde{Q}^{old}\left(w,\text{accept}\right) \\
 \widetilde{TD}\left(w,\text{reject}\right) & = \left[ c+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w',a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{reject}\right),\;w'\sim F
-\tag{4'}
-\end{align*}
-$$
+\end{aligned}
+$$ (eq:temp-diff)
 
-It turns out that formulas (4') combined with our Q-learning recursion (3) can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised. 
+It turns out that formulas {eq}`eq:temp-diff` combined with our Q-learning recursion (3) can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised. 
 
 But learning is slower because  an agent who ends up accepting a wage offer prematurally loses the option to explore new states in the same episode and to adjust the value associated with that state.