MDP #171

oyamad · 2015-08-24T15:35:48Z

I finally finished coding the Markov Decision Process (MDP) class. There are a couple of issues to discuss.

An MDP instance can be created in two ways, in which the transition matrix is
- a 3-dimensional array defined on S x A x S ("product formulation")
- a 2-dimensional array defined on SA x S ("state-action pairs formulation")
where S is the state space, A the action space, and SA the set of feasible state-action pairs. The former should be straightforward; the latter requires additional information about the correspondence between the indices of SA and the actual state-action pairs. See the following example:
```
>>> s_indices = [0, 0, 1]  # State indices
>>> a_indices = [0, 1, 0]  # Action indices
>>> R = [5, 10, -1]
>>> Q = [(0.5, 0.5), (0, 1), (0, 1)]
>>> beta = 0.95
>>> mdp = MDP(R, Q, beta, s_indices, a_indices)
```
Here, S = {0, 1}, A = {0, 1}, and SA = {(0, 0), (0, 1), (1, 0)} (action 1 is infeasible at state 1). The correspondence is represented by s_indices and a_indices. I wonder whether this is an "optimal" implementation strategy.
Complications are basically kept in __init__ (so it is very long), to keep other methods simple and more or less intuitive.
Any suggestions for improvement in particular on variable/method/function names (and of course for other things too) will be appreciated.
I tried to follow the ideas in ENH: make calls to compute_fixed_point allocate less memory #40 and ENH: moved the lucas_tree tuple to a new LucasTree class #41. I wonder if I did it right. The method operator_iteration may be replaced with compute_fixed_point, whereas the latter does not exactly match here.
Results of solve are stored in MDPSolveResult. Are there other important informations to store there?
The policy iteration algorithm is usually initialized with a policy function, while the policy_iteration method here requires an initial value function, to be consistent with the other solution methods.
Deals only with infinite horizon. beta=1 is not accepted. (Do we want a finite horizon algorithm?)
I also added random_mdp. I hope this will be useful for performance tests.

I also added some notebooks on MDP to QuantEcon.site:

I have some more examples here.

jstac · 2015-08-25T00:39:42Z

Wow, this is great!! I have to spend so time with the code but the implementation looks outstanding.

The notebooks are very helpful too.

This is a really big contribution to open source economics.

robertdkirkby · 2015-08-25T10:13:50Z

Great stuff!

Few points on the 'modified policy iteration' that may or may not be helpful (hopefully they are). I caution that they are all based on my experience with a slightly different implementation (instead of just one Q my own codes differentiate between endogenous and exogenous states).

From speed perspective --- and you may already do this, I was not sure from the pseudocode and don't write python myself so didn't want to go digging --- once you calculate the policy indexes sigma, rather than calculate r(sigma) at each of your k iterations in step 3, you can calculate it once at the beginning of step 3 as a matrix and then just reuse for each of the k iterations. In my experience this provides a nono-trivial speed boost.

Using an alorithm similar to but different from your current algorithm it is possible to contaminate everything with -Infs (if your final true v has some -Infs and your initial v0 doesn't). This has happened to me in practice. This is easily avoided by also checking that span(u-v) is finite before you do step 4. I'm not certain this will happen with your implementation, but might be worth checking.

Another thing I would mention based on my experience an alternative implementation of your modified policy iteration algorithm. While the value function iteration is guaranteed to converge if you set iter_max high enough (thanks to contraction mapping property) this is not true of the modified policy iteration algorithm if your k is big relative to grid size (of S and A). This is something I have found by experience. It might therefore be worth adding an internal option that after a certain number of tries, if convergence has not yet been reached then the modified policy iteration algorithm will just switch over automatically over to doing value function iteration (ie. you might perhaps have iter_max/2 with modified policy iteration, and then from there to iter_max with value fn iteration). That way the user can be certain that if they just keep increasing iter_max they will eventually get convergence.

(ps. I haven't contributed anything to quantecon, but I 'watch' and this is my first comment here)

oyamad · 2015-08-25T12:02:37Z

Dear @robertdkirkby: Thank you for your comments, and welcome to quantecon!

once you calculate the policy indexes sigma, rather than calculate r(sigma) at each of your k iterations in step 3, you can calculate it once at the beginning of step 3 as a matrix and then just reuse for each of the k iterations.

Let me explain what I do in the code. sigma in step 2 and u in step 3 (as described here) are actually computed simultaneously. This sigma is used to generate the operator T_sigma using R_sigma and Q_sigma computed once, and the same T_sigma is used for k times without recomputing here, if I understand correctly. But maybe I am missing something important.

it is possible to contaminate everything with -Infs (if your final true v has some -Infs and your initial v0 doesn't).

This can happen when the input reward is such that for some s, R[s, a] = -inf for all a. I implicitly assumed the use will not give such data. Maybe we should check that this is not the case.

While the value function iteration is guaranteed to converge if you set iter_max high enough (thanks to contraction mapping property) this is not true of the modified policy iteration algorithm if your k is big relative to grid size (of S and A).

Thank you for reminding me of this, I have been forgetting that in the proof of convergence of modified policy iteration, I needed the assumption that v^0 be such that T v^0 >= v^0 (otherwise monotonicity of ~~T~~ T_sigma would have no bite). This condition is satisfied in particular when v^0 = v_sigma for some feasible policy sigma (for example the policy that maximizes the one-shot reward function) or when v^0 is very small that v^0 <= min_(s, a) r(s, a) / (1 - beta). I think I will change the default v_init (when it is not supplied) to either of these, and correct the notebook. Thanks!

oyamad · 2015-08-25T15:14:22Z

I chose to set v_init for modified PI to min_(s, a) r(s, a) / (1 - beta).

Any suggestions are welcome for this and for the other methods as well.

mmcky · 2015-08-25T21:43:46Z

Due to util changes to close #165 there are some small merge conflicts that will need to be fixed in this branch prior to merge.

robertdkirkby · 2015-08-25T22:13:53Z

Thanks @oyamad

Sounds like you already have this covered.
I would strongly advocate that you allow for r() to take -Inf values. Below are three examples of why it would be useful (the second is really just an explicit example of the third). Before the examples though I would highlight that it is quite easy to do. The value function iteration requires no modification whatsoever. The policy iteration method could check for r()=-Inf and throw an error saying this is not allowed with this solution method. The modified policy iteration method requires only the minor modification mentioned previously, checking that span(v-u) is finite before doing Step 4; given that you already calculate span(v-u) this will cost you essentially nothing in terms of run time. In summary, I would argue that is both very useful in many applications to let r() take -Inf values, and is easy enough to implement in the codes.
a. solving a stochastic growth model where util(consumption=0)=-Inf, and allowing for the state-action of capital =0. (if capital equals zero, then output is zero, and so consumption must be zero, giving a return on -Inf). In some sense this is just a 'mistake' by the user who should 'know' to rule out capital =zero. But I suspect it is a 'mistake' users would often make.
b. Implementing a stochastic growth model with putty-clay investment, so cannot choose (action of) capital levels (k') less than current capital level (k) state minus depreciation (delta). If you allow r() to take -Inf values this is easy, just make r()=-Inf for all actions which would imply k'<(1-delta)k.
c. r()=-Inf are very useful for implementing constraints. In many economic applications you will have the return/gain r() being a combination of a utility function u() and some (eg. budget) constraints. If r()=-Inf is allowed, then you can just make any breaching of the constraints a -Inf value and the rest of the MDP will be just as normal.
[In the language/notation of Stokey, Lucas & Prescott (1989), allowing r() take -Inf values is an easy way of implementing that the feasible policy correspondence, Gamma, depends on the current state (endog & exog states).]
[Update: It occours to me that you may have some way of imposing constraints by use of zeros in Q(), but it is not clear to me how exactly this would work in terms of interacting with r() and ensuring that such actions would never be optimal.]
Do you have a copy of the proof handy? I hadn't heard of this result before and would be interested in seeing it.

oyamad · 2015-08-26T03:39:59Z

Thanks @robertdkirkby

The "product formulation" assumes that constraints are expressed by setting r(s, a) = -inf for infeasible (s, a)'s. This works with no problem as long as for all states s, r(s, a) > -inf for some action a. I think this accommodates your point c.

I am afraid it would be problematic if we have some s such that r(s, a) = -inf for all a. Mathematically, even if the optimal value function is well defined, the optimal policy is not well defined. If the values are all -inf, NumPy returns -inf as max and the first index that attains max, namely index 0 in this case 0, as argmax. In the case u(0) = -inf and the current capital is 0, zero consumption yields -inf by assumption, whereas positive consumption also yields -inf for another reason, namely infeasibility. Note that assigning indices to actions can be arbitrary. If index 0 is assigned to some positive consumption and 1 is assigned to zero consumption, then argmax would still be 0. So I think the user should be advised to assign reward -inf only to infeasible actions (and not to include states under which all the actions are infeasible).

This is not very difficult. It is similar to the proof of convergence of policy iteration.
Let v^0, v^1, v^2, ... and sigma1, sigma2, ... be the sequences of values and policies obtained by modified policy iteration.
If v^0 <= T v^0, then we have T v^0 <= v^1 <= T v^1 (and so on).
Let me type the proof here:

v^0 <= T v^0 (by assumption)
    =  T_sigma1 v^0 (by the definition of sigma1)
    <= T_sigma1^2 v^0 (by the monotonicity of T_sigma1)
    ...
    <= T_sigma1^k v^0
    <= T_sigma1^k+1 ^v0
    =  v^1 (by the definition of v^1);

from the inequality T_sigma1^k v^0 <= v^1, we have

T_sigma1^k+1 v^0 <= T_sigma1 v^1 (by the monotonicity of T_sigma1)
                 <= T v^1 (since T_sigma v <= T v for any sigma and v).

By this, we have T^i v^0 <= v^i for all i (by the monotonicity of T). It is clear that v^i <= v^*. So, if v^0 <= T v^0, then the sequence converges monotonically to v^* from below.

robertdkirkby · 2015-08-26T06:17:51Z

It does cover my c. I missed the 'all' in 'for some s, all a' in your earlier reply. I now agree with your earlier statement that it might be worth checking for as I suspect many beginner users will make the kind of mistake described by the u(c=0)=-inf and including capital=0 example.
Thanks.

oyamad · 2015-08-27T15:16:27Z

The description of the algorithms in the notebook was misleading because of iter_max; with it, finite termination is trivial (the problem was that I tried to described the algorithms and implementation at the same time). I removed the termination condition with iter_max from the description.

oyamad · 2015-08-28T05:58:24Z

I did rebase and forced push.

oyamad · 2015-08-30T15:06:58Z

As another demo, I used MDP to solve the career-job model:

Modeling career choice

A remark:
In using compute_fixed_point in the Solutions, the default max_iter=50 is used. This is binding, and makes a significant difference in particular in the case of beta = 0.99. In fact, the median first passage time in this case seems to be about 14 with a large enough number of iterations (or with policy iteration), whereas the Solutions notebook claims that it is about 11.

oyamad · 2015-09-02T15:13:58Z

Following the suggestion from @robertdkirkby , I added a method that checks that every state has at lease one action with finite reward. It also checks, in the case of state-action pairs, that for every state, there is at least one available action.

- markov.MDP - markov.random_mdp

…gence

Checks - every state has at least one action with finite reward - for the sa-pair formulation, every state has at least one available action

following sparse matrix support in MarkovChain

oyamad · 2015-09-03T12:51:17Z

Following a discussion with @jstac, I renamed MDP to DiscreteDP. I hope I have made all the relevant changes properly.

jstac · 2015-09-03T13:26:30Z

Everything looks great. Thanks again! I'll merge now and start writing the lecture.

MDP

oyamad force-pushed the mdp branch from b0d754d to ef22741 Compare August 28, 2015 05:57

oyamad added 4 commits September 3, 2015 00:57

Add mdp

72dd98e

- markov.MDP - markov.random_mdp

For mpi: v^0 set to min_(s, a) r(s, a) / (1-beta) to guarantee conver…

48d79f7

…gence

Add _check_action_feasibility

8248d51

Checks - every state has at least one action with finite reward - for the sa-pair formulation, every state has at least one available action

Remove conversion to dense matrix

fe893d5

following sparse matrix support in MarkovChain

oyamad force-pushed the mdp branch from 44bb845 to fe893d5 Compare September 2, 2015 16:02

Rename MDP to DiscreteDP

1ba72c3

jstac added a commit that referenced this pull request Sep 3, 2015

Merge pull request #171 from QuantEcon/mdp

e14c863

MDP

jstac merged commit e14c863 into master Sep 3, 2015

jstac deleted the mdp branch September 3, 2015 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDP #171

MDP #171

oyamad commented Aug 24, 2015

jstac commented Aug 25, 2015

robertdkirkby commented Aug 25, 2015

oyamad commented Aug 25, 2015

oyamad commented Aug 25, 2015

mmcky commented Aug 25, 2015

robertdkirkby commented Aug 25, 2015

oyamad commented Aug 26, 2015

robertdkirkby commented Aug 26, 2015

oyamad commented Aug 27, 2015

oyamad commented Aug 28, 2015

oyamad commented Aug 30, 2015

oyamad commented Sep 2, 2015

oyamad commented Sep 3, 2015

jstac commented Sep 3, 2015

MDP #171

MDP #171

Conversation

oyamad commented Aug 24, 2015

jstac commented Aug 25, 2015

robertdkirkby commented Aug 25, 2015

oyamad commented Aug 25, 2015

oyamad commented Aug 25, 2015

mmcky commented Aug 25, 2015

robertdkirkby commented Aug 25, 2015

oyamad commented Aug 26, 2015

robertdkirkby commented Aug 26, 2015

oyamad commented Aug 27, 2015

oyamad commented Aug 28, 2015

oyamad commented Aug 30, 2015

oyamad commented Sep 2, 2015

oyamad commented Sep 3, 2015

jstac commented Sep 3, 2015