Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDP #171

Merged
merged 5 commits into from
Sep 3, 2015
Merged

MDP #171

merged 5 commits into from
Sep 3, 2015

Conversation

oyamad
Copy link
Member

@oyamad oyamad commented Aug 24, 2015

I finally finished coding the Markov Decision Process (MDP) class. There are a couple of issues to discuss.

  1. An MDP instance can be created in two ways, in which the transition matrix is

    • a 3-dimensional array defined on S x A x S ("product formulation")
    • a 2-dimensional array defined on SA x S ("state-action pairs formulation")

    where S is the state space, A the action space, and SA the set of feasible state-action pairs. The former should be straightforward; the latter requires additional information about the correspondence between the indices of SA and the actual state-action pairs. See the following example:

    >>> s_indices = [0, 0, 1]  # State indices
    >>> a_indices = [0, 1, 0]  # Action indices
    >>> R = [5, 10, -1]
    >>> Q = [(0.5, 0.5), (0, 1), (0, 1)]
    >>> beta = 0.95
    >>> mdp = MDP(R, Q, beta, s_indices, a_indices)

    Here, S = {0, 1}, A = {0, 1}, and SA = {(0, 0), (0, 1), (1, 0)} (action 1 is infeasible at state 1). The correspondence is represented by s_indices and a_indices. I wonder whether this is an "optimal" implementation strategy.

  2. Complications are basically kept in __init__ (so it is very long), to keep other methods simple and more or less intuitive.

  3. Any suggestions for improvement in particular on variable/method/function names (and of course for other things too) will be appreciated.

  4. I tried to follow the ideas in ENH: make calls to compute_fixed_point allocate less memory #40 and ENH: moved the lucas_tree tuple to a new LucasTree class #41. I wonder if I did it right. The method operator_iteration may be replaced with compute_fixed_point, whereas the latter does not exactly match here.

  5. Results of solve are stored in MDPSolveResult. Are there other important informations to store there?

  6. The policy iteration algorithm is usually initialized with a policy function, while the policy_iteration method here requires an initial value function, to be consistent with the other solution methods.

  7. Deals only with infinite horizon. beta=1 is not accepted. (Do we want a finite horizon algorithm?)

  8. I also added random_mdp. I hope this will be useful for performance tests.

I also added some notebooks on MDP to QuantEcon.site:

I have some more examples here.

@jstac
Copy link
Contributor

jstac commented Aug 25, 2015

Wow, this is great!! I have to spend so time with the code but the implementation looks outstanding.

The notebooks are very helpful too.

This is a really big contribution to open source economics.

@robertdkirkby
Copy link

Great stuff!

Few points on the 'modified policy iteration' that may or may not be helpful (hopefully they are). I caution that they are all based on my experience with a slightly different implementation (instead of just one Q my own codes differentiate between endogenous and exogenous states).

From speed perspective --- and you may already do this, I was not sure from the pseudocode and don't write python myself so didn't want to go digging --- once you calculate the policy indexes sigma, rather than calculate r(sigma) at each of your k iterations in step 3, you can calculate it once at the beginning of step 3 as a matrix and then just reuse for each of the k iterations. In my experience this provides a nono-trivial speed boost.

Using an alorithm similar to but different from your current algorithm it is possible to contaminate everything with -Infs (if your final true v has some -Infs and your initial v0 doesn't). This has happened to me in practice. This is easily avoided by also checking that span(u-v) is finite before you do step 4. I'm not certain this will happen with your implementation, but might be worth checking.

Another thing I would mention based on my experience an alternative implementation of your modified policy iteration algorithm. While the value function iteration is guaranteed to converge if you set iter_max high enough (thanks to contraction mapping property) this is not true of the modified policy iteration algorithm if your k is big relative to grid size (of S and A). This is something I have found by experience. It might therefore be worth adding an internal option that after a certain number of tries, if convergence has not yet been reached then the modified policy iteration algorithm will just switch over automatically over to doing value function iteration (ie. you might perhaps have iter_max/2 with modified policy iteration, and then from there to iter_max with value fn iteration). That way the user can be certain that if they just keep increasing iter_max they will eventually get convergence.

(ps. I haven't contributed anything to quantecon, but I 'watch' and this is my first comment here)

@oyamad
Copy link
Member Author

oyamad commented Aug 25, 2015

Dear @robertdkirkby: Thank you for your comments, and welcome to quantecon!

once you calculate the policy indexes sigma, rather than calculate r(sigma) at each of your k iterations in step 3, you can calculate it once at the beginning of step 3 as a matrix and then just reuse for each of the k iterations.

Let me explain what I do in the code. sigma in step 2 and u in step 3 (as described here) are actually computed simultaneously. This sigma is used to generate the operator T_sigma using R_sigma and Q_sigma computed once, and the same T_sigma is used for k times without recomputing here, if I understand correctly. But maybe I am missing something important.

it is possible to contaminate everything with -Infs (if your final true v has some -Infs and your initial v0 doesn't).

This can happen when the input reward is such that for some s, R[s, a] = -inf for all a. I implicitly assumed the use will not give such data. Maybe we should check that this is not the case.

While the value function iteration is guaranteed to converge if you set iter_max high enough (thanks to contraction mapping property) this is not true of the modified policy iteration algorithm if your k is big relative to grid size (of S and A).

Thank you for reminding me of this, I have been forgetting that in the proof of convergence of modified policy iteration, I needed the assumption that v^0 be such that T v^0 >= v^0 (otherwise monotonicity of T T_sigma would have no bite). This condition is satisfied in particular when v^0 = v_sigma for some feasible policy sigma (for example the policy that maximizes the one-shot reward function) or when v^0 is very small that v^0 <= min_(s, a) r(s, a) / (1 - beta). I think I will change the default v_init (when it is not supplied) to either of these, and correct the notebook. Thanks!

@oyamad
Copy link
Member Author

oyamad commented Aug 25, 2015

I chose to set v_init for modified PI to min_(s, a) r(s, a) / (1 - beta).

Any suggestions are welcome for this and for the other methods as well.

@mmcky
Copy link
Contributor

mmcky commented Aug 25, 2015

Due to util changes to close #165 there are some small merge conflicts that will need to be fixed in this branch prior to merge.

@robertdkirkby
Copy link

Thanks @oyamad

  1. Sounds like you already have this covered.
  2. I would strongly advocate that you allow for r() to take -Inf values. Below are three examples of why it would be useful (the second is really just an explicit example of the third). Before the examples though I would highlight that it is quite easy to do. The value function iteration requires no modification whatsoever. The policy iteration method could check for r()=-Inf and throw an error saying this is not allowed with this solution method. The modified policy iteration method requires only the minor modification mentioned previously, checking that span(v-u) is finite before doing Step 4; given that you already calculate span(v-u) this will cost you essentially nothing in terms of run time. In summary, I would argue that is both very useful in many applications to let r() take -Inf values, and is easy enough to implement in the codes.
    a. solving a stochastic growth model where util(consumption=0)=-Inf, and allowing for the state-action of capital =0. (if capital equals zero, then output is zero, and so consumption must be zero, giving a return on -Inf). In some sense this is just a 'mistake' by the user who should 'know' to rule out capital =zero. But I suspect it is a 'mistake' users would often make.
    b. Implementing a stochastic growth model with putty-clay investment, so cannot choose (action of) capital levels (k') less than current capital level (k) state minus depreciation (delta). If you allow r() to take -Inf values this is easy, just make r()=-Inf for all actions which would imply k'<(1-delta)k.
    c. r()=-Inf are very useful for implementing constraints. In many economic applications you will have the return/gain r() being a combination of a utility function u() and some (eg. budget) constraints. If r()=-Inf is allowed, then you can just make any breaching of the constraints a -Inf value and the rest of the MDP will be just as normal.
    [In the language/notation of Stokey, Lucas & Prescott (1989), allowing r() take -Inf values is an easy way of implementing that the feasible policy correspondence, Gamma, depends on the current state (endog & exog states).]
    [Update: It occours to me that you may have some way of imposing constraints by use of zeros in Q(), but it is not clear to me how exactly this would work in terms of interacting with r() and ensuring that such actions would never be optimal.]
  3. Do you have a copy of the proof handy? I hadn't heard of this result before and would be interested in seeing it.

@oyamad
Copy link
Member Author

oyamad commented Aug 26, 2015

Thanks @robertdkirkby

The "product formulation" assumes that constraints are expressed by setting r(s, a) = -inf for infeasible (s, a)'s. This works with no problem as long as for all states s, r(s, a) > -inf for some action a. I think this accommodates your point c.

I am afraid it would be problematic if we have some s such that r(s, a) = -inf for all a. Mathematically, even if the optimal value function is well defined, the optimal policy is not well defined. If the values are all -inf, NumPy returns -inf as max and the first index that attains max, namely index 0 in this case 0, as argmax. In the case u(0) = -inf and the current capital is 0, zero consumption yields -inf by assumption, whereas positive consumption also yields -inf for another reason, namely infeasibility. Note that assigning indices to actions can be arbitrary. If index 0 is assigned to some positive consumption and 1 is assigned to zero consumption, then argmax would still be 0. So I think the user should be advised to assign reward -inf only to infeasible actions (and not to include states under which all the actions are infeasible).

This is not very difficult. It is similar to the proof of convergence of policy iteration.
Let v^0, v^1, v^2, ... and sigma1, sigma2, ... be the sequences of values and policies obtained by modified policy iteration.
If v^0 <= T v^0, then we have T v^0 <= v^1 <= T v^1 (and so on).
Let me type the proof here:

v^0 <= T v^0 (by assumption)
    =  T_sigma1 v^0 (by the definition of sigma1)
    <= T_sigma1^2 v^0 (by the monotonicity of T_sigma1)
    ...
    <= T_sigma1^k v^0
    <= T_sigma1^k+1 ^v0
    =  v^1 (by the definition of v^1);

from the inequality T_sigma1^k v^0 <= v^1, we have

T_sigma1^k+1 v^0 <= T_sigma1 v^1 (by the monotonicity of T_sigma1)
                 <= T v^1 (since T_sigma v <= T v for any sigma and v).

By this, we have T^i v^0 <= v^i for all i (by the monotonicity of T). It is clear that v^i <= v^*. So, if v^0 <= T v^0, then the sequence converges monotonically to v^* from below.

@robertdkirkby
Copy link

  1. It does cover my c. I missed the 'all' in 'for some s, all a' in your earlier reply. I now agree with your earlier statement that it might be worth checking for as I suspect many beginner users will make the kind of mistake described by the u(c=0)=-inf and including capital=0 example.
  2. Thanks.

@oyamad
Copy link
Member Author

oyamad commented Aug 27, 2015

The description of the algorithms in the notebook was misleading because of iter_max; with it, finite termination is trivial (the problem was that I tried to described the algorithms and implementation at the same time). I removed the termination condition with iter_max from the description.

@oyamad
Copy link
Member Author

oyamad commented Aug 28, 2015

I did rebase and forced push.

@oyamad
Copy link
Member Author

oyamad commented Aug 30, 2015

As another demo, I used MDP to solve the career-job model:

A remark:
In using compute_fixed_point in the Solutions, the default max_iter=50 is used. This is binding, and makes a significant difference in particular in the case of beta = 0.99. In fact, the median first passage time in this case seems to be about 14 with a large enough number of iterations (or with policy iteration), whereas the Solutions notebook claims that it is about 11.

@oyamad
Copy link
Member Author

oyamad commented Sep 2, 2015

Following the suggestion from @robertdkirkby , I added a method that checks that every state has at lease one action with finite reward. It also checks, in the case of state-action pairs, that for every state, there is at least one available action.

- markov.MDP
- markov.random_mdp
Checks
- every state has at least one action with finite reward
- for the sa-pair formulation, every state has at least one available action
following sparse matrix support in MarkovChain
@oyamad
Copy link
Member Author

oyamad commented Sep 3, 2015

Following a discussion with @jstac, I renamed MDP to DiscreteDP. I hope I have made all the relevant changes properly.

@jstac
Copy link
Contributor

jstac commented Sep 3, 2015

Everything looks great. Thanks again! I'll merge now and start writing the lecture.

jstac added a commit that referenced this pull request Sep 3, 2015
@jstac jstac merged commit e14c863 into master Sep 3, 2015
@jstac jstac deleted the mdp branch September 3, 2015 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants