-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRLinterface.html
executable file
·279 lines (272 loc) · 14.8 KB
/
RLinterface.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>RLAI open web page</title>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">
<meta http-equiv="Cache-control" content="no-cache">
<link href="/style.css" rel="stylesheet" type="text/css">
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
<style type="text/css">
<!--
p {font-family: Verdana, Trebuchet, Arial, Palatino, sans-serif; font-size: 14px}
code {font-family: Monaco, courier}
pre {font-family: Monaco, courier}
sub {font-size: 80%}
-->
</style>
</head>
<body style="background-color: rgb(255, 255, 255);">
<table style="width: 100%;" border="0" cellpadding="3" cellspacing="0">
<tbody>
<tr>
<td rowspan="2" bgcolor="#ffffff" valign="top" width="1%"> <a
href="http://rlai.cs.ualberta.ca"><img src="/rl-logo.jpg" alt="Home"
border="0"></a></td>
<td style="vertical-align: top;"><span
style="color: rgb(102, 102, 102);">Reinforcement Learning and
Artificial
Intelligence (RLAI)</span><br>
</td>
</tr>
<tr>
<td style="vertical-align: bottom;"> <span
style="font-family: sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 24px; line-height: normal; font-size-adjust: none;">Reinforcement
learning interface documentation (python) version 5</span> </td>
</tr>
</tbody>
</table>
<strong><span style="font-weight: normal;"><span
style="color: rgb(102, 102, 102);"><br>
</span></span></strong>The <span style="font-weight: bold;">ambition</span>
of this web
page is to fully describe how to use the Python module defining a
standard reinforcement learning interface. We describe 1) how to
construct an interface object for a given agent and environment, 2) the
inputs and outputs of the interface object, and 3) the inputs and
outputs of the functions (procedures) defining the agent and
environment. Not convered is the internal workings of the interface
object or any particular agent and environment.<br>
<hr>
<p>The RLI (Reinforcement Learning Interface) module provides a
standard interface for computational experiments with
reinforcement-learning agents and environments. The interface is
designed to facilitate comparison of different agent designs and their
application to different problems (environments). This documentation
presents the general ideas of the interface and a few examples of its
use. After that is the source code for the <code>RLinterface</code>
class and its three methods (<code>episode</code>, <code>steps</code>,
and <code>episodes</code>) to answer any remaining questions. </p>
<p>An <code>RLinterface</code> is a Python object, created by calling <code>RLinterface(agentFunction,
environmentFunction)</code>. The <code>agentFunction</code> and <code>environmentFunction</code>
define the agent and environment that will participate in the
interface. There will be libraries of standard <code>agentFunction</code>'s
and <code>environmentFunction</code>'s, and of course you can write
your own. An <code>environmentFunction</code> normally takes an action
from the <code>agentFunction</code> and produces a sensation and
reward, while the <code>agentFunction</code> does the reverse: </p>
<blockquote>
<p><code>environmentFunction(action) ==> sensation, reward</code></p>
<p><code>agentFunction(sensation, reward) ==> action</code></p>
</blockquote>
<p>(An <code>action</code> is defined as anything accepted by <code>environmentFunction</code>
and a <code>sensation</code> is defined as anything produced by <code>environmentFunction</code>;
<code>reward</code>s must be numbers.) Together, the <code>agentFunction</code>
and <code>environmentFunction</code> can be used to generate <em>episodes</em>
-- sequences of sensations s, actions a, and rewards r: </p>
<blockquote>
<p><code>import RLinterface</code> <br>
<code>rli = RLinterface(myAgent, myEnv) </code></p>
<p><code>rli.episode(maxSteps) ==></code><strong><code> </code></strong>s<sub>0</sub>,
a<sub>0</sub>, r<sub>1</sub>, s<sub>1</sub>, a<sub>1</sub>, r<sub>2</sub>,
s<sub>2</sub>, a<sub>2</sub>, ..., r<sub>T</sub>, <code>'terminal'</code></p>
</blockquote>
<p>where <code>'terminal'</code> is a special sensation recognized by <code>RLinterface</code>
and <code>agentFunction.</code> (In a continuing problem there would
be just one never-terminating episode.)</p>
<p>To produce the initial s<sub>0</sub>, and a<sub>0</sub>, the <code>agentFunction</code>
and <code>environmentFunction</code> must also support being called
with fewer arguments:</p>
<blockquote>
<p><code>environmentFunction() ==> sensation</code></p>
<p><code>agentFunction(sensation) ==> action</code></p>
</blockquote>
<p>When the <code>environmentFunction</code> is called in this way
(with no arguments) it should start a new episode -- reset the
environment to a characteristic initial state (or distribution of
states) and produce just a sensation without a reward. When the <code>agentFunction</code>
is called in this way (with just one argument) it should not try to
process a reward on this step and should also initialize itself for the
beginning of an episode. The <code>agentFunction</code> and <code>environmentFunction</code>
will always be called in this "reduced" way before being called in the
"normal" way. </p>
<p>Episodes can be generated by calling <code>rli.episode(maxNumSteps)</code>
as above or, alternatively (and necessarily for continuing problems),
segments of an episode can be generated by calling <code>rli.steps(numSteps)</code>,
which returns the sequence of experience on the next<code> numSteps</code>
steps. For example, suppose <code>rli</code> is a freshly made
RLinterface and we run it for a single step, then for one more step,
and then for two steps after that: </p>
<blockquote>
<p><code>rli.steps(1) ==> </code> s<sub>0</sub>, a<sub>0</sub></p>
<p><code>rli.steps(1) ==> </code>r<sub>1</sub>, s<sub>1</sub>, a<sub>1</sub></p>
<p><code>rli.steps(2) ==> </code>r<sub>2</sub>, s<sub>2</sub>, a<sub>2</sub>,
r<sub>3</sub>, s<sub>3</sub>, a<sub>3</sub></p>
</blockquote>
<p>Each call to <code>rli.steps</code> continues the current episode.
To start a new episode, call <code>rli.episode(1)</code>, which
returns the same result as the first line above. Note that if <code>rli.steps(numSteps)</code>
is called on an episodic problem it will run for <code>numsteps</code>
even if episodes terminate and start along the way. Thus, for example,</p>
<blockquote>
<p><code>rli.episode(1) ==> </code>s<sub>0</sub>, a<sub>0</sub> </p>
<p><code>rli.steps(4) ==> </code>r<sub>1</sub>, s<sub>1</sub>, a<sub>1</sub>,
r<sub>2</sub>, <code>'terminal'</code>, s<sub>0</sub>, a<sub>0</sub>, r<sub>1</sub>,
s<sub>1</sub>, a<sub>1</sub></p>
</blockquote>
<p>The method <code>rli.episode<u>s</u>(numEpisodes,
maxStepsPerEpisode, maxStepsTotal)</code> is also provided for
efficiently running multiple episodes.</p>
<h3>Examples (these need to be reworked) </h3>
<p>Here we do Q-learning with a random policy, presuming an MDP with N
states and M actions.</p>
<pre><dd><br>Q = NxM array of zeros<br>alpha = 0.1<br>gamma =0.9<br><br>def agentFunction(s, r):<br> Q(s,a) = Q(s,a) + alpha * (r + max_a(gamma*Q(sp,ap))<br> return random(M) # better to do epsilon greedy<br><br>state = 0<br><br>def environmentFunction(a):<br> ...<br> return s, 0<br><br>rli = RLinterface(agentFunction, environmentFunction)<br>rli.steps(1000)<br></dd></pre>
If additional arguments are needed for the routines, use lambda
expressions:
<pre><dd><br><br>def agentFunction(agent, s, r):<br> ...<br> return a<br><br>def environmentFunction(environment, a):<br> if a == None:<br> return s0<br> else:<br> ...<br> return s, r<br><br>env = makeEnvironment ...<br>agt = makeAgent ...<br>rli = RLinterface(lambda s, r: agentFunction(agt, s, r), \<br> lambda a: environmentFunction(env, a) )<br>RLI.episodes(10,100000)<br></dd></pre>
<h3>Calling Sequences for the RLinterface methods</h3>
<p>
Here are the details for calling the RLinterface methods introduced
above:
</p>
<p><a name="RLinterface"></a>
</p>
<code><b>RLinterface</b></code>(<var>agentFunction,
environmentFunction</var>)<br>
<li> This function sets up an interface object, which can then be used
to run simulated episodes and steps. The two arguments are both
functions, and are described below.
<p><a name="agentFunction"></a> </p>
<dl>
<code><b>agentFunction</b></code>(<var>s</var> <big>[</big>, <var>r</var><big>]</big>)<br>
<dd>This function does the learning and chooses the actions for the
agent. It will be called with a sensation <var>s</var>, and optionally
with a reward<var>r</var>. If there is no reward, it indicates that
this is to be the first move in an episode. The agent function should
always return an action. For agents which have a specific first move
for an episode, the code will look something like this:
<pre><code><br>def agentFunction(s, r==None):<br> if r == None: # start of episode<br> return a0 # return initial action<br> else: # learn from previous action<br> learn with s and r (and previously saved info)<br> a = choose next action<br> return a # return next action<br></code></pre>
<p>Where the first move is chosen the same as other moves, the
code will
look like this:<br>
</p>
<pre><code><br>def agentFunction(s, r==None):<br> if r != None: # learn from previous action<br> learn with s and r (and previously saved info)<br> a = choose next action<br> return a # return next action<br></code></pre>
</dd>
</dl>
<p> <a name="environmentFunction"></a></p>
<dl>
<code><b>environmentFunction</b></code>(<big>[</big><var>a</var><big>]</big>)<br>
<dd>This function does the environment task, such as determining
the next
state, or sensation after a move. It may be called with or without an
action <var>a</var>. If there is no action, it indicates that this is
the start of a new episode. In this case, the function should only
return the initial sensation. Otherwise, it should return a new
sensation and a reward.
<pre><code><br>def environmentFunction(a==None):<br> if a == None: # start of episode<br> return s0 # return initial sensation<br> else: <br> s, r = do action a<br> return s, r # return next sensation and reward<br></code></pre>
<p>Where the first move is chosen the same as other moves, the
code will
look like this:<br>
</p>
<pre =""><code><br>def agentFunction(s, r==None):<br> if r != None: # do some learning<br> learn with s and r (and previously saved info)<br> choose next move<br> return a<br></code></pre>
</dd>
</dl>
</li>
<p>The object created by <code>RLsimulation</code> has the following
methods:.
</p>
<p> <a name="step"></a></p>
<code><b>step</b></code>()<br>
<li>Runs the simulation for exactly one step. Returns the list of
sensations, actions and rewards from that step.
</li>
<p><a name="steps"></a>
<a name="stepsQ"></a></p>
<code><b>steps</b></code>(<var>numSteps</var>)<br>
<code><b>stepsQ</b></code>(<var>numSteps</var>)<br>
<li>
Runs the simulation for <var>numSteps</var> steps, regardless of
episode endings (if any). If <code>steps</code> is used, it will
return a list of the sensations, actions and rewards in the simulation.
If this is not wanted, use <code>stepsQ</code> instead (the quicker
and quieter version). </li>
<p><a name="episode"></a>
<a name="episodeQ"></a></p>
<code><b>episode</b></code>(<big>[</big><var>maxSteps</var><big>]</big>)<br>
<code><b>episodeQ</b></code>(<big>[</big><var>maxSteps</var><big>]</big>)<br>
<li>
Runs a single episode (until state 'terminal' is reached). If <code>episode</code>
is used, it will return a list of the sensations, actions and rewards
in the episode. If this is not wanted, use <code>episodeQ</code>
instead (the quicker and quieter version). If <var>maxSteps</var> is
specified, the simulation will stop after that many steps even if the
end of the episode hasn't been reached.
</li>
<p><a name="episodes"></a>
<a name="episodesQ"></a></p>
<code><b>episodes</b></code>(<var>numEpisodes</var> <big>[</big>,
<var>maxSteps,
maxStepsTotal</var><big>]</big>)<br>
<code><b>episodesQ</b></code>(<var>numEpisodes</var> <big>[</big>, <var>maxSteps,
maxStepsTotal</var><big>]</big>)<br>
<li>
Runs <var>numEpisodes</var> episodes. If <code>episodes</code> is
used, it will return a list of the sensations, actions and rewards in
the episodes. If this is not wanted, use <code>episodesQ</code>
instead (the quicker and quieter version). If <var>maxSteps</var> is
specified, it indicates the maximum number of steps allowed for each
episode. If <var>maxStepsTotal</var> is specified, it limits the
number of steps for all of the episodes together (regardless of whether
an episode has finished, or the specified number of episodes have run).
</li>
<h3> </h3>
<h3>Source Code for RLinterface Module</h3>
Source code for the RLinterface module is located <a
href="RLinterface.py">here</a>.
<!-- Beginning of: /cgi-bin/lastmodified.pl -->
<hr><font size='-1'>
<a href="javascript:;" onClick="{path=window.location.pathname;
pageName = 'http://rlai.cs.ualberta.ca/openpageinfrastructure/extendform.html?previousPage='+path;
newWindow=window.open(pageName,'','width=630,height=690,resizable=yes,scrollbars=yes');}">
Extend</a>
<a href="/openpageinfrastructure/howtoauthor.html">How to edit</a>
<a href="/openpageinfrastructure/styleguide.html">Style guide</a>
<a href="javascript:;"; onClick="window.open('/cgi-bin/viewpublishlog.py','simple',config='resizable=yes,scrollbar=yes,width=500,height=200')";>Troubleshooting</a>
<a href="javascript:;" onClick="{path=window.location.pathname;
pageName = 'http://rlai.cs.ualberta.ca/openpageinfrastructure/subscribe.shtml?previousPage='+path;
newWindow=window.open(pageName,'','width=530,height=320,resizable=yes,scrollbars=yes');}">
Subscribe</a>
<a href="javascript:;" onClick="{path=window.location.pathname;
pageName = 'http://rlai.cs.ualberta.ca/openpageinfrastructure/notify.shtml?previousPage='+path;
newWindow=window.open(pageName,'','width=530,height=470,resizable=yes,scrollbars=yes');}">
Notify</a>
<a href="/openpageinfrastructure/suggestions.html"> Suggestions</a>
<a href="/openpageinfrastructure/help.html"> Help</a>
<br/>This <a href="/openpageinfrastructure/openpageidea.html">open web page</a> is hosted by the <a href="http://www.rlai.net"> RLAI
group</a> at the <a href = "http://www.cs.ualberta.ca">University of Alberta</a>.
<a href="/openpageinfrastructure/termsofuse.html">Terms of use</a></font>
<br>
<a name="refreshhere"></a><!--Auto content-->
<!--LastModified1098287271LastModified-->
<!-- End of: /cgi-bin/lastmodified.pl -->
</body>
</html>