-
Notifications
You must be signed in to change notification settings - Fork 3
/
core.py
220 lines (173 loc) · 7.67 KB
/
core.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
"""Core classes."""
class Sample:
"""Represents a reinforcement learning sample.
Used to store observed experience from an MDP. Represents a
standard `(s, a, r, s', terminal)` tuple.
Note: This is not the most efficient way to store things in the
replay memory, but it is a convenient class to work with when
sampling batches, or saving and loading samples while debugging.
Parameters
----------
state: array-like
Represents the state of the MDP before taking an action. In most
cases this will be a numpy array.
action: int, float, tuple
For discrete action domains this will be an integer. For
continuous action domains this will be a floating point
number. For a parameterized action MDP this will be a tuple
containing the action and its associated parameters.
reward: float
The reward received for executing the given action in the given
state and transitioning to the resulting state.
next_state: array-like
This is the state the agent transitions to after executing the
`action` in `state`. Expected to be the same type/dimensions as
the state.
is_terminal: boolean
True if this action finished the episode. False otherwise.
"""
pass
class Preprocessor:
"""Preprocessor base class.
This is a suggested interface for the preprocessing steps. You may
implement any of these functions. Feel free to add or change the
interface to suit your needs.
Preprocessor can be used to perform some fixed operations on the
raw state from an environment. For example, in ConvNet based
networks which use image as the raw state, it is often useful to
convert the image to greyscale or downsample the image.
Preprocessors are implemented as class so that they can have
internal state. This can be useful for things like the
AtariPreproccessor which maxes over k frames.
If you're using internal states, such as for keeping a sequence of
inputs like in Atari, you should probably call reset when a new
episode begins so that state doesn't leak in from episode to
episode.
"""
def process_state_for_network(self, state):
"""Preprocess the given state before giving it to the network.
Should be called just before the action is selected.
This is a different method from the process_state_for_memory
because the replay memory may require a different storage
format to reduce memory usage. For example, storing images as
uint8 in memory is a lot more efficient thant float32, but the
networks work better with floating point images.
Parameters
----------
state: np.ndarray
Generally a numpy array. A single state from an environment.
Returns
-------
processed_state: np.ndarray
Generally a numpy array. The state after processing. Can be
modified in anyway.
"""
return state
def process_state_for_memory(self, state):
"""Preprocess the given state before giving it to the replay memory.
Should be called just before appending this to the replay memory.
This is a different method from the process_state_for_network
because the replay memory may require a different storage
format to reduce memory usage. For example, storing images as
uint8 in memory and the network expecting images in floating
point.
Parameters
----------
state: np.ndarray
A single state from an environmnet. Generally a numpy array.
Returns
-------
processed_state: np.ndarray
Generally a numpy array. The state after processing. Can be
modified in any manner.
"""
return state
def process_batch(self, samples):
"""Process batch of samples.
If your replay memory storage format is different than your
network input, you may want to apply this function to your
sampled batch before running it through your update function.
Parameters
----------
samples: list(tensorflow_rl.core.Sample)
List of samples to process
Returns
-------
processed_samples: list(tensorflow_rl.core.Sample)
Samples after processing. Can be modified in anyways, but
the list length will generally stay the same.
"""
return samples
def process_reward(self, reward):
"""Process the reward.
Useful for things like reward clipping. The Atari environments
from DQN paper do this. Instead of taking real score, they
take the sign of the delta of the score.
Parameters
----------
reward: float
Reward to process
Returns
-------
processed_reward: float
The processed reward
"""
return reward
def reset(self):
"""Reset any internal state.
Will be called at the start of every new episode. Makes it
possible to do history snapshots.
"""
pass
class ReplayMemory:
"""Interface for replay memories.
We have found this to be a useful interface for the replay
memory. Feel free to add, modify or delete methods/attributes to
this class.
It is expected that the replay memory has implemented the
__iter__, __getitem__, and __len__ methods.
If you are storing raw Sample objects in your memory, then you may
not need the end_episode method, and you may want to tweak the
append method. This will make the sample method easy to implement
(just ranomly draw saamples saved in your memory).
However, the above approach will waste a lot of memory (as states
will be stored multiple times in s as next state and then s' as
state, etc.). Depending on your machine resources you may want to
implement a version that stores samples in a more memory efficient
manner.
Methods
-------
append(state, action, reward, debug_info=None)
Add a sample to the replay memory. The sample can be any python
object, but it is suggested that tensorflow_rl.core.Sample be
used.
end_episode(final_state, is_terminal, debug_info=None)
Set the final state of an episode and mark whether it was a true
terminal state (i.e. the env returned is_terminal=True), of it
is is an artificial terminal state (i.e. agent quit the episode
early, but agent could have kept running episode).
sample(batch_size, indexes=None)
Return list of samples from the memory. Each class will
implement a different method of choosing the
samples. Optionally, specify the sample indexes manually.
clear()
Reset the memory. Deletes all references to the samples.
"""
def __init__(self, max_size, window_length):
"""Setup memory.
You should specify the maximum size o the memory. Once the
memory fills up oldest values should be removed. You can try
the collections.deque class as the underlying storage, but
your sample method will be very slow.
We recommend using a list as a ring buffer. Just track the
index where the next sample should be inserted in the list.
"""
pass
def append(self, state, action, reward):
raise NotImplementedError('This method should be overridden')
def end_episode(self, final_state, is_terminal):
raise NotImplementedError('This method should be overridden')
def sample(self, batch_size, indexes=None):
raise NotImplementedError('This method should be overridden')
def clear(self):
raise NotImplementedError('This method should be overridden')