-
Notifications
You must be signed in to change notification settings - Fork 0
Policy Development Task Specification
oprc_env provides a reinforcement learning environment for the task of multi agent coverage by a swarm of UAVs. It does not include an agent to perform the reinforcement learning on this task (note, however, that the oprc repository aims to supply such an agent).
The purpose of this document is to provide a specification of the interface exposed by the oprc_env modules, and the interface required of a reinforcement learning agent which is compatible with this environment.
At this point, all code for the project is in Haskell, and all interfaces are defined in terms of Haskell modules. In the future, there are plans to support other popular frameworks for reinforcement learning. In particular, this environment may eventually be compatible with the popular OpenAI Gym.
In this environment, a reinforcement learning agent is tasked with observing the entirety of a two dimensional search space (representing an area of land with varied terrain) using a team of drones.
Each drone may be controlled independently, and observes subsets of the terrain directly below it. Drones may fly at a high altitude, in which case they can observe a large area of land in low detail. Drones may also fly low, observing a smaller portion of the search space in high detail. A drone may also ascend and descend to swap between these two altitudes. Finally, drones may of course move horizontally, so that they may observe new patches of the search space.
The individual patches of land which make up the search space may require either low or high levels of scrutiny. A patch which requires high scrutiny will be fully observed by a drone viewing it from a low altitude. If a drone views this patch from a high altitude, the patch will be classified as requiring high scrutiny, and this information will be made available to the planing system. However, this same patch will not be fully observed until it is viewed from a low altitude.
The goal of this task is to fully observe each point in the search space in as few time steps as possible. At any point, an agent performing this task has access to a description of where all of the patches in the search space are, as well as the information known about each of these patches based on observations made in the task so far.
The most important piece of the environment-agent interface is the policy typeclass:
class Policy p where
nextMove :: p -> WorldView -> NextActions
See: WorldView, NextActions
Note that the term 'policy' may refer either to an instance of the above typeclass or to the general reinforcement learning notion of a policy.
At minimum, an agent must have an instance of policy to interact with oprc_env. To understand what a reasonable instance of policy looks like, it will help to describe the task - and the data structures used to represent it's many parts - in more detail.
One fundamental data structure in this project is the Environment:
type Environment = Map.Map Position Patch
Environments can be thought of as a two dimensional surface made of individual units of land. In this project, such a unit of land is called a Patch. A patch contains information describing what level of scrutiny is required to observe it:
--Levels of Scrutiny that may be required
data DetailReq = Close | Far
deriving (Eq, Show)
--a patch is a single spot in the map
data Patch = Patch DetailReq
deriving Eq
Each patch in the environment is associated with a location. Locations are represented by the Position datatype, which effectively represents a pair of Cartesian coordinates:
type XCoord = Integer
type YCoord = Integer
data Position = Position XCoord YCoord
deriving Eq
The environment data structure simply associates each element in a collection of locations with the type of patch which can be found at that location. Though it is not represented this way in the code, it is sometimes helpful to think of the environment as a graph, where each node in the graph represents a (Position, Patch) pair, and edges exist between nodes whose x and y coordinates each differ by no more than one. Valid values for environment must be connected, i.e. a path must exist between any two nodes in the environment.
The Env module contains many useful functions to operate on these data structures.
The information stored in Environment is not directly accessible to an agent performing this task. Rather, the agent must operate on partial information about the environment which becomes more complete as the task progresses. An agent may have three levels of knowledge about a patch:
- The patch has not been seen; nothing is known about it
- The patch has been seen, but requires higher scrutiny to observe completely
- The patch has been fully observed.
Information known about an individual patch is represented by PatchInfo.
--given a patch of land, what do we know about it?
data PatchInfo =
Unseen --nothing is known about the patch
| Classified DetailReq --only the type of the patch is known
| FullyObserved Env.Patch --this patch has been adequately observed
deriving Eq
The EnvironmentInfo datatype represents the information known by an agent about the environment as a whole. It contains the same collection of Positions as the corresponding Environment. This datatype is a crucial component of the WorldView data supplied to an agent, as seen in Policy.
In addition to the environment itself, oprc_env simulates a collection of drones which observe the search region. As previously described, a drone may either occupy a high or low altitude. Therefore, the position of a drone is its (x, y) coordinates together with an altitude. This is represented by DronePosition:
data DronePosition = DronePos Env.Position Env.Altitude
deriving (Eq, Show)
Drones also possess a status. A drone may be:
- waiting for a command
- assigned a command, but not yet acting
- in the middle of executing an action
This is represented by DroneStatus:
type StepsRemaining = Integer
data DroneStatus =
Unassigned
| Assigned Action
| Acting Action StepsRemaining
deriving (Eq, Show)
The action assigned to a drone may be any of:
- Moving horizontally (North, South-West, etc.)
- Moving Vertically
- Hovering
Relevant code snippets:
data Action =
MoveCardinal Env.CardinalDir
| MoveIntercardinal Env.IntercardinalDir
| MoveVertical VerticalDirection
| Hover
deriving (Eq, Show)
data VerticalDirection = Ascend | Descend
deriving (Eq, Show)
data CardinalDir = North | South | East | West
deriving (Show, Eq)
data IntercardinalDir = NE | SE | NW | SW
deriving (Show, Eq)
In order to represent an ensemble of drones, each drone is given an ID number. Then the DroneList datatype represents the collection of drones being controlled in the current task:
data Drone = DroneID Integer
deriving (Eq, Show)
type DroneList = [Drone]
Finally, the EnsembleStatus represents what each drone in the collection is doing:
type EnsembleStatus = [(Drone, DroneStatus)]
Now that some key datatypes have been explained, consider the policy typeclass again:
class Policy p where
nextMove :: p -> WorldView -> NextActions
nextMove takes a WorldView as input:
data WorldView =
WorldView {
getView :: EnvironmentInfo
, getDroneList :: Ensemble.DroneList
, getEnsembleStatus :: Ensemble.EnsembleStatus
}
deriving Eq
Once applied to a WorldView, nextMove must yield a NextActions, which represents what a drone should do next:
type NextActions = [(Drone, Action)]
At this point, enough information has been provided to implement a policy which is technically valid. The remaining notes give some information about the reward structure in the environment, as well as other relevant information about the way drones interact with the world.
Though it is possible to learn this information directly from experience in the reinforcement learning setting, here is some information about the specific dynamics of this environment.
Actions each take a certain amount of time to complete (Timed):
instance Timed Action where
duration Hover = 1
duration (MoveCardinal _) = 10
duration (MoveIntercardinal _ ) = 14
duration (MoveVertical _ ) = 10
The altitude a drone occupies affects the amount that it can view. A low flying drone can view the patch below it, whereas a high flying drone can view this patch as well as its eight neighbors:
viewableFrom :: DronePosition -> [Position]
viewableFrom (DronePos pos Low) = [pos]
viewableFrom (DronePos pos High) = pos : (Env.neighborsOf pos)
This reinforcement learning task terminates when all patches in the environment have been fully observed.
While the goal of this task is to adequately observe the entire search space in the least possible time, it is not possible to craft a reasonable reward signal which corresponds to this goal exactly. For now, the following reward structure is proposed as a closely related signal that will provide much more frequent reward signals from which the agent can learn.
At each time step, the reward signal will be computed as a sum of the following components
- 0.9 for each patch which has become fully observed since the last time step
- 0.1 for each patch which has become classified in the last time step
Patches which go from completely unclassified to fully observed in one observation trigger both conditions, for a total reward of one.
Note that the reward signal has not yet been implemented in code.