-
Notifications
You must be signed in to change notification settings - Fork 0
Policy Development Task Specification
oprc_env provides a reinforcement learning environment for the task of multi agent coverage by a swarm of UAVs. It does not include an agent to perform the reinforcement learning on this task (note, however, that the oprc repository aims to supply such an agent).
The purpose of this document is to provide a specification of the interface exposed by the oprc_env modules, and the interface required of a reinforcement learning agent which is compatible with this environment.
At this point, all code for the project is in Haskell, and all interfaces are defined in terms of Haskell modules. In the future, there are plans to support other popular frameworks for reinforcement learning. In particular, this environment may eventually be compatible with the popular OpenAI Gym.
In this environment, a reinforcement learning agent is tasked with observing the entirety of a two dimensional search space (representing an area of land with varied terrain) using a team of drones.
Each drone may be controlled independently, and observes subsets of the terrain directly below it. Drones may fly at a high altitude, in which case they can observe a large area of land in low detail. Drones may also fly low, observing a smaller portion of the search space in high detail. A drone may also ascend and descend to swap between these two altitudes. Finally, drones may of course move horizontally, so that they may observe new patches of the search space.
The individual patches of land which make up the search space may require either low or high levels of scrutiny. A patch which requires high scrutiny will be fully observed by a drone viewing it from a low altitude. If a drone views this patch from a high altitude, the patch will be classified as requiring high scrutiny, and this information will be made available to the planing system. However, this same patch will not be fully observed until it is viewed from a low altitude.
The goal of this task is to fully observe each point in the search space in as few time steps as possible. At any point, an agent performing this task has access to a description of where all of the patches in the search space are, as well as the information known about each of these patches based on observations made in the task so far.
The most important piece of the environment-agent interface is the policy typeclass:
class Policy p where
nextMove :: p -> WorldView -> NextActions
See: WorldView, NextActions
Note that the term 'policy' may refer either to an instance of the above typeclass or to the general reinforcement learning notion of a policy.
At minimum, an agent must have an instance of policy to interact with oprc_env. To understand what a reasonable instance of policy looks like, it will help to describe the task - and the data structures used to represent it's many parts - in more detail.
One fundamental data structure in this project is the Environment:
type Environment = Map.Map Position Patch
Environments can be thought of as a two dimensional surface made of individual units of land. In this project, such a unit of land is called a Patch. A patch contains information describing what level of scrutiny is required to observe it:
--Levels of Scrutiny that may be required
data DetailReq = Close | Far
deriving (Eq, Show)
--a patch is a single spot in the map
data Patch = Patch DetailReq
deriving Eq
Each patch in the environment is associated with a location. Locations are represented by the Position datatype, which effectively represents a pair of Cartesian coordinates:
type XCoord = Integer
type YCoord = Integer
data Position = Position XCoord YCoord
deriving Eq
The environment data structure simply associates each element in a collection of locations with the type of patch which can be found at that location. Though it is not represented this way in the code, it is sometimes helpful to think of the environment as a graph, where each node in the graph represents a (Position, Patch) pair, and edges exist between nodes whose x and y coordinates each differ by no more than one. Valid values for environment must be connected, i.e. a path must exist between any two nodes in the environment.
This reinforcement learning task terminates when