Skip to content

Policy Development Task Specification

Kevin Bradner edited this page May 9, 2019 · 40 revisions

Introduction

oprc_env provides a reinforcement learning environment for the task of multi agent coverage by a swarm of UAVs. It does not include an agent to perform the reinforcement learning on this task (note, however, that the oprc repository aims to supply such an agent).

The purpose of this document is to provide a specification of the interface exposed by the oprc_env modules, and the interface required of a reinforcement learning agent which is compatible with this environment.

At this point, all code for the project is in Haskell, and all interfaces are defined in terms of Haskell modules. In the future, there are plans to support other popular frameworks for reinforcement learning. In particular, this environment may eventually be compatible with the popular OpenAI Gym.

Quick Task Overview

In this environment, a reinforcement learning agent is tasked with observing the entirety of a two dimensional search space (representing an area of land with varied terrain) using a team of drones.

Each drone may be controlled independently, and observes subsets of the terrain directly below it. Drones may fly at a high altitude, in which case they can observe a large area of land in low detail. Drones may also fly low, observing a smaller portion of the search space in high detail. A drone may also ascend and descend to swap between these two altitudes. Finally, drones may of course move horizontally, so that they may observe new patches of the search space.

The individual patches of land which make up the search space may require either low or high levels of scrutiny. A patch which requires high scrutiny will be fully observed by a drone viewing it from a low altitude. If a drone views this patch from a high altitude, the patch will be classified as requiring high scrutiny, and this information will be made available to the planing system. However, this same patch will not be fully observed until it is viewed from a low altitude.

The goal of this task is to fully observe each point in the search space in as few time steps as possible. At any point, an agent performing this task has access to a description of where all of the patches in the search space are, as well as the information known about each of these patches based on observations made in the task so far.

The Policy Typeclass

The most important piece of the environment-agent interface is the policy typeclass:

class Policy p where
  nextMove :: p -> WorldView -> NextActions

See: WorldView, NextActions

Note that the term 'policy' may refer either to an instance of the above typeclass or to the general reinforcement learning notion of a policy.

At minimum, an agent must have an instance of policy to interact with oprc_env. To understand what a reasonable instance of policy looks like, it will help to describe the task - and the data structures used to represent it's many parts - in more detail.

The Environment

One fundamental data structure in this project is the Environment:

type Environment = Map.Map Position Patch

Environments can be thought of as a two dimensional surface made of individual units of land. In this project, such a unit of land is called a Patch. A patch contains information describing what level of scrutiny is required to observe it:

--Levels of Scrutiny that may be required
data DetailReq = Close | Far
  deriving (Eq, Show)


--a patch is a single spot in the map
data Patch = Patch DetailReq
  deriving Eq

Each patch in the environment is associated with a location. Locations are represented by the Position datatype, which effectively represents a pair of Cartesian coordinates:

type XCoord = Integer
type YCoord = Integer


data Position = Position XCoord YCoord
  deriving Eq

The environment data structure simply associates each element in a collection of locations with the type of patch which can be found at that location. Though it is not represented this way in the code, it is sometimes helpful to think of the environment as a graph, where each node in the graph represents a (Position, Patch) pair, and edges exist between nodes whose x and y coordinates each differ by no more than one. Valid values for environment must be connected, i.e. a path must exist between any two nodes in the environment.

The Env module contains many useful functions to operate on these data structures.

The Environment View

The information stored in Environment is not directly accessible to an agent performing this task. Rather, the agent must operate on partial information about the environment which becomes more complete as the task progresses. An agent may have three levels of knowledge about a patch:

  • The patch has not been seen; nothing is known about it
  • The patch has been seen, but requires higher scrutiny to observe completely
  • The patch has been fully observed.

Information known about an individual patch is represented by PatchInfo.

--given a patch of land, what do we know about it?
data PatchInfo =
    Unseen --nothing is known about the patch
  | Classified DetailReq --only the type of the patch is known
  | FullyObserved Env.Patch --this patch has been adequately observed
  deriving Eq

The EnvironmentInfo datatype represents the information known by an agent about the environment as a whole. It contains the same collection of Positions as the corresponding Environment. This datatype is a crucial component of the WorldView data supplied to an agent, as seen in Policy.

Drone

In addition to the environment itself, oprc_env simulates a collection of drones which observe the search region. As previously described, a drone may either occupy a high or low altitude. Therefore, the position of a drone is its (x, y) coordinates together with an altitude. This is represented by DronePosition:

data DronePosition = DronePos Env.Position Env.Altitude
  deriving (Eq, Show)

Drones also possess a status. A drone may be:

  • waiting for a command
  • assigned a command, but not yet acting
  • in the middle of executing an action

This is represented by DroneStatus:

type StepsRemaining = Integer


data DroneStatus =
    Unassigned
  | Assigned Action
  | Acting Action StepsRemaining
  deriving (Eq, Show)

The action assigned to a drone may be any of:

  • Moving horizontally (North, South-West, etc.)
  • Moving Vertically
  • Hovering

Relevant code snippets:

Action

data Action =
    MoveCardinal Env.CardinalDir
  | MoveIntercardinal Env.IntercardinalDir
  | MoveVertical VerticalDirection
  | Hover
  deriving (Eq, Show)

VerticalDirection

data VerticalDirection = Ascend | Descend
  deriving (Eq, Show)

Cardinal Directions

data CardinalDir = North | South | East | West
  deriving (Show, Eq)

Intercardinal Directions

data IntercardinalDir = NE | SE | NW | SW
  deriving (Show, Eq)

In order to represent an ensemble of drones, each drone is given an ID number. Then the DroneList datatype represents the collection of drones being controlled in the current task:

data Drone = DroneID Integer
  deriving (Eq, Show)


type DroneList = [Drone]

Finally, the EnsembleStatus represents what each drone in the collection is doing:

type EnsembleStatus = [(Drone, DroneStatus)]

Policy, Again

Now that some key datatypes have been explained, consider the policy typeclass again:

class Policy p where
  nextMove :: p -> WorldView -> NextActions

nextMove takes a WorldView as input:

data WorldView = 
  WorldView {
    getView :: EnvironmentInfo
  , getDroneList :: Ensemble.DroneList
  , getEnsembleStatus :: Ensemble.EnsembleStatus
  }
  deriving Eq

Once applied to a WorldView, nextMove must yield a NextActions, which represents what a drone should do next:

type NextActions = [(Drone, Action)]

At this point, enough information has been provided to implement a policy which is technically valid. The remaining notes give some information about the reward structure in the environment, as well as other relevant information about the way drones interact with the world.

Costs and Reward Information

Though it is possible to learn this information directly from experience in the reinforcement learning setting, here is some information about the specific dynamics of this environment.

Actions each take a certain amount of time to complete (Timed):

instance Timed Action where
  duration Hover = 1
  duration (MoveCardinal _) = 10
  duration (MoveIntercardinal _ ) = 14
  duration (MoveVertical _ ) = 10

The altitude a drone occupies affects the amount that it can view. A low flying drone can view the patch below it, whereas a high flying drone can view this patch as well as its eight neighbors:

viewableFrom :: DronePosition -> [Position]
viewableFrom (DronePos pos Low) = [pos]
viewableFrom (DronePos pos High) = pos : (Env.neighborsOf pos)

This reinforcement learning task terminates when all patches in the environment have been fully observed.

While the goal of this task is to adequately observe the entire search space in the least possible time, it is not possible to craft a reasonable reward signal which corresponds to this goal exactly. For now, the following reward structure is proposed as a closely related signal that will provide much more frequent reward signals from which the agent can learn.

At each time step, the reward signal will be computed as a sum of the following components

  • 0.9 for each patch which has become fully observed since the last time step
  • 0.1 for each patch which has become classified in the last time step

Patches which go from completely unclassified to fully observed in one observation trigger both conditions, for a total reward of one.

Note that the reward signal has not yet been implemented in code.