Skip to content

Policy Development Task Specification

Kevin Bradner edited this page May 7, 2019 · 40 revisions

Introduction

oprc_env provides a reinforcement learning environment for the task of multi agent coverage by a swarm of UAVs. It does not include an agent to perform the reinforcement learning on this task (note, however, that the oprc repository aims to supply such an agent).

The purpose of this document is to provide a specification of the interface exposed by the oprc_env modules, and the interface required of a reinforcement learning agent which is compatible with this environment.

At this point, all code for the project is in Haskell, and all interfaces are defined in terms of Haskell modules. In the future, there are plans to support other popular frameworks for reinforcement learning. In particular, this environment may eventually be compatible with the popular OpenAI Gym.

Quick Task Overview

In this environment, a reinforcement learning agent is tasked with observing the entirety of a two dimensional search space (representing an area of land with varied terrain) using a team of drones.

Each drone may be controlled independently, and observes subsets of the terrain directly below it. Drones may fly at a high altitude, in which case they can observe a large area of land in low detail. Drones may also fly low, observing a smaller portion of the search space in high detail. A drone may also ascend and descend to swap between these two altitudes. Finally, drones may of course move horizontally, so that they may observe new patches of the search space.

The individual patches of land which make up the search space may require either low or high levels of scrutiny. A patch which requires high scrutiny will be fully observed by a drone viewing it from a low altitude. If a drone views this patch from a high altitude, the patch will be classified as requiring high scrutiny, and this information will be made available to the planing system. However, this same patch will not be fully observed until it is viewed from a low altitude.

The goal of this task is to fully observe each point in the search space in as few time steps as possible. At any point, an agent performing this task has access to a description of where all of the patches in the search space are, as well as the information known about each of these patches based on observations made in the task so far.

The Policy Typeclass

The most important piece of the environment-agent interface is the policy typeclass:

class Policy p where
  nextMove :: p -> WorldView -> NextActions

See: WorldView, NextActions

Note that the term 'policy' may refer either to an instance of the above typeclass or to the general reinforcement learning notion of a policy.

At minimum, an agent must have an instance of policy to interact with oprc_env. To understand what a reasonable instance of policy looks like, it will help to describe the task - and the data structures used to represent it's many parts - in more detail.

The Environment

One fundamental data structure in this project is the Environment:

type Environment = Map.Map Position Patch

Environments can be thought of as a two dimensional surface made of individual units of land. In this project, such a unit of land is called a Patch. A patch contains information describing what level of scrutiny is required to observe it:

--Levels of Scrutiny that may be required
data DetailReq = Close | Far
  deriving (Eq, Show)


--a patch is a single spot in the map
data Patch = Patch DetailReq
  deriving Eq

Each patch in the environment is associated with a location. Locations are represented by the Position datatype, which effectively represents a pair of Cartesian coordinates:

type XCoord = Integer
type YCoord = Integer


data Position = Position XCoord YCoord
  deriving Eq

As the datatype definition suggests, the

This reinforcement learning task terminates when

Clone this wiki locally