Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final thesis submission #4

Merged
merged 13 commits into from
Oct 19, 2023
10 changes: 10 additions & 0 deletions Chapters/Part_1/chapter_1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ \section{Simulators for robotics}
This section describes the high-level software architecture of a robot simulator.
We introduce and describe the main components existing simulators typically implement, and then identify a list of properties that could be used as a means of comparison.

\begin{addmargin}{0cm}
\setstretch{1.45}

\subsection{Components}

A \emph{robotics simulator} is a collection of different independent components that, when combined, expose to the user a virtual environment where simulated robots can move and interact.
Expand Down Expand Up @@ -59,6 +62,8 @@ \subsection{Components}
Many simulators implement rendering either by integrating external rendering engines or exposing new custom functionalities.
Depending on the selected engine, rendering could be more or less realistic, with the cost of becoming the overall bottleneck of the simulation in case of too detailed rendering.

\end{addmargin}

\paragraph{Graphical User Interface}

Simulators with rendering capabilities usually also implement a \ac{GUI} to simplify the visualisation and, possibly, also the interaction with the simulated scene.
Expand Down Expand Up @@ -111,6 +116,9 @@ \section{Enabling technologies}

In this section, we describe the technology enablers that made developing the experiments presented in this thesis possible.

\begin{addmargin}{0cm}
\setstretch{1.4}

\subsection{Gazebo Sim}

Gazebo~\parencite{koenig_design_2004}, developed by Open Robotics, is among the most used and widely adopted simulators by the robotics community.
Expand All @@ -129,6 +137,8 @@ \subsection{Gazebo Sim}
One of the simulation architectures presented in this thesis is based on Gazebo Sim.
A more detailed overview of the features that motivated the adoption of the simulator and why they represent a valid choice for the contributed architecture is discussed in more detail in Section~\ref{sec:scenario_gazebo}.

\end{addmargin}

\subsection{The iCub humanoid robot}

The iCub humanoid robot~\parencite{natale_icub_2017} is an open-source robot platform developed and produced by iCub Tech at the Italian Institute of Technology.
Expand Down
16 changes: 15 additions & 1 deletion Chapters/Part_1/chapter_2.tex
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
\begin{flushright}
\textsl{
All models are wrong.\\
But some are useful.
}\\
--- {George Box}
\end{flushright}

\chapter{Robot Modelling}
\label{ch:robot_modelling}

Expand Down Expand Up @@ -445,6 +453,9 @@ \subsection{Cross product on $\realn^6$}
.
\end{equation*}

\begin{addmargin}{0cm}
\setstretch{1.45}

\section{Accelerations and Forces}

\subsection{Accelerations}
Expand Down Expand Up @@ -580,6 +591,8 @@ \subsection{Forces}
Note that the relation $(\crossvelsix[{\velsix[B]_{A,B}}])^{-\top} = \crossforsix[{\velsix[B]_{A,B}}]$ holds.
The combination of the overline with the star marks mnemonically the $(\cdot)^{-\top}$ operator.

\end{addmargin}

\section{Rigid-body Kinematics}
\label{sec:rigid_body_kinematics}

Expand Down Expand Up @@ -943,7 +956,7 @@ \section{Joint Model}

\begin{table}
\centering
\caption{List of motion subspaces for the supported 1 \ac{DoF} joints.}
\caption{List of motion subspaces for the supported 1 \acs{DoF} joints.}
\label{tab:motion_subspaces}
\begin{tblr}{
colspec={Q[l, m]Q[c, m]},
Expand Down Expand Up @@ -1274,6 +1287,7 @@ \subsubsection{Link velocity}
%
\end{remark}

\newpage
\subsection{Dynamics}
\label{sec:multibody_dynamics}

Expand Down
28 changes: 23 additions & 5 deletions Chapters/Part_1/chapter_3.tex
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
\begin{addmargin}{0cm}
\setstretch{1.42}

\begin{flushright}
\textsl{In life, unlike chess, the game continues after checkmate.}\\
--- Isaac Asimov
\end{flushright}

\chapter{Basics of Reinforcement Learning}
\label{ch:reinforcement_learning}
\vspace*{-4mm}

In the field of \ac{AI}, it has been hypothesised that intelligence can be understood as subserving the maximisation of reward~\parencite{silver_reward_2021}.
It was suggested that intelligent abilities and behaviours could be attained by agents that learn from trial-and-error by receiving feedback on their performance: the reward.
\ac{RL} is one among the possible \emph{generic} formulations aiming to train agents that solve the problem of maximising the reward.

\acl{RL} operates on a unified setting decoupled as two systems interacting sequentially over time, illustrated in Figure~\ref{fig:rl_setting}.
The \emph{environment} is the world the \emph{agent} interacts with.
Differently from other domains of \ac{ML}, the learned \ac{RL} policy generates actions that may affect not only the current instant (immediate reward), but also the new configuration in which the environment transitions and its corresponding reward.
The trial-and-error nature together with delayed rewards, which give rise to the \emph{credit assignment problem}, are two among the essential features that characterise \acs{RL}~\parencite{sutton_reinforcement_2018}.
Unlike other \ac{ML} domains, the learned \ac{RL} policy generates actions that may affect not only the current instant (immediate reward), but also the new configuration in which the environment transitions and its corresponding reward.
The trial-and-error nature together with delayed rewards, giving rise to the \emph{credit assignment problem}, are two among the essential features that characterise \acs{RL}~\parencite{sutton_reinforcement_2018}.

This chapter, based on the theory and notation from~\textcite{achiam_spinning_2018} and \textcite{dong_deep_2020}, describes in greater detail the \acl{RL} setting, introduces the terminology and the mathematical framework used throughout the thesis, and presents algorithms to solve the corresponding reward maximisation problem.
Despite being a generic setting from a high-level perspective, all the element parts of Figure~\ref{fig:rl_setting} may have different properties altering the formulation of the specific problem at hand.
Expand All @@ -17,6 +26,8 @@ \chapter{Basics of Reinforcement Learning}
The setting adopting this family of policies is known as \ac{DRL}.
We will use throughout the thesis the \ac{RL} and \ac{DRL} terminology interchangeably.

\end{addmargin}

\begin{figure}
\centering
\resizebox{.65\textwidth}{!}{
Expand Down Expand Up @@ -285,6 +296,7 @@ \section{Reinforcement Learning Formalism}
The previous section provided an informal introduction to the \acl{RL} setting.
In this section, we consolidate and formalise the notions into a structured framework composed of two key ingredients: \aclp{MDP} and the Bellman equation.

\vspace{-3mm}
\subsection{Markov Decision Processes}

\aclp{MDP}~\parencite{puterman_markov_2005, sutton_reinforcement_2018} are one of the classical formulations of sequential decision-making, which introduce the mathematical framework of choice for the discrete-time stochastic setting described in Section~\ref{section:key_concepts}.
Expand Down Expand Up @@ -387,6 +399,7 @@ \subsection{Value functions}
In fact, we can think of $V(s_t)$ as a function providing the expected return of state $s_t$ averaged over all the possible actions $a_t$ that can be taken in this state\footnote{This can be clearly seen from Equation~\eqref{equation:v_as_func_of_q}.}, and $Q(s_t, a_t)$ as providing the expected return of state $s_t$ considering that the action taken was $a_t$.
If this action $a_t$ performs better than average, expressed mathematically as $Q(s_t, a_t) > V(s_t) \implies A(s_t, a_t) > 0$, we could use this information to reinforce the choice of $a_t$ the next time the trajectory evolves through $s_t$.

\pagebreak
\subsection{Bellman Equation}
\label{section:bellman_equation}

Expand Down Expand Up @@ -417,12 +430,16 @@ \subsection{Bellman Equation}
Our setting, characterised by continuous action and state spaces, unknown $\mathcal{P}$ and $\mathcal{R}$, and the usage of function approximation for the value functions and the policy, does not have any closed-form solution.
Therefore, at best, we can use iterative approaches.

\begin{addmargin}{0cm}
\setstretch{1.42}

\section{Algorithms}

After the emergence of \acl{DRL}, the late 2010s have seen an extensive research effort that led to a great variety of algorithms.
In this section, we provide a bird's-eye view of different families of algorithms that can iteratively solve the \ac{RL} problem.
We highlight their main properties that motivate the choice of methods used for the setting studied in this thesis.

\vspace*{-3mm}
\subsection{Model-free and Model-based}

The formulation of the \ac{RL} problem provided in Section~\ref{section:reinforcement_learning_problem} outlines that the agent has no knowledge of environment details.
Expand All @@ -441,6 +458,8 @@ \subsection{Model-free and Model-based}
Still, the reality is that in the case the model is learned from data, it remains highly challenging to obtain a description of a complex environment accurate enough to be exploited by the agent.
In addition, the prediction inaccuracy introduces a strong bias in the learning process that can be exploited by the agent, resulting in sub-optimal behaviour in the actual environment.

\end{addmargin}

\subsection{Value-based and Policy-based}

The second major categorisation separates the algorithms in \emph{value-based} and \emph{policy-based} methods.
Expand All @@ -457,7 +476,7 @@ \subsection{Value-based and Policy-based}
On the other hand, value-based methods can reuse most of the collected data with high sampling efficiency, and the maximisation strategy allows to improve faster and with a lower variance to the optimal policy.

Policy-based methods, instead of resorting to a value function, perform an optimisation that directly targets the final aim of reward maximisation.
They represent a policy with a parameterized function, typically a state-conditioned probability distribution, that is optimised directly from collected trajectories.
They represent a policy with a parameterized function, typically a state-conditioned probability distribution, directly optimised from collected trajectories.
Depending on the choice of the policy, these methods are suitable for continuous and high-dimensional action spaces.
In practice, they present better convergence properties by applying small incremental changes at every iteration.
Although usually slower, less efficient, and more prone to getting stuck and converging to a local optimum, learning performance could be more stable.
Expand Down Expand Up @@ -635,7 +654,7 @@ \subsection{Generalized Advantage Estimation}
\hat{A}_t^{(k)} = -V(s_t) + r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k}) .
\end{equation*}
%
We could think of $k$ as an additional hyperparameter to tune.
Think of $k$ as an additional hyperparameter to tune.
Intuitively, the variance is reduced by considering a smaller number of noisy sampled rewards, and the bias is mitigated by the high discount applied to the bootstrapped value.

While $k$-step TD already helps trade off bias and variance, it selects just one horizon.
Expand All @@ -652,7 +671,6 @@ \subsection{Generalized Advantage Estimation}
We can recognise $\text{GAE}(\gamma, 0) = \hat{A}_t^{(1)}$ and $\text{GAE}(\gamma, 1) = \hat{A}_t^{(\infty)}$.
The hyperparameter $0 < \lambda < 1$ balances between these two extremes.

\newpage
\subsection{Proximal Policy Optimization}
\label{sec:ppo}

Expand Down
9 changes: 8 additions & 1 deletion Chapters/Part_1/chapter_4.tex
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
\begin{flushright}
\textsl{Intelligence is what you use when you don’t know what to do.}\\
--- Jean Piaget
\end{flushright}

\chapter{State-of-the-Art and Thesis Content}
\label{ch:sota}

Expand Down Expand Up @@ -96,7 +101,7 @@ \section{Review of Push-recovery Strategies}
Many fundamental methods, techniques, and control architectures widely adopted by bipedal locomotion research have been first studied in the simplified case of push recovery.
In fact, the ability to react appropriately to external perturbations is paramount for achieving robust locomotion capabilities, and often advances in push recovery research are preparatory for advances in locomotion research~\parencite{jeong_robust_2019}.

Humans are able to employ different strategies to keep their balance, such as \emph{ankle}, \emph{hip}, and \emph{stepping} strategies~\parencite{nashner_organization_1985, maki_role_1997, stephens_humanoid_2007}.
Humans use various strategies to maintain balance, including \emph{ankle}, \emph{hip}, and \emph{stepping} strategies~\parencite{nashner_organization_1985, maki_role_1997, stephens_humanoid_2007}.
The adoption of these strategies follows an activation proportional to the magnitude of external disturbances.
The effectiveness of human capabilities mainly stems from how different strategies are combined into a continuous motion~\parencite{mcgreavy_unified_2020}.
The applicability of these principles for the generation of control actions applied to robots has traditionally relied on \emph{simplified models} approximating their dynamics, such as the \ac{LIP} model~\parencite{kajita_3d_2001} and the \ac{CP}~\parencite{pratt_capture_2006}.
Expand Down Expand Up @@ -237,6 +242,7 @@ \subsection{\autoref{ch:contact_aware_dynamics}: Contact-aware Multibody Dynamic
Software architectures typically used to train \ac{RL} policies for application in robotics, in most cases rely on general-purpose rigid-body simulators from which experience is sampled.
As experienced in the experiment of the previous contribution, often the bottleneck that limits the performance of such architectures does not reside in the optimisation problem that utilises the data, but rather in the process of data generation.

\pagebreak
\paragraph{Open Problem}

Most of the physics engines included in general-purpose simulators, besides computing the evolution of a multibody system considering its law of motion, also need to implement routines for detecting and solving collisions.
Expand All @@ -262,6 +268,7 @@ \subsection{\autoref{ch:contact_aware_dynamics}: Contact-aware Multibody Dynamic
\textbf{Daniele Pucci:} Resources, Supervision, Funding acquisition.
\end{quote}

\pagebreak
\subsection{\autoref{ch:scaling_rigid_body_simulations}: Scaling Rigid Body Simulations}

Towards the aim of maximising the performance of sampling synthetic data for robot locomotion started in the previous contribution, we combine the contact-aware state-space representation of free-floating robots dynamics with state-of-the-art \acp{RBDA} to create a novel physics engine that can exploit modern hardware accelerators.
Expand Down
16 changes: 11 additions & 5 deletions Chapters/Part_2/chapter_5.tex
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,12 @@ \chapter{Reinforcement Learning Environments for Robotics}
Finally, we propose a new framework for developing robotic environments, starting with a description of the design goals, and continuing with the implemented software architecture.
This new framework will be validated in the next chapter, where it is used to learn a push-recovery policy for balancing a simulated humanoid robot.

\begin{addmargin}{-2mm}
\setstretch{1.4}
\newpage
\section{Frameworks providing robotic environments}

\vspace*{-0.3mm}
\subsection{Properties}

In this section, we define the properties characterising frameworks that provide robotic \ac{RL} environments.
Expand Down Expand Up @@ -77,6 +80,7 @@ \subsection{Properties}
Modern \acp{GPU} are becoming more capable of efficiently computing extremely complex light interactions in the simulated environment, and technologies such as ray tracing are becoming suitable for real-time usage.
%
\end{description}
\end{addmargin}

\subsection{Existing frameworks}

Expand Down Expand Up @@ -203,7 +207,9 @@ \section{Design Goals}
Instead, the most popular language for the \ac{RL} logic is Python.
Gym-Ignition is a pure Python package that provides the \verb|Runtime| and \verb|Task| interfaces, together with high-level helpers to compute model-based quantities based on the iDynTree~\parencite{nori_icub_2015} library.

\begin{addmargin}{-0.6cm}
\section{ScenarIO: SCENe interfAces for Robot Input/Output}
\end{addmargin}

\scenario\footnote{\url{https://github.com/robotology/gym-ignition/tree/master/scenario}} is a \cpp library acting as a \ac{HAL} over either simulated or real robots.
The abstraction of the scene is structured in different interfaces:
Expand Down Expand Up @@ -250,9 +256,9 @@ \subsection{\scenario Gazebo}
It communicates with the new \emph{Gazebo Sim} simulator, also known in its earlier releases as Ignition Gazebo.

Gazebo Sim is the new generation of the widely used Gazebo Classic simulator, developed by Open Robotics\footnote{\url{https://www.openrobotics.org/}}.
It was used in the new DARPA SubT Challenge\footnote{\url{https://www.subtchallenge.com/}} for both local and cloud simulations.
The monolithic architecture of Gazebo Classic has been split into a suite of multiple independent libraries, with Gazebo Sim being only one among them, and refactored with a more pervasive plugin-based architecture.
For our target applications, we chose to use Gazebo Sim as our main simulation backend for the following two key features:
It was used in the new {\small DARPA} SubT Challenge\footnote{\url{https://www.subtchallenge.com/}} for both local and cloud simulations.
The monolithic architecture of Gazebo Classic has been split into a suite of multiple libraries, with Gazebo Sim being only one among them, and refactored with a more pervasive plugin-based architecture.
For our target applications, we selected Gazebo Sim as our main simulation backend due to the following two key features:
%
\begin{description}
%
Expand Down Expand Up @@ -301,8 +307,8 @@ \section{Gym-Ignition}
\resizebox{0.75\textwidth}{!}{
\includegraphics{images/contributions/chapter_5/scenario_and_gym_ignition.tikz}
}
\caption{Architecture of \scenario and Gym-Ignition. Users of the overall framework just need to provide the \acs{URDF} or \acs{SDF} description of their robot and implement the \texttt{Task} interface with the desired decision-making logic. The framework, following a top-down approach, exposes to the \texttt{Agent} algorithms the unified \texttt{gym.Env} interface. The provided \texttt{Runtime} classes either instantiate the simulator, or handle soft real-time logic for real-world robots. The runtimes are generic and can operate on any decision-making logic that exposes the \texttt{Task} interface. Finally, \texttt{Task} implementations use the \scenario \acp{API} to interact with the robots part of the environment.\\
A typical data flow starts with the agent setting the action with \texttt{gym.Env.step}. The processing of the action is a combination of logic inside the active runtime and the active task. In particular, the runtime receives the action and directly forwards it to the task for being processed. The task, by operating transparently over the \scenario \acp{API}, applies the action to the robot, and then waits the runtime to perform the time stepping. After this phase, the task computes the reward, packs the observation, detects if the environment reached the terminal state, and returns all this data back to the agent passing through the \texttt{gym.Env} \acp{API}.}
\caption{Architecture of \scenario and Gym-Ignition. Users of the overall framework just need to provide the \acs{URDF} or \acs{SDF} description of their robot and implement the \texttt{Task} interface with the desired decision-making logic. The framework, following a top-down approach, exposes to the \texttt{Agent} algorithms the unified \texttt{gym.Env} interface. The provided \texttt{Runtime} classes either instantiate the simulator, or handle soft real-time logic for real-world robots. The runtimes are generic and can operate on any decision-making logic that exposes the \texttt{Task} interface. Finally, \texttt{Task} implementations use the \scenario \acsp{API} to interact with the robots part of the environment.\\
A typical data flow starts with the agent setting the action with \texttt{gym.Env.step}. The processing of the action is a combination of logic inside the active runtime and the active task. In particular, the runtime receives the action and directly forwards it to the task for being processed. The task, by operating transparently over the \scenario \acsp{API}, applies the action to the robot, and then waits the runtime to perform the time stepping. After this phase, the task computes the reward, packs the observation, detects if the environment reached the terminal state, and returns all this data back to the agent passing through the \texttt{gym.Env} \acsp{API}.}
\label{fig:scenario_and_gym_ignition}
\end{figure}

Expand Down
Loading