Skip to content

Update report #457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 16, 2021
Merged

Conversation

pilgrimygy
Copy link
Member

PR Checklist

  • Update NEWS.md?

This technical report is the first evaluation report of Project "Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl" in OSPP. It includes three components: project information, project schedule, future plan.
## Project Information
- Project name: Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl
- Scheme Description: Recent advances in offline reinforcement learning make it possible to turn reinforcement learning into a data-driven discipline, such that many effective methods from the supervised learning field could be applied. Until now, the only offline method provided in ReinforcementLearning.jl is behavior cloning. We'd like to have more algorithms added like Batch Constrain Q-Learning (BCQ)\dcite{DBLP:conf/icml/FujimotoMP19}, Conservative Q-Learning (CQL)\dcite{DBLP:conf/nips/KumarZTL20}. It is expected to implement at least three to four modern offline RL algorithms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a reference to behavior cloning.

batch_size::Int
end
```
This implementation of `OfflinePolicy` refers to `QBasePolicy` ([link](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/master/src/ReinforcementLearningCore/src/policies/q_based_policies/q_based_policy.jl)). It provides a parameter `continuous` to support different action space types, including continuous and discrete. `learner` is a specific algorithm for learning and providing policy. `dataset` and `batch_size` are used to sample data for learning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```
This implementation of `OfflinePolicy` refers to `QBasePolicy` ([link](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/master/src/ReinforcementLearningCore/src/policies/q_based_policies/q_based_policy.jl)). It provides a parameter `continuous` to support different action space types, including continuous and discrete. `learner` is a specific algorithm for learning and providing policy. `dataset` and `batch_size` are used to sample data for learning.

Besides, we implement corresponding functions `π`, `update!` and `sample`. `π` is used to select the action, whose form is determined by the type of action space. `update!` can be used in two stages. In `PreExperiment` stage, we can call this function for pre-training algorithms with `pretrain_step` parameters (such as PLAS). In `PreAct` stage, we call this function for training the `learner`. In function `update!`, we need to call function `sample` to sample a batch of data from the dataset. With the development of RLDataset.jl, the `sample` function will be deprecated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain PLAS here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to RLDataset.jl.

And better to use the full name of ReinforcementLearningDatasets.jl in this report.

Comment on lines 71 to 77
learner = DQNLearner(
# Omit specific code
),
dataset = dataset,
continuous = false,
batch_size = 64,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the indent.

)
```

Therefore, we unified the parameter name in different algorithms so that different `learner` can be compatible with `OfflinePolicy`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Therefore, we unified the parameter name in different algorithms so that different `learner` can be compatible with `OfflinePolicy`.
Therefore, we unified the parameter name in different algorithms so that different `learner`s can be compatible with `OfflinePolicy`.

```

#### Offline RL Algorithms
We used the existing algorithms and hooks to train the offline RL algorithm to create datasets in several environments (such as CartPole, Pendulum) for training. This work can guide the subsequent development of package RLDataset.jl, for example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, add links to CartPole and Pendulum in the docs.

```

##### Benchmark
We implemented and experimented with offline DQN (in discrete action space) and offline SAC (in continuous action space) as benchmarks. The performance of offline DQN in Cartpole environment:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tense you used in this report is kind of confusing to me. Sometimes the present tense is used, now the past tense here. Better to unify them all.


Besides, we implement corresponding functions `π`, `update!` and `sample`. `π` is used to select the action, whose form is determined by the type of action space. `update!` can be used in two stages. In `PreExperiment` stage, we can call this function for pre-training algorithms with `pretrain_step` parameters (such as PLAS). In `PreAct` stage, we call this function for training the `learner`. In function `update!`, we need to call function `sample` to sample a batch of data from the dataset. With the development of RLDataset.jl, the `sample` function will be deprecated.

We can quickly call the offline version of the existing algorithms with almost no additional code with this framework. Therefore, the implementation and performance testing of offline DQN and offline SAC can be completed soon. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the first place to mention SAC, better to add a reference here.


\dfig{body;PLAS2.png}

Please refer to this link for specific code ([link](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/master/src/ReinforcementLearningZoo/src/algorithms/offline_rl/PLAS.jl)). The brief function parameters are as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use the link in docs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do the links in the docs? Is it a brief introduction later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well.

@pilgrimygy pilgrimygy requested a review from findmyway August 16, 2021 07:53
@findmyway findmyway merged commit 64c87bc into JuliaReinforcementLearning:master Aug 16, 2021
@pilgrimygy pilgrimygy deleted the report branch August 19, 2021 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants