-
-
Notifications
You must be signed in to change notification settings - Fork 108
Update report #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update report #457
Conversation
This technical report is the first evaluation report of Project "Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl" in OSPP. It includes three components: project information, project schedule, future plan. | ||
## Project Information | ||
- Project name: Enriching Offline Reinforcement Learning Algorithms in ReinforcementLearning.jl | ||
- Scheme Description: Recent advances in offline reinforcement learning make it possible to turn reinforcement learning into a data-driven discipline, such that many effective methods from the supervised learning field could be applied. Until now, the only offline method provided in ReinforcementLearning.jl is behavior cloning. We'd like to have more algorithms added like Batch Constrain Q-Learning (BCQ)\dcite{DBLP:conf/icml/FujimotoMP19}, Conservative Q-Learning (CQL)\dcite{DBLP:conf/nips/KumarZTL20}. It is expected to implement at least three to four modern offline RL algorithms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a reference to behavior cloning.
batch_size::Int | ||
end | ||
``` | ||
This implementation of `OfflinePolicy` refers to `QBasePolicy` ([link](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/master/src/ReinforcementLearningCore/src/policies/q_based_policies/q_based_policy.jl)). It provides a parameter `continuous` to support different action space types, including continuous and discrete. `learner` is a specific algorithm for learning and providing policy. `dataset` and `batch_size` are used to sample data for learning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the link with the one in the docs https://juliareinforcementlearning.org/docs/rlcore/#ReinforcementLearningCore.QBasedPolicy
``` | ||
This implementation of `OfflinePolicy` refers to `QBasePolicy` ([link](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/master/src/ReinforcementLearningCore/src/policies/q_based_policies/q_based_policy.jl)). It provides a parameter `continuous` to support different action space types, including continuous and discrete. `learner` is a specific algorithm for learning and providing policy. `dataset` and `batch_size` are used to sample data for learning. | ||
|
||
Besides, we implement corresponding functions `π`, `update!` and `sample`. `π` is used to select the action, whose form is determined by the type of action space. `update!` can be used in two stages. In `PreExperiment` stage, we can call this function for pre-training algorithms with `pretrain_step` parameters (such as PLAS). In `PreAct` stage, we call this function for training the `learner`. In function `update!`, we need to call function `sample` to sample a batch of data from the dataset. With the development of RLDataset.jl, the `sample` function will be deprecated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain PLAS
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to RLDataset.jl
.
And better to use the full name of ReinforcementLearningDatasets.jl
in this report.
learner = DQNLearner( | ||
# Omit specific code | ||
), | ||
dataset = dataset, | ||
continuous = false, | ||
batch_size = 64, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the indent.
) | ||
``` | ||
|
||
Therefore, we unified the parameter name in different algorithms so that different `learner` can be compatible with `OfflinePolicy`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, we unified the parameter name in different algorithms so that different `learner` can be compatible with `OfflinePolicy`. | |
Therefore, we unified the parameter name in different algorithms so that different `learner`s can be compatible with `OfflinePolicy`. |
``` | ||
|
||
#### Offline RL Algorithms | ||
We used the existing algorithms and hooks to train the offline RL algorithm to create datasets in several environments (such as CartPole, Pendulum) for training. This work can guide the subsequent development of package RLDataset.jl, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, add links to CartPole
and Pendulum
in the docs.
``` | ||
|
||
##### Benchmark | ||
We implemented and experimented with offline DQN (in discrete action space) and offline SAC (in continuous action space) as benchmarks. The performance of offline DQN in Cartpole environment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tense you used in this report is kind of confusing to me. Sometimes the present tense is used, now the past tense here. Better to unify them all.
|
||
Besides, we implement corresponding functions `π`, `update!` and `sample`. `π` is used to select the action, whose form is determined by the type of action space. `update!` can be used in two stages. In `PreExperiment` stage, we can call this function for pre-training algorithms with `pretrain_step` parameters (such as PLAS). In `PreAct` stage, we call this function for training the `learner`. In function `update!`, we need to call function `sample` to sample a batch of data from the dataset. With the development of RLDataset.jl, the `sample` function will be deprecated. | ||
|
||
We can quickly call the offline version of the existing algorithms with almost no additional code with this framework. Therefore, the implementation and performance testing of offline DQN and offline SAC can be completed soon. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the first place to mention SAC, better to add a reference here.
docs/homepage/blog/offline_reinforcement_learning_algorithm_phase1/index.md
Show resolved
Hide resolved
|
||
\dfig{body;PLAS2.png} | ||
|
||
Please refer to this link for specific code ([link](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/master/src/ReinforcementLearningZoo/src/algorithms/offline_rl/PLAS.jl)). The brief function parameters are as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use the link in docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do the links in the docs? Is it a brief introduction later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well.
PR Checklist