Releases: thu-ml/tianshou
Releases · thu-ml/tianshou
0.4.5
Bug Fix
- Fix tqdm issue (#481)
- Fix atari wrapper to be deterministic (#467)
- Add
writer.flush()
in TensorboardLogger to ensure real-time logging result (#485)
Enhancement
- Implements set_env_attr and get_env_attr for vector environments (#478)
- Implement BCQPolicy and offline_bcq example (#480)
- Enable
test_collector=None
in 3 trainers to turn off testing during training (#485) - Fix an inconsistency in the implementation of Discrete CRR. Now it uses
Critic
class for its critic, following conventions in other actor-critic policies (#485) - Update several offline policies to use
ActorCritic
class for its optimizer to eliminate randomness caused by parameter sharing between actor and critic (#485) - Move Atari offline RL examples to
examples/offline
and tests totest/offline
(#485)
0.4.4
API Change
- add a new class DataParallelNet for multi-GPU training (#461)
- add ActorCritic for deterministic parameter grouping for share-head actor-critic network (#458)
- collector.collect() now returns 4 extra keys: rew/rew_std/len/len_std (previously this work is done in logger) (#459)
- rename WandBLogger -> WandbLogger (#441)
Bug Fix
- fix logging in atari examples (#444)
Enhancement
0.4.3
Bug Fix
Enhancement
- add Rainbow (#386)
- add WandbLogger (#427)
- add env_id in preprocess_fn (#391)
- update README, add new chart and bibtex (#406)
- add Makefile, now you can use
make commit-checks
to automatically perform almost all checks (#432) - add isort and yapf, apply to existing codebase (#432)
- add spelling check by using
make spelling
(#432) - update contributing.rst (#432)
0.4.2
Enhancement
- Add model-free dqn family: IQN (#371), FQF (#376)
- Add model-free on-policy algorithm: NPG (#344, #347), TRPO (#337, #340)
- Add offline-rl algorithm: CQL (#359), CRR (#367)
- Support deterministic evaluation for onpolicy algorithms (#354)
- Make trainer resumable (#350)
- Support different state size and fix exception in venv.__del__ (#352, #384)
- Add vizdoom example (#384)
- Add numerical analysis tool and interactive plot (#335, #341)
0.4.1
API Change
- Add observation normalization in BaseVectorEnv (
norm_obs
,obs_rms
,update_obs_rms
andRunningMeanStd
) (#308) - Add
policy.map_action
to bound with raw action (e.g., map from (-inf, inf) to [-1, 1] by clipping or tanh squashing), and the mapped action won't store in replaybuffer (#313) - Add
lr_scheduler
in on-policy algorithms, typically forLambdaLR
(#318)
Note
To adapt with this version, you should change the action_range=...
to action_space=env.action_space
in policy initialization.
Bug Fix
- Fix incorrect behaviors (error when
n/ep==0
and reward shown in tqdm) with on-policy algorithm (#306, #328) - Fix q-value mask_action error for obs_next (#310)
Enhancement
- Release SOTA Mujoco benchmark (DDPG/TD3/SAC: #305, REINFORCE: #320, A2C: #325, PPO: #330) and add corresponding notes in /examples/mujoco/README.md
- Fix
numpy>=1.20
typing issue (#323) - Add cross-platform unittest (#331)
- Add a test on how to deal with finite env (#324)
- Add value normalization in on-policy algorithms (#319, #321)
- Separate advantage normalization and value normalization in PPO (#329)
0.4.0
This release contains several API and behavior changes.
API Change
Buffer
- Add ReplayBufferManager, PrioritizedReplayBufferManager, VectorReplayBuffer, PrioritizedVectorReplayBuffer, CachedReplayBuffer (#278, #280);
- Change
buffer.add
API frombuffer.add(obs, act, rew, done, obs_next, info, policy, ...)
tobuffer.add(batch, buffer_ids)
in order to add data more efficient (#280); - Add
set_batch
method in buffer (#278); - Add
sample_index
method, same assample
but only return index instead of both index and batch data (#278); - Add
prev
(one-step previous transition index),next
(one-step next transition index) andunfinished_index
(the last modified index whosedone==False
) (#278); - Add internal method
_alloc_by_keys_diff
in batch to support any form of keys pop up (#280);
Collector
- Rewrite the original Collector, split the async function to AsyncCollector: Collector only supports sync mode, AsyncCollector support both modes (#280);
- Drop
collector.collect(n_episode=List[int])
because the new collector can collect episodes without bias (#280); - Move
reward_metric
from Collector to trainer (#280); - Change
Collector.collect
logic:AsyncCollector.collect
's semantic is the same as previous version, wherecollect(n_step or n_episode)
will not collect exact n_step or n_episode transitions;Collector.collect(n_step or n_episode)
's semantic now changes to exact n_step or n_episode collect (#280);
Policy
- Add
policy.exploration_noise(action, batch) -> action
method instead of implemented inpolicy.forward()
(#280); - Add
Timelimit.truncate
handler incompute_*_returns
(#296); - remove
ignore_done
flag (#296); - remove
reward_normalization
option in offpolicy-algorithm (will raise Error if set to True) (#298);
Trainer
- Change
collect_per_step
tostep_per_collect
(#293); - Add
update_per_step
andepisode_per_collect
(#293); onpolicy_trainer
now supports either step_collect or episode_collect (#293)- Add BasicLogger and LazyLogger to log data more conveniently (#295)
Bug Fix
0.3.2
0.3.1
API Change
- change
utils.network
args to support any form of MLP by default (#275), removelayer_num
andhidden_layer_size
, addhidden_sizes
(a list of int indicate the network architecture) - add HDF5 save/load method for ReplayBuffer (#261)
- add offline_trainer (#263)
- move Atari-related network to
examples/atari/atari_network.py
(#275)
Bug Fix
- fix a potential bug in discrete behavior cloning policy (#263)
Enhancement
0.3.0.post1
Several bug fix (trainer, test and docs)
0.3.0
Since at this point, the code has largely changed from v0.2.0, we release version 0.3 from now on.
API Change
- add policy.updating and clarify collecting state and updating state in training (#224)
- change
train_fn(epoch)
totrain_fn(epoch, env_step)
andtest_fn(epoch)
totest_fn(epoch, env_step)
(#229) - remove out-of-the-date API: collector.sample, collector.render, collector.seed, VectorEnv (#210)
Bug Fix
- fix a bug in DDQN: target_q could not be sampled from np.random.rand (#224)
- fix a bug in DQN atari net: it should add a ReLU before the last layer (#224)
- fix a bug in collector timing (#224)
- fix a bug in the converter of Batch: deepcopy a Batch in to_numpy and to_torch (#213)
- ensure buffer.rew has a type of float (#229)