Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates on Reanalyse / Sample Efficiency (Re-executing MCTS, Parallelization, Stabilization with a target model, etc.) #142

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

FXDevailly
Copy link

This pull requests aims to provide the following additions :

  • The re-execution of the MCTS in Reanalyse to obtain updated/fresh policy targets (child_visits), which as stated in the MuZero paper is used for improved sample efficiency
  • The use of lagging parameters (target model) in Reanalyse to stabilize the bootstrapping of root values (as done in the MuZero paper).
  • The ability to parallelize multiple Reanalyse processes (especially useful when running MCTS is costly and/or the memory buffer is large).
  • An updated 'Reanalyse sampling' method to prioritize the update of episodes with 'older' targets (root values and child_visits).
  • The following settings are added to games files : 1) the parameters-update frequency of the target model used in Reanalyse, 2) the number of Reanalyse processes to be used, and 3) whether to update root values based on the target model alone (representation + value), or based on the updated MCTS root_values.
  • For PER, priorities are now updated in Reanalyse using the new MCTS predictions (in the same way they are initialized when saving an episode).
  • cartpole_sample_efficient.py with some of the settings used in the MuZero paper (PER=1, num_td_steps =5, num_unroll_steps=5), a train_steps/play_steps ratio of 20, parallelized Reanalyse, the use of a target model (high frequency/frequent update), and root_values updates based on the re-execution of the MCTS.

@FXDevailly FXDevailly changed the title Updates on Reanalyse / Sample Efficiency Updates on Reanalyse / Sample Efficiency (Re-executing the MCTS, Parallelization, Stabilization with a target model, etc.) Mar 27, 2021
@FXDevailly FXDevailly changed the title Updates on Reanalyse / Sample Efficiency (Re-executing the MCTS, Parallelization, Stabilization with a target model, etc.) Updates on Reanalyse / Sample Efficiency (Re-executing MCTS, Parallelization, Stabilization with a target model, etc.) Mar 27, 2021
@theword
Copy link
Contributor

theword commented Mar 29, 2021

The number of games that get played drastically changes.

Your pull request -

num_games_sample

Compared to the default cartpole -
num_games_normal

But the results look good !

sample_expertiment

Didn't compare your other game parameters to the default (that could influence the results) which could make this an imperfect experiment but this method seems to be more productive if you can pay the higher computational cost.

@FXDevailly
Copy link
Author

FXDevailly commented Mar 29, 2021

The number of games that get played drastically changes.

Your pull request -

num_games_sample

Compared to the default cartpole -
num_games_normal

But the results look good !

sample_expertiment

Didn't compare your game parameters to the default that could influence the results and make this an imperfect expertiment but this method seems to be more productive if you can pay the higher computational cost.

Thanks for the feedback.
One of the changes influencing the number of played games is the ratio in the game file (20 in the sample_efficient version). I think you could try the same value (20) with the original version (not this pull request) to make the comparison fairer ! Also, about the computational cost, parallelization (using multiple reanalyze processes) can compensate for the higher 'reanalyze' cost.

@theword
Copy link
Contributor

theword commented Mar 29, 2021

The number of games that get played drastically changes.
Your pull request -
num_games_sample
Compared to the default cartpole -
num_games_normal
But the results look good !
sample_expertiment
Didn't compare your game parameters to the default that could influence the results and make this an imperfect expertiment but this method seems to be more productive if you can pay the higher computational cost.

Thanks for the feedback.
One of the changes influencing the number of played games is the ratio in the game file (20 in the sample_efficient version). I think you could try the same value (20) with the original version (not this pull request) to make the comparison fairer ! Also, about the computational cost, parallelization (using multiple reanalyze processes) can compensate for the higher 'reanalyze' cost.

Yeah, with parallelization it works out great. I didn't notice any performance problems on my CPU.

However, I think your code is breaking tic tic toe and connect 4 now.

When I run other games (not cartpole or your cartpole sample)

Possible unhandled error from worker: ray::Reanlyse.reanalyse()

Then it freaks out at return self.env.legal_actions() on line 167 of tic tac toe.

AttributeError: 'NoneType object has no attribute 'legal_actions'

This happens for both self.use_updated_mcts_value_targets being False or True

@theword
Copy link
Contributor

theword commented Apr 22, 2021

The number of games that get played drastically changes.
Your pull request -
num_games_sample
Compared to the default cartpole -
num_games_normal
But the results look good !
sample_expertiment
Didn't compare your game parameters to the default that could influence the results and make this an imperfect expertiment but this method seems to be more productive if you can pay the higher computational cost.

Thanks for the feedback.
One of the changes influencing the number of played games is the ratio in the game file (20 in the sample_efficient version). I think you could try the same value (20) with the original version (not this pull request) to make the comparison fairer ! Also, about the computational cost, parallelization (using multiple reanalyze processes) can compensate for the higher 'reanalyze' cost.

So I revisited this PR because it did increase the efficiency of Muz.

I tracked down the culprit to this line of code in replay_buffer.py in the init() of Reanalyze.

self.game.env = None

When legal_actions was called by the MCTS, it had no game object. When I took out this line of code, it all runs without error and appears to work. Cartpole worked because it just returns a list of 0 and 1. But all the other games call the legal_actions on self. What was the intended purpose of this line?

@FXDevailly
Copy link
Author

FXDevailly commented Apr 22, 2021

The number of games that get played drastically changes.
Your pull request -
num_games_sample
Compared to the default cartpole -
num_games_normal
But the results look good !
sample_expertiment
Didn't compare your game parameters to the default that could influence the results and make this an imperfect expertiment but this method seems to be more productive if you can pay the higher computational cost.

Thanks for the feedback.
One of the changes influencing the number of played games is the ratio in the game file (20 in the sample_efficient version). I think you could try the same value (20) with the original version (not this pull request) to make the comparison fairer ! Also, about the computational cost, parallelization (using multiple reanalyze processes) can compensate for the higher 'reanalyze' cost.

So I revisited this PR because it did increase the efficiency of Muz.

I tracked down the culprit to this line of code in replay_buffer.py in the init() of Reanalyze.

self.game.env = None

When legal_actions was called by the MCTS, it had no game object. When I took out this line of code, it all runs without error and appears to work. Cartpole worked because it just returns a list of 0 and 1. But all the other games call the legal_actions on self. What was the intended purpose of this line?

When using cartpole, the "legal_actions" method is part of the Game object (self.game in Reanalyze). Since there can be many reanalyze processes, I wanted to save memory by removing the environments themselves (self.game.env in Reanalyze) in these game objects since they were not required to run the MCTS.

However, it appears that for other games (such as tic-tac-toe) the "legal_actions" method is obtained from the env object (self.game.env) itself, so it is required. I guess that the memory impact should not be too big if we just remove the line, as you suggested. Thanks for sharing this feedback !

Copy link
Author

@FXDevailly FXDevailly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line enables saving a negligible amount of memory but is incompatible with some environments as suggested by @theword.
It should therefore be deleted.

replay_buffer.py Outdated
# Import the game class to enable MCTS updates
game_module = importlib.import_module("games." + self.config.game_filename)
self.game = game_module.Game()
self.game.env = None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.game.env = None

@qwyin
Copy link

qwyin commented Feb 10, 2022

This is great. But after loading a checkpoint and restarting the training, the reanalyse_priorities will be initialized to None. That leads to an error when self.reanalyse_priorities +=1

torch.squeeze(values).detach().cpu().numpy()
)
# re-execute MCTS to update targets (child visist and root_values)
l = len(game_history.root_values)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,

Should there not be an else on this part?
If I understand correctly, the if in line 346 is there to trigger "do not use re-executed MCTS tree roots, use updated values directly from the value function instead".
So these lines should trigger only if the if is not firing, I believe
Apologies if the mistake is mine

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I'm seeing now I missed the if in line 392, which solves the problem of updating with tree roots without intending to.
For computation efficiency though, wouldn't it still be better to add the else, instead of computing both?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there shouldn't be an "else" on this since we have to execute MCTS to obtain a "fresh" policy (used as target for training, 80% of the time as mentioned in the paper, MuZero Reanalyze appendix). Meanwhile the value function is either updated via re-running MCTS or from a target network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants