-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
@leculver skimmed through and this looks awesome! Will need a bit of time to review. Will try and get you some feedback by early next week. FYI @dotnet/jit-contrib @matouskozak @mikelle-rogers Also FYI @NoahIslam @LogarithmicFrog1: you might find the approach Lee is taking here a bit more accessible and/or familiar, if you're still up for some collaboration. |
No problem, take the time you need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great. Happy to merge this as is.
Mostly my comments are about clarification and trying to match up what you have here with what I have done previously.
src/coreclr/scripts/cse_ml/readme.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a bit more of a writeup about the overall approach, either here or somewhere else. Things like
- are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?
- are the rewards discounted or undiscounted?
- how are you handling the fact that reward magnitudes can vary greatly from one method to the next?
- what sort of neural net topology gets built? Why is this a good choice?
- how are you producing the aggregate score across all the methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 100% in agreement about needing to create more writeup and documentation.
I guess I should have been a bit more clear in the intention of this Pull Request. I consider the code here the absolute minimum starting point that other folks (and myself) can play with to make improvements. The code here is meant as that playground for use over the next couple of months.
When I'm further along in experimenting with different approaches, model architecture, and so on, that's when I plan to write everything up. Some of the techniques will certainly change after I've had more time to experiment in the space, so I didn't write down too much about this base-design because I expect a lot of it to be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's quick answers to your questions:
are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?
This version uses incremental rewards by building up a sequence of decisions.
are the rewards discounted or undiscounted?
Rewards are discounted, but not heavily. Actually, we currently just use the stable-baselines default gamma of 0.99. I intentionally haven't tuned hyperparameters in this checkin. Again trying to keep it as simple as possible.
how are you handling the fact that reward magnitudes can vary greatly from one method to the next?
Currently, we use % change in the perfscore. This keeps rewards relatively within the same magnitude. Obviously some methods are longer than others and the change in perfscore for choosing a CSE likely doesn't scale with method length, so this is a place for improvement.
My overall goal with this checkin was simplicity and being able to understand what it's doing. Since the model trains successfully (though doesn't beat the current CSE Heursitic), I did not try to refine them further yet.
what sort of neural net topology gets built? Why is this a good choice?
Currently, it's the default for stable-baselines. I can give you the topology, but this was also a non-choice so far. The default network trained successfully, so I haven't dug further into the design (yet).
how are you producing the aggregate score across all the methods?
I'm just averaging the change in perfscore. I like your method better and will update to that next checkin.
src/coreclr/jit/optcse.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIT changes look good.
There is some overlap with things from the other RL heuristic but I think it's ok and probably simpler for now to keep them distinct.
return REWARD_SCALE * (prev - curr) / prev | ||
|
||
def _is_valid_action(self, action, method): | ||
# Terminating is only valid if we have performed a CSE. Doing no CSEs isn't allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because you track the "no cse" cases separately, so when learning you're always doing some cses?
There will certainly be some instances where doing no cses is the best policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My overall goal with this checkin is to get something relatively simple and understandable as the baseline for future work. In this case, my (intentionally) simple reward function isn't capable of understanding an initial "no" choice without adding extra complexity.
A more refined version of this project can and will handle the case where we choose no CSEs to perform, but I did not want to overcomplicate the initial version.
if np.isclose(prev, 0.0): | ||
return 0.0 | ||
|
||
return REWARD_SCALE * (prev - curr) / prev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this answers my question about how the variability in rewards is handled? Is prev
here some fixed policy result (say no cse or the current heuristic)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The architecture of this model is to individually choose each CSE one after another until "none" is selected. The prev
score is the score of the previous decision. For example, let's say the model eventually choses [3, 1, 5, stop]. In the first iteration, prev
will be the perfscore of the method with no CSEs and curr
will be the perfscore of only CSE 3 chosen. On the second iteration, prev
will be the perfscocre of only CSE 3 chose, and curr
will be with CSEs [3, 1]
chosen. And so on.
This isn't the only way to build training, but it's the one I started with.
no_jit_failure = result[result['failed'] != ModelResult.JIT_FAILED] | ||
|
||
# next calculate how often we improved on the heuristic | ||
improved = no_jit_failure[no_jit_failure['model_score'] < no_jit_failure['heuristic_score']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this touches on how the aggregate score is computed.
Generally I like to use the geomean. If we have
(here lower is better) and I expect the "best possible" improvement to be around 0.99 (my policy gets about 0.994).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting. I will add this to the next update, thanks for the suggestion!
- Also better wrapper factoring.
from .method_context import MethodContext | ||
|
||
MIN_CSE = 3 | ||
MAX_CSE = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a starting number for min and max CSEs and if this works, then we will extend further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. This was the starting point to get something working. We need to think through how to give the model the ability to see and select all CSEs (up to 64 which is the JIT's max). Defining a new architecture is yet another project to work on. I filed that as an issue here: leculver/jitml#8
…learning (dotnet#101856) * Initial code * Add notes * Add CSE_HeuristicRLHook * Move metric print location, double -> int * Produce non-viable entries, fix output issue * Shuffle features by type * Initial JitEnv - not yet working * Change to snake_case * Initial RL implementation with stable-baselines3 * Enable parallel processing, fix some errors * Clean up train.py, allow algorithm selection * Fix paths * Fix issue with null result * Save method indexes * Check if process is still running * Up argument count before warning * Track more statistics on tensorboard * Fix an issue where we didn't let the model know it shouldn't pick something * Reward improvements - Scale up rewards. - Clamp rewards to [-1, 1] - Reward/penalize when complete if there are better/worse CSEs (this is very slow) - Reward when complete based on whether we beat the heuristic or not * Update jitenv.py to remove unused import * Fix inverted graph * Split data into test/train * Refactor for clarity * Use numpy for randomness * Add open questions * Fix a couple of model saving issues * Refactor and cleanup * Add evaluate.py * Fix inverted test/train * Add a way to get the probabilities of actions * Rename file * Clean up imports * Changed action space - 0 to action_space.n-2 are now the CSEs to apply instead of adding and subtracting 1 to the action. - 0 no longer means terminate, instead the action from the model of n-1 is the terminate signal. This is not passed to the JIT. * Add field validator for perf_score This shouldn't happen but it's important enough to validate * Update applicability to ensure we have at least enough viable candidates and not more than total * Fix a few bugs with evaluate * Fix test/train split, some extra output * Remove dead code, simplify format * Rename JitEnv -> JitCseEnv * More renames * Try to factor the observaiton space * Fix test/train split * Reward cleanup - Split reward function into shallow and deep. * Remove 0 perfscore check * Enable deep rewards * Fix issue where jit failed and produced None method * Simplify deeper rewards * Update todo * Add reward customization * Clean up __all__ * Fix issue where we would JIT the first CSE candidate in reset This was leftover code from the previous design of RLHook. * Add two new features, emit selected sequence * Jit one less method per cse chosen in deep rewards * Use info dictionary instead of a specific state * Fix segfault due to null variable * Add superpmi_context Getting the code well-factored so it's easy to modify. * Add tensorboard entry for invalid choices, clear results * Close the environment * Add documentation for JIT changes * Rename method * Normalize observation * Set return type hint for clarity * Add RemoveFeaturesWrapper * Update docstring * Rename function * Move feature normalization to a wrapper - Also better wrapper factoring. * Remove import * Fix warning * Fix Windows issue * Properly log when using A2C * Add readme * Change argument name * Remove whitespace change * Format fixes * Fix formatting * Update readme: Fix grammar, add a note about evaluation * Fixed incorrect filename in readme * Save more data to .json in preparation of other model kinds
Implement a gymnasium environment,
JitCseEnv
, to allow us to rapidly iterate on features, reward functions, and model/neural network architecture for JIT CSE optimization. This change:This implements the bare minimum rewards and features needed to experiment with CSE optimization. The current non-normalized features and simple reward function creates a model that is almost as good as the current, hand-written CSE Heuristic in the JIT. Further developments and improvements will likely be done offline, this is meant to be the skeleton of the project that's shared.
More information can be found in the README.md included in this pull request.
Contributes to: #92915.