[FlashRL 3/N] Add example for FP8 training with FlashRL#169
Merged
SumanthRH merged 11 commits intoNovaSky-AI:mainfrom Aug 20, 2025
Merged
[FlashRL 3/N] Add example for FP8 training with FlashRL#169SumanthRH merged 11 commits intoNovaSky-AI:mainfrom
SumanthRH merged 11 commits intoNovaSky-AI:mainfrom
Conversation
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
04e753a to
7bb4e7c
Compare
SumanthRH
commented
Aug 20, 2025
|
|
||
| .. warning:: | ||
|
|
||
| FlashRL integration is experimental. While generation times can improve for large models with quantization, we've observed that the time spent in weight syncing is much higher with FlashRL for fp8. This negates most of the benefits of fp8 inference. The slowdown is primarily due to slow weight quantization in vLLM's ``process_weights_after_loading`` function. We are actively working on improving this. No newline at end of file |
Member
Author
There was a problem hiding this comment.
This is an important warning. I've already improved weight syncing with the batching impl + fixes for FlashRL's patch_load_weights method, but it is still not good enough. We will revisit the fp8 slowdown, and meanwhile also see if int8 can provide good overall step time improvements
erictang000
reviewed
Aug 20, 2025
| """ | ||
| from skyrl_train.utils import ray_noset_visible_devices, get_all_env_variables, get_ray_pg_ready_with_timeout | ||
|
|
||
| assert not async_engine, "`async_engine` is not supported for FlashRL" |
Collaborator
There was a problem hiding this comment.
just to confirm - we can only use the offline engine for flash-rl, so only single turn rollouts?
Collaborator
There was a problem hiding this comment.
maybe worth a clarification in the doc, I didn't realize until i hit this line of code
Member
Author
There was a problem hiding this comment.
Yeah let me add a warning
| How does it work? | ||
| ~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| We pass `quantization=fp8` flag to the vLLM engine at initialization time. This means that the weights are loaded as usual in half precision and then quantized down to fp8. During training, generations are sampled as usual, and in this case, sampled from quantized weights. Since we use online quantization, the scale factor used for quantizing activations are computed on the fly by vLLM internally. |
Collaborator
There was a problem hiding this comment.
Suggested change
| We pass `quantization=fp8` flag to the vLLM engine at initialization time. This means that the weights are loaded as usual in half precision and then quantized down to fp8. During training, generations are sampled as usual, and in this case, sampled from quantized weights. Since we use online quantization, the scale factor used for quantizing activations are computed on the fly by vLLM internally. | |
| We pass the `quantization=fp8` flag to the vLLM engine at initialization time. This means that the weights are loaded as usual in half precision and then quantized down to FP8. During training, generations are sampled as usual, but now from the quantized weights. Since vLLM uses online quantization, the scale factors used for quantizing activations are computed dynamically during runtime. |
523f734 to
2e90cac
Compare
dzorlu
referenced
this pull request
in fleet-ai/SkyRL
Feb 4, 2026
# What does this PR do? WIP PR to add FP8 training with FlashRL. Note that we currently only support online FP8 quantization. Support for pre-quantized fp8 and in8 will follow soon - it's a bit more involved given that you need to calibrate scaling uses a custom vllm wheel. Found this to be the simplest way to manage the custom vllm patches in flashRL. Wheel is pre-packaged build from branch: https://github.com/SumanthRH/vllm/tree/flashrl . Specifying the git url directly led to uv building the cpu-only version of vllm for some reason. We'll use this wheel for now TODO: - [x] Verify E2E run on Deepspeed and FSDP - [x] Verify training on qwen3 14B and 32B - [x] upload wheel to github releases and use the link from Github - [x] add docs --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
WIP PR to add FP8 training with FlashRL.
Note that we currently only support online FP8 quantization. Support for pre-quantized fp8 and in8 will follow soon - it's a bit more involved given that you need to calibrate scaling
uses a custom vllm wheel. Found this to be the simplest way to manage the custom vllm patches in flashRL. Wheel is pre-packaged build from branch: https://github.com/SumanthRH/vllm/tree/flashrl . Specifying the git url directly led to uv building the cpu-only version of vllm for some reason. We'll use this wheel for now
TODO: