Skip to content

[FlashRL 3/N] Add example for FP8 training with FlashRL#169

Merged
SumanthRH merged 11 commits intoNovaSky-AI:mainfrom
SumanthRH:sumanthrh/its-flashrl-time
Aug 20, 2025
Merged

[FlashRL 3/N] Add example for FP8 training with FlashRL#169
SumanthRH merged 11 commits intoNovaSky-AI:mainfrom
SumanthRH:sumanthrh/its-flashrl-time

Conversation

@SumanthRH
Copy link
Member

@SumanthRH SumanthRH commented Aug 20, 2025

What does this PR do?

WIP PR to add FP8 training with FlashRL.

Note that we currently only support online FP8 quantization. Support for pre-quantized fp8 and in8 will follow soon - it's a bit more involved given that you need to calibrate scaling

uses a custom vllm wheel. Found this to be the simplest way to manage the custom vllm patches in flashRL. Wheel is pre-packaged build from branch: https://github.com/SumanthRH/vllm/tree/flashrl . Specifying the git url directly led to uv building the cpu-only version of vllm for some reason. We'll use this wheel for now

TODO:

  • Verify E2E run on Deepspeed and FSDP
  • Verify training on qwen3 14B and 32B
  • upload wheel to github releases and use the link from Github
  • add docs

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
SumanthRH and others added 2 commits August 20, 2025 04:08
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
@SumanthRH SumanthRH force-pushed the sumanthrh/its-flashrl-time branch from 04e753a to 7bb4e7c Compare August 20, 2025 17:43
x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
@SumanthRH SumanthRH marked this pull request as ready for review August 20, 2025 18:00

.. warning::

FlashRL integration is experimental. While generation times can improve for large models with quantization, we've observed that the time spent in weight syncing is much higher with FlashRL for fp8. This negates most of the benefits of fp8 inference. The slowdown is primarily due to slow weight quantization in vLLM's ``process_weights_after_loading`` function. We are actively working on improving this. No newline at end of file
Copy link
Member Author

@SumanthRH SumanthRH Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important warning. I've already improved weight syncing with the batching impl + fixes for FlashRL's patch_load_weights method, but it is still not good enough. We will revisit the fp8 slowdown, and meanwhile also see if int8 can provide good overall step time improvements

@SumanthRH SumanthRH changed the title [FlashRL N/N] Add example for FP8 training with FlashRL [FlashRL 3/N] Add example for FP8 training with FlashRL Aug 20, 2025
"""
from skyrl_train.utils import ray_noset_visible_devices, get_all_env_variables, get_ray_pg_ready_with_timeout

assert not async_engine, "`async_engine` is not supported for FlashRL"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm - we can only use the offline engine for flash-rl, so only single turn rollouts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth a clarification in the doc, I didn't realize until i hit this line of code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let me add a warning

How does it work?
~~~~~~~~~~~~~~~~~~

We pass `quantization=fp8` flag to the vLLM engine at initialization time. This means that the weights are loaded as usual in half precision and then quantized down to fp8. During training, generations are sampled as usual, and in this case, sampled from quantized weights. Since we use online quantization, the scale factor used for quantizing activations are computed on the fly by vLLM internally.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We pass `quantization=fp8` flag to the vLLM engine at initialization time. This means that the weights are loaded as usual in half precision and then quantized down to fp8. During training, generations are sampled as usual, and in this case, sampled from quantized weights. Since we use online quantization, the scale factor used for quantizing activations are computed on the fly by vLLM internally.
We pass the `quantization=fp8` flag to the vLLM engine at initialization time. This means that the weights are loaded as usual in half precision and then quantized down to FP8. During training, generations are sampled as usual, but now from the quantized weights. Since vLLM uses online quantization, the scale factors used for quantizing activations are computed dynamically during runtime.

x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
@SumanthRH SumanthRH force-pushed the sumanthrh/its-flashrl-time branch from 523f734 to 2e90cac Compare August 20, 2025 20:01
x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
@SumanthRH SumanthRH merged commit 825f2e8 into NovaSky-AI:main Aug 20, 2025
1 check passed
dzorlu referenced this pull request in fleet-ai/SkyRL Feb 4, 2026
# What does this PR do?

WIP PR to add FP8 training with FlashRL. 

Note that we currently only support online FP8 quantization. Support for
pre-quantized fp8 and in8 will follow soon - it's a bit more involved
given that you need to calibrate scaling


uses a custom vllm wheel. Found this to be the simplest way to manage
the custom vllm patches in flashRL. Wheel is pre-packaged build from
branch: https://github.com/SumanthRH/vllm/tree/flashrl . Specifying the
git url directly led to uv building the cpu-only version of vllm for
some reason. We'll use this wheel for now

TODO:
- [x] Verify E2E run on Deepspeed and FSDP
- [x] Verify training on qwen3 14B and 32B
- [x] upload wheel to github releases and use the link from Github
- [x] add docs

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants