I'm interested in multi-turn environments with turn-level rewards in the SkyRLGymGenerator.
From my reading of the code, only the final reward from the trajectory is used: https://github.com/NovaSky-AI/SkyRL/blob/main/skyrl-train/skyrl_train/generators/skyrl_gym_generator.py#L187 - the reward variable is updated with the most recent reward at each step, and the final value returned at the end of the method.
However, loss functions like GSPO explicitly support token-level advantages.
How am I supposed to give varying token-level rewards when only the final reward is ever returned, or is this not supported?