[Clarification request] Per-turn rewards in SKyRLGymGenerator agent_loop

I'm interested in multi-turn environments with turn-level rewards in the SkyRLGymGenerator.

From my reading of the code, only the final reward from the trajectory is used: https://github.com/NovaSky-AI/SkyRL/blob/main/skyrl-train/skyrl_train/generators/skyrl_gym_generator.py#L187 - the `reward` variable is updated with the most recent reward at each step, and the final value returned at the end of the method.

However, loss functions like GSPO explicitly support token-level advantages.

How am I supposed to give varying token-level rewards when only the final reward is ever returned, or is this not supported?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Clarification request] Per-turn rewards in SKyRLGymGenerator agent_loop #201

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Clarification request] Per-turn rewards in SKyRLGymGenerator agent_loop #201

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions