Follow token-in-token-out protocol for inference engines

Currently, whether in the offline batched codepath or the agent loop codepath, SkyRL gets output from inference engines in strings (i.e. the inference engine detokenizes internally), and re-tokenizes the string output to prepare for the input into the training pipeline:
- batched: https://github.com/NovaSky-AI/SkyRL/blob/19b8ca58a23be8cae39dfc1ad3d65c7b76d72d27/skyrl-train/skyrl_train/generators/skyrl_gym_generator.py#L221
- agent loop: https://github.com/NovaSky-AI/SkyRL/blob/19b8ca58a23be8cae39dfc1ad3d65c7b76d72d27/skyrl-train/skyrl_train/generators/skyrl_gym_generator.py#L500

This is susceptible to inconsistency between the token IDs LLMs actually generated and the detokenized-then-tokenized token IDs.

This short script can illustrate the inconsistency, where the assertion at the end can fail (also see section below):  https://gist.github.com/CharlieFRuan/9c3961847176c6e447504e34c35e543f

A fail-safe solution is to always ask the inference engines to return token IDs, use it as-is to prepare for training input. We only de-tokenize for goals such as tool call parsing.

For our remote LLM engines (i.e. `vllm_server.py` and `sglang_server.py`), this means that SGLang will use the `http://localhost:{port}/generate` endpoint, and vLLM is pending this PR to land https://github.com/vllm-project/vllm/pull/22587

### What is token-in-token-out and why do we need it
- Essentially it means that, during the process of converting LLM engines’ output to trainer’s input, we do not want to re-tokenize the engines’ text output to token IDs, as this can cause 2 types of misalignments:
  1. Misalignment between what the LLM actually generated, and what the trainer thinks the LLM generated
  2. In multi-turn setting, misalignment between output token IDs from turn N, and input token IDs to turn N+1
- Consider the following example to illustrate the first type of misalignment
- Say we have 4 vocabs, mapping from token ID to string:
  - `0: <`
  - `1: search`
  - `2: <search`
  - `3: >`
- During rollout, the underlying LLM generated `0, 1, 3`
- `vLLMEngine` returned string `<search>` in `InferenceEngineOutput` all the way back to `Generator` (where the detokenization of token IDs to string is done by the vLLM engine internally)
- When preparing for `GeneratorOutput` for training, we need to re-tokenize the strings. However `<search>` is tokenized into `2, 3`, causing misalignment
- The policy model will be updated based on actions it did not take, potentially harming the RL training (though not sure how much the influence actually is).
- To resolve this, we need to always keep the token IDs output of the vLLMEngines, rather than just keeping the string output and re-tokenize when we need to


### Reference
- https://github.com/0russwest0/Agent-R1/issues/30#issuecomment-2826155367
- https://www.notion.so/verl-reTool-recipe-Using-multi-round-conversations-and-code-sandboxing-to-improve-the-math-of-large-23a8b5b7feba80b386b2e5b5e3c1cde0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow token-in-token-out protocol for inference engines #123

What is token-in-token-out and why do we need it

Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Follow token-in-token-out protocol for inference engines #123

Description

What is token-in-token-out and why do we need it

Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions