-
Notifications
You must be signed in to change notification settings - Fork 257
Description
Currently, whether in the offline batched codepath or the agent loop codepath, SkyRL gets output from inference engines in strings (i.e. the inference engine detokenizes internally), and re-tokenizes the string output to prepare for the input into the training pipeline:
- batched:
response_ids = self.tokenizer(response)["input_ids"] - agent loop:
new_resp_tokens = self.tokenizer.encode(output, add_special_tokens=False)
This is susceptible to inconsistency between the token IDs LLMs actually generated and the detokenized-then-tokenized token IDs.
This short script can illustrate the inconsistency, where the assertion at the end can fail (also see section below): https://gist.github.com/CharlieFRuan/9c3961847176c6e447504e34c35e543f
A fail-safe solution is to always ask the inference engines to return token IDs, use it as-is to prepare for training input. We only de-tokenize for goals such as tool call parsing.
For our remote LLM engines (i.e. vllm_server.py and sglang_server.py), this means that SGLang will use the http://localhost:{port}/generate endpoint, and vLLM is pending this PR to land vllm-project/vllm#22587
What is token-in-token-out and why do we need it
- Essentially it means that, during the process of converting LLM engines’ output to trainer’s input, we do not want to re-tokenize the engines’ text output to token IDs, as this can cause 2 types of misalignments:
- Misalignment between what the LLM actually generated, and what the trainer thinks the LLM generated
- In multi-turn setting, misalignment between output token IDs from turn N, and input token IDs to turn N+1
- Consider the following example to illustrate the first type of misalignment
- Say we have 4 vocabs, mapping from token ID to string:
0: <1: search2: <search3: >
- During rollout, the underlying LLM generated
0, 1, 3 vLLMEnginereturned string<search>inInferenceEngineOutputall the way back toGenerator(where the detokenization of token IDs to string is done by the vLLM engine internally)- When preparing for
GeneratorOutputfor training, we need to re-tokenize the strings. However<search>is tokenized into2, 3, causing misalignment - The policy model will be updated based on actions it did not take, potentially harming the RL training (though not sure how much the influence actually is).
- To resolve this, we need to always keep the token IDs output of the vLLMEngines, rather than just keeping the string output and re-tokenize when we need to