Skip to content

Follow token-in-token-out protocol for inference engines #123

@CharlieFRuan

Description

@CharlieFRuan

Currently, whether in the offline batched codepath or the agent loop codepath, SkyRL gets output from inference engines in strings (i.e. the inference engine detokenizes internally), and re-tokenizes the string output to prepare for the input into the training pipeline:

This is susceptible to inconsistency between the token IDs LLMs actually generated and the detokenized-then-tokenized token IDs.

This short script can illustrate the inconsistency, where the assertion at the end can fail (also see section below): https://gist.github.com/CharlieFRuan/9c3961847176c6e447504e34c35e543f

A fail-safe solution is to always ask the inference engines to return token IDs, use it as-is to prepare for training input. We only de-tokenize for goals such as tool call parsing.

For our remote LLM engines (i.e. vllm_server.py and sglang_server.py), this means that SGLang will use the http://localhost:{port}/generate endpoint, and vLLM is pending this PR to land vllm-project/vllm#22587

What is token-in-token-out and why do we need it

  • Essentially it means that, during the process of converting LLM engines’ output to trainer’s input, we do not want to re-tokenize the engines’ text output to token IDs, as this can cause 2 types of misalignments:
    1. Misalignment between what the LLM actually generated, and what the trainer thinks the LLM generated
    2. In multi-turn setting, misalignment between output token IDs from turn N, and input token IDs to turn N+1
  • Consider the following example to illustrate the first type of misalignment
  • Say we have 4 vocabs, mapping from token ID to string:
    • 0: <
    • 1: search
    • 2: <search
    • 3: >
  • During rollout, the underlying LLM generated 0, 1, 3
  • vLLMEngine returned string <search> in InferenceEngineOutput all the way back to Generator (where the detokenization of token IDs to string is done by the vLLM engine internally)
  • When preparing for GeneratorOutput for training, we need to re-tokenize the strings. However <search> is tokenized into 2, 3, causing misalignment
  • The policy model will be updated based on actions it did not take, potentially harming the RL training (though not sure how much the influence actually is).
  • To resolve this, we need to always keep the token IDs output of the vLLMEngines, rather than just keeping the string output and re-tokenize when we need to

Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions