Skip to content

Conversation

@Jialin
Copy link
Collaborator

@Jialin Jialin commented Nov 19, 2025

Purpose

There're <batch_size> of EngineCoreOutput generated per decode batch, in order to avoid GC, we should avoid using objects (e.g. list) inside EngineCoreOutput for large batch size scenarios.

This is a continuous effort on top of #28245

Test Plan & Test Result

CI signals

As a followup, we should really need to introduce e2e tests to ensure GC costs are not regressing.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses garbage collection overhead by replacing list[int] with np.ndarray for new_token_ids in EngineCoreOutput. The changes are consistently applied across the scheduler, engine, output processor, and metrics components, which is great. My review includes one suggestion to further optimize performance by avoiding an unnecessary ndarray -> list -> ndarray conversion cycle in the scheduler, which could reduce overhead in this hot path.

EngineCoreOutput(
request_id=req_id,
new_token_ids=new_token_ids,
new_token_ids=np.array(new_token_ids),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change is correct, there's an opportunity to further improve performance by avoiding the np.ndarray -> list -> np.ndarray conversion cycle.

Currently, sampled_token_ids[req_index] (which is a numpy array) is converted to a list using .tolist(), then processed by _update_request_with_output, and finally converted back to a numpy array here.

Consider refactoring _update_request_with_output to operate directly on numpy arrays. This would eliminate the intermediate list conversion, reducing overhead.

A refactored _update_request_with_output might look like this:

def _update_request_with_output(
    self,
    request: Request,
    new_token_ids: np.ndarray,
) -> tuple[np.ndarray, bool]:
    stopped = False
    for num_new, output_token_id in enumerate(new_token_ids, 1):
        request.append_output_token_ids(int(output_token_id))
        stopped = check_stop(request, self.max_model_len)
        if stopped:
            # Truncate the array and break
            new_token_ids = new_token_ids[:num_new]
            break
    return new_token_ids, stopped

Then, in update_from_output, you could work with numpy arrays throughout, avoiding the .tolist() and np.array() calls.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

)

new_token_ids = engine_core_output.new_token_ids
new_token_ids: list[int] = engine_core_output.new_token_ids.tolist()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle non-ndarray EngineCoreOutput token_ids

In process_outputs we now call engine_core_output.new_token_ids.tolist(), assuming every EngineCoreOutput carries a NumPy array. Several producers still emit plain Python lists (e.g., MockEngineCore.get_outputs in tests/v1/engine/utils.py builds EngineCoreOutput(new_token_ids=[...])), so when those outputs are processed—such as in the log-stats variants of the output-processor tests or any external mock engine—this line raises AttributeError before detokenization/logprob handling. Either convert producers to np.ndarray or accept list inputs here.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@22quinn 22quinn enabled auto-merge (squash) November 19, 2025 22:05
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants