-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[Core] Avoid list[int] in EngineCoreOutput for GC efficiency #29033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses garbage collection overhead by replacing list[int] with np.ndarray for new_token_ids in EngineCoreOutput. The changes are consistently applied across the scheduler, engine, output processor, and metrics components, which is great. My review includes one suggestion to further optimize performance by avoiding an unnecessary ndarray -> list -> ndarray conversion cycle in the scheduler, which could reduce overhead in this hot path.
| EngineCoreOutput( | ||
| request_id=req_id, | ||
| new_token_ids=new_token_ids, | ||
| new_token_ids=np.array(new_token_ids), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change is correct, there's an opportunity to further improve performance by avoiding the np.ndarray -> list -> np.ndarray conversion cycle.
Currently, sampled_token_ids[req_index] (which is a numpy array) is converted to a list using .tolist(), then processed by _update_request_with_output, and finally converted back to a numpy array here.
Consider refactoring _update_request_with_output to operate directly on numpy arrays. This would eliminate the intermediate list conversion, reducing overhead.
A refactored _update_request_with_output might look like this:
def _update_request_with_output(
self,
request: Request,
new_token_ids: np.ndarray,
) -> tuple[np.ndarray, bool]:
stopped = False
for num_new, output_token_id in enumerate(new_token_ids, 1):
request.append_output_token_ids(int(output_token_id))
stopped = check_stop(request, self.max_model_len)
if stopped:
# Truncate the array and break
new_token_ids = new_token_ids[:num_new]
break
return new_token_ids, stoppedThen, in update_from_output, you could work with numpy arrays throughout, avoiding the .tolist() and np.array() calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ) | ||
|
|
||
| new_token_ids = engine_core_output.new_token_ids | ||
| new_token_ids: list[int] = engine_core_output.new_token_ids.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle non-ndarray EngineCoreOutput token_ids
In process_outputs we now call engine_core_output.new_token_ids.tolist(), assuming every EngineCoreOutput carries a NumPy array. Several producers still emit plain Python lists (e.g., MockEngineCore.get_outputs in tests/v1/engine/utils.py builds EngineCoreOutput(new_token_ids=[...])), so when those outputs are processed—such as in the log-stats variants of the output-processor tests or any external mock engine—this line raises AttributeError before detokenization/logprob handling. Either convert producers to np.ndarray or accept list inputs here.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
Purpose
There're <batch_size> of EngineCoreOutput generated per decode batch, in order to avoid GC, we should avoid using objects (e.g. list) inside EngineCoreOutput for large batch size scenarios.
This is a continuous effort on top of #28245
Test Plan & Test Result
CI signals
As a followup, we should really need to introduce e2e tests to ensure GC costs are not regressing.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.