Fix TRT-LLM Multigpu Compatibility #2837
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[Wip] What does this PR do?
We need to use Composer to run our evaluation framework on TRT-LLM models. Unfortunately, this breaks in the Multi-GPU case. These fixes allow Composer to run N copies in parallel and feed data in a way that works with multi-gpu TRT-LLM models. Essentially, these changes are (a) not initializing dist and (b) fixing some race conditions related to data loading.
TODO: