[BFCL] Fix Hanging Inference for OSS Models on GPU Platforms #663
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses issues encountered when running locally-hosted models on GPU-renting platforms (e.g., Lambda Cloud). Specifically, there were problems with output display from
vllm
due to the use of subprocesses for launching these models. Additionally, some multi-turn functions (such asxargs
) rely on subprocesses, which caused inference on certain test entries (such asmulti_turn_36
) to hang indefinitely, resulting in an undesirable pipeline halt.To fix this, the terminal logging logic has been updated to utilize a separate thread for reading from the subprocess pipe and printing to the terminal.
Alos, for readability, the
_format_prompt
function has been moved to thePrompting methods
section; this would not change the leaderboard score.