Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions examples/offline_inference/data_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,18 @@ def parse_args():
parser.add_argument(
"--trust-remote-code", action="store_true", help="Trust remote code."
)
parser.add_argument(
"--max-num-seqs",
type=int,
default=64,
help=("Maximum number of sequences to be processed in a single iteration."),
)
Comment on lines +67 to +72
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both of these args required to avoid the OOM? 64 is quite small for batch mode, would be good if we could fix just with the gpu_memory_utilization reduction...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, also seems that we need much more memory during initialization than before. I was about to investigate more into this, but didn't get time to do so. Wondering if @yewentao256 could dig further into this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am happy to dig further, but what is the expected result for this? To reduce the memory usage? But it is kind of like a tradeoff between speed and efficiency I am afraid.
Basically, the original cause of this OOM issue is from #18724, which I think it is reasonable to adopt. @houseroad

parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.8,
help=("Fraction of GPU memory vLLM is allowed to allocate (0.0, 1.0]."),
)
return parser.parse_args()


Expand All @@ -77,6 +89,8 @@ def main(
GPUs_per_dp_rank,
enforce_eager,
trust_remote_code,
max_num_seqs,
gpu_memory_utilization,
):
os.environ["VLLM_DP_RANK"] = str(global_dp_rank)
os.environ["VLLM_DP_RANK_LOCAL"] = str(local_dp_rank)
Expand Down Expand Up @@ -127,6 +141,8 @@ def start(rank):
enforce_eager=enforce_eager,
enable_expert_parallel=True,
trust_remote_code=trust_remote_code,
max_num_seqs=max_num_seqs,
gpu_memory_utilization=gpu_memory_utilization,
)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
Expand Down Expand Up @@ -181,6 +197,8 @@ def start(rank):
tp_size,
args.enforce_eager,
args.trust_remote_code,
args.max_num_seqs,
args.gpu_memory_utilization,
),
)
proc.start()
Expand Down