[Core] Multiprocessing executor for single-node multi-GPU [1/2] #4345

njhill · 2024-04-25T02:50:24Z

This introduces the MultiProcGPUExecutor which uses multiprocessing for tensor parallel as an alternative to ray.

This PR does not actually wire it up for use, that will be done in a follow-on PR.

This PR includes some refactoring to simplify the executor class hierarchy:

A MultiGPUExecutor abstract superclass shared between ray and vanilla multiprocessing implementations
Add a shutdown() method to BaseExecutor abstract class, executor is shutdown when the LLMEngine is garbage collected
Simplification/centralization of GPU Worker construction
Move ray_utils.py from engine to executor package (per @zhuohan123's suggestion)
Move function call tracing setup to utils function
Fix various typing things

This replaces #3466, see background in that issue.

This introduces the MultiProcGPUExecutor which uses multiprocessing for tensor parallel as an alternative to ray. This PR does not actually wire it up for use, that will be done in a follow-on PR. This PR includes some refactoring to simplify the executor class hierarchy: - A `MultiGPUExecutor` abstract superclass shared between ray and vanilla multiprocessing implementations - Add a shutdown() method to BaseExecutor abstract class - Simplification/centralization of GPU Worker construction - Move ray_utils.py from engine to executor package - Move function call tracing setup to utils function - Fix various typing things

youkaichao · 2024-04-25T03:00:56Z

vllm/utils.py

+@lru_cache(maxsize=None)
+def get_distributed_init_method() -> str:
+    ip = get_ip()
+    port = get_open_port()


is it safe to cache here? what if we want to init distributed for two different groups, and we expect the function to return different results per call?

youkaichao · 2024-04-25T03:05:16Z

vllm/utils.py

@@ -607,3 +614,15 @@ def find_nccl_library():
            raise ValueError("NCCL only supports CUDA and ROCm backends.")
        logger.info(f"Found nccl from library {so_file}")
    return so_file
+
+
+def enable_trace_function_call_for_process():


actually this is per-thread. the naming is not accurate.

youkaichao · 2024-04-25T04:17:27Z

To be honest, I feel like this PR is still too large to be effective reviewed. I suggest to break it down into pieces, each pr having about 100~200 lines of change, so that we can have a quick feedback loop.

njhill · 2024-04-25T05:00:57Z

@youkaichao here is the first one: #4347

njhill · 2024-04-25T05:23:41Z

Next one, just introducing the new multi-gpu abstract executor class: #4348

njhill · 2024-04-25T05:41:54Z

Next one, add shutdown method to ExecutorBase: #4349

youkaichao · 2024-04-25T05:44:04Z

Let's keep 1~2 PRs per day to avoid zhuohan and me being overwhelmed ?

njhill · 2024-04-25T06:01:41Z

@youkaichao these are mostly the contents of the original PR that was already reviewed and discussed, just broken up (and mostly very small). Please feel to review at whatever pace you're comfortable with!

Next one is just moving the function tracing setup to a util function: #4352

njhill · 2024-04-25T17:59:45Z

Closing this in favor of a set of more granular PRs, some already opened (linked above).

youkaichao reviewed Apr 25, 2024

View reviewed changes

Rename function, fix tests, fix yapf

6532881

njhill marked this pull request as draft April 25, 2024 06:07

njhill closed this Apr 25, 2024

njhill mentioned this pull request May 1, 2024

[Core] Add MultiprocessingGPUExecutor #4539

Merged

njhill deleted the ray-optional3 branch May 15, 2024 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Multiprocessing executor for single-node multi-GPU [1/2] #4345

[Core] Multiprocessing executor for single-node multi-GPU [1/2] #4345

njhill commented Apr 25, 2024

youkaichao Apr 25, 2024

youkaichao Apr 25, 2024

youkaichao commented Apr 25, 2024

njhill commented Apr 25, 2024

njhill commented Apr 25, 2024

njhill commented Apr 25, 2024

youkaichao commented Apr 25, 2024

njhill commented Apr 25, 2024

njhill commented Apr 25, 2024

[Core] Multiprocessing executor for single-node multi-GPU [1/2] #4345

[Core] Multiprocessing executor for single-node multi-GPU [1/2] #4345

Conversation

njhill commented Apr 25, 2024

youkaichao Apr 25, 2024

Choose a reason for hiding this comment

youkaichao Apr 25, 2024

Choose a reason for hiding this comment

youkaichao commented Apr 25, 2024

njhill commented Apr 25, 2024

njhill commented Apr 25, 2024

njhill commented Apr 25, 2024

youkaichao commented Apr 25, 2024

njhill commented Apr 25, 2024

njhill commented Apr 25, 2024