Fix pickle error with remote code models in vLLM Ray worker process #53815

eicherseiji · 2025-06-13T20:58:18Z

Why are these changes needed?

Since #53621, vLLM engines running DeepSeek-V2-Lite and related models that fetch remote code fail with pickle errors, since the engine only registers custom configs to serialize by value (and avoid the pickle error) if transformers_modules can be imported.

# vllm/transformers_utils/config.py
def maybe_register_config_serialize_by_value() -> None:
    try:
        import transformers_modules
    except ImportError:
        # the config does not need trust_remote_code
        return
...

We relied on a call to AutoProcessor.from_pretrained to initialize transformers_modules, so that maybe_register_config_serialize_by_value would execute correctly when AsyncLLM starts. This was removed in #53621. Now we can use init_hf_modules so to accomplish the same result more directly.

We could also fix this upstream with vllm-project/vllm#19510.

Traceback:

(ServeController pid=14535) Traceback (most recent call last):
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/deployment_state.py", line 718, in check_ready
(ServeController pid=14535)     ) = ray.get(self._ready_obj_ref)
(ServeController pid=14535)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=14535)     return fn(*args, **kwargs)
(ServeController pid=14535)            ^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(ServeController pid=14535)     return func(*args, **kwargs)
(ServeController pid=14535)            ^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2847, in get
(ServeController pid=14535)     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(ServeController pid=14535)                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 947, in get_objects
(ServeController pid=14535)     raise value.as_instanceof_cause()
(ServeController pid=14535) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:default:LLMDeploymentdeepseek.initialize_and_get_metadata() (pid=32141, ip=10.0.127.215, actor_id=b962327f5b252ba15620da8102000000, repr=<ray.serve._private.replica.ServeReplica:default:LLMDeploymentdeepseek object at 0x7ae2c961c7d0>)
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
(ServeController pid=14535)     return self.__get_result()
(ServeController pid=14535)            ^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
(ServeController pid=14535)     raise self._exception
(ServeController pid=14535)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1022, in initialize_and_get_metadata
(ServeController pid=14535)     await self._replica_impl.initialize(deployment_config)
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 730, in initialize
(ServeController pid=14535)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=14535) RuntimeError: Traceback (most recent call last):
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 703, in initialize
(ServeController pid=14535)     self._user_callable_asgi_app = await asyncio.wrap_future(
(ServeController pid=14535)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1442, in initialize_callable
(ServeController pid=14535)     await self._call_func_or_gen(
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1388, in _call_func_or_gen
(ServeController pid=14535)     result = await result
(ServeController pid=14535)              ^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 442, in __init__
(ServeController pid=14535)     await asyncio.wait_for(self._start_engine(), timeout=ENGINE_START_TIMEOUT_S)
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/asyncio/tasks.py", line 489, in wait_for
(ServeController pid=14535)     return fut.result()
(ServeController pid=14535)            ^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 488, in _start_engine
(ServeController pid=14535)     await self.engine.start()
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 280, in start
(ServeController pid=14535)     self.engine = await self._start_engine()
(ServeController pid=14535)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 337, in _start_engine
(ServeController pid=14535)     return await self._start_engine_v1()
(ServeController pid=14535)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 405, in _start_engine_v1
(ServeController pid=14535)     return self._start_async_llm_engine(
(ServeController pid=14535)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 528, in _start_async_llm_engine
(ServeController pid=14535)     engine = vllm.engine.async_llm_engine.AsyncLLMEngine(
(ServeController pid=14535)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 123, in __init__
(ServeController pid=14535)     self.engine_core = core_client_class(
(ServeController pid=14535)                        ^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 734, in __init__
(ServeController pid=14535)     super().__init__(
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 404, in __init__
(ServeController pid=14535)     self.resources.local_engine_manager = CoreEngineProcManager(
(ServeController pid=14535)                                           ^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/utils.py", line 142, in __init__
(ServeController pid=14535)     proc.start()
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/process.py", line 121, in start
(ServeController pid=14535)     self._popen = self._Popen(self)
(ServeController pid=14535)                   ^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
(ServeController pid=14535)     return Popen(process_obj)
(ServeController pid=14535)            ^^^^^^^^^^^^^^^^^^
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
(ServeController pid=14535)     super().__init__(process_obj)
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
(ServeController pid=14535)     self._launch(process_obj)
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
(ServeController pid=14535)     reduction.dump(process_obj, fp)
(ServeController pid=14535)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
(ServeController pid=14535)     ForkingPickler(file, protocol).dump(obj)
(ServeController pid=14535) _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite.604d5664dddd88a0433dbae533b7fe9472482de0.configuration_deepseek.DeepseekV2Config'>: import of module 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite.604d5664dddd88a0433dbae533b7fe9472482de0.configuration_deepseek' failed

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

kouroshHakha

just one nit, let's also rerun the release tests on this PR before merging.

release/llm_tests/serve/test_llm_serve_integration.py

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…er process (#53815) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…er process (ray-project#53815) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…er process (#53815) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

eicherseiji marked this pull request as ready for review June 13, 2025 23:21

Copilot AI review requested due to automatic review settings June 13, 2025 23:21

eicherseiji requested a review from a team as a code owner June 13, 2025 23:21

Copilot AI reviewed Jun 13, 2025

View reviewed changes

eicherseiji added the go add ONLY when ready to merge, run all tests label Jun 14, 2025

eicherseiji marked this pull request as draft June 14, 2025 23:00

eicherseiji changed the title ~~Fix pickle error with remote code models in vLLM multiprocessing~~ Fix pickle error with remote code models in vLLM Ray worker process Jun 16, 2025

eicherseiji marked this pull request as ready for review June 16, 2025 17:09

kouroshHakha reviewed Jun 16, 2025

View reviewed changes

release/llm_tests/serve/test_llm_serve_integration.py Outdated Show resolved Hide resolved

eicherseiji added 5 commits June 16, 2025 13:44

Add back logic to load remote model configuration classes

0d0ca7b

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Add release test

75598ba

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Test that model fails without trust_remote_code

ef24997

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Upgrade workaround to fix

7976370

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Test hygiene

86e3add

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji force-pushed the revert-hf-prompt-removal branch from fbc1dc5 to b0c508e Compare June 16, 2025 20:45

Add back pytest.raises

c7126cf

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji force-pushed the revert-hf-prompt-removal branch from b0c508e to c7126cf Compare June 16, 2025 22:44

kouroshHakha approved these changes Jun 17, 2025

View reviewed changes

kouroshHakha merged commit c76f9a3 into ray-project:master Jun 17, 2025
5 checks passed

elliot-barn pushed a commit that referenced this pull request Jun 18, 2025

[Serve.llm] Fix pickle error with remote code models in vLLM Ray work…

c957713

…er process (#53815) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025

[Serve.llm] Fix pickle error with remote code models in vLLM Ray work…

9b4f19a

…er process (ray-project#53815) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

elliot-barn pushed a commit that referenced this pull request Jul 2, 2025

[Serve.llm] Fix pickle error with remote code models in vLLM Ray work…

a24baeb

…er process (#53815) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

eicherseiji mentioned this pull request Jul 2, 2025

Add DeepSeek example RayService ray-project/kuberay#3838

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix pickle error with remote code models in vLLM Ray worker process #53815

Fix pickle error with remote code models in vLLM Ray worker process #53815

Uh oh!

eicherseiji commented Jun 13, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

kouroshHakha left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix pickle error with remote code models in vLLM Ray worker process #53815

Fix pickle error with remote code models in vLLM Ray worker process #53815

Uh oh!

Conversation

eicherseiji commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eicherseiji commented Jun 13, 2025 •

edited

Loading