Skip to content

[serve][llm] Generalize DP Fix for LMCache Port Conflicts #57757

@nrghosh

Description

@nrghosh

[Serve.LLM] Port collisions with TP/PP when using NIXL/LMCache KV transfer backends

What happened + What you expected to happen

Reference: PR #55802 partially addressed port collisions for Data Parallelism by setting NIXL side-channel to base_port + data_parallel_rank.

Issue: With TP / PP, and multi-replica deployments, multiple vLLM processes on the same node concurrently probe for "open ports" using get_open_port() and can select the same one --> this can result in binding conflicts.

This manifests as:

  • Flaky startups and deployment failures
  • Address already in use errors from ZMQ
  • NIXL_ERR_BACKEND errors
  • Stuck initialization when scaling replicas or increasing TP/PP on shared nodes

Expected: Each worker should receive a unique port to avoid collisions across all parallelism strategies (DP/TP/PP).

Symptoms

  1. Port collisions: Multiple workers use identical ports

    Creating v1 connector with engine_id: Pkjr3b-10.0.235.241-52910 [repeated 3x across cluster]
    
  2. NIXL backend errors:

    nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND
    nixl_agent.cpp:481] registerMem: registration failed
    
  3. Deployment failures: Replicas fail to initialize and continuously restart

Root Cause

PR #55802 partially fixed port collisions for DP by adding logic to use data_parallel_rank:

dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0)
port = base_port + dp_rank

However, this only works when data_parallel_rank is explicitly set by DPServer, which only occurs in DP deployments.

For TP/PP deployments:

  • data_parallel_rank is not set (or defaults to 0)
  • All workers use offset 0 → same port for all workers
  • Port collision occurs when multiple workers initialize on the same node

Current code (nixl_connector.py):

dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0)
port = base_port + dp_rank  # Always 0 for TP/PP!

Reproduction

Environment

  • Ray: 3.0.0.dev0 (nightly)
  • Python: 3.11.11
  • Cluster: 8 GPU head node

Minimal Config (serve_config.yaml)

applications:
  - name: test-pp2-nixl
    import_path: ray.serve.llm:build_openai_app
    route_prefix: /
    args:
      llm_configs:
        - model_loading_config:
            model_id: facebook/opt-125m
          engine_kwargs:
            pipeline_parallel_size: 2
            tensor_parallel_size: 1
            max_num_seqs: 4
            enforce_eager: true
            kv_transfer_config:
              kv_connector: NixlConnector
              kv_role: kv_both
          deployment_config:
            autoscaling_config:
              min_replicas: 1
              max_replicas: 1
            ray_actor_options:
              num_cpus: 4
              num_gpus: 0

Steps to Reproduce

  1. Deploy the application:

    serve run serve_config.yaml
  2. Check logs for port collision:

    ray logs --grep "Creating v1 connector"
  3. Observe that all workers use the same port:

    Creating v1 connector with engine_id: Pkjr3b-10.0.235.241-52910
    Creating v1 connector with engine_id: Pkjr3b-10.0.235.241-52910 [repeated 3x]
    
  4. Check for NIXL errors:

    ray logs --grep "NIXL_ERR_BACKEND"

Related Issues

Versions / Dependencies

  • Ray: 2.47+ (nightly 3.0.0.dev0 tested)
  • vLLM: 0.10+
  • Python: 3.11+

Ways To Reproduce Issue

Option 1 - NIXL with Pipeline Parallelism:
Set num_replicas=2, pipeline_parallel_size>=2 (or tensor_parallel_size>=2) with kv_transfer_config={'kv_connector': 'NixlConnector', 'kv_role': 'kv_both'} and observe bind conflicts without disambiguation.

Option 2 - LMCache with numeric port:
Set kv_connector_extra_config={'lmcache_rpc_port': 5555} and num_replicas>=2; observe ZMQ EADDRINUSE without disambiguation.

Impact

Without fix:

  • TP/PP deployments with NIXL/LMCache fail to start
  • Flaky deployments with intermittent port collisions
  • Impossible to scale replicas reliably

With fix:

  • All parallelism strategies (DP/TP/PP) work correctly with unique ports per worker
  • Reliable scaling and deployment
  • No manual port management required

Workaround

Currently, users must manually specify unique ports per worker using NIXL_SIDE_CHANNEL_PORT_BASE in experimental_configs, which is cumbersome and error-prone for multi-worker deployments.

Issue Severity

High - Blocks TP/PP deployments with KV transfer backends

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tllmserveRay Serve Related IssuestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions