Skip to content

[BUG]: NATS request-plane race condition on registration maybe #4753

@grahamking

Description

@grahamking

Describe the Bug

There seems to be an error on worker startup with the NATS service, on latest main.

I have seen it work occasionally, so likely a race condition.

Steps to Reproduce

Clear etcd. Restart nats. Run this:

python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B

Gives this output.

2025-12-04T14:55:38.090064Z  INFO main.init: Registering model with endpoint types: chat,completions
2025-12-04T14:55:38.090222Z DEBUG dynamo_runtime::local_endpoint_registry: Registering local endpoint: generate
2025-12-04T14:55:38.090253Z DEBUG dynamo_runtime::component::endpoint: Registered engine for endpoint 'generate' in local registry
2025-12-04T14:55:38.090457Z  INFO register._get_runtime_config: Got total KV blocks from scheduler: 750229 (max_total_tokens=750229, page_size=1)
2025-12-04T14:55:38.090712Z DEBUG dynamo_runtime::component::endpoint: Starting endpoint: dynamo/backend/generate
2025-12-04T14:55:38.098016Z DEBUG dynamo_llm::local_model: Registering MDC at path: dynamo/backend/generate/694d9adb9d05322d
2025-12-04T14:55:38.098130Z DEBUG dynamo_runtime::discovery::kv_store: KVStoreDiscovery::register: Registering base model instance_id=7587891215212032557, namespace=dynamo, component=backend, endpoint=generate, key=dynamo/backend/generate/694d9adb9d05322d
2025-12-04T14:55:38.098188Z DEBUG dynamo_runtime::discovery::kv_store: KVStoreDiscovery::register: Serialized instance to 1584 bytes for key=dynamo/backend/generate/694d9adb9d05322d
2025-12-04T14:55:38.098196Z DEBUG dynamo_runtime::discovery::kv_store: KVStoreDiscovery::register: Getting/creating bucket=v1/mdc for key=dynamo/backend/generate/694d9adb9d05322d
2025-12-04T14:55:38.098212Z DEBUG dynamo_runtime::discovery::kv_store: KVStoreDiscovery::register: Inserting into bucket=v1/mdc, key=dynamo/backend/generate/694d9adb9d05322d
2025-12-04T14:55:38.098543Z  INFO dynamo_llm::kv_router::publisher: Registered KvStats Prometheus metrics
2025-12-04T14:55:38.098540Z  INFO dynamo_runtime::component::endpoint: Endpoint starting with request plane mode: nats
2025-12-04T14:55:38.098661Z DEBUG dynamo_runtime::component::endpoint: Registering endpoint health check target endpoint_name=generate
2025-12-04T14:55:38.098802Z DEBUG dynamo_runtime::component::endpoint: Registering endpoint 'generate' with graceful shutdown tracker
2025-12-04T14:55:38.098846Z DEBUG dynamo_runtime::utils::graceful_shutdown: Endpoint registered, total active: 0 -> 1
2025-12-04T14:55:38.098897Z  INFO dynamo_runtime::pipeline::network::manager: Creating NATS request plane server
2025-12-04T14:55:38.098991Z  INFO dynamo_runtime::component::endpoint: Registering endpoint with request plane server endpoint=generate transport="nats"
2025-12-04T14:55:38.099046Z  INFO dynamo_runtime::pipeline::network::ingress::nats_server: NatsMultiplexedServer::register_endpoint called endpoint_name=generate namespace=dynamo component=backend instance_id=7587891215212032557
2025-12-04T14:55:38.099106Z DEBUG dynamo_runtime::pipeline::network::ingress::nats_server: Looking up service group in registry service_name_raw=dynamo_backend service_name=dynamo_backend
2025-12-04T14:55:38.099543Z ERROR main.init: Failed to serve endpoints: Service 'dynamo_backend' not found in registry
2025-12-04T14:55:38.099592Z  INFO main.init: Metrics task succesfully cancelled
2025-12-04T14:55:38.100186Z  INFO dynamo_runtime::distributed: Added NATS service dynamo_backend
2025-12-04T14:55:38.107282Z  INFO decode_handler.cleanup: Engine shutdown
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/grahamk/src/dynamo/main/components/src/dynamo/sglang/__main__.py", line 7, in <module>
    main()
  File "/home/grahamk/src/dynamo/main/components/src/dynamo/sglang/main.py", line 530, in main
    uvloop.run(worker())
  File "/home/grahamk/venv/sglang-0.5.4/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/grahamk/venv/sglang-0.5.4/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/grahamk/src/dynamo/main/components/src/dynamo/sglang/main.py", line 97, in worker
    await init(runtime, config)
  File "/home/grahamk/src/dynamo/main/components/src/dynamo/sglang/main.py", line 175, in init
    await asyncio.gather(
Exception: Service 'dynamo_backend' not found in registry

It's definitely NATS:

  • --store-kv etcd --request-plane tcp works
  • --store-kv file --request-plane nats fails

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions