Skip to content

bug: Router Health Check has Synchronously blocking calls (will fail liveness on engine scaling) #431

@TheCodeWrangler

Description

@TheCodeWrangler

Describe the bug

In practice this was discovered by doing tests to scale backends up and down and observe if traffic was balanced between new backends. The router deployment began failing readiness and health checks and was killed by the k8s controller whenever the engine pods would scale. I have a theory based on seeing there be numerous requests.get/post (synchronous and blocking on thread) calls throughout the code base which live inside of async def functions.

When the service discovery mode is set to 'k8s' and backend engines scale down or become unresponsive, the background threads responsible for Kubernetes service discovery and engine statistics scraping can hang due to synchronous, blocking HTTP calls. The /health check endpoint in src/vllm_router/routers/main_router.py relies on the liveness of these threads (by calling get_service_discovery().get_health() and get_engine_stats_scraper().get_health()). If these threads hang or die because their internal blocking calls take too long (e.g., due to timeouts when trying to reach scaled-down/dead backends), the Kubernetes liveness probes for the /health endpoint can eventually fail, leading to the router pod being killed unnecessarily. While the get_health() calls themselves are simple thread liveness checks, the underlying problem is the blocking nature of operations within those threads.

To Reproduce

  1. Configure the vLLM router with service_discovery mode set to k8s.
  2. Deploy backend engines that are discoverable via Kubernetes.
  3. Simulate a scenario where one or more backend engines scale down or become unresponsive.
  4. Continuously poll the /health endpoint of the router.
  5. Observe that requests to /health may hang or take an extended period to respond if backend services are slow or unresponsive.
  6. If the time taken exceeds the Kubernetes liveness probe timeout, the router pod will be restarted.

Expected behavior

The /health endpoint should respond quickly. The background threads for service discovery and stats scraping should use non-blocking (asynchronous) HTTP requests with appropriate timeouts when communicating with backend engines. This will prevent these threads from hanging or dying when backends are slow or unresponsive, ensuring the main /health check remains responsive and accurately reflects the router's operational status.

Additional context

The main /health endpoint in src/vllm_router/routers/main_router.py calls:

  • get_service_discovery().get_health()
  • get_engine_stats_scraper().get_health()

These methods themselves are non-blocking thread liveness checks (e.g., return self.watcher_thread.is_alive()).

The actual blocking HTTP calls happen within the background threads managed by these services:

  1. For K8s Service Discovery (K8sServiceDiscovery in src/vllm_router/service_discovery.py):

    • The _watch_engines method (running in self.watcher_thread) calls _get_model_name.
    • Inside _get_model_name(self, pod_ip):
      # ...
      response = requests.get(url, headers=headers) # BLOCKING CALL
      # ...

    This call to requests.get(f"http://{pod_ip}:{self.port}/v1/models", ...) can block if a pod is unresponsive.

  2. For Engine Stats Scraper (EngineStatsScraper in src/vllm_router/stats/engine_stats.py):

    • The _scrape_worker method (running in self.scrape_thread) calls _scrape_metrics.
    • _scrape_metrics calls _scrape_one_endpoint(self, url).
    • Inside _scrape_one_endpoint(self, url):
      # ...
      response = requests.get(url + "/metrics", timeout=self.scrape_interval) # BLOCKING CALL
      # ...

    This call to requests.get(url + "/metrics", ...) can block if a backend's metrics endpoint is unresponsive.

Relevant code snippet from src/vllm_router/routers/main_router.py (showing the non-blocking checks):

The FastAPI enpoint is marked as "async" when in fact it consists of blocking synchronous calls.

@main_router.get("/health")
async def health() -> Response:
    ...

    if not get_service_discovery().get_health(): # Checks watcher_thread.is_alive()
        return JSONResponse(
            content={"status": "Service discovery module is down."}, status_code=503
        )
    if not get_engine_stats_scraper().get_health(): # Checks scrape_thread.is_alive()
        return JSONResponse(
            content={"status": "Engine stats scraper is down."}, status_code=503
        )

    # ... rest of the function

The initialization of these components happens in src/vllm_router/app.py. The core issue is the synchronous nature of requests.get within the background threads of these services when backends are unhealthy.

Recomendations

The identified blocking requests.get calls within K8sServiceDiscovery._get_model_name and EngineStatsScraper._scrape_one_endpoint need to be re-implemented using an asynchronous HTTP client (like httpx or aiohhtp) to prevent these background threads from blocking and potentially hanging or crashing, which in turn affects the reliability of the /health endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions