bug: Router Health Check has Synchronously blocking calls (will fail liveness on engine scaling)

### Describe the bug

In practice this was discovered by doing tests to scale backends up and down and observe if traffic was balanced between new backends.  The router deployment began failing readiness and health checks and was killed by the k8s controller whenever the engine pods would scale.  I have a theory based on seeing there be numerous `requests.get/post` (synchronous and blocking on thread) calls throughout the code base which live inside of `async def` functions.


When the service discovery mode is set to 'k8s' and backend engines scale down or become unresponsive, the background threads responsible for Kubernetes service discovery and engine statistics scraping can hang due to synchronous, blocking HTTP calls. The `/health` check endpoint in `src/vllm_router/routers/main_router.py` relies on the liveness of these threads (by calling `get_service_discovery().get_health()` and `get_engine_stats_scraper().get_health()`). If these threads hang or die because their internal blocking calls take too long (e.g., due to timeouts when trying to reach scaled-down/dead backends), the Kubernetes liveness probes for the `/health` endpoint can eventually fail, leading to the router pod being killed unnecessarily. While the `get_health()` calls themselves are simple thread liveness checks, the underlying problem is the blocking nature of operations within those threads.

### To Reproduce

1.  Configure the vLLM router with `service_discovery` mode set to `k8s`.
2.  Deploy backend engines that are discoverable via Kubernetes.
3.  Simulate a scenario where one or more backend engines scale down or become unresponsive.
4.  Continuously poll the `/health` endpoint of the router.
5.  Observe that requests to `/health` may hang or take an extended period to respond if backend services are slow or unresponsive.
6.  If the time taken exceeds the Kubernetes liveness probe timeout, the router pod will be restarted.

### Expected behavior

The `/health` endpoint should respond quickly. The background threads for service discovery and stats scraping should use non-blocking (asynchronous) HTTP requests with appropriate timeouts when communicating with backend engines. This will prevent these threads from hanging or dying when backends are slow or unresponsive, ensuring the main `/health` check remains responsive and accurately reflects the router's operational status.

### Additional context

The main `/health` endpoint in `src/vllm_router/routers/main_router.py` calls:
*   `get_service_discovery().get_health()`
*   `get_engine_stats_scraper().get_health()`

These methods themselves are non-blocking thread liveness checks (e.g., `return self.watcher_thread.is_alive()`).

The **actual blocking HTTP calls** happen *within the background threads* managed by these services:

1.  **For K8s Service Discovery (`K8sServiceDiscovery` in `src/vllm_router/service_discovery.py`):**
    *   The `_watch_engines` method (running in `self.watcher_thread`) calls `_get_model_name`.
    *   Inside `_get_model_name(self, pod_ip)`:
        ```python
        # ...
        response = requests.get(url, headers=headers) # BLOCKING CALL
        # ...
        ```
    This call to `requests.get(f"http://{pod_ip}:{self.port}/v1/models", ...)` can block if a pod is unresponsive.

2.  **For Engine Stats Scraper (`EngineStatsScraper` in `src/vllm_router/stats/engine_stats.py`):**
    *   The `_scrape_worker` method (running in `self.scrape_thread`) calls `_scrape_metrics`.
    *   `_scrape_metrics` calls `_scrape_one_endpoint(self, url)`.
    *   Inside `_scrape_one_endpoint(self, url)`:
        ```python
        # ...
        response = requests.get(url + "/metrics", timeout=self.scrape_interval) # BLOCKING CALL
        # ...
        ```
    This call to `requests.get(url + "/metrics", ...)` can block if a backend's metrics endpoint is unresponsive.

Relevant code snippet from `src/vllm_router/routers/main_router.py` (showing the non-blocking checks):

The FastAPI enpoint is marked as "async" when in fact it consists of blocking synchronous calls.

```python
@main_router.get("/health")
async def health() -> Response:
    ...

    if not get_service_discovery().get_health(): # Checks watcher_thread.is_alive()
        return JSONResponse(
            content={"status": "Service discovery module is down."}, status_code=503
        )
    if not get_engine_stats_scraper().get_health(): # Checks scrape_thread.is_alive()
        return JSONResponse(
            content={"status": "Engine stats scraper is down."}, status_code=503
        )

    # ... rest of the function
```

The initialization of these components happens in `src/vllm_router/app.py`. The core issue is the synchronous nature of `requests.get` within the background threads of these services when backends are unhealthy.


## Recomendations

The identified blocking `requests.get` calls within `K8sServiceDiscovery._get_model_name` and `EngineStatsScraper._scrape_one_endpoint` need to be re-implemented using an asynchronous HTTP client (like `httpx` or `aiohhtp`) to prevent these background threads from blocking and potentially hanging or crashing, which in turn affects the reliability of the `/health` endpoint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: Router Health Check has Synchronously blocking calls (will fail liveness on engine scaling) #431

Describe the bug

To Reproduce

Expected behavior

Additional context

Recomendations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: Router Health Check has Synchronously blocking calls (will fail liveness on engine scaling) #431

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Recomendations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions