Skip to content

Support requests between APIs within the cluster #1503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 6, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions docs/deployments/realtime-api/predictors.md
Original file line number Diff line number Diff line change
Expand Up @@ -532,3 +532,20 @@ def predict(self, payload):
content=data, media_type="text/plain")
return response
```

## Chaining APIs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you prefer a different title, I wasn't sure what to go with


It is possible to make requests from one API to another within a Cortex cluster. All running APIs are accessible from within the predictor at `http://api-<api_name>:8888/predict`, where `<api_name>` is the name of the API you are making a request to.

For example, if there is an api named `text-generator` running in the cluster, you could make a request to it from a different API by using:

```python
import requests

class PythonPredictor:
def predict(self, payload):
response = requests.post("http://api-text-generator:8888/predict", json={"text": "machine learning is"})
# ...
```

Note that the autoscaling configuration (i.e. `target_replica_concurrency`) for the API that is making the request should be modified with the understanding that requests will still be considered "in-flight" with the first API as the request is being fulfilled in the second API (during which it will also be considered "in-flight" with the second API). See more details in the [autoscaling docs](autoscaling.md).
6 changes: 5 additions & 1 deletion pkg/workloads/cortex/serve/serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
from concurrent.futures import ThreadPoolExecutor
import threading
import math
import uuid
import asyncio
from typing import Any

Expand Down Expand Up @@ -121,7 +122,10 @@ async def register_request(request: Request, call_next):
try:
if is_prediction_request(request):
if local_cache["provider"] != "local":
request_id = request.headers["x-request-id"]
if "x-request-id" in request.headers:
request_id = request.headers["x-request-id"]
else:
request_id = uuid.uuid1()
file_id = f"/mnt/requests/{request_id}"
open(file_id, "a").close()

Expand Down