Add key features and architecture diagram (#2239)

vishalbollu · deliahu · commit 84e55ef4b673 · 2021-06-08T17:55:36.000Z
(cherry picked from commit cea16f1)
diff --git a/docs/overview.md b/docs/overview.md
@@ -32,3 +32,7 @@ Cortex uses a collection of containers, referred to as a pod, as the atomic unit
 * Task
 
 Visit the workload-specific documentation for more details.
+
+## Architecture Diagram
+
+![](https://user-images.githubusercontent.com/4365343/121231768-ce62e200-c85e-11eb-84b1-3d5d4b999c12.png)
diff --git a/docs/workloads/async/async.md b/docs/workloads/async/async.md
@@ -4,12 +4,26 @@ Async APIs are designed for asynchronous workloads in which the user submits an
 
 Async APIs are a good fit for users who want to submit longer workloads (such as video, audio or document processing), and do not need the result immediately or synchronously.
 
+**Key features**
+
+* asynchronously process requests
+* retrieve status and response via HTTP endpoint
+* autoscale based on queue length
+* avoid cold starts
+* scale to 0
+* perform rolling updates
+* automatically recover from failures and spot instance termination
+
 ## How it works
 
-When you deploy an AsyncAPI, Cortex creates an SQS queue, a pool of Async Gateway workers, and a pool of workers running your containers.
+When you deploy an AsyncAPI, Cortex creates an SQS queue, a pool of Async Gateway workers, and a pool of worker pods. Each worker pod is running a dequeuer sidecar and your containers.
+
+Upon receiving a request, the Async Gateway will save the request payload to S3, enqueue the request ID onto an SQS FIFO queue, and respond with the request ID.
+
+The dequeuer sidecar in the worker pod will pull the request from the SQS queue, download the request's payload from S3, and make a POST request to your containers. After the dequeuer receives a response, the corresponding request payload will be deleted from S3 and the response will be saved in S3 for 7 days.
 
-The Async Gateway is responsible for submitting the workloads to the queue and for retrieving workload statuses and results. Cortex fully implements and manages the Async Gateway and the queue.
+You can fetch the result by making a GET request to the AsyncAPI endpoint with the request ID. The Async Gateway will respond with the status and the result (if the request has been completed).
 
 The pool of workers running your containers autoscales based on the average number of messages in the queue and can scale down to 0 (if configured to do so).
 
-![](https://user-images.githubusercontent.com/7456627/111491999-9b67f100-873c-11eb-87f0-effcf4aab01b.png)
+![](https://user-images.githubusercontent.com/4365343/121231833-e470a280-c85e-11eb-8be7-ad0a7cf9bce3.png)
diff --git a/docs/workloads/batch/batch.md b/docs/workloads/batch/batch.md
@@ -4,16 +4,25 @@ Batch APIs run distributed and fault-tolerant batch processing jobs on demand.
 
 Batch APIs are a good fit for users who want to break up their workloads and distribute them across a dedicated pool of workers (for example, running inference on a set of images).
 
-## How it works
+**Key features**
+
+* distribute a batch job across multiple workers
+* scale to 0 (when there are no batch jobs)
+* trigger `/on-job-complete` hook once all batches have been processed
+* attempt all batches at least once
+* reroute failed batches to a dead letter queue
+* automatically recover from failures and spot instance termination
 
-When you deploy a Batch API, Cortex creates an endpoint to receive job submissions.
+## How it works
 
-Upon job submission, Cortex responds with a Job ID, and asynchronously triggers a Batch Job.
+When you deploy a Batch API, Cortex creates an endpoint to receive job submissions. Upon submitting a job, Cortex will respond with a Job ID, and will asynchronously trigger a Batch Job.
 
-First, Cortex deploys an enqueuer, which breaks up the data in the job into batches and pushes them onto an SQS FIFO queue.
+A Batch Job begins with the deployment of an enqueuer process which breaks up the data in the job into batches and pushes them onto an SQS FIFO queue.
 
 After enqueuing is complete, Cortex initializes the requested number of worker pods and attaches a dequeuer sidecar to each pod. The dequeuer is responsible for retrieving batches from the queue and making an http request to your pod for each batch.
 
 After the worker pods have emptied the queue, the job is marked as complete, and Cortex will terminate the worker pods and delete the SQS queue.
 
 You can make GET requests to the BatchAPI endpoint to get the status of the Job and metrics such as the number of batches completed and failed.
+
+![](https://user-images.githubusercontent.com/4365343/121231862-ed617400-c85e-11eb-96fb-84b10c211131.png)
diff --git a/docs/workloads/realtime/realtime.md b/docs/workloads/realtime/realtime.md
@@ -4,8 +4,19 @@ Realtime APIs respond to requests synchronously and autoscale based on in-flight
 
 Realtime APIs are a good fit for users who want to run stateless containers as a scalable microservice (for example, deploying machine learning models as APIs).
 
+**Key features**
+
+* respond to requests synchronously
+* autoscale based on request volume
+* avoid cold starts
+* perform rolling updates
+* automatically recover from failures and spot instance termination
+* perform A/B tests and canary deployments
+
 ## How it works
 
 When you deploy a Realtime API, Cortex initializes a pool of worker pods and attaches a proxy sidecar to each of the pods.
 
 The proxy is responsible for receiving incoming requests, queueing them (if necessary), and forwarding them to your pod when it is ready. Autoscaling is based on aggregate in-flight request volume, which is published by the proxy sidecars.
+
+![](https://user-images.githubusercontent.com/4365343/121231921-fe11ea00-c85e-11eb-9813-6ee114f9a3fc.png)
diff --git a/docs/workloads/task/task.md b/docs/workloads/task/task.md
@@ -4,12 +4,18 @@ Task APIs provide a lambda-style execution of containers. They are useful for ru
 
 Task APIs are a good fit when you need to trigger container execution via an HTTP request. They can be used to run tasks (e.g. training models), and can be configured as task runners for orchestrators (such as airflow).
 
-## How it works
+**Key Features**
+
+* run containers on-demand
+* scale to 0 (when there are no tasks)
+* automatically recover from failures and spot instance termination
 
-When you deploy a Task API, an endpoint is created to receive task submissions.
+## How it works
 
-Upon submitting a Task, Cortex will respond with a Task ID and will asynchronously trigger the execution of a Task.
+When you deploy a Task API, an endpoint is created to receive task submissions. Upon submitting a Task, Cortex will respond with a Task ID and will asynchronously trigger the execution of a Task.
 
 Cortex will initialize a worker pod based on your API specification. After the worker pod runs to completion, the Task is marked as completed and the pod is terminated.
 
-You can make GET requests to the Task API endpoint to retreive the status of the Task.
+You can make GET requests to the Task API endpoint to retrieve the status of the Task.
+
+![](https://user-images.githubusercontent.com/4365343/121231738-c30fb680-c85e-11eb-886f-dc4d9bf3ef17.png)