Skip to content

Translator example #1592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Feb 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
3d669e2
Add translator example
caleb-kaiser Nov 23, 2020
096c0e8
Fix typo in README
caleb-kaiser Nov 23, 2020
274ffcf
Clean up code snippets
caleb-kaiser Nov 23, 2020
93a4273
Add version warning to cluster.yaml
caleb-kaiser Nov 24, 2020
25cb9fc
Merge branch 'master' into translator-example
caleb-kaiser Nov 24, 2020
355fbb5
Update README.md
caleb-kaiser Nov 24, 2020
66209af
fix trailing whitespace
caleb-kaiser Nov 24, 2020
0c0447e
Fix trailing whitespace
caleb-kaiser Nov 24, 2020
7fa3402
Merge branch 'master' into translator-example
caleb-kaiser Nov 24, 2020
2df5dac
Update cluster.yaml
caleb-kaiser Nov 25, 2020
0b2c98d
Merge branch 'translator-example' of https://github.com/cortexlabs/co…
caleb-kaiser Nov 25, 2020
9f8089e
Merge branch 'master' into translator-example
RobertLucian Nov 25, 2020
6a13e26
Merge branch 'master' into translator-example
RobertLucian Nov 28, 2020
8c23315
Merge branch 'master' into translator-example
RobertLucian Dec 1, 2020
4c006ba
Update requirements.txt
RobertLucian Dec 7, 2020
4d71c00
Merge branch 'master' into translator-example
RobertLucian Dec 8, 2020
6c04eee
Merge branch 'master' into translator-example
RobertLucian Dec 8, 2020
ea73563
Fix GPU support and readme
caleb-kaiser Dec 9, 2020
9b4682f
Lint
caleb-kaiser Dec 9, 2020
0bb6419
Lint
caleb-kaiser Dec 9, 2020
7f83bbc
Lint
caleb-kaiser Dec 9, 2020
80b61ee
Lint
caleb-kaiser Dec 9, 2020
d1e9275
Lint
caleb-kaiser Dec 9, 2020
91c5a36
Lint
caleb-kaiser Dec 9, 2020
0135450
Remove master branch warning
caleb-kaiser Dec 9, 2020
afd3387
Remove master branch warning
caleb-kaiser Dec 9, 2020
ec98ab1
Remove master branch warning
caleb-kaiser Dec 9, 2020
104fe22
Remove master branch warning
caleb-kaiser Dec 9, 2020
d63f206
Merge branch 'master' into translator-example
RobertLucian Jan 21, 2021
a6287be
Update cortex.yaml
RobertLucian Jan 21, 2021
a805cf7
Merge branch 'master' into translator-example
RobertLucian Jan 21, 2021
55353ac
Merge branch 'master' into translator-example
RobertLucian Jan 21, 2021
385ece0
Update cluster.yaml
RobertLucian Jan 22, 2021
891134c
Update cortex.yaml
RobertLucian Jan 22, 2021
7849d23
Update README.md
RobertLucian Jan 22, 2021
21e65bb
Merge branch 'master' into translator-example
RobertLucian Jan 25, 2021
8a8bf82
Merge branch 'master' into translator-example
RobertLucian Jan 30, 2021
6c53dc4
Merge branch 'master' into translator-example
RobertLucian Feb 3, 2021
b784803
Merge branch 'master' into translator-example
RobertLucian Feb 3, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions test/apis/model-caching/python/translator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Translator API

This project implements a multi-lingual translation API, supporting translations between over 150 languages, using +1,000 large pre-trained models served from a single EC2 instance via Cortex:


```bash
curl https://***.amazonaws.com/translator -X POST -H "Content-Type: application/json" -d
{"source_language": "en", "destination_language": "phi", "text": "It is a mistake to think you can solve any major problems just with potatoes." }

{"generated_text": "Sayop an paghunahuna nga masulbad mo ang bisan ano nga dagkong mga problema nga may patatas lamang."}
```

Priorities of this project include:

- __Cost effectiveness.__ Each language-to-language translation is handled by a different ~300 MB model. Traditional setups would deploy all +1,000 models across many servers to ensure availability, but this API can be run on a single server thanks to Cortex's multi-model caching.
- __Ease of use.__ Predictions are generated using Hugging Face's Transformer Library and Cortex's Predictor API, while the translation service itself runs on a Cortex cluster self-hosted on your AWS account.
- __Configurability.__ All tools used in this API are fully open source and modifiable. The deployed service and underlying infrastructure run on your AWS account. The prediction API can be run on CPU and GPU instances.

## Models used

This project uses pre-trained Opus MT neural machine translation models, trained by Jörg Tiedemann and the Language Technology Research Group at the University of Helsinki. The models are hosted for free by Hugging Face. For the full list of language-to-language models, you can view the model repository [here.](https://huggingface.co/Helsinki-NLP)

## How to deploy the API

To deploy the API, first spin up a Cortex cluster by running `$ cortex cluster up --config cortex.yaml`. Note that the configuration file we are providing Cortex with (accessible at `cortex.yaml`) requests a g4dn.xlarge GPU instance. If your AWS account does not have access to GPU instances, you can request an EC2 service quota increase easily [here](https://console.aws.amazon.com/servicequotas), or you can simply use CPU instances (CPU will still work, you will just likely experience higher latency).

```bash
$ cortex cluster up --config cortex.yaml

email address [press enter to skip]:

verifying your configuration ...

aws access key id ******************** will be used to provision a cluster named "cortex" in us-east-1:

○ using existing s3 bucket: cortex-***** ✓
○ using existing cloudwatch log group: cortex ✓
○ creating cloudwatch dashboard: cortex ✓
○ spinning up the cluster (this will take about 15 minutes) ...
○ updating cluster configuration ✓
○ configuring networking ✓
○ configuring autoscaling ✓
○ configuring logging ✓
○ configuring metrics ✓
○ configuring gpu support ✓
○ starting operator ✓
○ waiting for load balancers ...... ✓
○ downloading docker images ✓

cortex is ready!

```

Once the cluster is spun up (roughly 20 minutes), we can deploy by running:

```bash
cortex deploy
```

(I've configured my CLI to default to the AWS environment by running `cortex env default aws`)

Now, we wait for the API to become live. You can track its status with `cortex get --watch`.

Note that after the API goes live, we may need to wait a few minutes for it to register all the models hosted in the S3 bucket. Because the bucket is so large, it takes Cortex a bit longer than usual. When it's done, running `cortex get translator` should return something like:

```
cortex get translator

using aws environment

status up-to-date requested last update avg request 2XX
live 1 1 3m -- --

metrics dashboard: https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards:name=***

endpoint: http://***.elb.us-east-1.amazonaws.com/translator
example: curl: curl http://***.elb.us-east-1.amazonaws.com/translator -X POST -H "Content-Type: application/json" -d @sample.json

model name model version edit time
marian_converted_v1 1 (latest) 24 Aug 20 14:23:41 EDT
opus-mt-NORTH_EU-NORTH_EU 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-ROMANCE-en 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-SCANDINAVIA-SCANDINAVIA 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-aav-en 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-aed-es 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-af-de 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-af-en 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-af-eo 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-af-es 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-af-fi 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-af-fr 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-af-nl 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-af-ru 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-af-sv 1 (latest) 21 Aug 20 10:42:38 EDT
opus-mt-afa-afa 1 (latest) 21 Aug 20 10:42:38 EDT
...
```

This initial deploy will take a bit of time (~9 minutes) as Cortex indexes all the models in the bucket. After Cortex's upcoming release, deploys will take seconds, as model validation will be done in a nonblocking fashion (you can [track here](https://github.com/cortexlabs/cortex/issues/1663))

Once Cortex has indexed all +1,000 models. We can now query the API at the endpoint given, structuring the body of our request according to the format expected by our predictor (specified in `predictor.py`):

```
{
"source_language": "en",
"destination_language": "es",
"text": "So long and thanks for all the fish."
}
```

The response should look something like this:

```
{"generated_text": "Hasta luego y gracias por todos los peces."}
```

The API, as currently defined, uses the two-letter codes used by the Helsinki NLP team to abbreviate languages. If you're unsure of a particular language's code, check the model names. Additionally, you can easily implement logic on the frontend or within your API itself to parse different abbreviations.

## Performance

The first time you request a specific language-to-language translation, the model will be downloaded from S3, which may take some time (~60s, depending on bandwidth). Every subsequent request will be much faster, as the API is defined as being able to hold 250 models on disk and 5 in memory. Models already loaded into memory will serve predictions fastest (a couple seconds at most with GPU), while those on disk will take slightly longer as they need to be swapped into memory. Instances with more memory and disk space can naturally hold more models.

As for caching logic, when space is full, models are removed from both memory and disk according to which model was used last. You can read more about how caching works in the [Cortex docs.](https://docs.cortex.dev/)

Finally, note that this project places a heavy emphasis on cost savings, to the detriment of optimal performance. If you are interested in improving performance, there are a number of changes you can make. For example, if you know which models are most likely to be needed, you can "warm up" the API by calling them immediately after deploy. Alternatively, if you have a handful of translation requests that comprise the bulk of your workload, you can deploy a separate API containing just those models, and route traffic accordingly. You will increase cost (though still benefit greatly from multi-model caching), but you will also significantly improve the overall latency of your system.

## Projects to thank

This project is built on top of many free and open source tools. If you enjoy it, please consider supporting them by leaving a Star on their GitHub repo. These projects include Cortex, Transformers, and Helsinki NLP's Opus MT, as well as the many tools used under the hood by each.
17 changes: 17 additions & 0 deletions test/apis/model-caching/python/translator/cluster.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# EKS cluster name for cortex (default: cortex)
cluster_name: cortex

# AWS region
region: us-east-1

# instance type
instance_type: g4dn.xlarge

# minimum number of instances (must be >= 0)
min_instances: 1

# maximum number of instances (must be >= 1)
max_instances: 2

# disk storage size per instance (GB) (default: 50)
instance_volume_size: 125
12 changes: 12 additions & 0 deletions test/apis/model-caching/python/translator/cortex.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
- name: translator
kind: RealtimeAPI
predictor:
type: python
path: predictor.py
multi_model_reloading:
dir: s3://models.huggingface.co/bert/Helsinki-NLP/
cache_size: 5
disk_cache_size: 250
compute:
cpu: 1
gpu: 1 # this is optional, since the api can also run on cpu
24 changes: 24 additions & 0 deletions test/apis/model-caching/python/translator/predictor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from transformers import MarianMTModel, MarianTokenizer, pipeline
import torch


class PythonPredictor:
def __init__(self, config, python_client):
self.client = python_client
self.device = torch.cuda.current_device() if torch.cuda.is_available() else -1

def load_model(self, model_path):
return MarianMTModel.from_pretrained(model_path, local_files_only=True)

def predict(self, payload):
model_name = "opus-mt-" + payload["source_language"] + "-" + payload["destination_language"]
tokenizer_path = "Helsinki-NLP/" + model_name
model = self.client.get_model(model_name)
tokenizer = MarianTokenizer.from_pretrained(tokenizer_path)

inf_pipeline = pipeline(
"text2text-generation", model=model, tokenizer=tokenizer, device=self.device
)
result = inf_pipeline(payload["text"])

return result[0]
2 changes: 2 additions & 0 deletions test/apis/model-caching/python/translator/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
transformers==3.5.1
torch
5 changes: 5 additions & 0 deletions test/apis/model-caching/python/translator/sample.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"source_language": "en",
"destination_language": "es",
"text": "So long and thanks for all the fish."
}