Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 5 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,7 @@ such as:

### Supported Models

You can use any JinaBERT model with Alibi or absolute positions or any BERT, CamemBERT or XLM-RoBERTa model with
absolute positions in `text-embeddings-inference`.
You can use any JinaBERT model with Alibi or absolute positions or any BERT, CamemBERT, RoBERTa, or XLM-RoBERTa model with absolute positions in `text-embeddings-inference`.

**Support for other model types will be added in the future.**

Expand Down Expand Up @@ -96,8 +95,8 @@ curl 127.0.0.1:8080/embed \
-H 'Content-Type: application/json'
```

**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
We also recommend using NVIDIA drivers with CUDA version 12.0 or higher.
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
We also recommend using NVIDIA drivers with CUDA version 12.0 or higher.

To see all options to serve your models:

Expand Down Expand Up @@ -236,7 +235,7 @@ Text Embeddings Inference ships with multiple Docker images that you can use to
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-0.3.0 |
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-0.3.0 (experimental) |

**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.

### API documentation
Expand Down Expand Up @@ -329,7 +328,7 @@ cargo install --path router -F candle-cuda-turing --no-default-features
cargo install --path router -F candle-cuda --no-default-features
```

You can now launch Text Embeddings Inference on GPU with:
You can now launch Text Embeddings Inference on GPU with:

```shell
model=BAAI/bge-large-en-v1.5
Expand Down
1 change: 1 addition & 0 deletions backends/candle/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ impl CandleBackend {
if config.model_type != Some("bert".to_string())
&& config.model_type != Some("xlm-roberta".to_string())
&& config.model_type != Some("camembert".to_string())
&& config.model_type != Some("roberta".to_string())
{
return Err(BackendError::Start(format!(
"Model {:?} is not supported",
Expand Down
14 changes: 8 additions & 6 deletions router/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -215,12 +215,14 @@ async fn main() -> Result<()> {
tokenizer.with_padding(None);

// Position IDs offset. Used for Roberta and camembert.
let position_offset =
if &config.model_type == "xlm-roberta" || &config.model_type == "camembert" {
config.pad_token_id + 1
} else {
0
};
let position_offset = if &config.model_type == "xlm-roberta"
|| &config.model_type == "camembert"
|| &config.model_type == "roberta"
{
config.pad_token_id + 1
} else {
0
};
let max_input_length = config.max_position_embeddings - position_offset;

let tokenization_workers = args
Expand Down