[Bug/Model Request]: intfloat/multilingual-e5-large should use average pooling #384

ITHwang · 2024-11-01T14:04:28Z

What happened?

Hi, I'm using intfloat/multilingual-e5-large for a retrieval task and I found that when E5OnnxEmbedding embeds texts using the model, the model output is pooled by CLS-pooling.

class E5OnnxEmbedding(OnnxTextEmbedding):
    ...

class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[np.ndarray]):
    """Implementation of the Flag Embedding model."""
    ...

    def _post_process_onnx_output(self, output: OnnxOutputContext) -> Iterable[np.ndarray]:
        embeddings = output.model_output
        return normalize(embeddings[:, 0]).astype(np.float32)

But I think it would be better to use average pooling as the paper does when pretraining the model.

Following the popular biencoder architecture, we use a pre-trained Transformer encoder and average pooling over the output layer to get fixed-size text embeddings Eq and Ep. The score is the cosine similarity scaled by a temperature hyperparameter τ : ...

So I'm alternatively using the model that uses average pooling by overriding E5OnnxEmbedding:

def average_pool(last_hidden_states: np.ndarray, attention_mask: np.ndarray) -> np.ndarray:
    ...
    return avg_hidden

class CustomE5OnnxEmbedding(E5OnnxEmbedding):
    ...

    def _post_process_onnx_output(self, output: OnnxOutputContext) -> Iterable[np.ndarray]:
        embeddings, attention_masks = output.model_output, output.attention_mask

        pooled_embeddings = average_pool(embeddings, attention_masks)
        nomalized_embeddings = normalize(pooled_embeddings).astype(np.float32)

        return nomalized_embeddings

TextEmbedding.EMBEDDINGS_REGISTRY.append(CustomE5OnnxEmbedding)

Would you consider changing the pooling method to average pooling?

And separated with this, I'm really enjoying using FastEmbed and I appreciate your work on it!

Thanks for your time and consideration!

What Python version are you on? e.g. python --version

Python 3.11
FastEmbed 0.4.1

Version

0.2.7 (Latest)

What os are you seeing the problem on?

MacOS

Relevant stack traces and/or logs

No response

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug/Model Request]: intfloat/multilingual-e5-large should use average pooling #384

[Bug/Model Request]: intfloat/multilingual-e5-large should use average pooling #384

ITHwang commented Nov 1, 2024

[Bug/Model Request]: intfloat/multilingual-e5-large should use average pooling #384

[Bug/Model Request]: intfloat/multilingual-e5-large should use average pooling #384

Comments

ITHwang commented Nov 1, 2024

What happened?

What Python version are you on? e.g. python --version

Version

What os are you seeing the problem on?

Relevant stack traces and/or logs