Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug/Model Request]: intfloat/multilingual-e5-large should use average pooling #384

Open
ITHwang opened this issue Nov 1, 2024 · 0 comments

Comments

@ITHwang
Copy link

ITHwang commented Nov 1, 2024

What happened?

Hi, I'm using intfloat/multilingual-e5-large for a retrieval task and I found that when E5OnnxEmbedding embeds texts using the model, the model output is pooled by CLS-pooling.

class E5OnnxEmbedding(OnnxTextEmbedding):
    ...

class OnnxTextEmbedding(TextEmbeddingBase, OnnxTextModel[np.ndarray]):
    """Implementation of the Flag Embedding model."""
    ...

    def _post_process_onnx_output(self, output: OnnxOutputContext) -> Iterable[np.ndarray]:
        embeddings = output.model_output
        return normalize(embeddings[:, 0]).astype(np.float32)

But I think it would be better to use average pooling as the paper does when pretraining the model.

Following the popular biencoder architecture, we use a pre-trained Transformer encoder and average pooling over the output layer to get fixed-size text embeddings Eq and Ep. The score is the cosine similarity scaled by a temperature hyperparameter τ : ...

So I'm alternatively using the model that uses average pooling by overriding E5OnnxEmbedding:

def average_pool(last_hidden_states: np.ndarray, attention_mask: np.ndarray) -> np.ndarray:
    ...
    return avg_hidden

class CustomE5OnnxEmbedding(E5OnnxEmbedding):
    ...

    def _post_process_onnx_output(self, output: OnnxOutputContext) -> Iterable[np.ndarray]:
        embeddings, attention_masks = output.model_output, output.attention_mask

        pooled_embeddings = average_pool(embeddings, attention_masks)
        nomalized_embeddings = normalize(pooled_embeddings).astype(np.float32)

        return nomalized_embeddings

TextEmbedding.EMBEDDINGS_REGISTRY.append(CustomE5OnnxEmbedding)

Would you consider changing the pooling method to average pooling?

And separated with this, I'm really enjoying using FastEmbed and I appreciate your work on it!

Thanks for your time and consideration!

What Python version are you on? e.g. python --version

  • Python 3.11
  • FastEmbed 0.4.1

Version

0.2.7 (Latest)

What os are you seeing the problem on?

MacOS

Relevant stack traces and/or logs

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant