Bugfix: wrong attention mask calculation resulted in wrong embeddings #14496

maziyarpanahi · 2025-01-06T22:30:42Z

This pull request includes several changes to the BGEEmbeddings class in python/sparknlp/annotator/embeddings/bge_embeddings.py and various Scala files to add new functionality and improve code quality. The most important changes include adding support for the CLS token in sentence embeddings, updating the default pretrained model, and various code refactorings to improve readability and maintainability.

Enhancements to `BGEEmbeddings`:

Added HasClsTokenProperties to BGEEmbeddings to support the use of the CLS token for sentence embeddings. (python/sparknlp/annotator/embeddings/bge_embeddings.py) [1] [2] [3]
Updated the default pretrained model name to bge_small_en_v1.5. (python/sparknlp/annotator/embeddings/bge_embeddings.py)

New Class and Methods:

Introduced HasClsTokenProperties class to handle CLS token properties. (python/sparknlp/common/properties.py)

Refactorings and Improvements:

Refactored BGE class to include useCLSToken parameter in various methods and improved tensor handling. (src/main/scala/com/johnsnowlabs/ml/ai/BGE.scala) [1] [2] [3] [4] [5] [6] [7]
Removed unnecessary blank lines and improved code formatting in Albert and Bart classes. (src/main/scala/com/johnsnowlabs/ml/ai/Albert.scala, src/main/scala/com/johnsnowlabs/ml/ai/Bart.scala) [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

These changes collectively enhance the functionality of the BGEEmbeddings class and improve the overall code quality in related files.This pull request includes several updates to the attention mask logic across multiple classes in the com.johnsnowlabs.ml.ai package. The changes primarily involve modifying the condition used to set the attention mask values.

Updates to attention mask logic:

BGE.scala: Changed the condition for setting attention mask values from x < 0L to x == 0L in both getSentenceEmbeddingFromOv and getSentenceEmbeddingFromOnnx methods. [1] [2]
E5.scala: Updated the attention mask condition from x < 0L to x == 0L in the getSentenceEmbeddingFromOnnx method. [1] [2]
MPNet.scala: Modified the attention mask condition from x < this.paddingTokenId to x == this.paddingTokenId in the getSentenceEmbeddingFromOv and getSentenceEmbeddingFromOnnx methods. [1] [2]
Mxbai.scala: Changed the attention mask condition from x < 0L to x == 0L in the getSentenceEmbeddingFromOnnx method.
Nomic.scala: Updated the attention mask condition from x < 0L to x == 0L in the getSentenceEmbeddingFromOnnx method. [1] [2]
SnowFlake.scala: Modified the attention mask condition from x < 0L to x == 0L in both getSentenceEmbeddingFromOv and getSentenceEmbeddingFromOnnx methods. [1] [2]
UAE.scala: Changed the attention mask condition from x < 0L to x == 0L in both getSentenceEmbeddingFromOpenvino and getSentenceEmbeddingFromOnnx methods. [1] [2] [3]

ahmedlone127

LGTM!

maziyarpanahi added 2 commits January 6, 2025 23:23

Fix the bug generating wrong embeddings

0898d8c

fixing attention mask in bge, e5, mxbai, nomic, snowflake, and uae

9ac93ba

maziyarpanahi requested a review from ahmedlone127 January 6, 2025 22:33

maziyarpanahi self-assigned this Jan 6, 2025

maziyarpanahi added bug-fix DON'T MERGE Do not merge this PR labels Jan 6, 2025

ahmedlone127 reviewed Jan 16, 2025

View reviewed changes

maziyarpanahi added 3 commits January 16, 2025 22:36

fix issue with attention mask calculation

a6a6fca

Add a new CLSToken feature to BGE annotator [skip test]

d0da98d

code reformat [skip test]

aefce9b

maziyarpanahi changed the base branch from master to release/553-release-candidate January 29, 2025 21:31

maziyarpanahi merged commit 14daf85 into release/553-release-candidate Jan 29, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: wrong attention mask calculation resulted in wrong embeddings #14496

Bugfix: wrong attention mask calculation resulted in wrong embeddings #14496

maziyarpanahi commented Jan 6, 2025 •

edited

Loading

ahmedlone127 left a comment

Bugfix: wrong attention mask calculation resulted in wrong embeddings #14496

Bugfix: wrong attention mask calculation resulted in wrong embeddings #14496

Conversation

maziyarpanahi commented Jan 6, 2025 • edited Loading

Enhancements to BGEEmbeddings:

New Class and Methods:

Refactorings and Improvements:

Updates to attention mask logic:

ahmedlone127 left a comment

Choose a reason for hiding this comment

maziyarpanahi commented Jan 6, 2025 •

edited

Loading

Enhancements to `BGEEmbeddings`: