Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fix deberta tokenizer bug caused by bug in normalizer #117189

Merged
merged 4 commits into from
Nov 21, 2024

Conversation

maxhniebergall
Copy link
Member

... which caused offsets to be negative, and caused exceptions for some very rare input combinations.

@maxhniebergall maxhniebergall added >bug :ml Machine learning auto-backport Automatically create backport pull requests when merged v9.0.0 v8.17.0 v8.16.2 labels Nov 20, 2024
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 20, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @maxhniebergall, I've created a changelog YAML for you.

Copy link
Member

@dan-rubinstein dan-rubinstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maxhniebergall maxhniebergall enabled auto-merge (squash) November 20, 2024 20:08
@@ -194,7 +194,7 @@ Reader normalize(CharSequence str) {
if (charDelta < 0) {
// normalised form is shorter
int lastDiff = getLastCumulativeDiff();
addOffCorrectMap(normalizedCharPos, lastDiff + charDelta);
addOffCorrectMap(normalizedCharPos, lastDiff - charDelta);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle, nice find! Can we add any tests here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I added a test which I confirmed fails prior to this fix, and works with this fix.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maxhniebergall maxhniebergall merged commit 5500a5e into main Nov 21, 2024
17 checks passed
@maxhniebergall maxhniebergall deleted the debertaTokenizerNormalizeFix branch November 21, 2024 14:38
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x
8.16
8.18 The branch "8.18" is invalid or doesn't exist

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 117189

maxhniebergall added a commit to maxhniebergall/elasticsearch that referenced this pull request Nov 21, 2024
…17189)

* Fix deberta tokenizer bug caused by bug in normalizer which caused offesets to be negative

* Update docs/changelog/117189.yaml
maxhniebergall added a commit to maxhniebergall/elasticsearch that referenced this pull request Nov 21, 2024
…17189)

* Fix deberta tokenizer bug caused by bug in normalizer which caused offesets to be negative

* Update docs/changelog/117189.yaml
maxhniebergall added a commit to maxhniebergall/elasticsearch that referenced this pull request Nov 21, 2024
…17189)

* Fix deberta tokenizer bug caused by bug in normalizer which caused offesets to be negative

* Update docs/changelog/117189.yaml

(cherry picked from commit 5500a5e)
@maxhniebergall
Copy link
Member Author

💚 All backports created successfully

Status Branch Result
8.17

Questions ?

Please refer to the Backport tool documentation

elasticsearchmachine pushed a commit that referenced this pull request Nov 21, 2024
…117254)

* Fix deberta tokenizer bug caused by bug in normalizer which caused offesets to be negative

* Update docs/changelog/117189.yaml
elasticsearchmachine pushed a commit that referenced this pull request Nov 21, 2024
…117260)

* Fix deberta tokenizer bug caused by bug in normalizer which caused offesets to be negative

* Update docs/changelog/117189.yaml

(cherry picked from commit 5500a5e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending >bug :ml Machine learning Team:ML Meta label for the ML team v8.16.2 v8.17.0 v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants