-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Allow NLP truncate option to be updated when span is set #91224
Conversation
Pinging @elastic/ml-core (Team:ML) |
Hi @davidkyle, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a comment about clarifying constructors for RobertaTokenization
.
@@ -51,7 +51,7 @@ public static RobertaTokenization fromXContent(XContentParser parser, boolean le | |||
|
|||
private final boolean addPrefixSpace; | |||
|
|||
private RobertaTokenization( | |||
public RobertaTokenization( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a constructor that doesn't expect doLowerCase
for RobertaTokenization
? The only public constructor before this change forces lower case to false
. This seems something worth clarifying as we're refactoring the area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this change as it was only a convenience for a test. The value of doLowerCase
should always be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💚 Backport successful
|
* main: (1300 commits) update c2id/c2id-server-demo docker image to support ARM (elastic#91144) Allow legacy index settings on legacy indices (elastic#90264) Skip prevoting if single-node discovery (elastic#91255) Chunked encoding for snapshot status API (elastic#90801) Allow different decay values depending on the score function (elastic#91195) Fix handling indexed envelopes crossing the dateline in mvt API (elastic#91105) Ensure cleanups succeed in JoinValidationService (elastic#90601) Add overflow behaviour test for RecyclerBytesStreamOutput (elastic#90638) More actionable error for ancient indices (elastic#91243) Fix APM configuration file delete (elastic#91058) Clean up handshake test class (elastic#90966) Improve H3#hexRing logic and add H3#areNeighborCells method (elastic#91140) Restrict direct use of `ApplicationPrivilege` constructor (elastic#91176) [ML] Allow NLP truncate option to be updated when span is set (elastic#91224) Support multi-intersection for FieldPermissions (elastic#91169) Support intersecting multi-sets of queries with DocumentPermissions (elastic#91151) Ensure TermsEnum action works correctly with API keys (elastic#91170) Fix NPE in auditing authenticationSuccess for non-existing run-as user (elastic#91171) Ensure PKI's delegated_by_realm metadata respect run-as (elastic#91173) [ML] Update API documentation for anomaly score explanation (elastic#91177) ... # Conflicts: # x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/XPackClientPlugin.java # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/TransportRollupIndexerAction.java # x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/v2/RollupActionSingleNodeTests.java
Models preconfigured with the tokenization options
truncate: NONE
andspan: X
where X is > 0 error when updating the truncate option. For exampleReturns the error
Because the model is configured with a
span
option the validation check fails. The work around is to setspan: -1
where the value-1
unsetsspan
.This change wipes out any preexisting
span
option when truncate is set tofirst
orsecond
. That in itself is a small change the rest is testing and a refactoring of the Roberta and BERT tokenizers to share common code.