Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Allow NLP truncate option to be updated when span is set #91224

Merged
merged 5 commits into from
Nov 2, 2022

Conversation

davidkyle
Copy link
Member

Models preconfigured with the tokenization options truncate: NONE and span: X where X is > 0 error when updating the truncate option. For example

POST _ml/trained_models/model/_infer
{
  "docs": [..]
  "inference_config": {
    "question_answering": {
      "question": "Who moved my cheese?",
      "tokenization" : {
        "bert": {
          "truncate": "second"         <-- override the existing truncate option
        }
      }
    }
  }
}

Returns the error

[span] must not be provided when [truncate] is not [none]"

Because the model is configured with a span option the validation check fails. The work around is to set span: -1 where the value -1 unsets span.

This change wipes out any preexisting span option when truncate is set to first or second. That in itself is a small change the rest is testing and a refactoring of the Roberta and BERT tokenizers to share common code.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.6.0 labels Nov 1, 2022
@davidkyle davidkyle added >bug :ml Machine learning v8.5.1 and removed needs:triage Requires assignment of a team area label labels Nov 1, 2022
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 1, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a comment about clarifying constructors for RobertaTokenization.

@@ -51,7 +51,7 @@ public static RobertaTokenization fromXContent(XContentParser parser, boolean le

private final boolean addPrefixSpace;

private RobertaTokenization(
public RobertaTokenization(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a constructor that doesn't expect doLowerCase for RobertaTokenization? The only public constructor before this change forces lower case to false. This seems something worth clarifying as we're refactoring the area.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this change as it was only a convenience for a test. The value of doLowerCase should always be false.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.5

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Nov 2, 2022
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 3, 2022
* main: (1300 commits)
  update c2id/c2id-server-demo docker image to support ARM (elastic#91144)
  Allow legacy index settings on legacy indices (elastic#90264)
  Skip prevoting if single-node discovery (elastic#91255)
  Chunked encoding for snapshot status API (elastic#90801)
  Allow different decay values depending on the score function (elastic#91195)
  Fix handling indexed envelopes crossing the dateline in mvt API (elastic#91105)
  Ensure cleanups succeed in JoinValidationService (elastic#90601)
  Add overflow behaviour test for RecyclerBytesStreamOutput (elastic#90638)
  More actionable error for ancient indices (elastic#91243)
  Fix APM configuration file delete (elastic#91058)
  Clean up handshake test class (elastic#90966)
  Improve H3#hexRing logic and add H3#areNeighborCells method (elastic#91140)
  Restrict direct use of `ApplicationPrivilege` constructor (elastic#91176)
  [ML] Allow NLP truncate option to be updated when span is set (elastic#91224)
  Support multi-intersection for FieldPermissions (elastic#91169)
  Support intersecting multi-sets of queries with DocumentPermissions (elastic#91151)
  Ensure TermsEnum action works correctly with API keys (elastic#91170)
  Fix NPE in auditing authenticationSuccess for non-existing run-as user (elastic#91171)
  Ensure PKI's delegated_by_realm metadata respect run-as (elastic#91173)
  [ML] Update API documentation for anomaly score explanation (elastic#91177)
  ...

# Conflicts:
#	x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/XPackClientPlugin.java
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/TransportRollupIndexerAction.java
#	x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/v2/RollupActionSingleNodeTests.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v8.5.1 v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants