Serialize MediaType as an integer to avoid string parsing #16073

andrross · 2024-09-24T21:41:19Z

Related Issues

Resolves #15979

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-09-24T21:48:06Z

❌ Gradle check result for 412c076: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Andrew Ross <andrross@amazon.com>

github-actions · 2024-09-24T23:24:57Z

❌ Gradle check result for ff72418: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-25T17:04:21Z

✅ Gradle check result for ff72418: SUCCESS

codecov · 2024-09-25T17:05:22Z

Codecov Report

Attention: Patch coverage is 75.00000% with 10 lines in your changes missing coverage. Please review.

Project coverage is 71.90%. Comparing base (b3cc802) to head (ff72418).
Report is 107 commits behind head on main.

Files with missing lines	Patch %	Lines
...rg/opensearch/core/xcontent/MediaTypeRegistry.java	66.66%	2 Missing and 1 partial ⚠️
...n/java/org/opensearch/core/xcontent/MediaType.java	71.42%	1 Missing and 1 partial ⚠️
...search/action/search/PutSearchPipelineRequest.java	0.00%	2 Missing ⚠️
...nsearch/search/pipeline/PipelineConfiguration.java	0.00%	2 Missing ⚠️
.../opensearch/core/common/io/stream/StreamInput.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #16073      +/-   ##
============================================
- Coverage     72.01%   71.90%   -0.12%     
+ Complexity    64425    64378      -47     
============================================
  Files          5281     5281              
  Lines        301004   300972      -32     
  Branches      43482    43461      -21     
============================================
- Hits         216779   216416     -363     
- Misses        66423    66840     +417     
+ Partials      17802    17716      -86

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dblock

Help me understand where these uniqueId's come from? For XContentType I see it uses index? Do those change? What happens when a new media type is inserted into the list of possible media types in a minor version?

reta · 2024-09-27T13:36:55Z

Adding to @dblock concerns, the media type has variable part. Fe "application/json; charset=UTF-8" or "application/json; charset=UTF-16", all of those are "application/json" but have charset deviation, the MediaRegistry knows only about "application/json" (if I am not mistaken). The unique id concept will not cover it since media type is also coming from client requests.

May be what we could do, in order to simplify things, is to cache media types on each node? Fe using hash or something? How would it work (in the nutshell):

write side:

out.write(mediaType.hash());
out.write(mediaType);

read side:

hash = out.read();
mediaTypeStr =  out.readString();
if (cache.contains(hash)) {
   // use from cache
} else {
   MediaType.fromMediaType(mediaTypeStr);
}

andrross · 2024-10-15T21:59:05Z

the media type has variable part. Fe "application/json; charset=UTF-8" or "application/json; charset=UTF-16", all of those are "application/json" but have charset deviation, the MediaRegistry knows only about "application/json" (if I am not mistaken). The unique id concept will not cover it since media type is also coming from client requests.

@reta @dblock This is a fair point, but the code that exists today can only model the variable part of the media type as a ParsedMediaType type. Once we're down to a MediaType instance, then the variable part has been discarded and (for the JSON case) it is going to say "application/json; charset=UTF-8" in all cases because that is hardcoded.

I think there may be a disconnect in the object modelling in that the MediaType interface is quite generic, but in practice the only implementation is as the XContentType enum. This interface was only relatively recently introduced to decouple the specific XContentType enum out of the opensearch-core library. Prior to that the enum ordinal was being serialized over the transport layer. The refactoring in this PR gets us back to that behavior, but retains the decoupling by adding this new uniqueId concept to the interface.

One alternative, which I mentioned in the issue, is to simply add in a short-circuit optimization in the parseMediaType() method where if the string matches one of the hardcoded values it will return without actually doing any parsing. We'd still be paying the penalty of writing the whole (hardcoded) "application/json; charset=UTF-8" string instead of just a single byte though. @reta what do you think?

reta · 2024-10-16T18:12:20Z

Once we're down to a MediaType instance, then the variable part has been discarded and (for the JSON case) it is going to say "application/json; charset=UTF-8" in all cases because that is hardcoded.

I did some debugging and you are very right, the whatever the content type request had is narrowed down to the respective XContent's media type.

I think there may be a disconnect in the object modelling in that the MediaType interface is quite generic, but in practice the only implementation is as the XContentType enum.

This is actually an important point - the media type at least designed to be extensible.

One alternative, which I mentioned in the issue, is to simply add in a short-circuit optimization in the parseMediaType() method where if the string matches one of the hardcoded values it will return without actually doing any parsing.

I think that would be a best option, that we could generalize in MediaTypeRegistry by checking XContentType::mediaType first (since these ones are known and hardcoded).

andrross · 2024-10-16T20:24:44Z

Closing in favor of #16358

github-actions bot added enhancement Enhancement or improvement to existing feature or request Libraries Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo Performance This is for any performance related enhancements or bugs labels Sep 24, 2024

andrross added the skip-changelog label Sep 24, 2024

Serialize MediaType as an integer to avoid string parsing

ff72418

Signed-off-by: Andrew Ross <andrross@amazon.com>

andrross force-pushed the media-type-int branch from 412c076 to ff72418 Compare September 24, 2024 21:55

opensearch-ci-bot mentioned this pull request Sep 25, 2024

[AUTOCUT] Gradle Check Flaky Test Report for RecoveryFromGatewayIT #14304

Open

dblock reviewed Sep 26, 2024

View reviewed changes

andrross mentioned this pull request Oct 16, 2024

Skip media type parsing for known string values #16358

Merged

3 tasks

andrross closed this Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize MediaType as an integer to avoid string parsing #16073

Serialize MediaType as an integer to avoid string parsing #16073

andrross commented Sep 24, 2024

github-actions bot commented Sep 24, 2024

github-actions bot commented Sep 24, 2024

github-actions bot commented Sep 25, 2024

codecov bot commented Sep 25, 2024 •

edited

Loading

dblock left a comment

reta commented Sep 27, 2024

andrross commented Oct 15, 2024

reta commented Oct 16, 2024 •

edited

Loading

andrross commented Oct 16, 2024

Serialize MediaType as an integer to avoid string parsing #16073

Serialize MediaType as an integer to avoid string parsing #16073

Conversation

andrross commented Sep 24, 2024

Related Issues

Check List

github-actions bot commented Sep 24, 2024

github-actions bot commented Sep 24, 2024

github-actions bot commented Sep 25, 2024

codecov bot commented Sep 25, 2024 • edited Loading

Codecov Report

dblock left a comment

Choose a reason for hiding this comment

reta commented Sep 27, 2024

andrross commented Oct 15, 2024

reta commented Oct 16, 2024 • edited Loading

andrross commented Oct 16, 2024

codecov bot commented Sep 25, 2024 •

edited

Loading

reta commented Oct 16, 2024 •

edited

Loading