Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize MediaType as an integer to avoid string parsing #16073

Closed
wants to merge 1 commit into from

Conversation

andrross
Copy link
Member

Related Issues

Resolves #15979

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Libraries Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo Performance This is for any performance related enhancements or bugs labels Sep 24, 2024
Copy link
Contributor

❌ Gradle check result for 412c076: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Andrew Ross <andrross@amazon.com>
Copy link
Contributor

❌ Gradle check result for ff72418: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for ff72418: SUCCESS

Copy link

codecov bot commented Sep 25, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 10 lines in your changes missing coverage. Please review.

Project coverage is 71.90%. Comparing base (b3cc802) to head (ff72418).
Report is 107 commits behind head on main.

Files with missing lines Patch % Lines
...rg/opensearch/core/xcontent/MediaTypeRegistry.java 66.66% 2 Missing and 1 partial ⚠️
...n/java/org/opensearch/core/xcontent/MediaType.java 71.42% 1 Missing and 1 partial ⚠️
...search/action/search/PutSearchPipelineRequest.java 0.00% 2 Missing ⚠️
...nsearch/search/pipeline/PipelineConfiguration.java 0.00% 2 Missing ⚠️
.../opensearch/core/common/io/stream/StreamInput.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16073      +/-   ##
============================================
- Coverage     72.01%   71.90%   -0.12%     
+ Complexity    64425    64378      -47     
============================================
  Files          5281     5281              
  Lines        301004   300972      -32     
  Branches      43482    43461      -21     
============================================
- Hits         216779   216416     -363     
- Misses        66423    66840     +417     
+ Partials      17802    17716      -86     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me understand where these uniqueId's come from? For XContentType I see it uses index? Do those change? What happens when a new media type is inserted into the list of possible media types in a minor version?

@reta
Copy link
Collaborator

reta commented Sep 27, 2024

Adding to @dblock concerns, the media type has variable part. Fe "application/json; charset=UTF-8" or "application/json; charset=UTF-16", all of those are "application/json" but have charset deviation, the MediaRegistry knows only about "application/json" (if I am not mistaken). The unique id concept will not cover it since media type is also coming from client requests.

May be what we could do, in order to simplify things, is to cache media types on each node? Fe using hash or something? How would it work (in the nutshell):

  • write side:
out.write(mediaType.hash());
out.write(mediaType);
  • read side:
hash = out.read();
mediaTypeStr =  out.readString();
if (cache.contains(hash)) {
   // use from cache
} else {
   MediaType.fromMediaType(mediaTypeStr);
}

@andrross
Copy link
Member Author

the media type has variable part. Fe "application/json; charset=UTF-8" or "application/json; charset=UTF-16", all of those are "application/json" but have charset deviation, the MediaRegistry knows only about "application/json" (if I am not mistaken). The unique id concept will not cover it since media type is also coming from client requests.

@reta @dblock This is a fair point, but the code that exists today can only model the variable part of the media type as a ParsedMediaType type. Once we're down to a MediaType instance, then the variable part has been discarded and (for the JSON case) it is going to say "application/json; charset=UTF-8" in all cases because that is hardcoded.

I think there may be a disconnect in the object modelling in that the MediaType interface is quite generic, but in practice the only implementation is as the XContentType enum. This interface was only relatively recently introduced to decouple the specific XContentType enum out of the opensearch-core library. Prior to that the enum ordinal was being serialized over the transport layer. The refactoring in this PR gets us back to that behavior, but retains the decoupling by adding this new uniqueId concept to the interface.

One alternative, which I mentioned in the issue, is to simply add in a short-circuit optimization in the parseMediaType() method where if the string matches one of the hardcoded values it will return without actually doing any parsing. We'd still be paying the penalty of writing the whole (hardcoded) "application/json; charset=UTF-8" string instead of just a single byte though. @reta what do you think?

@reta
Copy link
Collaborator

reta commented Oct 16, 2024

Once we're down to a MediaType instance, then the variable part has been discarded and (for the JSON case) it is going to say "application/json; charset=UTF-8" in all cases because that is hardcoded.

I did some debugging and you are very right, the whatever the content type request had is narrowed down to the respective XContent's media type.

I think there may be a disconnect in the object modelling in that the MediaType interface is quite generic, but in practice the only implementation is as the XContentType enum.

This is actually an important point - the media type at least designed to be extensible.

One alternative, which I mentioned in the issue, is to simply add in a short-circuit optimization in the parseMediaType() method where if the string matches one of the hardcoded values it will return without actually doing any parsing.

I think that would be a best option, that we could generalize in MediaTypeRegistry by checking XContentType::mediaType first (since these ones are known and hardcoded).

@andrross
Copy link
Member Author

Closing in favor of #16358

@andrross andrross closed this Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Libraries Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo Performance This is for any performance related enhancements or bugs skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CPU intensive string parsing of MediaType field in binary serialization
3 participants