Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(NODE-6450): Lazy objectId hex string cache #722

Merged
merged 8 commits into from
Nov 20, 2024

Conversation

SeanReece
Copy link
Contributor

@SeanReece SeanReece commented Oct 23, 2024

Description

When ObjectId.cacheHexString is enabled there is a performance impact on ObjectId creation and the extra memory used by the hex string is consumed even if .toHexString() is never called on the ObjectId.

By lazily caching the hex string it provides the following benefits:

  • No performance impact on ObjectId creation
  • Memory use does not increase unless hex string is requested
  • Duplicate .toHexString calls are still efficient as cache is used

Possible Negatives:

  • hex string performance impact is moved to first .toHexString() invocation. Overall performance is still consistent, but it is no longer front loaded on ObjectId creation.

Performance

Custom benchmarks with cacheHexString=true

Test Case bson Ops/sec bson lazy Ops/sec Improvement (%)
new ObjectId(string) 3,386,079 4,805,488 41.92%
new ObjectId() 5,269,078 11,309,317 114.64%
new ObjectId(string) + toHexString() 3,334,197 4,780,762 43.39%
new ObjectId() + toHexString() 5,082,142 5,080,183 0%

new ObjectId(string) + toHexString() test sees a performance improvement as we were performing an unnecessary .toHex when string is passed in.

Running granular benchmarks with cacheHexString=true

Test Case bson throughput bson lazy throughput Improvement (%)
objectid_array_1000._deserialize_bson 95.1MB 178.2MB 87.4%
objectid_array_100._deserialize_bson 87.5MB 158.6MB 81.1%
objectid_array_10._deserialize_bson 77.5MB 132.9MB 71.5%

What is changing?

When cacheHexString=true we are no longer calling ByteUtils.toHex(buffer) in the constructor, this was already being checked/performed in the toHexString() call. If a string is passed into the constructor we can cache this immediately as there is no performance impact and no extra memory that needs to be allocated.

Is there new documentation needed for these changes?

No.

What is the motivation for this change?

We are often pulling many ObjectIds into memory (possibly millions) and need to deduplicate arrays of objectIds. In order to efficiently deduplicate ObjectIds (using Set) we need the string representation of an ObjectId, then when serializing documents to JSON we need the string representation of the ObjectId again. Converting an ObjectId to a hex string is a somewhat expensive operation that we'd like to avoid if necessary.

Enabling cacheHexString would improve the efficiency of these specific scenarios, but would lead to a performance impact and increased memory usage across the board for other scenarios that may never need ObjectId string.

Implementing lazy hex string caching will allow us to enable cacheHexString with zero performance impact, but see performance improvements in some scenarios.

Release Highlight

Cache the hex string of an ObjectId lazily

When ObjectId.cacheHexString is set to true we no longer convert the buffer to a hex string in the constructor, since the cache is already being filled in any call to objectid.toHexString().

Additionally, if a string is passed into the constructor we can cache this immediately as there is no performance impact and no extra memory that needs to be allocated.

This improves the performance for situations where you are parsing ObjectIds from a string (ex. JSON) and want to avoid recalculating the hex. It also improves situations where you have ObjectIds coming from BSON and only convert some of them strings perhaps after applying some filter to eliminate some.

With cacheHexString enabled deserializing ObjectIds from BSON shows ~80% performance improvement and toString-ing ObjectIds that were constructed from a string convert ~40% faster!

Thanks to @SeanReece for contributing this improvement!

Double check the following

  • Ran npm run check:lint script
  • Self-review completed using the steps outlined here
  • PR title follows the correct format: type(NODE-xxxx)[!]: description
    • Example: feat(NODE-1234)!: rewriting everything in coffeescript
  • Changes are covered by tests
  • New TODOs have a related JIRA ticket

@aditi-khare-mongoDB
Copy link
Contributor

aditi-khare-mongoDB commented Oct 30, 2024

Hi @SeanReece, thank you for submitting this pull request!

As of now, we won't explore fully supporting lazy loading hex string cache until the next major release (see NODE-6480) for the following reasons:

  1. This changes the behavior of the non-hidden ObjectId.__id property upon construction
  2. Generic deep equality checkers, such as LODASH, will no longer safely work on ObjectId

For the time being, when ObjectId.cacheHexString === true, let's still cache the hex string in constructor. In the case that a hex string is input to the constructor, there's no need to call toHex, but otherwise toHex is still called. This still allows for a perf optimization without being a breaking change. What do you think?

@SeanReece
Copy link
Contributor Author

SeanReece commented Oct 30, 2024

Thanks for the review @aditi-khare-mongoDB

Good point about deep equality. That's the same thing holding my other PR (#707) as well 😅

Being able to use # to hide __id would be great, but I noticed a significant performance and memory hit since we target es2021 and # was added in es2022 so tslib has to shim every call to __id

By using a weakMap instead of # to create actual private __id I don't see any performance hit so that's one option if we don't want to bump our target anytime soon.

Another option is to provide a cacheHexStringLazy option with documentation around it being incompatible with deep equality.

I can split the PR to implement the toHex perf improvement, but the main goal of this PR was to completely avoid any performance / memory impact when caching is not necessary. We have an application that is very memory sensitive and we handle millions of ObjectIds at a time.

Edit: This is what I'm thinking in terms of using weakMap for private members (if moving to ES2022 is not an option). Performance seems to be good with this implementation. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/WeakMap#emulating_private_members

@aditi-khare-mongoDB
Copy link
Contributor

Hi again @SeanReece, thanks for sharing your thoughts and replying so promptly! Next week, the team will consider your points, discuss the best path forward, and get back to you.

@baileympearson
Copy link
Contributor

Hey @SeanReece, thanks for bearing with us on this PR. Private properties implemented with weakmaps work for us, if you're willing to make those changes. We'll need to figure out a way to test the caching behavior though, the existing tests rely on accessing the TS-private __id variable.

@SeanReece
Copy link
Contributor Author

@baileympearson great! I don’t mind implementing this. Good point about the tests. I’ll investigate how to best make these changes and make them testable 🤔

@SeanReece
Copy link
Contributor Author

SeanReece commented Nov 13, 2024

@baileympearson I've updated the code to hide __id using weakMap implementation. There is a slight performance hit vs direct property access but it's still faster than using # with ES2021 target, and there's still meaningful performance gains vs main. I've added a // TODO to convert this to # private field when target is updated to ES2022.

As for the tests, I've added a isCached() private function that checks if the hex string is cached. This also allows us to test the lazy functionality by ensuring that isCached is only true after toHexString() is called once.

Let me know if you have any questions or concerns about anything here.

Perf tests

main vs lazy-weakMap
🎉WINNERS
objectid_singleFieldDocument._deserialize_bson 46.3MB 154.7MB 234.3%
objectid_array_1000._deserialize_bson 113.2MB 294.3MB 160.1%
objectid_array_100._deserialize_bson 106.1MB 261.8MB 146.8%
objectid_array_10._deserialize_bson 94.4MB 180.6MB 91.2%
objectid_singleElementArray._deserialize_bson 76.5MB 105.5MB 37.8%

@durran durran self-assigned this Nov 15, 2024
@durran durran added the Primary Review In Review with primary reviewer, not yet ready for team's eyes label Nov 15, 2024
durran
durran previously approved these changes Nov 15, 2024
@durran durran added Team Review Needs review from team and removed Primary Review In Review with primary reviewer, not yet ready for team's eyes labels Nov 15, 2024
Copy link
Contributor

@baileympearson baileympearson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change - otherwise LGTM!

src/objectid.ts Outdated Show resolved Hide resolved
@SeanReece
Copy link
Contributor Author

@baileympearson Good catch. I've added it back.

Any idea when v7 is planned for?

src/objectid.ts Outdated Show resolved Hide resolved
@baileympearson
Copy link
Contributor

@SeanReece We don't have a target date for BSON v7 or driver v7.

Co-authored-by: Bailey Pearson <bailey.pearson@gmail.com>
@durran durran merged commit 7c37580 into mongodb:main Nov 20, 2024
6 of 7 checks passed
@SeanReece SeanReece deleted the lazy-objectid-cache branch November 20, 2024 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team Review Needs review from team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants