Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Use charCodeAt instead of TextEncoder for improved UTF string comparison performance. #8778

Closed
amiller-gh opened this issue Feb 9, 2025 · 8 comments

Comments

@amiller-gh
Copy link

amiller-gh commented Feb 9, 2025

Operating System

MacOS Ventura 13.2.1

Environment (if applicable)

Node.js 18.19.1, Chrome 131.0.6778.266

Firebase SDK Version

11.3.0

Firebase SDK Product(s)

Firestore

Project Tooling

Browser react app, Node.js Express server, Electron app with React frontend.

Detailed Problem Description

We've recently migrated a number of systems from firebase-admin to the firebase client SDKs (internal browser-based dashboards, cloud functions, electron background process, etc) and noticed a major reduction in performance for batch Firestore write operations.

This is the flame graph for a 500 document batch write operation that takes ~868ms:
Image

Unfortunately, we have some operations that need to run hundreds of thousands of writes, and repeat this 500 document batch write many times over. All of this time adds up significantly!

Zooming in a little, we saw that the culprit appears to be a TextEncoder.encode() call from a function called compareUtf8Strings that 1) takes some time to encode the characters, and 2) is quickly thrown out, causing a significant amount of garbage collection.
Image

The current implementation uses TextEncoder to compare strings as UTF-8 integers in byte order:

/** Compare strings in UTF-8 encoded byte order */
export function compareUtf8Strings(left: string, right: string): number {
  // Convert the string to UTF-8 encoded bytes
  const encodedLeft = newTextEncoder().encode(left);
  const encodedRight = newTextEncoder().encode(right);

  for (let i = 0; i < Math.min(encodedLeft.length, encodedRight.length); i++) {
    const comparison = primitiveComparator(encodedLeft[i], encodedRight[i]);
    if (comparison !== 0) {
      return comparison;
    }
  }

  return primitiveComparator(encodedLeft.length, encodedRight.length);
}

A very small change to use charCodeAt instead means we avoid creating thousands of new TextEncoder objects:

/** Compare strings in UTF-16 encoded byte order */
export function compareUtf16Strings(left: string, right: string): number {
  for (let i = 0; i < Math.min(left.length, right.length); i++) {
    const comparison = primitiveComparator(left.charCodeAt(i), right.charCodeAt(i));
    if (comparison !== 0) {
      return comparison;
    }
  }
  return primitiveComparator(left.length, right.length);
}

And reduce the runtime of this operation from 868ms to 66ms!

Image

To get these perf improvements, this change introduces one small, but hopefully inconsequential, change to the way this comparator behaves: TextEncoder converts the string to a UInt8Array, and compares by looking at the individual UTF-8 encoded bytes, but charCodeAt fetches the character as UTF-16, represented by a standard JS 64 bit float.

On first blush, I don't see how this will introduce any practical changes to the return value of this function – please correct me if I'm wrong! I'm not sure if the library was comparing strings as UTF-8 for any particular reason (Mirroring underlying database behavior? Downstream perf implications?), but if not, this may be a very easy win for client side SDK performance.

I've opened up a PR with this change at #8779. I'm boarding a flight right now, so will make sure tests all pass still once I'm back the ground, but wanted to open up this issue to kick off discussion.

Steps and code to reproduce issue

See above.

@amiller-gh amiller-gh added new A new issue that hasn't be categoirzed as question, bug or feature request question labels Feb 9, 2025
@google-oss-bot
Copy link
Contributor

I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.

@dlarocque
Copy link
Contributor

Hi @amiller-gh, thanks for submitting such a detailed issue, and the PR! It's very much appreciated. We are looking into this.

This issue was introduced in 11.3.0 last week (721e5a7), downgrading to 11.2.0 temporarily will resolve this until we publish the next release with a fix.

@milaGGL
Copy link
Contributor

milaGGL commented Feb 10, 2025

Hi @amiller-gh, yes, UTF-8 encoding is introduced to be consistent with the backend string sorting. We are currently looking into more efficient ways (and hopefully without burdening the memory) to do UTF-8 encoding in the sdk. Sorry for the inconvenience.

@amiller-gh
Copy link
Author

No worries, glad I could be helpful! Thank you both.

@dlarocque
Copy link
Contributor

Hey @amiller-gh, I just released 11.3.1. This release reverted the change that was causing this issue. Could you confirm that this issue no longer exists in 11.3.1?

@amiller-gh
Copy link
Author

Hey @dlarocque, confirmed! The perf regression has disappeared. Thanks for the fast turnaround 🙏

This was a sub-set of a number of client side "bulk-import" issues that we've been wrestling with, largely related to how the Firestore server treats client rate limits vs firebase-admin rate limits.

This client side bulk upload problem seems to be a not infrequent issue: #7655 (comment)

Even just uploading 20k-50k documents from the client SDK (not that many for most "bulk import" systems), we run into RESOURCE_EXHAUSTED errors very quickly, even with throttled, batched queries, and the upload slows to a crawl. It's been a very difficult process just to get something presentable that can reliably upload this amount of data without having to go through the server!

Do you think there is an appetite to visit this internally, and can I start a new issue to discuss client vs admin rate limit differences in more detail?

@dlarocque
Copy link
Contributor

@amiller-gh Thanks for confirming that this issue is resolved!

I'm sorry to hear you've been having other issues with Firestore. If you feel that this discussion is different than the one had in #7655, feel free to open a new issue. Other ways to draw more attention to this would be to report the issue to Firebase Support, or submit an idea to UserVoice

@milaGGL
Copy link
Contributor

milaGGL commented Feb 14, 2025

Hi @amiller-gh, could you please help custom build and test this fix out? A lazy encoding is used instead of original "encode them all first" method. Hopefully, this should be able to do the UTF-8 encoded string comparison without dragging the performance down.

#  Clone the web SDK 
git clone https://github.com/firebase/firebase-js-sdk.git
cd firebase-js-sdk

# Checkout the modified branch and build
git checkout origin/mila/fix-string-utf8-comparison
npm install -g yarn
yarn
yarn build

# Create a .tgz file for Firestore
cd packages/firestore
yarn pack

# Cd to your project directory containing package.json and install the package
npm install <path_to_tgz_file_created_by_yarn_pack>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants