Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextDecoder().decode() pollyfill performance issue #316

Closed
samuelbrian opened this issue Oct 19, 2022 · 5 comments
Closed

TextDecoder().decode() pollyfill performance issue #316

samuelbrian opened this issue Oct 19, 2022 · 5 comments

Comments

@samuelbrian
Copy link
Contributor

samuelbrian commented Oct 19, 2022

Since changing from the text-encoding pollyfill to text-encoding-utf-8 (in #311), the performance of text decoding has dropped significantly.

In my testing, it was taking over 3 seconds to read a 175kb file and decode it into a JS string. It does depends on the size of the chunk you're giving to the text decoder.

Part of the problem is that the decoding routine in text-encoding-utf-8 examines each byte in the input buffer, it shifts the byte from the start (see the Stream class here) .
The original text-encoding does this better (see the Stream class here) by reversing the input buffer and operating on the end of the buffer rather than the start.

I've applied the reversing changes to text-encoding-utf-8 (in samuelbrian@99b895c) and it's seems on-par with the original version. But I'm not sure the best way to manage patching the node_modules dependencies, short of forking and maintaining a copy of text-encoding-utf-8.

Even better is using text encoding and decoding from QuickJS which I have done here samuelbrian@f308a6d, and that makes a file read that was taking >3 seconds into a <1 millisecond operation.
I've exposed these as tjs functions rather than implement them as TextEncoder/TextDecoder and remove the pollyfill, mainly because I didn't want to have to deal with the streaming option. But it might be viable to do.

@saghul
Copy link
Owner

saghul commented Oct 19, 2022

Good analysis, thanks!

How about this: implement our own TextDecoder polyfill using the C functions you created under the hood.

Did the original one support streaming? I didn't think it did...

@samuelbrian
Copy link
Contributor Author

The original and UTF-8-only pollyfills do support streaming. I'm relying on it right now to decode while reading single bytes from raw mode stdin.

I think to correctly handle streaming, ignoreBOM, and fatal, it'd have to go deeper and use QuickJS's unicode_to_utf8() and unicode_from_utf8().

@saghul
Copy link
Owner

saghul commented Oct 20, 2022

Oh I see.

So another approach is to go the patch-package route. We can patch a Node module at install time (see the patches directory) and either fix the current dep or revert back and slash other encodings?

@samuelbrian
Copy link
Contributor Author

I never noticed the patches directory. That's perfect.

I did start working on implementing the TextDecoder API in C, only using JS_NewStringLen, and testing against test cases I found in the Firefox source tree.
Streaming and BOM were fine, but handling decoding errors did not work out well.
Fatal was a hack because I was searching for the � in the output string and throwing if it was found, but that would also be triggered if the � was in the input data.
And there were problems handling invalid UTF-8 when it is chunked across streamed decodes.

To do it right, I think I'd need to expose unicode_from_utf8() and also add something like a JS_NewStringUnicode that'd let me create a string that is already decoded.

samuelbrian added a commit to samuelbrian/txiki.js that referenced this issue Nov 4, 2022
The UTF-8-only encoder/decoder was performing poorly due to inefficient
array manipulation.
Applied changes from the original text-encoding package that reduce
the amount of copying.

Relates to saghul#316
saghul pushed a commit that referenced this issue Nov 4, 2022
The UTF-8-only encoder/decoder was performing poorly due to inefficient
array manipulation.
Applied changes from the original text-encoding package that reduce
the amount of copying.

Relates to #316
@saghul
Copy link
Owner

saghul commented Dec 8, 2022

Fixed in #321

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants