Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextDecoding pretty slow #447

Open
lal12 opened this issue Dec 21, 2023 · 14 comments
Open

TextDecoding pretty slow #447

lal12 opened this issue Dec 21, 2023 · 14 comments

Comments

@lal12
Copy link
Contributor

lal12 commented Dec 21, 2023

I recently wondered why TextDecoding was so slow. Several seconds for 50-100 Kbytes (I am running on a low power embedded system with single core performance comparable to RPI 1).

I saw it currently uses a JS only implementation. I am wondering why, the JS_NewStringLen of quickjs already handles utf8 doesn't it?

As a workaround I am currently using

import FFI from 'tjs:ffi';
export function quickDecodeUtf8(buf) {
    return FFI.bufferToString(buf);
}

Which also just does a JS_NewString.

The opposite case of Encoding isn't quite as slow, but it seems JS_ToCString would also handle this case.

Though maybe I am missing something, e.g. some feature TextEncoder/TextDecoder which would not be happening/handled in the quickjs functions.

@saghul
Copy link
Owner

saghul commented Dec 21, 2023

I think you might be right. Maybe running some WPT tests would help clarify if there are any missing corner cases.

@saghul
Copy link
Owner

saghul commented Dec 21, 2023

I remember we discussed that here #316 and didn't go for it, but I'm happy to revisit.

@lal12
Copy link
Contributor Author

lal12 commented Dec 21, 2023

I remember we discussed that here #316 and didn't go for it, but I'm happy to revisit.

Ah, I misread that as only being about the changing of the JS polyfill.

@saghul
Copy link
Owner

saghul commented Dec 21, 2023

I think we could have a JS polyfill that uses the native functions to do the heavy lifting and use JS to implement encodeInto for example.

@lal12
Copy link
Contributor Author

lal12 commented Dec 21, 2023

I copied (and adjusted) the TextDecoder tests from wpt.
The issue seems to be with streaming / multiple calls to TextDecoder.
Which matches what was described in #316.

@lal12
Copy link
Contributor Author

lal12 commented Dec 21, 2023

I will look into this tomorrow, trying to make a C implementation.

@lal12
Copy link
Contributor Author

lal12 commented Jan 12, 2024

Unfortunately it isn't as straight foward as I expected. There are the following options to implement it:

  1. Write a decoder in C and use QuickJS internals to directly build a string
    • Unfortunately requires exposing some quickjs stuff currently not exposed
  2. Write a decoder in C which only checks some stuff, handles some edge cases and then passes it to the quickjs internal decoding.
    • A bit complicated
    • decoding is done twice
    • maybe multiple strings are created and need to be concated -> further overhead
  3. Write a bunch of wrapping code to handle edge cases, but basically let the quickjs decoder do its work.
    • Needs some special "own" decoding for edge cases in streaming mode
    • Needs some regex lookups in fatal mode
    • Is not 100% spec compliant. Quickjs inserts one U+FFFD replacement character where two are expected. I think this is utf8 compliant, but not compliant for TextDecoder spec
  4. Use quickjs decoder for straight foward cases: Only ASCII
    • Requires one string search (probably one regex)
  5. Use quickjs decoder for straight foward cases: Non streaming mode
    • Is not 100% spec compliant. Same as Option 3. BUT only in non streaming mode

I've implemented Option 3 in JS. And have some WIP work in Option 1.
Option 1 would probably be the ideal, however such changes would need to be handled. Probably with a forked quickjs repo.

@saghul
Copy link
Owner

saghul commented Jan 12, 2024

Thanks for being so thorough!

As for Option 1, we do already use a fork, quickjs-ng, so we can discuss exposing what's necessary there.

@saghul
Copy link
Owner

saghul commented Feb 27, 2024

At this point it looks like having some tjs.encodeUTF8 / decodeUTF8 might the simplest way to go. Would that work for you? This would also simplify some of the internals I think, doing new TextEncoder().endode(data) gets old quick 😅

@lal12
Copy link
Contributor Author

lal12 commented Apr 9, 2024

doing new TextEncoder().endode(data) gets old quick 😅

I agree, but I don't think I would be quite happy about that solution, since some JS libs just use TextDecoder internally.
I think the more clean solution would be to have a standard conforming quick TextEncoder/TextDecoder.

The qjs internals could just be copied into the Decoder module. And then

  • use minimal exposure of QJS internals (basically only allowing to create a string from external buffer)
  • add an additional function which (if necessary) copies the string once, but doesn't check for any encoding stuff
  • just letting the QJS internals as is, and doing the utf8 decode twice basically

@saghul
Copy link
Owner

saghul commented Apr 9, 2024

I agree, but I don't think I would be quite happy about that solution, since some JS libs just use TextDecoder internally.

Fair enough.

  • use minimal exposure of QJS internals (basically only allowing to create a string from external buffer)

Without a copy? IIRC some normalization might be applied, so I'm not sure we'd be able to expose that... I can try.

  • add an additional function which (if necessary) copies the string once, but doesn't check for any encoding stuff

This sounds reasonable.

  • just letting the QJS internals as is, and doing the utf8 decode twice basically

Let's leave this as a last resort.

If you can propose patches we can discuss them in quickjs-ng, of which I am a member.

@chqrlie
Copy link

chqrlie commented Apr 9, 2024

Hello, late to the discussion...

Where does the byte stream come from?
It looks like we should support raw byte strings in the QuickJS engine and add some methods to convert between JSStrings and UTF-8 byte data. Uint8Array might be a good candidate, I am not sure.

There is a use case in the engine for this: function source code. We should keep the source code as raw 8-bit data and implement a getter for Function.prototype.source that performs the conversion to JSString. This would allow us to share the source code for functions, nested functions and classes / modules. The source is currently duplicated, using extra memory and sizeable overhead in the serialized data.

@lal12
Copy link
Contributor Author

lal12 commented Apr 10, 2024

@chqrlie
Hmm so your proposal would be to add an implementation into QuickJS and thereby unify two existing implementations in it?

The current implementation of JS_NewStringLen has the issue that it apparently isn't conforming to TextDecoder spec which is tighter than the general utf8 spec.

So to be useful such an implementation would need to be:

  • able to be used in streaming
  • compliant with TextDecoder spec (at least with some options set)
  • able to be used in other use cases of Function.prototype.source and JS_NewStringLen

Correct? I guess I could look into it and create a PR for QuickJS-ng?

@chqrlie
Copy link

chqrlie commented Apr 10, 2024

Sorry, I got carried away... The use case for the source code is not pertinent, as we could use JS_NewStringLen() to create the string from a fragment of the source code kept in its original UTF-8 encoded format.

I am not familiar with the issue you mention regarding TextDecoder and JS_NewStringLen, let me investigate this later today.

Efficient conversions between raw byte strings and JSString would still be a useful addition. I shall propose an API later today too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants