TextDecoding pretty slow #447

lal12 · 2023-12-21T14:29:23Z

I recently wondered why TextDecoding was so slow. Several seconds for 50-100 Kbytes (I am running on a low power embedded system with single core performance comparable to RPI 1).

I saw it currently uses a JS only implementation. I am wondering why, the JS_NewStringLen of quickjs already handles utf8 doesn't it?

As a workaround I am currently using

import FFI from 'tjs:ffi';
export function quickDecodeUtf8(buf) {
    return FFI.bufferToString(buf);
}

Which also just does a JS_NewString.

The opposite case of Encoding isn't quite as slow, but it seems JS_ToCString would also handle this case.

Though maybe I am missing something, e.g. some feature TextEncoder/TextDecoder which would not be happening/handled in the quickjs functions.

The text was updated successfully, but these errors were encountered:

saghul · 2023-12-21T14:52:30Z

I think you might be right. Maybe running some WPT tests would help clarify if there are any missing corner cases.

saghul · 2023-12-21T14:55:02Z

I remember we discussed that here #316 and didn't go for it, but I'm happy to revisit.

lal12 · 2023-12-21T15:10:04Z

I remember we discussed that here #316 and didn't go for it, but I'm happy to revisit.

Ah, I misread that as only being about the changing of the JS polyfill.

saghul · 2023-12-21T15:13:41Z

I think we could have a JS polyfill that uses the native functions to do the heavy lifting and use JS to implement encodeInto for example.

lal12 · 2023-12-21T16:12:21Z

I copied (and adjusted) the TextDecoder tests from wpt.
The issue seems to be with streaming / multiple calls to TextDecoder.
Which matches what was described in #316.

lal12 · 2023-12-21T16:15:28Z

I will look into this tomorrow, trying to make a C implementation.

lal12 · 2024-01-12T11:28:52Z

Unfortunately it isn't as straight foward as I expected. There are the following options to implement it:

Write a decoder in C and use QuickJS internals to directly build a string
- Unfortunately requires exposing some quickjs stuff currently not exposed
Write a decoder in C which only checks some stuff, handles some edge cases and then passes it to the quickjs internal decoding.
- A bit complicated
- decoding is done twice
- maybe multiple strings are created and need to be concated -> further overhead
Write a bunch of wrapping code to handle edge cases, but basically let the quickjs decoder do its work.
- Needs some special "own" decoding for edge cases in streaming mode
- Needs some regex lookups in fatal mode
- Is not 100% spec compliant. Quickjs inserts one U+FFFD replacement character where two are expected. I think this is utf8 compliant, but not compliant for TextDecoder spec
Use quickjs decoder for straight foward cases: Only ASCII
- Requires one string search (probably one regex)
Use quickjs decoder for straight foward cases: Non streaming mode
- Is not 100% spec compliant. Same as Option 3. BUT only in non streaming mode

I've implemented Option 3 in JS. And have some WIP work in Option 1.
Option 1 would probably be the ideal, however such changes would need to be handled. Probably with a forked quickjs repo.

saghul · 2024-01-12T12:11:37Z

Thanks for being so thorough!

As for Option 1, we do already use a fork, quickjs-ng, so we can discuss exposing what's necessary there.

saghul · 2024-02-27T11:06:25Z

At this point it looks like having some tjs.encodeUTF8 / decodeUTF8 might the simplest way to go. Would that work for you? This would also simplify some of the internals I think, doing new TextEncoder().endode(data) gets old quick 😅

lal12 · 2024-04-09T13:35:14Z

doing new TextEncoder().endode(data) gets old quick 😅

I agree, but I don't think I would be quite happy about that solution, since some JS libs just use TextDecoder internally.
I think the more clean solution would be to have a standard conforming quick TextEncoder/TextDecoder.

The qjs internals could just be copied into the Decoder module. And then

use minimal exposure of QJS internals (basically only allowing to create a string from external buffer)
add an additional function which (if necessary) copies the string once, but doesn't check for any encoding stuff
just letting the QJS internals as is, and doing the utf8 decode twice basically

saghul · 2024-04-09T14:57:15Z

I agree, but I don't think I would be quite happy about that solution, since some JS libs just use TextDecoder internally.

Fair enough.

use minimal exposure of QJS internals (basically only allowing to create a string from external buffer)

Without a copy? IIRC some normalization might be applied, so I'm not sure we'd be able to expose that... I can try.

add an additional function which (if necessary) copies the string once, but doesn't check for any encoding stuff

This sounds reasonable.

just letting the QJS internals as is, and doing the utf8 decode twice basically

Let's leave this as a last resort.

If you can propose patches we can discuss them in quickjs-ng, of which I am a member.

chqrlie · 2024-04-09T16:43:43Z

Hello, late to the discussion...

Where does the byte stream come from?
It looks like we should support raw byte strings in the QuickJS engine and add some methods to convert between JSStrings and UTF-8 byte data. Uint8Array might be a good candidate, I am not sure.

There is a use case in the engine for this: function source code. We should keep the source code as raw 8-bit data and implement a getter for Function.prototype.source that performs the conversion to JSString. This would allow us to share the source code for functions, nested functions and classes / modules. The source is currently duplicated, using extra memory and sizeable overhead in the serialized data.

lal12 · 2024-04-10T12:46:52Z

@chqrlie
Hmm so your proposal would be to add an implementation into QuickJS and thereby unify two existing implementations in it?

The current implementation of JS_NewStringLen has the issue that it apparently isn't conforming to TextDecoder spec which is tighter than the general utf8 spec.

So to be useful such an implementation would need to be:

able to be used in streaming
compliant with TextDecoder spec (at least with some options set)
able to be used in other use cases of Function.prototype.source and JS_NewStringLen

Correct? I guess I could look into it and create a PR for QuickJS-ng?

chqrlie · 2024-04-10T13:14:33Z

Sorry, I got carried away... The use case for the source code is not pertinent, as we could use JS_NewStringLen() to create the string from a fragment of the source code kept in its original UTF-8 encoded format.

I am not familiar with the issue you mention regarding TextDecoder and JS_NewStringLen, let me investigate this later today.

Efficient conversions between raw byte strings and JSString would still be a useful addition. I shall propose an API later today too.

lal12 mentioned this issue Jan 13, 2024

add c utf8 TextDecoder/TextEncoder implementation #448

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextDecoding pretty slow #447

TextDecoding pretty slow #447

lal12 commented Dec 21, 2023 •

edited

Loading

saghul commented Dec 21, 2023

saghul commented Dec 21, 2023

lal12 commented Dec 21, 2023

saghul commented Dec 21, 2023

lal12 commented Dec 21, 2023 •

edited

Loading

lal12 commented Dec 21, 2023

lal12 commented Jan 12, 2024 •

edited

Loading

saghul commented Jan 12, 2024

saghul commented Feb 27, 2024

lal12 commented Apr 9, 2024

saghul commented Apr 9, 2024

chqrlie commented Apr 9, 2024

lal12 commented Apr 10, 2024

chqrlie commented Apr 10, 2024

TextDecoding pretty slow #447

TextDecoding pretty slow #447

Comments

lal12 commented Dec 21, 2023 • edited Loading

saghul commented Dec 21, 2023

saghul commented Dec 21, 2023

lal12 commented Dec 21, 2023

saghul commented Dec 21, 2023

lal12 commented Dec 21, 2023 • edited Loading

lal12 commented Dec 21, 2023

lal12 commented Jan 12, 2024 • edited Loading

saghul commented Jan 12, 2024

saghul commented Feb 27, 2024

lal12 commented Apr 9, 2024

saghul commented Apr 9, 2024

chqrlie commented Apr 9, 2024

lal12 commented Apr 10, 2024

chqrlie commented Apr 10, 2024

lal12 commented Dec 21, 2023 •

edited

Loading

lal12 commented Dec 21, 2023 •

edited

Loading

lal12 commented Jan 12, 2024 •

edited

Loading