-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance issue with UTF8ToString #12517
Comments
This is pretty widely used function so I don't think we can break back-compat here. But using an extra arg or a seperate function sounds fine. This will be useful for C++ strings where the length of the string also already known a-priori. I do have question though: Do you have a real world application the cost of doing strlen in JS is having a noticeable effect on performace? It doesn't seem that surprising that this cost dominates in your microbenchmark since that is all you are doing there. |
Do we actually need to compute the length of the string? I'm not sure from the docs if TextDecoder.decode stops at a 0 or not. If it doesn't, we should ask for an API that does... |
Thanks for the reply~ @sbc100 I'm still developing my ScriptEngine library, which intended to provide a unified C++ API for V8/JSCore/Lua and now WASM. My library doesn't have any real-world application using on WASM for now (but will have very soon). The result is tested on my UnitTest. But since we already find the performance problem, I'd like to open a PR to fix it.
reply @kripken According to the comment |
- add extra parameter `exactStringLength` to UTF8ToString, This parameter is optional, and if given, the `maxBytesToRead` parameter is ignored. to keep consistent API flavor, `UTF16ToString`, `UTF32ToString`, is also changed.
- add extra parameter `exactStringLength` to UTF8ToString, This parameter is optional, and if given, the `maxBytesToRead` parameter is ignored. to keep consistent API flavor, `UTF16ToString`, `UTF32ToString`, is also changed.
@LanderlYoung I wonder if that comment is accurate, though. Reading the MDN docs I'm not sure. Also, there may be another Web API for this, as it does seem pretty obvious. Worth looking for if someone has time. |
introduce new functions - UTF8ToStringWithLength - UTF16ToStringWithLength - UTF32ToStringWithLength Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.
introduce new functions - UTF8ToStringWithLength - UTF16ToStringWithLength - UTF32ToStringWithLength Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.
@kripken As my tested shows, the comment is accurate. MDN does not clarify the situation when there is a '\0' inside the string. But Unicode standard does allow '\0' inside strings. So, as far as I know, the behavior is correct. As for other APIs, neither do I have any idea and also test code: const decoder = new TextDecoder("utf-8");
const raw = new Int8Array([
0, // '\0'
0, // '\0'
0, // '\0'
65, // 'A'
65, // 'A'
0, // '\0'
65, // 'A'
0, // '\0'
0, // '\0'
]);
const str = "\0\0\0AA\0A\0\0";
const decodedString = decoder.decode(raw);
console.assert(str[0] === '\0');
console.assert(str[3] === 'A');
console.assert(str[7] === '\0');
console.assert(str === decodedString);
console.log("test done."); |
introduce new functions - UTF8ToStringWithLength - UTF16ToStringWithLength - UTF32ToStringWithLength Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.
introduce new functions: - UTF8ToStringNBytes - UTF16ToStringNBytes - UTF32ToStringNBytes add docs to preamble.js.rst Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.
I see, thanks @LanderlYoung @RReverser , maybe you know: is there really no Web API for "get a JS string from a UTF-8 data starting at an offset in a typed array, and possibly not using the entire typed array"? |
@kripken That's what The bigger problem, however, is that TextDecoder and TextEncoder themselves are fairly slow, as they're parts of DOM API (implemented in C++) and any usage, especially for small strings, ends up dominated by JS <-> C++ communication overhead. When I worked on wasm-bindgen, I made some deeper analysis of this problem and managed to get significant speed-ups by avoiding the TextEncoder / TextDecoder code path as long as the string is ASCII-only (and switching to those APIs once I've hit a Unicode character). You can check rustwasm/wasm-bindgen#1470 (comment) for some benchmark comparisons and more details - perhaps Emscripten would be able to do the same? (AFAIK it already has code for manual encoding / decoding, it's just currently unused when TextEncoder & TextDecoder are detected, so it's just a matter of merging code paths together.) |
Oh and FWIW as for step 1 one more thing you could try is replace manual
with something like
Not 100% sure if it would be faster due to extra |
introduce new functions: - UTF8ToStringNBytes - UTF16ToStringNBytes - UTF32ToStringNBytes add docs to preamble.js.rst Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.
Thanks @RReverser , sounds interesting to investigate those. |
introduce new functions: - UTF8ToStringNBytes - UTF16ToStringNBytes - UTF32ToStringNBytes add docs to preamble.js.rst Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.
introduce new functions: - UTF8ToStringNBytes - UTF16ToStringNBytes - UTF32ToStringNBytes add docs to preamble.js.rst Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.
This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant. |
EDIT: this is a bad idea |
@liquidaty Well two big differences are that 1) this doesn't handle UTF-8, it can only handle ASCII characters correctly and 2) on any long string it will overflow the stack - that's why you should almost never use So yes, it's faster, but it's because it does something quite different and in a more unsafe manner. |
@RReverser ah. duh. terrible idea. Thanks for clarifying! |
My code frequently passes a string from C++ to JS. To do this I used the
UTF8ToString
built-in function.Something like this:
However, I found this code executes slower than expected (chrome profile shows 85% of total execution time). So I looked into the code.
call chain:
UTF8ToString
->UTF8ArrayToString
And
UTF8ArrayToString
basically has such logic:Turns out that step (1) spend as much as (or even more than) step (2).
So I wrote a simple test code.
test code
Online test code on repl.it
On my Chrome 85 @MacOS 10.15 i9-9900K environment, it shows:
Plus: I also tested
strlen
performance in C (Wasm) and JSStrlenTest.cc
run with node.js
Known facts:
strlen
is really fastDiscuss
UTF8ToString
already takes amaxBytesToRead
argument, but it is not really the string length, so, inside it, string length is still needed to be calculated.C/C++ code already knows (or can be really fast to know) the actual length of a string.
So, a simple solution is to pass string length to
UTF8ToString
.Possible solutions:
Solutions are simple.
change the
maxBytesToRead
semantic tostringLength
this breaks backward compatibility.
add a third argument named
stringLength
PS: there is also
UTF16ToString
,UTF32ToString
.The text was updated successfully, but these errors were encountered: