performance issue with UTF8ToString #12517

LanderlYoung · 2020-10-14T12:06:44Z

My code frequently passes a string from C++ to JS. To do this I used the UTF8ToString built-in function.

Something like this:

EM_JS(void, consumeAString(const char* str, size_t length), {
  const string = UTF8ToString(str, length);
  typeof string === 'string';
  // do with string;
});

void myFunc() {
  std::string str;
  consumeAString(str.c_str(), str.size());
}

However, I found this code executes slower than expected (chrome profile shows 85% of total execution time). So I looked into the code.

call chain: UTF8ToString -> UTF8ArrayToString

And UTF8ArrayToString basically has such logic:

function UTF8ArrayToString(heap, idx, maxBytesToRead) {
  var endIdx = idx + maxBytesToRead;
  var endPtr = idx;

  // (1)  find the '\0' terminator -- this is slow
  while (heap[endPtr] && !(endPtr >= endIdx)) ++endPtr;

  if (endPtr - idx > 16 && heap.subarray && UTF8Decoder) {
    // (2) do decode
    return UTF8Decoder.decode(heap.subarray(idx, endPtr));

...

Turns out that step (1) spend as much as (or even more than) step (2).

So I wrote a simple test code.

test code

const testIteration = 100;
const decoder = new TextDecoder('utf-8');

function test(length) {
    const buffer = new Int8Array(length);
    for (let i = 0; i < length; i++) {
        buffer[i] = 65; // ascii 'A'
    }

    function decode() {
        decoder.decode(buffer);
    }

    function iterate() {
        let i = 0;
        while(i < length && buffer[i]) i++;
    }

    let start = Date.now();
    for (let i = 0; i < testIteration; i++) {
        decode();
    }
    const decodeTime = (Date.now() - start) / testIteration;

    start = Date.now();
    for (let i = 0; i < testIteration; i++) {
        iterate();
    }
    const iterateTime = (Date.now() - start) / testIteration;

    console.log("TypedArray length:[" + length + "] decode:" + decodeTime + "ms" + " iterate:" + iterateTime + "ms");
}

test(1024);

test(1024 * 1024);

test(10 * 1024 * 1024);

Online test code on repl.it

On my Chrome 85 @MacOS 10.15 i9-9900K environment, it shows:

TypedArray length:[1024] decode:0ms iterate:0.01ms
TypedArray length:[1048576] decode:1.11ms iterate:0.91ms
TypedArray length:[10485760] decode:11.54ms iterate:8.96ms

Plus: I also tested `strlen` performance in C (Wasm) and JS

StrlenTest.cc

#include <iostream>
#include <chrono>
#include <emscripten/emscripten.h>

using namespace std;
using namespace std::chrono;

EM_JS(void, strlenTestJS, (const char* str), {
    function strlen(str) {
        let end = str;
        while (Module.HEAP8[end]) end++;
        return end - str;
    }

    for (let i = 0; i < 100; i++) {
        strlen(str);
    }
});

void strlenTestC(const char* str) {
    for (int i = 0; i < 100; i++) {
         std::strlen(str);
    }
}

void test(size_t length) {
    std::string str;
    str.resize(length, 'A');

    auto start = steady_clock::now().time_since_epoch();
    strlenTestC(str.c_str());
    auto c = steady_clock::now().time_since_epoch() - start;

    start = steady_clock::now().time_since_epoch();
    strlenTestJS(str.c_str());
    auto js = steady_clock::now().time_since_epoch() - start;

    cout << "length:" << length
        << " C:" << duration_cast<milliseconds>(c).count() / 100 << "ms"
        << " JS:" << duration_cast<milliseconds>(js).count() / 100 << "ms"
        << std::endl;
}

int main() {
    test(1024);
    test(1024 * 1024);
    test(10 * 1024 * 1024);
    test(50 * 1024 * 1024);
}

run with node.js

# em++ test.cpp -sALLOW_MEMORY_GROWTH=1
# Build with `-O0`
length:1024 C:0ms JS:0ms
length:1048576 C:0ms JS:0ms
length:10485760 C:1ms JS:6ms
length:52428800 C:9ms JS:37ms

# em++ test.cpp -O3 -sALLOW_MEMORY_GROWTH=1
# Build with `-O3`
length:1024 C:0ms JS:0ms
length:1048576 C:0ms JS:0ms
length:10485760 C:0ms JS:6ms
length:52428800 C:0ms JS:37ms

Known facts:

use js to iterate TypeArray is slow
use C/Wasm to calculate strlen is really fast

Discuss

UTF8ToString already takes a maxBytesToRead argument, but it is not really the string length, so, inside it, string length is still needed to be calculated.

C/C++ code already knows (or can be really fast to know) the actual length of a string.

So, a simple solution is to pass string length to UTF8ToString.

Possible solutions:

Solutions are simple.

change the maxBytesToRead semantic to stringLength
this breaks backward compatibility.
add a third argument named stringLength

PS: there is also UTF16ToString, UTF32ToString.

The text was updated successfully, but these errors were encountered:

sbc100 · 2020-10-14T16:40:57Z

This is pretty widely used function so I don't think we can break back-compat here. But using an extra arg or a seperate function sounds fine. This will be useful for C++ strings where the length of the string also already known a-priori.

I do have question though: Do you have a real world application the cost of doing strlen in JS is having a noticeable effect on performace? It doesn't seem that surprising that this cost dominates in your microbenchmark since that is all you are doing there.

kripken · 2020-10-15T00:11:47Z

Do we actually need to compute the length of the string? I'm not sure from the docs if TextDecoder.decode stops at a 0 or not.

If it doesn't, we should ask for an API that does...

LanderlYoung · 2020-10-16T09:02:17Z

Thanks for the reply~

@sbc100 I'm still developing my ScriptEngine library, which intended to provide a unified C++ API for V8/JSCore/Lua and now WASM.

My library doesn't have any real-world application using on WASM for now (but will have very soon).

The result is tested on my UnitTest.

But since we already find the performance problem, I'd like to open a PR to fix it.

I'm not sure from the docs if TextDecoder.decode stops at a 0 or not.

reply @kripken According to the comment TextDecoder needs to know the byte length in advance, it doesn't stop on null terminator by itself. It doesn't.

- add extra parameter `exactStringLength` to UTF8ToString, This parameter is optional, and if given, the `maxBytesToRead` parameter is ignored. to keep consistent API flavor, `UTF16ToString`, `UTF32ToString`, is also changed.

kripken · 2020-10-19T16:18:50Z

@LanderlYoung I wonder if that comment is accurate, though. Reading the MDN docs I'm not sure. Also, there may be another Web API for this, as it does seem pretty obvious. Worth looking for if someone has time.

introduce new functions - UTF8ToStringWithLength - UTF16ToStringWithLength - UTF32ToStringWithLength Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.

LanderlYoung · 2020-10-20T09:03:10Z

@LanderlYoung I wonder if that comment is accurate, though. Reading the MDN docs I'm not sure. Also, there may be another Web API for this, as it does seem pretty obvious. Worth looking for if someone has time.

@kripken As my tested shows, the comment is accurate. MDN does not clarify the situation when there is a '\0' inside the string. But Unicode standard does allow '\0' inside strings. So, as far as I know, the behavior is correct.

As for other APIs, neither do I have any idea
BTW, once there was a StringView API, but now It seems deprecated. https://developer.mozilla.org/en-US/docs/Archive/Add-ons/Code_snippets/StringView_library

and also test code:

const decoder = new TextDecoder("utf-8");

const raw = new Int8Array([
    0,  // '\0'
    0,  // '\0'
    0,  // '\0'
    65, // 'A'
    65, // 'A'
    0,  // '\0'
    65, // 'A'
    0,  // '\0'
    0,  // '\0'
]);

const str = "\0\0\0AA\0A\0\0";

const decodedString = decoder.decode(raw);

console.assert(str[0] === '\0');
console.assert(str[3] === 'A');
console.assert(str[7] === '\0');
console.assert(str === decodedString);

console.log("test done.");

introduce new functions - UTF8ToStringWithLength - UTF16ToStringWithLength - UTF32ToStringWithLength Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.

introduce new functions: - UTF8ToStringNBytes - UTF16ToStringNBytes - UTF32ToStringNBytes add docs to preamble.js.rst Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.

kripken · 2020-10-21T16:32:10Z

I see, thanks @LanderlYoung

@RReverser , maybe you know: is there really no Web API for "get a JS string from a UTF-8 data starting at an offset in a typed array, and possibly not using the entire typed array"?

RReverser · 2020-10-21T16:47:29Z

@kripken That's what subarray is for - it creates a cheap view into the existing ArrayBuffer, which can then be passed to TextDecoder.

The bigger problem, however, is that TextDecoder and TextEncoder themselves are fairly slow, as they're parts of DOM API (implemented in C++) and any usage, especially for small strings, ends up dominated by JS <-> C++ communication overhead.

When I worked on wasm-bindgen, I made some deeper analysis of this problem and managed to get significant speed-ups by avoiding the TextEncoder / TextDecoder code path as long as the string is ASCII-only (and switching to those APIs once I've hit a Unicode character).

You can check rustwasm/wasm-bindgen#1470 (comment) for some benchmark comparisons and more details - perhaps Emscripten would be able to do the same? (AFAIK it already has code for manual encoding / decoding, it's just currently unused when TextEncoder & TextDecoder are detected, so it's just a matter of merging code paths together.)

RReverser · 2020-10-21T16:51:47Z

Oh and FWIW as for step 1 one more thing you could try is replace manual

  // (1)  find the '\0' terminator -- this is slow
  while (heap[endPtr] && !(endPtr >= endIdx)) ++endPtr;

with something like

  endPtr = heap.subarray(0, endIdx).indexOf(0, endPtr);
  if (endPtr == -1) endPtr = endIdx;

Not 100% sure if it would be faster due to extra subarray indirection though.

introduce new functions: - UTF8ToStringNBytes - UTF16ToStringNBytes - UTF32ToStringNBytes add docs to preamble.js.rst Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.

kripken · 2020-10-22T16:57:42Z

Thanks @RReverser , sounds interesting to investigate those.

introduce new functions: - UTF8ToStringNBytes - UTF16ToStringNBytes - UTF32ToStringNBytes add docs to preamble.js.rst Decode string exactly length with given length, any '\0' in between will be kept as-is. Those functions require an argument `lengthInBytes`, so no need to iterator the heap to find a null-terminator, thus have better performance.

stale · 2022-04-17T21:21:33Z

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

liquidaty · 2022-10-16T23:56:12Z

~~An alternative that works better for me than `UTF8ToString` is ``` String.fromCharCode.apply(null, new Uint8Array(Module.HEAP8.buffer, s, length)) ```~~

Performance is roughly the same, but it handles invalid bytes such as \uDEAD better for my use case (perhaps "properly" but you be the judge) i.e. with malformed input, JSON.stringify() outputs valid JSON with the result from String.fromCharCode, but not with the output of UTF8ToString

EDIT: this is a bad idea

RReverser · 2022-10-17T01:41:49Z

@liquidaty Well two big differences are that 1) this doesn't handle UTF-8, it can only handle ASCII characters correctly and 2) on any long string it will overflow the stack - that's why you should almost never use String.fromCharCode.apply on inputs with arbitrary length.

So yes, it's faster, but it's because it does something quite different and in a more unsafe manner.

liquidaty · 2022-10-17T02:06:39Z

@RReverser ah. duh. terrible idea. Thanks for clarifying!

LanderlYoung mentioned this issue Oct 19, 2020

Fix performance issue with UTF8ToString (#12517) #12544

Closed

stale bot added the wontfix label Apr 17, 2022

Squareys mentioned this issue May 30, 2022

Platform: replace use of std::string with String and StringView mosra/magnum#559

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance issue with UTF8ToString #12517

performance issue with UTF8ToString #12517

LanderlYoung commented Oct 14, 2020 •

edited

Loading

sbc100 commented Oct 14, 2020

kripken commented Oct 15, 2020

LanderlYoung commented Oct 16, 2020

kripken commented Oct 19, 2020

LanderlYoung commented Oct 20, 2020 •

edited

Loading

kripken commented Oct 21, 2020

RReverser commented Oct 21, 2020

RReverser commented Oct 21, 2020

kripken commented Oct 22, 2020

stale bot commented Apr 17, 2022

liquidaty commented Oct 16, 2022 •

edited

Loading

RReverser commented Oct 17, 2022

liquidaty commented Oct 17, 2022

performance issue with UTF8ToString #12517

performance issue with UTF8ToString #12517

Comments

LanderlYoung commented Oct 14, 2020 • edited Loading

So I wrote a simple test code.

Plus: I also tested strlen performance in C (Wasm) and JS

Known facts:

Discuss

Possible solutions:

sbc100 commented Oct 14, 2020

kripken commented Oct 15, 2020

LanderlYoung commented Oct 16, 2020

kripken commented Oct 19, 2020

LanderlYoung commented Oct 20, 2020 • edited Loading

kripken commented Oct 21, 2020

RReverser commented Oct 21, 2020

RReverser commented Oct 21, 2020

kripken commented Oct 22, 2020

stale bot commented Apr 17, 2022

liquidaty commented Oct 16, 2022 • edited Loading

RReverser commented Oct 17, 2022

liquidaty commented Oct 17, 2022

LanderlYoung commented Oct 14, 2020 •

edited

Loading

Plus: I also tested `strlen` performance in C (Wasm) and JS

LanderlYoung commented Oct 20, 2020 •

edited

Loading

liquidaty commented Oct 16, 2022 •

edited

Loading