- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 208
fix: isWellFormed #792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: isWellFormed #792
Conversation
| This is imho not acceptable for fairness reasons. as the test is basically from @cesco69. You should try to give him atleast the credentials for the test. IMHO checkout his PR by doing  | 
      
        
              This comment was marked as off-topic.
        
        
      
    
  This comment was marked as off-topic.
| Ah, ok. :) Yeah, i thought so too. I guess we need to revert it completely | 
      
        
              This comment was marked as off-topic.
        
        
      
    
  This comment was marked as off-topic.
      
        
              This comment was marked as outdated.
        
        
      
    
  This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
      
        
              This comment was marked as outdated.
        
        
      
    
  This comment was marked as outdated.
| I have made a benchmark for test only  TLTR: isWellFormed with a short regex (only ASCII escape) seem faster. I have compared: 
 On my PC (cpu: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz) 
 BENCHMARKI have published the benchmark online 
 index.mjs import { run, bench, summary, do_not_optimize } from "mitata";
/**
 * NEW IMPLEMENTATION OF asString
 * 
 * Uses String.prototype.isWellFormed() to skip regex check for surrogate pairs
 */
const ASCII_ESCAPE = /[\u0000-\u001f\u0022\u005c]/;
const asStringNew = (str) => {
  const len = str.length;
  if (len === 0) {
    return '""';
  } else if (len < 42) {
    // magically escape strings for json
    // relying on their charCodeAt
    // everything below 32 needs JSON.stringify()
    // every string that contain surrogate needs JSON.stringify()
    // 34 and 92 happens all the time, so we
    // have a fast case for them
    let result = "";
    let last = -1;
    let point = 255;
    for (let i = 0; i < len; i++) {
      point = str.charCodeAt(i);
      if (
        point === 0x22 || // '"'
        point === 0x5c // '\'
      ) {
        last === -1 && (last = 0);
        result += str.slice(last, i) + "\\";
        last = i;
      } else if (point < 32 || (point >= 0xd800 && point <= 0xdfff)) {
        // The current character is non-printable characters or a surrogate.
        return JSON.stringify(str);
      }
    }
    return (
      (last === -1 && '"' + str + '"') || '"' + result + str.slice(last) + '"'
    );
  } else if (len < 5000 && str.isWellFormed() && ASCII_ESCAPE.test(str) === false) {
    // Only use the regular expression for shorter input. The overhead is otherwise too much.
    return '"' + str + '"';
  } else {
    return JSON.stringify(str);
  }
};
/**
 * OLD IMPLEMENTATION OF asString
 * 
 * Uses a regex to check for surrogate pairs
 */
const STR_ESCAPE = /[\u0000-\u001f\u0022\u005c\ud800-\udfff]/;
const asStringOld = (str) => {
  const len = str.length;
  if (len === 0) {
    return '""';
  } else if (len < 42) {
    // magically escape strings for json
    // relying on their charCodeAt
    // everything below 32 needs JSON.stringify()
    // every string that contain surrogate needs JSON.stringify()
    // 34 and 92 happens all the time, so we
    // have a fast case for them
    let result = "";
    let last = -1;
    let point = 255;
    for (let i = 0; i < len; i++) {
      point = str.charCodeAt(i);
      if (
        point === 0x22 || // '"'
        point === 0x5c // '\'
      ) {
        last === -1 && (last = 0);
        result += str.slice(last, i) + "\\";
        last = i;
      } else if (point < 32 || (point >= 0xd800 && point <= 0xdfff)) {
        // The current character is non-printable characters or a surrogate.
        return JSON.stringify(str);
      }
    }
    return (
      (last === -1 && '"' + str + '"') || '"' + result + str.slice(last) + '"'
    );
  } else if (len < 5000 && STR_ESCAPE.test(str) === false) {
    // Only use the regular expression for shorter input. The overhead is otherwise too much.
    return '"' + str + '"';
  } else {
    return JSON.stringify(str);
  }
};
/**
 * BENCHMARKS CASES
 */
const shortString = 'a'.repeat(10)
const longString = 'a'.repeat(1000)
const shortDirtyString = shortString + '\n'
const longDirtyString = longString + '\n'
const surrogate = '\uD800';
const escapeChar = '"';
const shortSurrogateString = 'a'.repeat(10) + surrogate;
const shortQuoteString = 'a'.repeat(10) + escapeChar;
const longSurrogateString = 'a'.repeat(1000) + surrogate;
const longQuoteString = 'a'.repeat(1000) + escapeChar;
const args = { 
  str: [
    '',
    shortString, longString, 
    shortDirtyString, longDirtyString,
    shortSurrogateString, shortQuoteString,
    longSurrogateString, longQuoteString
  ] 
};
/**
 * BENCHMARKS IMPLEMENTATIONS
 */
summary(() => {
  bench("asStringNew", function* xNew(state) {
    const str = state.get('str');
    yield () => do_not_optimize(asStringNew(str));
  }).gc("inner").args(args);
  bench("asStringOld", function* xOld(state) {
    const str = state.get('str');
    yield () => do_not_optimize(asStringOld(str));
  }).gc("inner").args(args);
});
await run({
  format: "mitata",
  colors: true,
  throw: true,
});run with node --expose-gc --allow-natives-syntax index.mjs --iterations=50000RAW RESULTnode --expose-gc --allow-natives-syntax index.mjs --iterations=50000
clk: ~2.96 GHz
cpu: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
runtime: node 20.14.0 (x64-win32)
benchmark                   avg (min … max) p75 / p99    (min … top 1%)
------------------------------------------- -------------------------------
asStringNew                    6.67 ns/iter   7.69 ns ▃ █▂ ▆▅
                       (2.29 ns … 38.26 ns)  20.02 ns ███████▄▃▂▃▂▃▄▃▁▂▁▁▁▁
                  gc(  1.75 ms …   6.64 ms)   0.50  b (  0.17  b… 42.74  b)
asStringNew                  120.74 ns/iter 143.55 ns     █▃▂   ▃▂▆ ▃
                     (51.32 ns … 231.05 ns) 202.03 ns ▂▂▂▅███▄▆██████▆▂▂▁▁▂
                  gc(  1.91 ms …   6.16 ms)  60.07  b (  6.28  b…144.35  b)
asStringNew                    1.04 µs/iter   1.04 µs ▃██▅▂
                      (916.89 ns … 1.57 µs)   1.48 µs █████▂▄▄▁▂▃▄▃▂▂▂▂▂▂▁▂
                  gc(  1.79 ms …   5.84 ms)  78.31  b ( 64.19  b…340.62  b)
asStringNew                  269.22 ns/iter 321.44 ns ██▅
                    (125.76 ns … 875.37 ns) 842.99 ns ███▆▆▆▆▃▄▃▁▁▂▁▂▁▂▂▂▁▂
                  gc(  1.72 ms …   6.89 ms)  62.08  b (  5.14  b…238.20  b)
asStringNew                    6.78 µs/iter   8.08 µs   ▅▂    █▂    ▂▂
                       (4.26 µs … 10.95 µs)   9.87 µs ▄▄██▄▄▄▄██▇▇▄▁██▇▁▁▁▄
                  gc(  2.18 ms …   7.96 ms)   2.23 kb (  2.19 kb…  2.23 kb)
asStringNew                  702.15 ns/iter 826.42 ns     ▇▂▃▃ █
                      (378.81 ns … 1.38 µs)   1.37 µs ███▅████▆█▇▃▆▂▂▁▂▁▃▃▂
                  gc(  1.93 ms …   9.78 ms) 205.19  b (142.14  b…473.12  b)
asStringNew                  201.91 ns/iter 248.00 ns   ▆█▇▄ ▃▅▇  ▃▆▃
                    (100.61 ns … 357.50 ns) 340.36 ns ▂▄█████████████▆▂▁▃▅▃
                  gc(  1.78 ms …   6.80 ms) 168.70  b ( 43.10  b…361.37  b)
asStringNew                    5.95 µs/iter   6.21 µs             █▂
                        (3.06 µs … 8.44 µs)   7.83 µs ▃▁▁▁▃▃▁▃▆▄▃▇██▁▆▄▁▃▃▄
                  gc(  2.09 ms …   7.20 ms) 200.28  b (199.85  b…212.28  b)
asStringNew                    7.87 µs/iter   8.52 µs           █
                       (5.33 µs … 10.24 µs)   9.85 µs ▃▁▁▁▃▅▅▃▃▇█▃▃▅▃▃▁▅▅▁▅
                  gc(  2.33 ms …   7.66 ms)   2.23 kb (  2.23 kb…  2.24 kb)
asStringOld                   11.60 µs/iter  12.30 µs      ▅█ ▆█
                       (8.10 µs … 33.60 µs)  16.00 µs ▃▄▄▄▃██████▇▇▃▅▃▄▂▂▂▄
                  gc(  2.78 ms …  11.86 ms) 519.60  b (504.00  b…  2.64 kb)
asStringOld                  177.89 ns/iter 208.23 ns            ▂▅█▃▆▄
                     (86.43 ns … 252.03 ns) 244.36 ns ▂▇▃▅█▇▆▅▇▄▆████████▆▃
                  gc(  1.86 ms …   8.59 ms)  54.05  b ( 22.36  b…191.69  b)
asStringOld                    1.88 µs/iter   2.22 µs    ▃    ▆▆█▅ ▅▆
                      (981.91 ns … 3.37 µs)   2.94 µs ▆▄▄██▄██████████▃▁▃▁▃
                  gc(  1.83 ms …   9.18 ms)  65.97  b ( 32.11  b…288.79  b)
asStringOld                  275.36 ns/iter 361.18 ns ▄█
                    (168.02 ns … 732.08 ns) 514.65 ns ██▆▅▄▃▅▄▃▃▃▅▃▆▅▄▃▁▁▁▁
                  gc(  1.76 ms …   6.52 ms)  50.09  b ( 48.11  b…163.04  b)
asStringOld                    5.48 µs/iter   6.73 µs  █             ▃
                        (3.36 µs … 7.92 µs)   7.77 µs ██▆▄▆█▁█▁█▁▆▆▄▄█▄▄█▆█
                  gc(  2.24 ms …   7.49 ms)   2.23 kb (  2.20 kb…  2.25 kb)
asStringOld                  732.46 ns/iter 868.70 ns  ▇ ▃    █
                      (368.90 ns … 1.61 µs)   1.56 µs ▇█████▄▇██▅▂▃▂▃▂▃▁▃▂▂
                  gc(  1.81 ms …   5.73 ms) 199.43  b ( 92.16  b…360.21  b)
asStringOld                  191.50 ns/iter 250.49 ns ██
                    (119.07 ns … 360.08 ns) 336.96 ns ██▇▆▅▄▅▂▆▃▅▃▄▄▄▆▅▆▂▁▁
                  gc(  1.76 ms …  10.20 ms) 165.59  b ( 40.81  b…452.06  b)
asStringOld                    4.80 µs/iter   5.75 µs █
                        (3.12 µs … 7.11 µs)   7.04 µs █▃▄▃▆▄▆▄▇▁▆▄▃▆▄▁▃▆▃▆▄
                  gc(  1.97 ms …   4.95 ms) 199.61  b (173.26  b…222.61  b)
asStringOld                    3.43 µs/iter   3.31 µs █
                        (2.92 µs … 6.92 µs)   6.88 µs ██▅▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂
                  gc(  1.76 ms …   7.42 ms)   2.23 kb (  2.21 kb…  2.27 kb)
summary
  asStringNew
   +1.47…+26.67x faster than asStringOldSide note for @Uzlopak in my previous comment I say it is slower, but it is slow only on browser side The built in benchPR MASTER from #707 PS sorry for the bad english | 
| @nigrosimone can you please include a revert of the revert of @climba03003 original commits? I've added tests for all listed regressions | 
Signed-off-by: Nigro Simone <nigro.simone@gmail.com>
Merge latest
| Done | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSLGTM
To be honest, I cant follow the benchmarks. mitata is not the best benchmark out there. Actually tinybench is afaik the best benchmark out there, which is the reason it is used for benchmarks in nodejs.
If you post that master has 15.000 ops and this PR has 12.000 ops then this PR is slower. But as I said I cant read the mitata benchmarks. Or lets say, I dont want to dig deep into them.
Disclaimer: I am also maintainer of tinybench.
| Ok, I can try to make a benchmark with Tiny bench | 
| @Uzlopak for tinybench asStringNew seem slower Resultnode .\index.mjs
asString
┌─────────┬───────────────┬──────────────────┬───────────────────┬────────────────────────┬────────────────────────┬─────────┐
│ (index) │ Task name     │ Latency avg (ns) │ Latency med (ns)  │ Throughput avg (ops/s) │ Throughput med (ops/s) │ Samples │
├─────────┼───────────────┼──────────────────┼───────────────────┼────────────────────────┼────────────────────────┼─────────┤
│ 0       │ 'asStringNew' │ '7098.5 ± 0.12%' │ '7100.0 ± 100.00' │ '142612 ± 0.04%'       │ '140845 ± 2012'        │ 140875  │
│ 1       │ 'asStringOld' │ '6991.4 ± 0.69%' │ '6900.0 ± 400.00' │ '145945 ± 0.05%'       │ '144928 ± 7941'        │ 143033  Benchmarkimport { Bench } from 'tinybench'
/**
 * NEW IMPLEMENTATION OF asString
 * 
 * Uses String.prototype.isWellFormed() to skip regex check for surrogate pairs
 */
const ASCII_ESCAPE = /[\u0000-\u001f\u0022\u005c]/;
const asStringNew = (str) => {
    const len = str.length;
    if (len === 0) {
        return '""';
    } else if (len < 42) {
        // magically escape strings for json
        // relying on their charCodeAt
        // everything below 32 needs JSON.stringify()
        // every string that contain surrogate needs JSON.stringify()
        // 34 and 92 happens all the time, so we
        // have a fast case for them
        let result = "";
        let last = -1;
        let point = 255;
        for (let i = 0; i < len; i++) {
            point = str.charCodeAt(i);
            if (
                point === 0x22 || // '"'
                point === 0x5c // '\'
            ) {
                last === -1 && (last = 0);
                result += str.slice(last, i) + "\\";
                last = i;
            } else if (point < 32 || (point >= 0xd800 && point <= 0xdfff)) {
                // The current character is non-printable characters or a surrogate.
                return JSON.stringify(str);
            }
        }
        return (
            (last === -1 && '"' + str + '"') || '"' + result + str.slice(last) + '"'
        );
    } else if (len < 5000 && str.isWellFormed() && ASCII_ESCAPE.test(str) === false) {
        // Only use the regular expression for shorter input. The overhead is otherwise too much.
        return '"' + str + '"';
    } else {
        return JSON.stringify(str);
    }
};
/**
 * OLD IMPLEMENTATION OF asString
 * 
 * Uses a regex to check for surrogate pairs
 */
const STR_ESCAPE = /[\u0000-\u001f\u0022\u005c\ud800-\udfff]/;
const asStringOld = (str) => {
    const len = str.length;
    if (len === 0) {
        return '""';
    } else if (len < 42) {
        // magically escape strings for json
        // relying on their charCodeAt
        // everything below 32 needs JSON.stringify()
        // every string that contain surrogate needs JSON.stringify()
        // 34 and 92 happens all the time, so we
        // have a fast case for them
        let result = "";
        let last = -1;
        let point = 255;
        for (let i = 0; i < len; i++) {
            point = str.charCodeAt(i);
            if (
                point === 0x22 || // '"'
                point === 0x5c // '\'
            ) {
                last === -1 && (last = 0);
                result += str.slice(last, i) + "\\";
                last = i;
            } else if (point < 32 || (point >= 0xd800 && point <= 0xdfff)) {
                // The current character is non-printable characters or a surrogate.
                return JSON.stringify(str);
            }
        }
        return (
            (last === -1 && '"' + str + '"') || '"' + result + str.slice(last) + '"'
        );
    } else if (len < 5000 && STR_ESCAPE.test(str) === false) {
        // Only use the regular expression for shorter input. The overhead is otherwise too much.
        return '"' + str + '"';
    } else {
        return JSON.stringify(str);
    }
};
/**
 * BENCHMARKS CASES
 */
const shortString = 'a'.repeat(10)
const longString = 'a'.repeat(1000)
const shortDirtyString = shortString + '\n'
const longDirtyString = longString + '\n'
const surrogate = '\uD800';
const escapeChar = '"';
const shortSurrogateString = 'a'.repeat(10) + surrogate;
const shortQuoteString = 'a'.repeat(10) + escapeChar;
const longSurrogateString = 'a'.repeat(1000) + surrogate;
const longQuoteString = 'a'.repeat(1000) + escapeChar;
const strs = [
    '',
    shortString, longString,
    shortDirtyString, longDirtyString,
    shortSurrogateString, shortQuoteString,
    longSurrogateString, longQuoteString
];
/**
 * BENCHMARKS IMPLEMENTATIONS
 */
const bench = new Bench({ name: 'asString', time: 1000 })
bench
    .add('asStringNew', () => {
        for (const str of strs) {
            asStringNew(str)
        }
    })
    .add('asStringOld', () => {
        for (const str of strs) {
            asStringOld(str)
        }
    })
await bench.run()
console.log(bench.name)
console.table(bench.table()) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for fixing the mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| There are conflicts now. | 
| Guys, this implementation seem slower... | 
| Oh I thought it was fixing something else actually, then yeah maybe let's keep it as is? | 
| I am closing this, as this part of the code made already too much confusion and friction. We should first make some good benchmarks for strings, with various cases. E.g. I think one of the tests is repeated NUL-character (\u0000), which is basically pointless, because it should have the same performance like a string with length of 1 with one NUL character. So we should make some smart test cases and then balance the performance. Also using tinybench, as it is more correct than any other benchmark engine right now on the market. E.g. one of the tables in benchmark.js is mathematically wrong. mitata is programmatically not correct. | 
| Side note, It will probably be very difficult for the committee to approve such a niche idea. A dedicated JSON.isWellFormed method would provide a fast, standardized way to check this case. | 
Checklist
npm run test && npm run benchmark --if-presentand the Code of conduct
see #792 (comment)
fix #794
fix #793