Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIMEType Performance #38

Open
climba03003 opened this issue Jan 13, 2023 · 24 comments
Open

MIMEType Performance #38

climba03003 opened this issue Jan 13, 2023 · 24 comments

Comments

@climba03003
Copy link

climba03003 commented Jan 13, 2023

I know that MIMEType is recently added and under Experimental flag. But I want to draw a attention on it's performance.
It is not the slowest compare to the userland module, but it can do better.

util#MIMEType x 1,206,781 ops/sec ±0.22% (96 runs sampled)
fast-content-type-parse#parse x 3,752,236 ops/sec ±0.42% (96 runs sampled)
fast-content-type-parse#safeParse x 3,675,645 ops/sec ±1.09% (94 runs sampled)
content-type#parse x 1,452,582 ops/sec ±0.37% (95 runs sampled)
busboy#parseContentType x 924,306 ops/sec ±0.43% (94 runs sampled)
Fastest is fast-content-type-parse#parse

I am considering replacing MIMEType for the Content-Type parsing in fastify but the performance is the main drawback.
The main reason behind using MIMEType is the .essence property did a great job in unifying the Browser and Server in terms of Content-Type guessing / matching and prevent the similar security issue like GHSA-3fjj-p79j-c9hh

Refs fastify/fastify#4502

@metcoder95
Copy link
Member

The implementation is different in how they parse the content-type header, being the utils.MIMEType is slightly more iterative compared to how the fast-content-type-parse parses the same header.

Among some other differences in how the util API is made and exported, we can use the fast-content-type-parse as a baseline for the benchmark comparison while keeping the safeguards provided by utils.MIMEType.

I'll dedicate time to this during the week 🙂

@anonrig
Copy link
Member

anonrig commented Jan 16, 2023

cc @nodejs/undici

@KhafraDev
Copy link
Member

Undici implements its own content-type parser, which like the nodejs one, is 100% spec compatible. As pointed out above it's unlikely that the other userland modules are totally compliant.

@metcoder95
Copy link
Member

Giving some updates, I've been doing some optimizations at both ends, starting with Undici.
I was able to gain a ~20% increase by doing smaller optimizations while being spec-compliant.
As @KhafraDev pointed out, Undici is the implementation that it implements fully compliant accordingly to the spec, meanwhile, Node's is a little bit distant but in essence, is compliant as well.

Results:

util#MIMEType x 1,345,116 ops/sec ±0.29% (97 runs sampled)
undici#parseMIMEType(optimised) x 1,954,188 ops/sec ±0.27% (93 runs sampled)
undici#parseMIMEType(original) x 1,621,463 ops/sec ±0.18% (98 runs sampled)
fast-content-type-parse#parse x 5,750,335 ops/sec ±0.34% (98 runs sampled)
fast-content-type-parse#safeParse x 5,752,805 ops/sec ±0.30% (96 runs sampled)

Fastest is fast-content-type-parse#safeParse,fast-content-type-parse#parse

One important thing that might be good to think about is that being spec compliant might limit the room for optimizations, meaning that most likely any optimization applied at Undici or Node will be able to fully reach or surface fast-content-type-parse.
Though, IMHO this is not bad, as being spec compliant is important.

I'll be aiming to have another round of optimizations, now including MIMEType as well to see how much I can push it at the first iteration.

Out of Scope
Just as an idea, it might be good to push the boundaries of utils#MIMEType to reach Undici's implementation, and as they serve for kinda the same purposes, Undici maybe can make use of it internally. Not so sure if a good idea, but want to just raise it in case it can be helpful 🙂

@anonrig
Copy link
Member

anonrig commented Jan 18, 2023

Thanks @metcoder95 for the awesome work. I didnt get any chance to look over the implementation but curious: would it be faster if we implemented the same exact code using C++?

@KhafraDev
Copy link
Member

It's not a performance concern in undici - it's only used for parsing the content-type in data: urls.

@climba03003
Copy link
Author

would it be faster if we implemented the same exact code using C++?

The MIME util is original written in C++ but turns out change into JavaScript to reduce maintenance burden.

spec-compliant

About specification compliant can be either follow the exact step or optimize the step with output compliant.
The former will have little room for improvement, but the latter will have a great room for it.

content-type and fast-content-type-parse is a example for the latter in providing RFC 7231 spec-compliant output.
MIMEType and undici is the former in providing MIMEType spec-compliant implementation.

@metcoder95
Copy link
Member

Thanks @metcoder95 for the awesome work. I didn't get any chance to look over the implementation but curious: would it be faster if we implemented the same exact code using C++?

Sure thing!
Toward the implementation (or re-implementation) into C++, I would say yes (without adding possible overhead on the communication low-JS), but it translates into the costs that @climba03003 pointed out.

The former will have little room for improvement, but the latter will have a great room for it

+1
I think it will be a matter of trade-offs as always. Depending on what's the exact goal that wants to be achieved (fulfill one or another spec).

Note: Undici uses the WHATWG spec for MIMEType representation and parsing.

I'll continue with the research and open subsequent PRs 🙂

@metcoder95
Copy link
Member

Opened PR as discussed, from the insights gathered of it, I'll proceed with the MIMEType experiments 🙂

Reference: Undici#1871

@metcoder95
Copy link
Member

The PR is merged, I'll use the insights from the assessment to continue with MIMEType improvements on Node 🙂

@metcoder95
Copy link
Member

metcoder95 commented Jan 27, 2023

After having played with the MIMEType class implementation in Node and using different approaches, I was able to come up with the following notes:

  • I think the most evident, MIMEType is fully attached to the spec(link) in its own way
  • The biggest pain for both implementations, but as its most for MIMEType class, is the parsing of the parameters of the MIME string

Baseline:
-> With string 'application/json; charset=utf-8'

util#MIMEType x 1,349,103 ops/sec ±0.16% (92 runs sampled)
undici#parseMIMEType x 2,349,808 ops/sec ±0.14% (99 runs sampled)
fast-content-type-parse#parse x 5,663,227 ops/sec ±0.58% (98 runs sampled)
fast-content-type-parse#safeParse x 5,719,013 ops/sec ±0.44% (95 runs sampled)

-> With string 'application/json'

util#MIMEType x 2,463,796 ops/sec ±0.32% (96 runs sampled)
undici#parseMIMEType x 4,467,917 ops/sec ±0.10% (100 runs sampled)
fast-content-type-parse#parse x 22,576,597 ops/sec ±0.12% (99 runs sampled)
fast-content-type-parse#safeParse x 22,565,488 ops/sec ±0.12% (102 runs sampled)

Scenario - 1

Over the same string 'application/json; charset=utf-8', and with improvements to Node.js implementation, the following results are shown:

util#MIMEType x 1,500,644 ops/sec ±0.39% (94 runs sampled)
undici#parseMIMEType x 2,337,932 ops/sec ±0.13% (97 runs sampled)
fast-content-type-parse#parse x 5,790,442 ops/sec ±0.13% (99 runs sampled)
fast-content-type-parse#safeParse x 5,793,647 ops/sec ±0.13% (97 runs sampled)

It showed ~11% improvement.
The improvements were mostly done at the Regex usage by replacing match to test for boolean scenarios.

Scenario - 2

Over the same string 'application/json', and similar improvements as first scenario to Node.js implementation, the following results are shown:

util#MIMEType x 2,834,890 ops/sec ±0.13% (94 runs sampled)
undici#parseMIMEType x 5,199,808 ops/sec ±0.17% (102 runs sampled)
fast-content-type-parse#parse x 23,378,962 ops/sec ±0.25% (95 runs sampled)
fast-content-type-parse#safeParse x 23,192,733 ops/sec ±0.27% (91 runs sampled)

Similar improvements in the results.

The biggest difference is how Undici parses the parameters of the MIME string, as Node.js uses more a direct approach by doing constant usage of slice, search, and indexOf of the String prototype to iterate, cut, and obtain the different parameters of the MIME string. Meanwhile, Undici does it by a more pragmatic approach not exhausting the String prototype usage but rather a more iterative approach with a callback acting as a predicate to stop the iteration.

My take is that an iterative approach will make things easier and improve the performance, as will also reduce the usage of Primordials which are well known for adding slight overhead.

I'll prepare a PR next week with the proposal so this is more graphic and will make it easier to provide feedback and discuss 🙂

@KhafraDev
Copy link
Member

Since undici's parser consistently outperforms node's, should we consider exposing it? It already passes the entirety of the mimesniff tests.

@metcoder95
Copy link
Member

Since undici's parser consistently outperforms node's, should we consider exposing it? It already passes the entirety of the mimesniff tests.

I think it might be a good option, worth it to explore. Iit might keep the things uniform that's for sure

@metcoder95
Copy link
Member

metcoder95 commented Feb 10, 2023

Hey! To give some heads-up about the progress of the work 🙂
I recently opened node#46607 with the initial changes in pro of trying to improve the perf of the MIMEType class.

Note: Just so you know, it is important to remark that it is in a draft state as cleanup and ensure that tests are required.

After several evaluations I was able to come up with the following results:

Compared against Node v18

                                                                                  confidence improvement accuracy (*)   (**)  (***)
util/mime-parser.js n=100000 strings='application/json; charset="utf-8"'                 ***     12.07 %       ±2.78% ±3.72% ±4.87%
util/mime-parser.js n=100000 strings='text/html ;charset=gbk'                            ***      8.13 %       ±2.18% ±2.91% ±3.79%
util/mime-parser.js n=100000 strings='text/html; charset=gbk'                            ***      4.20 %       ±2.35% ±3.13% ±4.08%
util/mime-parser.js n=100000 strings='text/html;charset= "gbk"'                          ***     10.10 %       ±2.02% ±2.69% ±3.50%
util/mime-parser.js n=100000 strings='text/html;charset=GBK'                             ***     11.53 %       ±2.61% ±3.48% ±4.57%
util/mime-parser.js n=100000 strings='text/html;charset=gbk'                             ***      8.37 %       ±2.23% ±2.96% ±3.86%
util/mime-parser.js n=100000 strings='text/html;charset=gbk;charset=windows-1255'        ***     14.77 %       ±1.66% ±2.21% ±2.88%
util/mime-parser.js n=100000 strings='text/html;x=(;charset=gbk'                         ***     12.14 %       ±3.32% ±4.41% ±5.75%

The improvements were between ~8-12% on average, which is good but still not sure if enough to call it successful.
On ops/sec, the results were unprecise as with the previous baseline of:

 * util#MIMEType x 1,349,103 ops/sec ±0.16% (92 runs sampled)
 * undici#parseMIMEType x 2,349,808 ops/sec ±0.14% (99 runs sampled)
 * undici#parseMIMEType(original) x 1,520,777 ops/sec ±0.26% (100 runs sampled)
 * fast-content-type-parse#parse x 5,663,227 ops/sec ±0.58% (98 runs sampled)
 * fast-content-type-parse#safeParse x 5,719,013 ops/sec ±0.44% (95 runs sampled)

the new baseline varied to:

util#MIMEType x 2,075,831 ops/sec ±0.14% (197 runs sampled)
undici#parseMIMEType x 2,729,673 ops/sec ±0.15% (194 runs sampled)
fast-content-type-parse#parse x 5,751,725 ops/sec ±0.13% (195 runs sampled)
fast-content-type-parse#safeParse x 5,766,105 ops/sec ±0.10% (195 runs sampled)
Fastest is fast-content-type-parse#safeParse

Almost ~50%, which seems to be quite out compared to the results of Node benchmarks. I wanted to put them on the table but wouldn't ensure they are precise as they were varying by ~20% on several iterations.

Looking forward to your feedback and seeing what points can be improved that I could missed 🙂

@anonrig
Copy link
Member

anonrig commented Feb 10, 2023

really good! good job @metcoder95

@metcoder95
Copy link
Member

Hi guys! Is there a way to move nodejs/node#46607 forward? 🙂

@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 29, 2023

Hmm. Didnt know this thread exists. Didnt think that fast-content-type-parse is not spec compliant. Can somebody point to a test suite for spec compliance?

@KhafraDev
Copy link
Member

@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 30, 2023

thx

@metcoder95
Copy link
Member

Hmm. Didnt know this thread exists. Didnt think that fast-content-type-parse is not spec compliant. Can somebody point to a test suite for spec compliance?

BTW, I believe that when referring to fast-content-type-parse is not compliant, it is about to MIMESniff standard from the WHATWG, not the RFC-1341 which I believe the library in fact respects.

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 27, 2023

LOL, I forgot about this issue and created #120

Dang.

Here my post from the other issue:

Maybe we should check how we can improve this overall?


I think that MIMEType can be improved significantly.

I created rn a PR regarding lazily parsing the MimeParams.

nodejs/node#49889

Also i could improve toASCIILower with this snippet:

const ASCII_LOOKUP = new Array(127).fill(0).map((v, i) => {
  if (i >= 65 && i <= 90) {
    return StringFromCharCode(i + 32);
  }
  return '';
})

function toASCIILower(str) {
  let result = '';
  for (let i = 0; i < str.length; ++i) {
    const code = StringPrototypeCharCodeAt(str, i);

    if (code > 90 || code < 65) {
      result += str[i];
    } else {
      result += ASCII_LOOKUP[code];
    }
  }
  return result;
}

Also why do we use SafeStringPrototypeSearch in encode? Cant we just use RegexPrototypeExec? Cant we manually inline the encode function into toString?

Is there a faster way to iterate the SafeMap? Why do we use a SafeMap and not a NullObject (because of the generator functions?

Can we avoid the generator fns? Does it make sense to avoid the generator fns?

Can we optimize parseTypeAndSubtype? Maybe also lazily parse the values? We need to throw errors in special cases, but other than that, we just have to store the string.

Why do we call in MIMEType toString FunctionPrototypeCall(MIMEParamsStringify, this.#parameters); and not
this.#parameters.toString()?

Questions over questions...

@metcoder95
Copy link
Member

No worries, happy to see there is progress on this front. I really want to continue my work over nodejs/node#46607, maybe now that @Uzlopak is a member, we can make it move forward with reviews and adjustments? 👀

Also why do we use SafeStringPrototypeSearch in encode? Cant we just use RegexPrototypeExec?

SafeStringPrototypeSearch is slightly faster than RegexPrototypeExec: nodejs/node#46607 (comment)

Is there a faster way to iterate the SafeMap? Why do we use a SafeMap and not a NullObject (because of the generator functions?

I think it was mostly commodity, didn't find counters while working on the PR of using a plain object. Can be sealed with NullObject if we want. The spec does not define a requirement for the params to be a proper Map instance, but rather Map-like data structure. We can even store them as we like meanwhile we can parse them as strings correctly.

Can we avoid the generator fns? Does it make sense to avoid the generator fns?

Which generators?

Can we optimize parseTypeAndSubtype? Maybe also lazily parse the values? We need to throw errors in special cases, but other than that, we just have to store the string.

I think we can lazy it at much until is required to provide the type and subtype back; but that won't help much as usually, it is within the hot path so it cannot be delayed too much the parsing.

Why do we call in MIMEType toString FunctionPrototypeCall(MIMEParamsStringify, this.#parameters); and not
this.#parameters.toString()?

Sealing the call I'd say?

@Uzlopak
Copy link
Contributor

Uzlopak commented Oct 8, 2023

nodejs/node#49889 got merged.

@metcoder95
Should we focus now on your PR?

@metcoder95
Copy link
Member

Sure thing, let me put it up-to-date and run the benchmarks agreed. Didn't had the time yet to take another look, I'd try to do it this week 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants