fs: improve readFile performance #27063

BridgeAR · 2019-04-03T01:36:28Z

This increases the maximum buffer size per read to 256kb when using
fs.readFile. This is important to improve the read performance for
bigger files.

Refs: #25741

Benchmark (512kb):

17:32:35                                                  confidence improvement accuracy (*)    (**)   (***)
17:32:35  fs/readfile.js concurrent=10 len=1024 dur=5                    -3.29 %       ±4.88%  ±6.48%  ±8.40%
17:32:35  fs/readfile.js concurrent=10 len=16777216 dur=5        ***    592.66 %      ±18.04% ±24.13% ±31.68%
17:32:35  fs/readfile.js concurrent=1 len=1024 dur=5                      2.18 %       ±4.41%  ±5.84%  ±7.57%
17:32:35  fs/readfile.js concurrent=1 len=16777216 dur=5         ***    356.82 %      ±25.96% ±34.72% ±45.57%

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included
documentation is changed or added
commit message follows commit guidelines

This increases the maximum buffer size per read to 256kb when using `fs.readFile`. This is important to improve the read performance for bigger files. Refs: nodejs#25741

nodejs-github-bot · 2019-04-03T01:36:30Z

Lite-CI: https://ci.nodejs.org/job/node-test-pull-request-lite-pipeline/3150

nodejs-github-bot · 2019-04-03T03:25:05Z

CI: https://ci.nodejs.org/job/node-test-pull-request/22154/

mscdex · 2019-04-03T06:51:05Z

Couldn't we just change kReadFileBufferLength instead?

BridgeAR · 2019-04-03T14:53:37Z

Couldn't we just change kReadFileBufferLength instead?

That would not have the expected behavior: it is now only used in case the stat call failed and we do not know the file size. In that case, it's likely best to be conservative and not to pre-allocate a big buffer.

mscdex · 2019-04-03T15:25:31Z

Maybe we should have two constants then, since kReadFileBufferLength would no longer be completely accurate?

lib/internal/fs/read_file_context.js

sam-github · 2019-04-03T17:19:28Z

That would not have the expected behavior: it is now only used in case the stat call failed and we do not know the file size. In that case, it's likely best to be conservative and not to pre-allocate a big buffer.

It would have the behaviour I'd expect from the commit description! :-)

Why is this more conservative? What is safer? If we are going to read 1,000K, and we know it, with this change it will be read in larger 256K chunks. But if we don't know that it will be 1,000K, it will be read in the older, smaller, 8K chunks. Maybe there is a good reason for this, but its not immediately obvious.

I'm actually not clear on why there are two branches at all. It seems like the branch where size isn't known would work for all cases. AFAICT the only optimization is that if the file happens to be exactly a multiple of kReadFileBufferLength, and .size is known, one extra read() (which would just return 0 meaning EOF), is avoided. Is that the entire point? But if .size came from a stat() call, then we always pay the stat() price, and almost never get the "1 less read" bonus, so it doesn't seem a net win.

.... OK, I'm not deleting the above, but I think I just realized why there are two branches. The .size !== 0 branch uses pread() in libuv, which doesn't change the file position, but that won't work if a fd was passed, and the fd isn't seekable (its a stream or pipe, like stdin). Its trying to make fs.readFile(A_FD) not change the current file position. I think a comment to this effect, if my guess is right, would be useful.

Still not sure why the larger buf size isn't used in all cases.

BridgeAR · 2019-04-04T18:58:59Z

@mscdex

Maybe we should have two constants then, since kReadFileBufferLength would no longer be completely accurate?

I added another commit that also refactors the code a tiny bit to make it simpler. I also added the constant as suggested.

@sam-github

I think a comment to this effect, if my guess is right, would be useful.

There is a comment in one branch that checks for size === 0. Should I add another one for this specific case?

Benchmark: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/323/

nodejs-github-bot · 2019-04-04T18:59:44Z

CI: https://ci.nodejs.org/job/node-test-pull-request/22200/

lib/internal/fs/read_file_context.js

jasnell · 2019-04-04T20:45:22Z

There's a slightly different approach we could take on this by either:

a. Providing an option to set the read chunk size... e.g. fs.readFile(fn, { readSize: 10000 }, () = > {}), or...

b. Providing an option that is effectively a preallocated buffer to fill on each read... fs.readFile(fn, { readBuf: Buffer.from(10000) }, () => {})

Option a is likely the far better choice and allows users to performance tune on their own.

BridgeAR · 2019-04-04T20:51:20Z

I increased the value to 512kb and also doubled the value for unknown file sizes.

256kb already show significant improvements:

15:41:32                                                  confidence improvement accuracy (*)    (**)   (***)
15:41:32  fs/readfile.js concurrent=10 len=1024 dur=5                     0.61 %       ±4.89%  ±6.48%  ±8.38%
15:41:32  fs/readfile.js concurrent=10 len=16777216 dur=5        ***    504.18 %      ±10.37% ±13.83% ±18.07%
15:41:32  fs/readfile.js concurrent=1 len=1024 dur=5                     -0.86 %       ±4.08%  ±5.40%  ±6.99%
15:41:32  fs/readfile.js concurrent=1 len=16777216 dur=5         ***    285.81 %      ±17.78% ±23.72% ±31.01%

@jasnell I believe we should raise this limit with or without the option. So adding the option could be done in a separate PR and this is a quick win.

zbjornson · 2019-04-04T20:53:24Z

master...zbjornson:chunksizeopt this branch has a chunkSize option like @jasnell's (a) if it's useful. I was using it for benchmarking.

However, @davisjam and I actually just had a nice chat and agreed that partitioning may be causing more harm than good (in part due to the increased risk of OOMs), and were both open to removing partitioning and instead adding a prominent doc warning.

BridgeAR · 2019-04-04T20:54:32Z

@jasnell

b. Providing an option that is effectively a preallocated buffer to fill on each read... fs.readFile(fn, { readBuf: Buffer.alloc(10000) }, () => {})

So that the buffer would be as big as the whole file? We would have to guard against wrong sizes (and then potentially still manage multiple buffers) and what would be the benefit of b over a?

jasnell · 2019-04-04T20:56:10Z

As I said, option a is likely better than option b ;-)

sam-github · 2019-04-04T21:17:01Z

There is a comment in one branch that checks for size === 0.

If you mean

Unknown size, just read until we don't get bytes.

That says what, but not why. Since you can read until you don't get bytes whether you know the size or not, without some context, its quite mysterious why the file doesn't just always use the size===0 branch. I happen to remember the kerflufle about using fds instead of file names, so I think I guessed why, but only after writing a paragraph about how I didn't undertstand. It shouldn't take insider knowledge to understand this.

This was not introduced in this PR, so consider it a nit. I can fix, if you want.

sam-github

The effect of this is to make the buffer size different when reading from pipes vs. from files (technically, for "seekable" vs "non-seekable" fds). I don't understand why both cases wouldn't want the same performance enhancement.

BridgeAR · 2019-04-04T21:29:21Z

@sam-github

This is not about using an fd or not but about the file type.

We know the file size for regular files (S_IFREG type) but not in case it's another file type. So if we know the file type, it'll allocate the full buffer in advance and then read in chunks into that buffer. For other file types we allocate a small buffer because we could otherwise over allocate and that costs memory and performance and create a new buffer on each read until we find the end of the file. At the end we concat those buffers together.

See https://github.com/nodejs/node/pull/27063/files#diff-9a205ef7ee967ee32efee02e58b3482dR265

mcollina

LGTM

BridgeAR · 2019-04-05T18:45:45Z

@nodejs/fs this could use some further reviews.

@sam-github @jasnell @zbjornson I suggest to land this, no matter what the latter solution might be to improve the current situation. Are you good with that?

nodejs-github-bot · 2019-04-05T18:47:01Z

CI: https://ci.nodejs.org/job/node-test-pull-request/22230/

mcollina

LGTM. However I’d make both of them 512KB.

BridgeAR · 2019-04-05T21:19:32Z

@mcollina

LGTM. However I’d make both of them 512KB.

I guess it won't make a huge difference as the cases where the file does not have a regular file type is not common but it will definitely take a significantly longer time to allocate 512kb over the current 8kb and if we over allocate, we also do work and use more memory than we should. What about using 64kb or 128kb for unknown file sizes? I guess with that size we cover a lot more files and also improve the performance significantly.

mcollina · 2019-04-05T21:36:08Z

I guess it won't make a huge difference as the cases where the file does not have a regular file type is not common but it will definitely take a significantly longer time to allocate 512kb over the current 8kb and if we over allocate, we also do work and use more memory than we should. What about using 64kb or 128kb for unknown file sizes? I guess with that size we cover a lot more files and also improve the performance significantly.

Mine is not a blocker. I'm ok as it is. Maybe can you add a comment with the above?

BridgeAR · 2019-04-05T22:45:33Z

I added a comment to highlight the difference and updated the buffer size to 64kb for unknown file sizes.

nodejs-github-bot · 2019-04-05T22:46:25Z

CI: https://ci.nodejs.org/job/node-test-pull-request/22238/

nodejs-github-bot · 2019-04-05T23:39:46Z

CI: https://ci.nodejs.org/job/node-test-pull-request/22240/

davisjam

LGTM. This should give a big performance boost while still "taking turns" on the threadpool.

This increases the maximum buffer size per read to 512kb when using `fs.readFile`. This is important to improve the read performance for bigger files. PR-URL: #27063 Refs: #25741 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Jamie Davis <davisjam@vt.edu>

BridgeAR · 2019-04-07T09:31:35Z

Landed in eb2d416 🎉

Thanks for the reviews!

fs: improve readFile performance

0d49428

This increases the maximum buffer size per read to 256kb when using `fs.readFile`. This is important to improve the read performance for bigger files. Refs: nodejs#25741

BridgeAR added the performance Issues and PRs related to the performance of Node.js. label Apr 3, 2019

nodejs-github-bot added the fs Issues and PRs related to the fs subsystem / file system. label Apr 3, 2019

antsmartian reviewed Apr 3, 2019

View reviewed changes

lib/internal/fs/read_file_context.js Outdated Show resolved Hide resolved

fixup! fs: improve readFile performance

8c24895

mscdex reviewed Apr 4, 2019

View reviewed changes

lib/internal/fs/read_file_context.js Outdated Show resolved Hide resolved

fixup! fs: improve readFile performance

b64266e

BridgeAR requested review from davisjam and mcollina April 4, 2019 20:46

sam-github reviewed Apr 4, 2019

View reviewed changes

mcollina approved these changes Apr 4, 2019

View reviewed changes

mcollina approved these changes Apr 5, 2019

View reviewed changes

fixup! fs: improve readFile performance

36774fe

BridgeAR added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Apr 5, 2019

jasnell approved these changes Apr 5, 2019

View reviewed changes

davisjam approved these changes Apr 6, 2019

View reviewed changes

BridgeAR closed this Apr 7, 2019

BridgeAR mentioned this pull request Apr 7, 2019

Increase buffer pool size and streams high water marks #27121

Open

BridgeAR mentioned this pull request Dec 20, 2019

for ... of replacing for(;;) in library networking code? #31024

Closed

BridgeAR deleted the improve-readfile-performance branch January 20, 2020 11:59

Linkgoron mentioned this pull request Mar 5, 2021

fs: improve fsPromises readFile performance #37608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs: improve readFile performance #27063

fs: improve readFile performance #27063

BridgeAR commented Apr 3, 2019 •

edited

Loading

nodejs-github-bot commented Apr 3, 2019

nodejs-github-bot commented Apr 3, 2019

mscdex commented Apr 3, 2019

BridgeAR commented Apr 3, 2019

mscdex commented Apr 3, 2019

sam-github commented Apr 3, 2019

BridgeAR commented Apr 4, 2019 •

edited

Loading

nodejs-github-bot commented Apr 4, 2019

jasnell commented Apr 4, 2019

BridgeAR commented Apr 4, 2019

zbjornson commented Apr 4, 2019

BridgeAR commented Apr 4, 2019 •

edited by jasnell

Loading

jasnell commented Apr 4, 2019

sam-github commented Apr 4, 2019

sam-github left a comment

BridgeAR commented Apr 4, 2019 •

edited

Loading

mcollina left a comment

BridgeAR commented Apr 5, 2019

nodejs-github-bot commented Apr 5, 2019

mcollina left a comment

BridgeAR commented Apr 5, 2019

mcollina commented Apr 5, 2019

BridgeAR commented Apr 5, 2019

nodejs-github-bot commented Apr 5, 2019

nodejs-github-bot commented Apr 5, 2019

davisjam left a comment

BridgeAR commented Apr 7, 2019

fs: improve readFile performance #27063

fs: improve readFile performance #27063

Conversation

BridgeAR commented Apr 3, 2019 • edited Loading

Checklist

nodejs-github-bot commented Apr 3, 2019

nodejs-github-bot commented Apr 3, 2019

mscdex commented Apr 3, 2019

BridgeAR commented Apr 3, 2019

mscdex commented Apr 3, 2019

sam-github commented Apr 3, 2019

BridgeAR commented Apr 4, 2019 • edited Loading

nodejs-github-bot commented Apr 4, 2019

jasnell commented Apr 4, 2019

BridgeAR commented Apr 4, 2019

zbjornson commented Apr 4, 2019

BridgeAR commented Apr 4, 2019 • edited by jasnell Loading

jasnell commented Apr 4, 2019

sam-github commented Apr 4, 2019

sam-github left a comment

Choose a reason for hiding this comment

BridgeAR commented Apr 4, 2019 • edited Loading

mcollina left a comment

Choose a reason for hiding this comment

BridgeAR commented Apr 5, 2019

nodejs-github-bot commented Apr 5, 2019

mcollina left a comment

Choose a reason for hiding this comment

BridgeAR commented Apr 5, 2019

mcollina commented Apr 5, 2019

BridgeAR commented Apr 5, 2019

nodejs-github-bot commented Apr 5, 2019

nodejs-github-bot commented Apr 5, 2019

davisjam left a comment

Choose a reason for hiding this comment

BridgeAR commented Apr 7, 2019

BridgeAR commented Apr 3, 2019 •

edited

Loading

BridgeAR commented Apr 4, 2019 •

edited

Loading

BridgeAR commented Apr 4, 2019 •

edited by jasnell

Loading

BridgeAR commented Apr 4, 2019 •

edited

Loading