Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readable.toWeb seems to load file contents to memory #46347

Closed
Dzieni opened this issue Jan 25, 2023 · 21 comments · Fixed by #48847
Closed

Readable.toWeb seems to load file contents to memory #46347

Dzieni opened this issue Jan 25, 2023 · 21 comments · Fixed by #48847
Labels
confirmed-bug Issues with confirmed bugs. stream Issues and PRs related to the stream subsystem.

Comments

@Dzieni
Copy link

Dzieni commented Jan 25, 2023

Version

v19.5.0

Platform

Darwin dzieni 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:35 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T8101 arm64

Subsystem

stream

What steps will reproduce the bug?

  • run REPL
  • create a file stream: const nodeStream = fs.createReadStream('/some/large/file')
  • check process.memoryUsage()
  • convert it to a web stream: const webStream = (require('stream').Readable).toWeb(nodeStream)
  • check process.memoryUsage()

How often does it reproduce? Is there a required condition?

At all times

What is the expected behavior?

Memory usage does not grow significantly.

What do you see instead?

Memory usage (precisely arrayBuffers section) grows by a few orders of magnitude. It seems to be correlated with the file size.

Additional information

No response

@VoltrexKeyva VoltrexKeyva added the stream Issues and PRs related to the stream subsystem. label Jan 25, 2023
@bnoordhuis
Copy link
Member

Can you post a complete, ready-to-run test case? Maybe you've hit a bug, maybe you haven't, but it's impossible to tell right now.

@Dzieni
Copy link
Author

Dzieni commented Feb 2, 2023

@bnoordhuis

// since it's ESM, save it as .mjs

import fs from 'node:fs'
import process from 'node:process'
import {Readable} from 'node:stream'

// we initialize a stream, but not start consuming it
const randomNodeStream = fs.createReadStream('/dev/urandom')
// after 10 seconds, it'll get converted to web stream
let randomWebStream

// we check memory usage every second
// since it's a stream, it shouldn't be higher than the chunk size
const reportMemoryUsage = () => {
	const {arrayBuffers} = process.memoryUsage()
	console.log(
		`Array buffers memory usage is ${Math.round(
			arrayBuffers / 1024 / 1024
		)} MiB`
	)
	if (arrayBuffers > 256 * 1024 * 1024) {
		// streaming should not lead to such a memory increase
		// therefore, if it happens => bail
		console.log('Over 256 MiB taken, exiting')
		process.exit(0)
	}
}
setInterval(reportMemoryUsage, 1000)

// after 10 seconds we use Readable.toWeb
// memory usage should stay pretty much the same since it's still a stream
setTimeout(() => {
	console.log('converting node stream to web stream')
	randomWebStream = Readable.toWeb(randomNodeStream)
}, 10000)

// after 15 seconds we start consuming the stream
// memory usage will grow, but the old chunks should be garbage-collected pretty quickly
setTimeout(async () => {
	console.log('reading the chunks')
	for await (const chunk of randomWebStream) {
		// do nothing, just let the stream flow
	}
}, 15000)

This produces the same behavior on macOS and Linux:

michal@dzieni ~ % node test.mjs
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 617 MiB
Over 256 MiB taken, exiting

Immediately after using Readable.toWeb the memory usage spikes. Since we use /dev/urandom (so the file that never ends), the usage will grow indefinitely.

You can compare it to a third-party library readable-stream-node-to-web, where it works properly:

// since it's ESM, save it as .mjs

import fs from 'node:fs'
import process from 'node:process'
import nodeToWebStream from 'readable-stream-node-to-web'

// we initialize a stream, but not start consuming it
const randomNodeStream = fs.createReadStream('/dev/urandom')
// after 10 seconds, it'll get converted to web stream
let randomWebStream

// we check memory usage every second
// since it's a stream, it shouldn't be higher than the chunk size
const reportMemoryUsage = () => {
	const {arrayBuffers} = process.memoryUsage()
	console.log(
		`Array buffers memory usage is ${Math.round(
			arrayBuffers / 1024 / 1024
		)} MiB`
	)
	if (arrayBuffers > 256 * 1024 * 1024) {
		// streaming should not lead to such a memory increase
		// therefore, if it happens => bail
		console.log('Over 256 MiB taken, exiting')
		process.exit(0)
	}
}
setInterval(reportMemoryUsage, 1000)

// after 10 seconds we use nodeToWebStream
// memory usage should stay pretty much the same since it's still a stream
setTimeout(() => {
	console.log('converting node stream to web stream')
	randomWebStream = nodeToWebStream(randomNodeStream)
}, 10000)

// after 15 seconds we start consuming the stream
// memory usage will grow, but the old chunks should be garbage-collected pretty quickly
setTimeout(async () => {
	console.log('reading the chunks')
	for await (const chunk of randomWebStream) {
		// do nothing, just let the stream flow
	}
}, 15000)

In that case, the memory usage is fine:

michal@dzieni ~ % node src/test.mjs 
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
reading the chunks
Array buffers memory usage is 1 MiB
Array buffers memory usage is 6 MiB
Array buffers memory usage is 9 MiB
Array buffers memory usage is 15 MiB
Array buffers memory usage is 5 MiB
Array buffers memory usage is 9 MiB
Array buffers memory usage is 11 MiB
Array buffers memory usage is 4 MiB
Array buffers memory usage is 19 MiB
Array buffers memory usage is 16 MiB
Array buffers memory usage is 1 MiB
Array buffers memory usage is 30 MiB
Array buffers memory usage is 24 MiB
Array buffers memory usage is 6 MiB
Array buffers memory usage is 4 MiB
Array buffers memory usage is 2 MiB
Array buffers memory usage is 1 MiB

@bnoordhuis bnoordhuis added the confirmed-bug Issues with confirmed bugs. label Feb 3, 2023
@bnoordhuis
Copy link
Member

bnoordhuis commented Feb 3, 2023

Thanks, I see what you mean. The stream starts reading before there's something consuming it.

@nodejs/whatwg-stream this is a legitimate bug. You can see it even more clearly when you switch from /dev/urandom to /dev/zero.

edit: bug also exists in v19.6.0.

@debadree25
Copy link
Member

debadree25 commented Feb 3, 2023

Hello, @Dzieni I think this issue is solved if you pass a strategy while converting the stream from node stream something like:

randomWebStream = Readable.toWeb(randomNodeStream, {
    strategy: new CountQueuingStrategy({ highWaterMark: 100 }),
  });

Tried this on my local machine and memory doesn't seem to be overflowing
Screenshot 2023-02-04 at 12 55 53 AM

I think the behavior here is somewhat expected when the readable stream is created, the pull function as described here https://nodejs.org/api/webstreams.html#new-readablestreamunderlyingsource--strategy would be called continuously as soon as the ReadableStream is created and given how adapters from webstreams are defined here

pull() { streamReadable.resume(); },
this would mean if there is no strategy defined streamReadable.resume() will try to consume the whole file causing memory overflow, but when we pass a strategy it ensures backpressure is applied

Maybe the docs here could be updated to include this scenario https://nodejs.org/api/stream.html#streamreadabletowebstreamreadable-options @bnoordhuis

@mcollina
Copy link
Member

mcollina commented Feb 3, 2023

I think this is a bug. The whole point of streams is to manage the flow of data.

@debadree25
Copy link
Member

I think this is a bug. The whole point of streams is to manage the flow of data.

So something like a default highWatermark while doing toWeb()?

@mcollina
Copy link
Member

mcollina commented Feb 3, 2023

No I think there is an actual bug somewhere. Instead of resume, this should call read()

@Dzieni
Copy link
Author

Dzieni commented Feb 6, 2023

@debadree25
Sounds like a workaround that is good enough for me, thanks!

@mcollina mcollina added the good first issue Issues that are suitable for first-time contributors. label Feb 6, 2023
debadree25 added a commit to debadree25/node that referenced this issue Feb 11, 2023
@lilsweetcaligula
Copy link
Contributor

lilsweetcaligula commented Feb 14, 2023

Hey guys and gals, could you please confirm my findings. So I've looked into this briefly, and I believe I have traced this to /lib/internal/webstreams/readablestream.js.

Now, I am not entirely sure if it's by design but neither readableByteStreamControllerShouldCallPull nor readableStreamDefaultControllerShouldCallPull seem to check for presence of a reader:
https://github.com/nodejs/node/blob/main/lib/internal/webstreams/readablestream.js#L2458-L2481
https://github.com/nodejs/node/blob/main/lib/internal/webstreams/readablestream.js#L2223-L2240

Refusing to pull when a stream has no reader seems to alleviate the issue. Something like this seems to do the trick:
1f0e249 (#46643)

Testing against @Dzieni's benchmark - slightly modified to convert to the web stream sooner, - here's before the change is applied:

Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 109 MiB
Array buffers memory usage is 206 MiB
Array buffers memory usage is 336 MiB
Over 256 MiB taken, exiting

Here's after the change is applied:

Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
converting node stream to web stream
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB
Array buffers memory usage is 0 MiB

For completeness, I'm attaching @Dzieni's benchmark with the slight modification that I mentioned along with this message.

// since it's ESM, save it as .mjs

import fs from 'node:fs'
import process from 'node:process'
import {Readable} from 'node:stream'

// we initialize a stream, but not start consuming it
const randomNodeStream = fs.createReadStream('/dev/urandom')
// in a few seconds, it'll get converted to web stream
let randomWebStream

// we check memory usage every second
// since it's a stream, it shouldn't be higher than the chunk size
const reportMemoryUsage = () => {
	const {arrayBuffers} = process.memoryUsage()
	console.log(
		`Array buffers memory usage is ${Math.round(
			arrayBuffers / 1024 / 1024
		)} MiB`
	)
	if (arrayBuffers > 256 * 1024 * 1024) {
		// streaming should not lead to such a memory increase
		// therefore, if it happens => bail
		console.log('Over 256 MiB taken, exiting')
		process.exit(0)
	}
}
setInterval(reportMemoryUsage, 1000)

// after 3 seconds we use Readable.toWeb
// memory usage should stay pretty much the same since it's still a stream
setTimeout(() => {
	console.log('converting node stream to web stream')
	randomWebStream = Readable.toWeb(randomNodeStream)
}, 3000)

// after 30 seconds we start consuming the stream
// memory usage will grow, but the old chunks should be garbage-collected pretty quickly
setTimeout(async () => {
	console.log('reading the chunks')
	for await (const chunk of randomWebStream) {
		// do nothing, just let the stream flow
	}
}, 30000)

[Edit, Feb 14]: updated the draft solution link to point to a specific commit.

@mcollina
Copy link
Member

@jasnell @KhafraDev could you take a look?

@KhafraDev
Copy link
Member

It looks like node implements ReadableByteStreamControllerShouldCallPull and ReadableStreamDefaultControllerShouldCallPull correctly, which makes me think it's not an issue with them. I don't know enough about the webstream spec to give a definitive answer though.

@lilsweetcaligula
Copy link
Contributor

@debadree25 I've been looking into this on and off and I may need a clarification. Could you please clarify it for me - whether in the code snippet below, - the highWaterMark value of the randomWebStream's default controller should default to 65536 bytes or 65536 chunks of size 65536 bytes each?

const randomNodeStream = fs.createReadStream('/dev/urandom')
const randomWebStream = Readable.toWeb(randomNodeStream)

@debadree25
Copy link
Member

From what I understand the high watermark would be 65536 "chunks" since here

return { highWaterMark };

we dont set any size function and by default the size function just returns 1 Ref:
if (size === undefined) return () => 1;

so it would be 65536 "chunks" each chunk regarded as size 1

The comment in lines

// When not running in objectMode explicitly, we just fall
// back to a minimal strategy that just specifies the highWaterMark
// and no size algorithm. Using a ByteLengthQueuingStrategy here
// is unnecessary.

mention ByteLengthQueuingStrategy as unecessary but maybe it indeed is?

@lilsweetcaligula
Copy link
Contributor

lilsweetcaligula commented Feb 19, 2023

@debadree25 Thank you for your reply. As I was inspecting the code yesterday, I just found it somewhat odd - hence I had to clarify it with someone.

Another odd thing I found, - if you have code like this:

const randomNodeStream = fs.createReadStream('/dev/urandom', {
	highWaterMark: 5556
})

const randomWebStream = Readable.toWeb(randomNodeStream)

Upon inspection with a debugger, I found that randomWebStream's controller's queue (!) was eventually filled with 5556 chunks (i.e. queueTotalSize === 5556) with seemingly each (!) chunk having a size of 5556 bytes. Changing 5556 to 5557 in the code snippet would give a similar result - 5557 chunks with each chunk having a size 5557 bytes.

That means if a user does not explicitly pass an hwm to fs.createReadStream - the resulting stream will have an hwm of 65536, which upon conversion to a web stream results in 65536 chunks 65536 bytes each, which is a total of 4 GB.

Is this a bug or am I misunderstanding it?

@benjamingr benjamingr removed the good first issue Issues that are suitable for first-time contributors. label Feb 19, 2023
@benjamingr
Copy link
Member

This isn't a "good first issue", I don't think "streams" and "good first issue" really mix except for tests/docs

@debadree25
Copy link
Member

@lilsweetcaligula tbf even I am confused here would need deeper investigation 😅😅

@mcollina
Copy link
Member

@benjamingr I think this might be as simple as it gets for streams.

The problem is that in

pull() { streamReadable.resume(); },
we call resume() and in
if (controller.desiredSize <= 0)
streamReadable.pause();
we call pause() only on certain conditions (which I think are never met or similar). This is somewhat dangerous and can cause exactly what we are seeing here.

Note that this should be calling .read() and listen to 'readable' instead.

nodejs-github-bot pushed a commit that referenced this issue Feb 22, 2023
Refs: #46347
PR-URL: #46617
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Harshitha K P <harshitha014@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
@rluvaton
Copy link
Member

From what I understand the high watermark would be 65536 "chunks" since here

return { highWaterMark };

we dont set any size function and by default the size function just returns 1 Ref:

if (size === undefined) return () => 1;

so it would be 65536 "chunks" each chunk regarded as size 1
The comment in lines

// When not running in objectMode explicitly, we just fall
// back to a minimal strategy that just specifies the highWaterMark
// and no size algorithm. Using a ByteLengthQueuingStrategy here
// is unnecessary.

mention ByteLengthQueuingStrategy as unecessary but maybe it indeed is?

Hey @mcollina

I think @debadree25 is right in his comment that we should use the ByteLengthQueuingStrategy

nevertheless, I did try using the readable event and it worked, I'll create a PR shortly

rluvaton added a commit to rluvaton/node that referenced this issue Feb 26, 2023
targos pushed a commit that referenced this issue Mar 13, 2023
Refs: #46347
PR-URL: #46617
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Harshitha K P <harshitha014@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
danielleadams pushed a commit that referenced this issue Apr 11, 2023
Refs: #46347
PR-URL: #46617
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Harshitha K P <harshitha014@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
CGQAQ added a commit to CGQAQ/node that referenced this issue Jul 20, 2023
CGQAQ added a commit to CGQAQ/node that referenced this issue Jul 21, 2023
CGQAQ added a commit to CGQAQ/node that referenced this issue Jul 21, 2023
@karimfromjordan
Copy link

Any updates on this?

@Dzieni
Copy link
Author

Dzieni commented May 7, 2024

In meanwhile I found another workaround - ReadableStream.from(randomNodeStream) seems to do the job just right, without having to set the queueing strategy explicitly.

aduh95 pushed a commit to CGQAQ/node that referenced this issue May 11, 2024
aduh95 pushed a commit that referenced this issue May 12, 2024
Fixes: #46347
PR-URL: #48847
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
targos pushed a commit that referenced this issue May 12, 2024
Fixes: #46347
PR-URL: #48847
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
@marcj
Copy link

marcj commented Jun 7, 2024

I had the same issue with the memory leak, and after installing Node v22.2 it went away, but not entirely. I still get massive slowdown/memory leak when I use the ReadableStream in fetch (which many probably try with this toWeb function):

await fetch(url, {
  method: 'POST',
  body: Readable.toWeb(data),
  ...({ duplex: "half" })
});

I had to switch to got to remove the memory leak, unfortunately

marco-ippolito pushed a commit that referenced this issue Jun 17, 2024
Fixes: #46347
PR-URL: #48847
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
marco-ippolito pushed a commit that referenced this issue Jun 17, 2024
Fixes: #46347
PR-URL: #48847
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
marco-ippolito pushed a commit that referenced this issue Jun 17, 2024
Fixes: #46347
PR-URL: #48847
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
EliphazBouye pushed a commit to EliphazBouye/node that referenced this issue Jun 20, 2024
Fixes: nodejs#46347
PR-URL: nodejs#48847
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
bmeck pushed a commit to bmeck/node that referenced this issue Jun 22, 2024
Fixes: nodejs#46347
PR-URL: nodejs#48847
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs. stream Issues and PRs related to the stream subsystem.
Projects
None yet