streaming support #32

RangerMauve · 2023-05-05T02:11:49Z

Hey, we're thinking of using this as part of Webrecorder for working with web archives.

one limitation is that we're dealing with files that are too large to practically store in an array buffer and need to use streaming interfaces.

Would you be interested in adding this functionality to your readers' or would you be interested in a pull request that would add this functionality? I think it should be easy to fit in with the existing code base.

greggman · 2023-05-05T05:25:21Z

Yes, that would be great. I'm not sure it's quite as easy with workers etc...

It might be good to propose an API before writing it. Maybe the Streaming API is a good reference or at least good in the sense of it being a solution devs are already familiar with? entry.getReader() ?

Also, maybe this is already clear, but you can pass a blob currently, and only the parts of the zip file needed for an individual entry are in memory. Of course if that one entry is large then yes, you'd need streaming to handle it. Ideally, like the current API, you can stream it to a blob.

In other words, it should be possible to do this

const {entries} = await unzip('zipFileWith5GigVideoFile.zip');
const blob = await entries['5GigVideoFile.mp4'].blob();
someVideoElement.src = URL.createObjectURL(blob);

I know that's not part of a streaming API but it is something "streaming" internally would allow. I think right now, IIRC, it would run out of memory, though it's been a while since I've touched this code.

RangerMauve · 2023-05-05T15:46:41Z

Yeah, getting back a blob would be ideal. I think the ideal API would be to add a async readBlob(offset, size) : Blob method HTTPRangeReader and BlobReader.

For our use case we actually need to read the entry headers and the central directory as well as the entry contents (uncompressed). We were hoping to mostly reuse the entry parsing logic from this library to get the offsets and reuse the readers to read data from it.

I saw that I could get the entry offset from the _rawEntry property but had a bit of a roadblock when it came to reading streams from the different backends you support.

Specifically this is going to be part of our official implementation for the IPFS wACZ custom chunking we're doing which will enable us to deduplicate content across web archive collections. https://github.com/webrecorder/specs/blob/main/wacz-ipfs/latest/index.md

greggman · 2023-05-05T18:01:06Z

It doesn't seem like adding readBlob to readers really solves anything. The code using the reader needs to read the bytes and it can already call reader.read(offset, length) so changing the code to read less than a whole entry at a time doesn't need any changes to Reader.

Off the top of my head

To support reading an entry as a blob without putting the entire thing in memory the code that decodes an entry needs to output blobs at say 1meg per blob. It can then return a blob made from this 1meg blobs return new Blob(collectionOfBlobChunks)

So if you call blob = await entry.blob() then there is nothing the user needs to do. The code will work with the same API.

The work here is refactoring the decompression code so it decompresses from chunks and can return chucks of memory. Once that's done, then the main thread, and workers, can return memory chunks and the main code and turn them into blobs if you asked for a blob. So there's never more than a few meg in memory.

You want the code doing the decompression to get passed a chunksize. If the user asked for arrayBuffer or text or json, then the chunk size should = the size of the entry (entry.size). Only if they asked for a blob (entry.blob) is the chunk size smaller internally. The reason is if you always return smaller chunks, then in the arrayBuffer, text, json case, it has to have all of the chunks in memory so it can make one large entry.size arrayBuffer which means you'd have to copy chunks into it which would be slower than decompressing in place into the arrayBuffer you're going to actually return.
Support streaming an entry to the user (not internally)

This is where I was suggesting copying the Stream API, if only because it's something devs should already have used. Of course maybe it sucks as an API and there's something to do simpler. The point of the streaming API is to force you read the entry in order from 0 to last byte.
Support random access

In this case you'd just do, something like
```
const myChunkSize = 1000;
const length = entry.size;
for (let i = 0; i < length; i += myChunkSize) {
  const chunkSize = Math.min(myChunkSize, size - i);
  const chunk = await entry.arrayBuffer(i, chunkSize);
}
```
A simple implemenation, internally it would just do the streaming and return portions and check
that the offset of the next call to arrayBuffer = the offset of the last read + size.
If yes, keep streaming. If no just start over (so super slow, but you shouldn't start over)

Adding entry.readPortionAsBlob(offset, size) would do the same, just return blob

entry.text and entry.json make no sense to have stream versions because the data
is not parsable by the browser mid-stream. If you want to stream text you'd need to
stream bytes or blobs and do the conversion to text yourself so you can handle the
edge cases when a codepoint crosses a chunk boundary.

wojpawlik · 2023-10-15T13:15:40Z

Blobs could just be piped through new DecompressionStream("deflate-raw") (nodejs/node#50097). ArrayBuffers would likely need to be chuncked first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streaming support #32

streaming support #32

RangerMauve commented May 5, 2023

greggman commented May 5, 2023

RangerMauve commented May 5, 2023

greggman commented May 5, 2023

wojpawlik commented Oct 15, 2023

streaming support #32

streaming support #32

Comments

RangerMauve commented May 5, 2023

greggman commented May 5, 2023

RangerMauve commented May 5, 2023

greggman commented May 5, 2023

wojpawlik commented Oct 15, 2023