Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event on end of deflate stream #26603

Closed
kingces95 opened this issue Mar 12, 2019 · 11 comments
Closed

Event on end of deflate stream #26603

kingces95 opened this issue Mar 12, 2019 · 11 comments

Comments

@kingces95
Copy link

kingces95 commented Mar 12, 2019

I expect the zlib InflateRaw stream to notify me somehow (maybe an 'error' event) that it has detected the end of the deflated stream and is now discarding any more bytes written to the stream. The zlib.net/zpipe.c annotated example code says

The outer do-loop ends when inflate() reports that it has reached the end of the input zlib stream, has completed the decompression and integrity check, and has provided all of the output. This is indicated by the inflate() return value Z_STREAM_END.

What zlib actually does is discard any bytes written to InflateRaw after then the end of the deflated stream has been identified; Zlib does not propagate the Z_STREAM_END notification to me.

I believe zlib will encounter Z_STREAM_END here. And here is where zlib needs to raise the error (I think), but only if there are additional bytes written after the end of the stream as that would indicate the stream user doesn't know where the end of the stream is.

Either way, the stream should also return false after encountering Z_STREAM_END instead of "consuming" by tossing the bytes.

I'm attempting to download and unzip a file in one pass. As the bytes are downloaded I detect the zip headers and start decompressing the stream. The stream is decompressed but as I'm not notified the decompression has reached the end of the deflated stream I'm forced to buffer the remainder of the file which defeats my goal of streaming decompression.

For example, see: #26624

See also my first issue (notes from when I was groping around trying to grock): #26332

@kingces95
Copy link
Author

Zlib docs do say

Return codes for the compression/decompression functions. Negative values are errors, positive values are used for special but normal events.

And lists Z_STREAM_END as a possible return value. But I find no tests that cover that case and I cannot figure out where, exactly, those values would be returned from.

@addaleax
Copy link
Member

@kingces95 Did you take a look at #26363? I linked it from your previous issue, and it seems like it is what you want here?

@kingces95
Copy link
Author

kingces95 commented Mar 13, 2019

Oh! Woot!

Although, are we sure that is the right fix? I'm not an expert, but are we sure that every input byte will result in output bytes? If not, then there could be a case where there is input to consume and space to publish the output yet inflate() needs to wait for another byte before it can generate output?

Shouldn't we shuttle the err_ value up from the inflate() call (say, as a third element in the _writeState array) instead of inferring it's value from having space left in the output buffer if there's still space left in the input buffer?

Take this case: If I write bytes one at a time them I'm forced to write one byte past the end of the deflate stream before the push(null) call. So (and even tho I may have suggested that as a way to detect the EOF) that fix feels sketchy.

What does push(null) do?

@madler
Copy link

madler commented Mar 13, 2019

No, not every input byte will result in output bytes.

@kingces95 kingces95 mentioned this issue Mar 13, 2019
4 tasks
@kingces95
Copy link
Author

kingces95 commented Mar 13, 2019

Given Mark Adler's comment, I don't think we can infer Z_STREAM_END given the conditions specified in the comment:

If we have more input that should be written, but we also have output space available, that means that the compression library was not interested in receiving more data, and in particular that the input stream has ended early. This applies to streams where we don't check data past the end of what was consumed; that is, everything except Gunzip/Unzip.

So there is a delta I think would work for me. I added a property result (along side bytesWritten) that can be compared against codes.Z_STREAM_END to detect the end of the deflate stream.

Please lemme know if that makes sense and I'll move the test case I added as a comment into the test suite + some docs. (Or just lemme know if you wanna rock out with it!)

I'm gonna see how hard it would be to add another field bytesDiscarded so that the user doesn't need to do the accounting to see how many bytes they wrote were tossed. Even better would be to expose a buffer discardedBuffer that returns the bytes not used in decompression to the user... but I'm thinking that might be too difficult for me; I suspect the buffering of discarded bytes would have to be done in the .c code instead of the .js code. And anyway, I think I've got that buffering/accounting/backtrack problem handled in my .js unzip logic...

@madler
Copy link

madler commented Mar 13, 2019

As an example, the Python zlib decompressobj object has an eof() function to indicate the end of the deflate stream, and an unused_data() function to return any bytes that were read but not consumed by the inflator after the end of stream.

@kingces95
Copy link
Author

Ah, so if I follow, node could introduce a 'eof' event with an unusedBytes argument that's a buffer of the discarded bytes. The 'eof' event would fire after the last 'data' event. It couldn't implicitly end the stream because that would be breaking but that's ok. The user can just call end() themselves in the 'eof' handler.

Could also deprecate bytesWritten as it could be computed as bytesRead - discarded.length.

Yeah. That'd be pretty.

@addaleax
Copy link
Member

Although, are we sure that is the right fix? I'm not an expert, but are we sure that every input byte will result in output bytes?

No, but it doesn’t have to, either; that’s not what the check is testing. It’s ending the stream when zlib has stopped accepting input and refuses to let avail_in decrease.

Shouldn't we shuttle the err_ value up from the inflate() call (say, as a third element in the _writeState array) instead of inferring it's value from having space left in the output buffer if there's still space left in the input buffer?

I’ve tried that in a first attempt to solve the previous issue that you posted, but it turned out to be a bit icky as a way to end the stream. I don’t exactly remember why.

Take this case: If I write bytes one at a time them I'm forced to write one byte past the end of the deflate stream before the push(null) call. So (and even tho I may have suggested that as a way to detect the EOF) that fix feels sketchy.

Yeah, that is a bit unfortunate indeed.

What does push(null) do?

It ends the readable side of the stream; after all data has been read from zlib, the 'end' event will be emitted.

@kingces95
Copy link
Author

kingces95 commented Mar 13, 2019

Ah! I was looking for the 'finish' event. Noob. I see the 'end' event!

So this fix rests on the fact that inflate() will refuse input once it's returned Z_STREAM_END, not that it hasn't generated output given input.

I think I found the ending the stream given err_ to be icky too because it prevented the last data call from being made...

And while an 'eof' event with an argument containing a buffer with the discarded bytes would be nice, and not having to write a byte past the last byte in the deflate would also be nice, the fix has the benefit of being economical, introduces no new surface area, and won't break anything. And those other features could be emulated. And practically speaking, we'll always write a byte past the end of the stream. Tho we'd have to write another 64k chunk to see that avail_in didn't shrink, right? So the price we pay is we're buffering a chunk we didn't need to. Eh.

Was that more or less the thinking?

@kingces95
Copy link
Author

And for those who find this later, here is my test case!

var zlib = require('zlib');
var assert = require('assert');

const input = Buffer.from('0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789');

var log = console.log.bind(console);

zlib.deflateRaw(input, (err, deflatedBuffer) => {
  assert(!err);

  var bytesRead = 0;
  var buffers = [];

  var stream = zlib.createInflateRaw();
  var position = 0;

  stream.on('data', function(chunk) {
    buffers.push(chunk);
    bytesRead += chunk.length;
    log('decompressed:', 'written', stream.bytesWritten, 'read', bytesRead);
  })

  stream.on('end', function done() {
    var result = Buffer.concat(buffers, bytesRead);
    this.close();
    log('result:', result.toString());
    log('chunks:', buffers.length);
    log('bytesWritten:', stream.bytesWritten);
    log('numberRead:', bytesRead);
    log('discarded:', position - stream.bytesWritten);
  });

  log('download: started')
  stream.write(deflatedBuffer.slice(0, 5)); position += 5;
  stream.write(deflatedBuffer.slice(5, 10)); position += 5;
  stream.write(deflatedBuffer.slice(10)); position += deflatedBuffer.length - 10;
  stream.write(Buffer.from([1,2,3,4,5,6])); position += 6; // discarded

  log('download: delayed')
  setTimeout(() => {
    log('download: resumed')
    log(stream.write(Buffer.from([1,2,3,4,5,6]))); // rejected
    log('download: finished')
    stream.end();
  }, 1000);
});

@kingces95
Copy link
Author

Wow. So excited to give this a spin. Will re-open if I get stuck but I think I've got what I need. Yea node!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants