-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Consider verifying checksum of Openstack Storage upload #236
Conversation
Looks like we can't hash a file without storing it in memory or on disk. Which means we'll have to write some code to create a temp file store, and that has problems as well. Not sure I'm in love with that. In memory is a non-starter as most users are running node on limited memory VMs. The hash itself should be pretty straightforward if we have the file on disk or in memory. |
I believe we could calculate the hash without temporarily storing the file using something like jeffbski/digest-stream, which calculates the hash of a stream. My understanding of MD5 is that it only needs to buffer 512 bytes of data before it can update its state, discard the data, and wait for the next chunk. |
Are we sure that's not buffering? https://github.com/jeffbski/digest-stream/blob/master/lib/digest-stream.js#L28 |
@kenperkins it flushes the buffer on each digest when the end function is called https://github.com/jeffbski/pass-stream/blob/master/lib/pass-stream.js#L46-L49 |
If so, lets write up a prototype/branch :) @rossj are you on it? |
@kenperkins I'm pretty sure that
I'll set up a branch and try some things out. |
…t returned ETag header. For feature pkgcloud#236.
…t returned ETag header. For feature pkgcloud#236.
|
||
var container = options.container, | ||
success = callback ? onUpload : null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @indexzero should have put a comment here, but this was very deliberate. If you pass a callback into mikeal/request, it will buffer the entire contents of the stream into memory.
That's why we had the code setup to check that if there was no callback provided by the caller, don't pass one down the chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct here @kenperkins. @rossj we want to preserve that check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenperkins @jcrugzz I'm pretty confident that mikeal/request
does not buffer outgoing data, even if a callback is provided. While #195 mentions both upload and download, the 2 request
issues that it references (request/request#639 and request/request#477) only refer to download buffering being a problem.
Case in point, I uploaded a 2.8 GB without issue and the process's memory usage hovered between 90-160 MB the whole time (comparable to current 0.9.0
branch).
Currently the code is always passing a callback down the chain to request
so that it can do the hash check on response. If this is still a concern, I can probably do this in request.on('response')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rossj is correct. Buffering only occurs for downloads. See: https://github.com/mikeal/request/blob/master/request.js#L843-L890
Thanks for the followup @rossj. So looks like we can callback every time in upload. |
I finally had a chance to get this branch evaluated on my local machine. Props for getting tests running. I have a few comments that I'll inline. |
|
||
// The returned checksum does not match what was calculated on client side, remove file | ||
self.removeFile(container, options.remote, function (err) { | ||
if (err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels brittle to me. Obviously our service should be highly available, but the fact that we upload the file, then check and delete feels inverted.
I guess there's not much we can do if you want to not have to buffer the entire file locally to generate the hash as part of the inbound request.
@rossj I think what we'd like to do is close this PR, hoping instead to leverage this code as an example on top of pkgcloud. There are a couple of reasons for it.
Would you be willing to re-work this code into an example that we could include in the openstack storage examples? |
Hey @kenperkins, I think this makes sense. There are some definite edge cases where there isn't a clear path forward, and these cases are probably best left to the user's implementation. What exactly do you mean by "partition after upload" in your 1st point? Would CloudFiles potentially move or split the file after upload? I'll take a look at the examples and try to work this in. |
Currently, it is up to a user to ensure data integrity by sending an
ETag
header during an upload or checking it the response. Is there any interest in moving this checking functionality intopkgcloud
? Using a through stream we could calculate the checksum during upload and compare at the end.Questions I still have are: