-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute and store file checksums on the fly / file integrity #11811
Comments
yes - for sure. Using hash over file content is the most common approach to generate the etag. We have to solve the external storage issue first. |
I'd suggest to let the file scanner just clear the checksum whenever a remote change is detected (through etag or other). Then the next file access (for example download) will auto-compute the checksum. Trouble is mostly with random file access |
Is there an common interface that each external storage has to implement? Some conceptual docs somewhere? |
This would have a huge performance impact as discussed several times. I would love to understand the real life problem that we could solve with this before we try to implement something like that. |
@karlitschek here's one example: #10975 We still didn't manage to get to the bottom of that bug. But having checksums would at least prevent bugs to cause file corruption. |
I'd like to hear the arguments about the performance impact again. |
"hear" as in "document them here" 😉 |
Isn´t is obvious? ;-) Calculating huge hashes like that has a huge performance impact and totally kills sync performance. Especially if you talk about a lot of user/files and big files. |
It isn't obvious as I'm not familiar with how much computation power is required for that. So far I thought that the performance argument was mostly because we weren't aware of rolling hashes ( Does anyone have experience with this ? |
The performance penalty can be "fixed" though by moving hash calculation to a background job as long as the hash isn't needed quickly after upload (i.e. the sync client checking the hash immediately after upload) |
the performance impact doesn´t go away by putting it into a background job. Please think about a big setup where you have a lot of users and files and the server is constantly under load. |
The actual hash calculation does not make the performance impact nowadays. The IO operation to read the whole file is the problem, and that you do not have if you do rolling hashes. To give a concrete example of a 2.3 MB file:
and I am not on a up-to-date CPU and this was not a rolling hash. Yes, we can discuss that on and on, but we should try to get facts like this:
Furthermore we should try to get real information about what the users have:
Maybe we can provide a script that collects these kind of data and ask users to let it run and provide us the results. That would help. |
I've done some measurements for doing a rolling md5 hash while uploading over WebDAV:
|
Thanks @icewind1991. Were these small files or big files ? I'm still a bit worried about cases like data corruption from this guy here: #10975 I guess the other solution is to invest in the high-level file locking here then #11804 which might have a higher priority than the checksum idea. |
Erm. The filesize is in the table? ;-) |
Ah sorry, I thought it was a group of files that added up to that size. Thanks for the clarification. |
Hmm yet another corruption issue: owncloud/client#2425 |
Other argument for a hash computation: it could solve a lot of redownload / reupload issues. We have seen a lot of examples in the past (only one famous example here: owncloud/client#994). And we're still having new ones today with the latest version (one example here: owncloud/client#2431) and massive reupload. It's highly probable that we'll have new similar issues in the future. Concerning doing the computation as a background job, is it really a common situation that the server is "constantly under load"? (#11811 (comment)) Moreover, the server is already computing a lot of tasks (thumbnails generation for example) which are, for me, much less important than guaranteeing data integrity and a correct syncing process. |
Re download are triggered by Etag handling bugs on the server or client. The right fix is to fix this bugs. |
@karlitschek I agree: calculating hashes would only be a workaround to prevent the consequences of these bugs, that need to be fixed. But these kind of bugs do happen. And they likely will do again. My point is that it would avoid to the user to suffer from these bugs. |
Here is another example of bugs where a checksum check would avoid the user having files be downloaded in their encrypted form: #11908 If the trouble is mostly about performance then maybe it should be optional and disabled by default. I wonder if it would be possible to use hardware hashes if supported by the CPU and PHP. |
Apologies if I seem negative - #523 clearly shows some frustration on my end. Owncloud devs have different priorities to me - so I moved on. owncloud/client#230 (comment) The primary concern against hashing has been the case of slow or costly remote storage on ownCloud server infrastructure (Amazon S3/etc). I understand this concern - but I feel it is naive. @PVince81, your suggestion bypasses that concern completely - so maybe there's no excuses now for this to be fixed (finally). |
Real life problems that would be become possible to be solved: De-douplication, LAN Sync, incremental sync. Lots of people ask for that. That is why I think we should start to think about that. |
@karlitschek, @freitag: corruption may happen on the wire or on the storage for whatever reason -- and not necessarily a bug in a component of owncloud (but thinking that all bugs may be squashed is a bit naive nontheless). How much corruption there will be is just a matter of scale. For a single user deployment it does not matter but for a large system - definitely yes! For example, many of your customers run with NFS async mode - and guess what happens when the server reboots or goes out of memory or RAM error. Another example: some people have seen really strange bit flips on dual stack IPv4/IPv6 networks and these guys handle large storage system so I tend to take their input seriously. Therefore I would very strongly advocate to add checksums to the sync protocol between the client and the server (end-to-end). You may then leave it up to the server to implement or not and to the client to discover if server supports it or not. About hashes and performance impact: there are other hashes than md5sum which are much less expensive to compute and have nice properties that you can hash chunks independently etc. These issues have been extensively analyzed and tested by many storage experts on the planet and owncloud can use existing research to do something smart. It's better to have less reliable but fast hashes than no hashes at all! |
With all of the code that runs each time a file is updated, including some rather hefty apps (e.g. encryption), I'm not convinced that computing a hash on the fly would be a performance killer. Besides, the extra overhead should be easily measured, as Icewind has already started, so we can make an informed decision. Looking at Icewind's numbers, I'd like to understand total clock time for adding those files. For example, 11 secs to add a 2GB file through WebDAV seems pretty fast. ;-) The benefits of the checksums have already been captured above, so I won't repeat them here. I will say that having an eTag based on file contents seems like a huge win since it should further prevent needless transfers between our clients and server. One last thought that is probably obvious but hasn't been mentioned. The additional overhead (which we still need to quantify) would only be incurred when writing a file, not reading it. I suspect that most of our webDAV / sync requests are GETs (not PUTs), which further limits the overall impact of the overhead. |
@moscicki Thank you for your proposal from 22. Nov 2014. I am missing an explanation why this can not be used for ETags. Please be explizt. Cheksums can be used for integrity and to avoid unneeded duplicate file transfer. Please update the proposal why this is not related to ETags. Thank you. |
@guettli - I don't think anybody has explicitly said that it cannot be used as the etag or that it cannot be a part of the etag. So far all comments are along the lines of "Should it be used?", "It can be used", and "the bugs giving you concern (that you think this will help fix) should be fixed first". |
@guettli - In the proposal, it is merely stating that the proposal has nothing to do with etags - not that checksums cannot be used. |
related #12417 |
We discussed about this some more and want the for now simpler solution in #18716 |
#18716 was closed, however the current implementation lacks the verification of the checksum on the server. So does it make sense to track that requirement here? |
Yeah, let's keep this ticket open |
Closing in favor of #26655 with more detailed tasks |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
To guarantee file integrity between the server and the sync client, we can compute a checksum based on the file contents. The checksum can be calculated on the fly using a rolling checksum function, for example using
hash_update
: http://php.net/manual/en/function.hash-update.phpThe hash can then be used to compare file versions and detect conflicts or broken files, broken transfers, etc...
The only challenge I see so far is for external storage, whenever a file is changed remotely. In this case, if the scanner detects a change the hash could be marked as dirty and be recomputed on the next occasion. The lest efficient method would require the scanner to download the file for the sole purpose of computing the hash, which is slow.
Also, should the hash be used for the etag as well ?
Let's discuss @dragotin @icewind1991 @DeepDiver1975 @karlitschek @craigpg @MTRichards
The text was updated successfully, but these errors were encountered: