Compute and store file checksums on the fly / file integrity #11811

PVince81 · 2014-10-28T13:49:31Z

To guarantee file integrity between the server and the sync client, we can compute a checksum based on the file contents. The checksum can be calculated on the fly using a rolling checksum function, for example using hash_update: http://php.net/manual/en/function.hash-update.php

The hash can then be used to compare file versions and detect conflicts or broken files, broken transfers, etc...

The only challenge I see so far is for external storage, whenever a file is changed remotely. In this case, if the scanner detects a change the hash could be marked as dirty and be recomputed on the next occasion. The lest efficient method would require the scanner to download the file for the sole purpose of computing the hash, which is slow.

Also, should the hash be used for the etag as well ?

Let's discuss @dragotin @icewind1991 @DeepDiver1975 @karlitschek @craigpg @MTRichards

The text was updated successfully, but these errors were encountered:

DeepDiver1975 · 2014-10-28T14:05:52Z

Also, should the hash be used for the etag as well ?

yes - for sure. Using hash over file content is the most common approach to generate the etag.

We have to solve the external storage issue first.

PVince81 · 2014-10-28T14:09:50Z

I'd suggest to let the file scanner just clear the checksum whenever a remote change is detected (through etag or other). Then the next file access (for example download) will auto-compute the checksum. Trouble is mostly with random file access fseek(), but for most storages a fopen() will already download the whole file to a temp dir instead of streaming, except for SFTP.

dragotin · 2014-10-28T14:27:24Z

Is there an common interface that each external storage has to implement? Some conceptual docs somewhere?

karlitschek · 2014-10-28T14:28:13Z

This would have a huge performance impact as discussed several times. I would love to understand the real life problem that we could solve with this before we try to implement something like that.

PVince81 · 2014-10-28T14:29:38Z

@karlitschek here's one example: #10975

We still didn't manage to get to the bottom of that bug. But having checksums would at least prevent bugs to cause file corruption.

PVince81 · 2014-10-28T14:30:12Z

I'd like to hear the arguments about the performance impact again.
Does computing a rolling hash (on the fly) imply a huge performance hit ?

PVince81 · 2014-10-28T14:30:26Z

"hear" as in "document them here" 😉

karlitschek · 2014-10-28T14:31:46Z

Isn´t is obvious? ;-) Calculating huge hashes like that has a huge performance impact and totally kills sync performance. Especially if you talk about a lot of user/files and big files.

PVince81 · 2014-10-28T14:33:30Z

It isn't obvious as I'm not familiar with how much computation power is required for that.
But we could run some tests on a slow server and see how big the impact is.

So far I thought that the performance argument was mostly because we weren't aware of rolling hashes (hash_update).

Does anyone have experience with this ?

icewind1991 · 2014-10-28T14:59:16Z

hash_update does not reduce the performance impact of the hash calculation, it only removes the need to have the full file stored in memory or on disk.

The performance penalty can be "fixed" though by moving hash calculation to a background job as long as the hash isn't needed quickly after upload (i.e. the sync client checking the hash immediately after upload)

karlitschek · 2014-10-28T15:02:37Z

the performance impact doesn´t go away by putting it into a background job. Please think about a big setup where you have a lot of users and files and the server is constantly under load.
Additionally the question is if the system behaves correctly if the hashes are not yet calculated. So this doesn´t solve anything.
And in the scenario where you have a lot of files on a slow external storage you can load everything to calculate the hashes. Just think about a scenario where you have petabytes of data like some ownCloud users have.

dragotin · 2014-10-28T15:11:02Z

The actual hash calculation does not make the performance impact nowadays. The IO operation to read the whole file is the problem, and that you do not have if you do rolling hashes.

To give a concrete example of a 2.3 MB file:

1d [kf:~/owncloud.com/mirall] 1.7+ ± time md5sum ~/xxx.log
0f21c2819c8caf09cc6004307b1bc8ad  /home/kf/xxx.log

real    0m0.045s
user    0m0.006s
sys     0m0.001s

and I am not on a up-to-date CPU and this was not a rolling hash.

Yes, we can discuss that on and on, but we should try to get facts like this:

Rolling hash sum calculation would not be a problem
The implementation on client and server would have to be able to deal with files that have checksums and files that have not and situations where they can not be calculated.

Furthermore we should try to get real information about what the users have:

How many files?
Which file types?
Size histogram
How much on external storage, and which external storages
...

Maybe we can provide a script that collects these kind of data and ask users to let it run and provide us the results. That would help.

icewind1991 · 2014-10-28T15:40:17Z

I've done some measurements for doing a rolling md5 hash while uploading over WebDAV:

Filesize	without hash	with hash
~200KB	260ms	370ms
~130MB	580ms	833ms
~2GB	11s	13s

PVince81 · 2014-10-28T17:58:32Z

Thanks @icewind1991.

Were these small files or big files ?
That does seem like a significant performance impact.

I'm still a bit worried about cases like data corruption from this guy here: #10975
He "fixed" it by installing files_locking. But who knows what other hidden bugs (race conditions?) could cause files to be corrupted ?

I guess the other solution is to invest in the high-level file locking here then #11804 which might have a higher priority than the checksum idea.

LukasReschke · 2014-10-28T18:01:14Z

Erm. The filesize is in the table? ;-)

@PVince81

PVince81 · 2014-10-28T18:08:24Z

Ah sorry, I thought it was a group of files that added up to that size. Thanks for the clarification.

PVince81 · 2014-10-29T09:47:30Z

Hmm yet another corruption issue: owncloud/client#2425

etiess · 2014-11-03T15:23:16Z

Other argument for a hash computation: it could solve a lot of redownload / reupload issues. We have seen a lot of examples in the past (only one famous example here: owncloud/client#994). And we're still having new ones today with the latest version (one example here: owncloud/client#2431) and massive reupload. It's highly probable that we'll have new similar issues in the future.

Concerning doing the computation as a background job, is it really a common situation that the server is "constantly under load"? (#11811 (comment)) Moreover, the server is already computing a lot of tasks (thumbnails generation for example) which are, for me, much less important than guaranteeing data integrity and a correct syncing process.

karlitschek · 2014-11-03T15:33:54Z

Re download are triggered by Etag handling bugs on the server or client. The right fix is to fix this bugs.

etiess · 2014-11-03T15:43:17Z

@karlitschek I agree: calculating hashes would only be a workaround to prevent the consequences of these bugs, that need to be fixed. But these kind of bugs do happen. And they likely will do again. My point is that it would avoid to the user to suffer from these bugs.

PVince81 · 2014-11-03T17:58:50Z

Here is another example of bugs where a checksum check would avoid the user having files be downloaded in their encrypted form: #11908

If the trouble is mostly about performance then maybe it should be optional and disabled by default.
People who want an extra level of robustness could enable it but need to be aware of the performance penalty / extra required resources.

I wonder if it would be possible to use hardware hashes if supported by the CPU and PHP.

etiess · 2014-11-03T18:17:01Z

@PVince81 If you didn't read it yet, contributions of #523 (particularly @zatricky 's ones) are full of information. @zatricky, this issue may be of interest for you.

zatricky · 2014-11-04T07:12:45Z

Apologies if I seem negative - #523 clearly shows some frustration on my end. Owncloud devs have different priorities to me - so I moved on. owncloud/client#230 (comment)

The primary concern against hashing has been the case of slow or costly remote storage on ownCloud server infrastructure (Amazon S3/etc). I understand this concern - but I feel it is naive. @PVince81, your suggestion bypasses that concern completely - so maybe there's no excuses now for this to be fixed (finally).

dragotin · 2014-11-06T08:59:41Z

Real life problems that would be become possible to be solved: De-douplication, LAN Sync, incremental sync. Lots of people ask for that. That is why I think we should start to think about that.

moscicki · 2014-11-13T20:10:32Z

@karlitschek, @freitag: corruption may happen on the wire or on the storage for whatever reason -- and not necessarily a bug in a component of owncloud (but thinking that all bugs may be squashed is a bit naive nontheless). How much corruption there will be is just a matter of scale. For a single user deployment it does not matter but for a large system - definitely yes! For example, many of your customers run with NFS async mode - and guess what happens when the server reboots or goes out of memory or RAM error. Another example: some people have seen really strange bit flips on dual stack IPv4/IPv6 networks and these guys handle large storage system so I tend to take their input seriously.

Therefore I would very strongly advocate to add checksums to the sync protocol between the client and the server (end-to-end). You may then leave it up to the server to implement or not and to the client to discover if server supports it or not.

About hashes and performance impact: there are other hashes than md5sum which are much less expensive to compute and have nice properties that you can hash chunks independently etc. These issues have been extensively analyzed and tested by many storage experts on the planet and owncloud can use existing research to do something smart. It's better to have less reliable but fast hashes than no hashes at all!

craigpg · 2014-11-15T19:39:50Z

With all of the code that runs each time a file is updated, including some rather hefty apps (e.g. encryption), I'm not convinced that computing a hash on the fly would be a performance killer. Besides, the extra overhead should be easily measured, as Icewind has already started, so we can make an informed decision.

Looking at Icewind's numbers, I'd like to understand total clock time for adding those files. For example, 11 secs to add a 2GB file through WebDAV seems pretty fast. ;-)

The benefits of the checksums have already been captured above, so I won't repeat them here. I will say that having an eTag based on file contents seems like a huge win since it should further prevent needless transfers between our clients and server.

One last thought that is probably obvious but hasn't been mentioned. The additional overhead (which we still need to quantify) would only be incurred when writing a file, not reading it. I suspect that most of our webDAV / sync requests are GETs (not PUTs), which further limits the overall impact of the overhead.

guettli · 2015-03-25T06:48:32Z

@moscicki Thank you for your proposal from 22. Nov 2014. I am missing an explanation why this can not be used for ETags. Please be explizt. Cheksums can be used for integrity and to avoid unneeded duplicate file transfer. Please update the proposal why this is not related to ETags. Thank you.

zatricky · 2015-03-25T15:08:19Z

@guettli - I don't think anybody has explicitly said that it cannot be used as the etag or that it cannot be a part of the etag. So far all comments are along the lines of "Should it be used?", "It can be used", and "the bugs giving you concern (that you think this will help fix) should be fixed first".

zatricky · 2015-03-25T15:10:24Z

@guettli - In the proposal, it is merely stating that the proposal has nothing to do with etags - not that checksums cannot be used.

dragotin · 2015-05-13T12:03:53Z

related #12417

guruz · 2015-10-15T14:18:33Z

We discussed about this some more and want the for now simpler solution in #18716

dragotin · 2016-03-09T10:27:21Z

#18716 was closed, however the current implementation lacks the verification of the checksum on the server. So does it make sense to track that requirement here?

PVince81 · 2016-03-09T10:28:25Z

Yeah, let's keep this ticket open

This is for owncloud/client#5102 (comment)

PVince81 · 2016-11-18T14:43:54Z

Closing in favor of #26655 with more detailed tasks

lock · 2019-08-03T07:00:39Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

PVince81 added the enhancement - proposed label Oct 28, 2014

PVince81 mentioned this issue Oct 28, 2014

Upload to OC server with encryption can lead to 0.1% corrupted files. #10975

Closed

PVince81 added the triage label Oct 28, 2014

etiess mentioned this issue Nov 7, 2014

[1.7.0-rc1] Unnecessary reupload of files owncloud/client#2431

Closed

PVince81 mentioned this issue Nov 28, 2014

Storage changes overview #12487

Closed

22 tasks

PVince81 mentioned this issue Jan 16, 2015

oc 7.0.3 / files "not readable" within desktop sync, but ok within web interface #13044

Closed

PVince81 mentioned this issue Feb 10, 2015

Tasks for preserving data integrity #14052

Closed

8 tasks

etiess mentioned this issue Mar 24, 2015

How to implement hash sum to prefent duplicate download/upload owncloud/client#3018

Closed

MTRichards mentioned this issue May 13, 2015

Server version 8.1 Wishlist owncloud/client#2725

Closed

7 tasks

tflidd mentioned this issue Jul 28, 2015

Don't re-download/re-upload manually copied files owncloud/client#3422

Closed

PVince81 added this to the backlog milestone Mar 1, 2016

PVince81 added enhancement and removed enhancement - proposed labels Mar 2, 2016

dragotin mentioned this issue Mar 9, 2016

Server version 9.1 Wishlist owncloud/client#4558

Closed

10 tasks

dragotin mentioned this issue Mar 29, 2016

Store client's file checksum as metadata and return it again in downloads and PROPFIND #18716

Closed

rullzer mentioned this issue May 19, 2016

ownCloud Platform - Planning #24684

Closed

guruz added a commit that referenced this issue Nov 8, 2016

Chunking NG: SHA1 checksumming of whole file #11811

b15a74f

This is for owncloud/client#5102 (comment)

guruz self-assigned this Nov 8, 2016

PVince81 mentioned this issue Nov 8, 2016

Chunking NG: SHA1 checksumming of whole file #11811 #26582

Closed

PVince81 removed the triage label Nov 16, 2016

PVince81 mentioned this issue Nov 18, 2016

Server computes checksum on read/write file operations #26655

Closed

7 tasks

PVince81 closed this as completed Nov 18, 2016

ownclouders mentioned this issue Apr 21, 2018

400 The computed checksum does not match #31235

Closed

lock bot locked as resolved and limited conversation to collaborators Aug 3, 2019

Compute and store file checksums on the fly / file integrity #11811

Compute and store file checksums on the fly / file integrity #11811

Comments

PVince81 commented Oct 28, 2014

DeepDiver1975 commented Oct 28, 2014

PVince81 commented Oct 28, 2014

dragotin commented Oct 28, 2014

karlitschek commented Oct 28, 2014

PVince81 commented Oct 28, 2014

PVince81 commented Oct 28, 2014

PVince81 commented Oct 28, 2014

karlitschek commented Oct 28, 2014

PVince81 commented Oct 28, 2014

icewind1991 commented Oct 28, 2014

karlitschek commented Oct 28, 2014

dragotin commented Oct 28, 2014

icewind1991 commented Oct 28, 2014

PVince81 commented Oct 28, 2014

LukasReschke commented Oct 28, 2014

PVince81 commented Oct 28, 2014

PVince81 commented Oct 29, 2014

etiess commented Nov 3, 2014

karlitschek commented Nov 3, 2014

etiess commented Nov 3, 2014

PVince81 commented Nov 3, 2014

etiess commented Nov 3, 2014

zatricky commented Nov 4, 2014

dragotin commented Nov 6, 2014

moscicki commented Nov 13, 2014

craigpg commented Nov 15, 2014

guettli commented Mar 25, 2015

zatricky commented Mar 25, 2015

zatricky commented Mar 25, 2015

dragotin commented May 13, 2015

guruz commented Oct 15, 2015

dragotin commented Mar 9, 2016

PVince81 commented Mar 9, 2016

PVince81 commented Nov 18, 2016

lock bot commented Aug 3, 2019