Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute and store file checksums on the fly / file integrity #11811

Closed
PVince81 opened this issue Oct 28, 2014 · 42 comments
Closed

Compute and store file checksums on the fly / file integrity #11811

PVince81 opened this issue Oct 28, 2014 · 42 comments
Assignees

Comments

@PVince81
Copy link
Contributor

To guarantee file integrity between the server and the sync client, we can compute a checksum based on the file contents. The checksum can be calculated on the fly using a rolling checksum function, for example using hash_update: http://php.net/manual/en/function.hash-update.php

The hash can then be used to compare file versions and detect conflicts or broken files, broken transfers, etc...

The only challenge I see so far is for external storage, whenever a file is changed remotely. In this case, if the scanner detects a change the hash could be marked as dirty and be recomputed on the next occasion. The lest efficient method would require the scanner to download the file for the sole purpose of computing the hash, which is slow.

Also, should the hash be used for the etag as well ?

Let's discuss @dragotin @icewind1991 @DeepDiver1975 @karlitschek @craigpg @MTRichards

@DeepDiver1975
Copy link
Member

Also, should the hash be used for the etag as well ?

yes - for sure. Using hash over file content is the most common approach to generate the etag.

We have to solve the external storage issue first.

@PVince81
Copy link
Contributor Author

I'd suggest to let the file scanner just clear the checksum whenever a remote change is detected (through etag or other). Then the next file access (for example download) will auto-compute the checksum. Trouble is mostly with random file access fseek(), but for most storages a fopen() will already download the whole file to a temp dir instead of streaming, except for SFTP.

@dragotin
Copy link
Contributor

Is there an common interface that each external storage has to implement? Some conceptual docs somewhere?

@karlitschek
Copy link
Contributor

This would have a huge performance impact as discussed several times. I would love to understand the real life problem that we could solve with this before we try to implement something like that.

@PVince81
Copy link
Contributor Author

@karlitschek here's one example: #10975

We still didn't manage to get to the bottom of that bug. But having checksums would at least prevent bugs to cause file corruption.

@PVince81
Copy link
Contributor Author

I'd like to hear the arguments about the performance impact again.
Does computing a rolling hash (on the fly) imply a huge performance hit ?

@PVince81
Copy link
Contributor Author

"hear" as in "document them here" 😉

@karlitschek
Copy link
Contributor

Isn´t is obvious? ;-) Calculating huge hashes like that has a huge performance impact and totally kills sync performance. Especially if you talk about a lot of user/files and big files.

@PVince81
Copy link
Contributor Author

It isn't obvious as I'm not familiar with how much computation power is required for that.
But we could run some tests on a slow server and see how big the impact is.

So far I thought that the performance argument was mostly because we weren't aware of rolling hashes (hash_update).

Does anyone have experience with this ?

@icewind1991
Copy link
Contributor

hash_update does not reduce the performance impact of the hash calculation, it only removes the need to have the full file stored in memory or on disk.

The performance penalty can be "fixed" though by moving hash calculation to a background job as long as the hash isn't needed quickly after upload (i.e. the sync client checking the hash immediately after upload)

@karlitschek
Copy link
Contributor

the performance impact doesn´t go away by putting it into a background job. Please think about a big setup where you have a lot of users and files and the server is constantly under load.
Additionally the question is if the system behaves correctly if the hashes are not yet calculated. So this doesn´t solve anything.
And in the scenario where you have a lot of files on a slow external storage you can load everything to calculate the hashes. Just think about a scenario where you have petabytes of data like some ownCloud users have.

@dragotin
Copy link
Contributor

The actual hash calculation does not make the performance impact nowadays. The IO operation to read the whole file is the problem, and that you do not have if you do rolling hashes.

To give a concrete example of a 2.3 MB file:

1d [kf:~/owncloud.com/mirall] 1.7+ ± time md5sum ~/xxx.log
0f21c2819c8caf09cc6004307b1bc8ad  /home/kf/xxx.log

real    0m0.045s
user    0m0.006s
sys     0m0.001s

and I am not on a up-to-date CPU and this was not a rolling hash.

Yes, we can discuss that on and on, but we should try to get facts like this:

  • Rolling hash sum calculation would not be a problem
  • The implementation on client and server would have to be able to deal with files that have checksums and files that have not and situations where they can not be calculated.

Furthermore we should try to get real information about what the users have:

  • How many files?
  • Which file types?
  • Size histogram
  • How much on external storage, and which external storages
  • ...

Maybe we can provide a script that collects these kind of data and ask users to let it run and provide us the results. That would help.

@icewind1991
Copy link
Contributor

I've done some measurements for doing a rolling md5 hash while uploading over WebDAV:

Filesize without hash with hash
~200KB 260ms 370ms
~130MB 580ms 833ms
~2GB 11s 13s

@PVince81
Copy link
Contributor Author

Thanks @icewind1991.

Were these small files or big files ?
That does seem like a significant performance impact.

I'm still a bit worried about cases like data corruption from this guy here: #10975
He "fixed" it by installing files_locking. But who knows what other hidden bugs (race conditions?) could cause files to be corrupted ?

I guess the other solution is to invest in the high-level file locking here then #11804 which might have a higher priority than the checksum idea.

@LukasReschke
Copy link
Member

Erm. The filesize is in the table? ;-)

@PVince81

@PVince81
Copy link
Contributor Author

Ah sorry, I thought it was a group of files that added up to that size. Thanks for the clarification.

@PVince81
Copy link
Contributor Author

Hmm yet another corruption issue: owncloud/client#2425

@etiess
Copy link

etiess commented Nov 3, 2014

Other argument for a hash computation: it could solve a lot of redownload / reupload issues. We have seen a lot of examples in the past (only one famous example here: owncloud/client#994). And we're still having new ones today with the latest version (one example here: owncloud/client#2431) and massive reupload. It's highly probable that we'll have new similar issues in the future.

Concerning doing the computation as a background job, is it really a common situation that the server is "constantly under load"? (#11811 (comment)) Moreover, the server is already computing a lot of tasks (thumbnails generation for example) which are, for me, much less important than guaranteeing data integrity and a correct syncing process.

@karlitschek
Copy link
Contributor

Re download are triggered by Etag handling bugs on the server or client. The right fix is to fix this bugs.

@etiess
Copy link

etiess commented Nov 3, 2014

@karlitschek I agree: calculating hashes would only be a workaround to prevent the consequences of these bugs, that need to be fixed. But these kind of bugs do happen. And they likely will do again. My point is that it would avoid to the user to suffer from these bugs.

@PVince81
Copy link
Contributor Author

PVince81 commented Nov 3, 2014

Here is another example of bugs where a checksum check would avoid the user having files be downloaded in their encrypted form: #11908

If the trouble is mostly about performance then maybe it should be optional and disabled by default.
People who want an extra level of robustness could enable it but need to be aware of the performance penalty / extra required resources.

I wonder if it would be possible to use hardware hashes if supported by the CPU and PHP.

@etiess
Copy link

etiess commented Nov 3, 2014

@PVince81 If you didn't read it yet, contributions of #523 (particularly @zatricky 's ones) are full of information. @zatricky, this issue may be of interest for you.

@zatricky
Copy link

zatricky commented Nov 4, 2014

Apologies if I seem negative - #523 clearly shows some frustration on my end. Owncloud devs have different priorities to me - so I moved on. owncloud/client#230 (comment)

The primary concern against hashing has been the case of slow or costly remote storage on ownCloud server infrastructure (Amazon S3/etc). I understand this concern - but I feel it is naive. @PVince81, your suggestion bypasses that concern completely - so maybe there's no excuses now for this to be fixed (finally).

@dragotin
Copy link
Contributor

dragotin commented Nov 6, 2014

Real life problems that would be become possible to be solved: De-douplication, LAN Sync, incremental sync. Lots of people ask for that. That is why I think we should start to think about that.

@moscicki
Copy link

@karlitschek, @freitag: corruption may happen on the wire or on the storage for whatever reason -- and not necessarily a bug in a component of owncloud (but thinking that all bugs may be squashed is a bit naive nontheless). How much corruption there will be is just a matter of scale. For a single user deployment it does not matter but for a large system - definitely yes! For example, many of your customers run with NFS async mode - and guess what happens when the server reboots or goes out of memory or RAM error. Another example: some people have seen really strange bit flips on dual stack IPv4/IPv6 networks and these guys handle large storage system so I tend to take their input seriously.

Therefore I would very strongly advocate to add checksums to the sync protocol between the client and the server (end-to-end). You may then leave it up to the server to implement or not and to the client to discover if server supports it or not.

About hashes and performance impact: there are other hashes than md5sum which are much less expensive to compute and have nice properties that you can hash chunks independently etc. These issues have been extensively analyzed and tested by many storage experts on the planet and owncloud can use existing research to do something smart. It's better to have less reliable but fast hashes than no hashes at all!

@craigpg
Copy link

craigpg commented Nov 15, 2014

With all of the code that runs each time a file is updated, including some rather hefty apps (e.g. encryption), I'm not convinced that computing a hash on the fly would be a performance killer. Besides, the extra overhead should be easily measured, as Icewind has already started, so we can make an informed decision.

Looking at Icewind's numbers, I'd like to understand total clock time for adding those files. For example, 11 secs to add a 2GB file through WebDAV seems pretty fast. ;-)

The benefits of the checksums have already been captured above, so I won't repeat them here. I will say that having an eTag based on file contents seems like a huge win since it should further prevent needless transfers between our clients and server.

One last thought that is probably obvious but hasn't been mentioned. The additional overhead (which we still need to quantify) would only be incurred when writing a file, not reading it. I suspect that most of our webDAV / sync requests are GETs (not PUTs), which further limits the overall impact of the overhead.

@guettli
Copy link

guettli commented Mar 25, 2015

@moscicki Thank you for your proposal from 22. Nov 2014. I am missing an explanation why this can not be used for ETags. Please be explizt. Cheksums can be used for integrity and to avoid unneeded duplicate file transfer. Please update the proposal why this is not related to ETags. Thank you.

@zatricky
Copy link

@guettli - I don't think anybody has explicitly said that it cannot be used as the etag or that it cannot be a part of the etag. So far all comments are along the lines of "Should it be used?", "It can be used", and "the bugs giving you concern (that you think this will help fix) should be fixed first".

@zatricky
Copy link

@guettli - In the proposal, it is merely stating that the proposal has nothing to do with etags - not that checksums cannot be used.

@dragotin
Copy link
Contributor

related #12417

@guruz
Copy link
Contributor

guruz commented Oct 15, 2015

We discussed about this some more and want the for now simpler solution in #18716

@PVince81 PVince81 added this to the backlog milestone Mar 1, 2016
@dragotin
Copy link
Contributor

dragotin commented Mar 9, 2016

#18716 was closed, however the current implementation lacks the verification of the checksum on the server. So does it make sense to track that requirement here?

@PVince81
Copy link
Contributor Author

PVince81 commented Mar 9, 2016

Yeah, let's keep this ticket open

@PVince81
Copy link
Contributor Author

Closing in favor of #26655 with more detailed tasks

@lock
Copy link

lock bot commented Aug 3, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests