-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checksums on files / reliable up/download #11138
Comments
GitMate.io thinks possibly related issues are #56 (File Drop: Create confirmation / show file checksum), #3475 (question about file count), #6129 (When using the REST API to find out checksums of files, a 404 is returned.), #7629 (Create Folders when moving files), and #1371 (No file-uploading possible at all). |
I believe the clients checksums files, not sure if it syncs the checksums and/or if the server verifies them. But that's more meant to check if files are transferred correctly. |
This should be done on server side, as even if all clients would compute & upload a checksum, the webUI would be missing and then we cannot use this at all. |
@icewind1991 we talked about this on NC15 meeting. |
The big problem with this is that it is nearly impossible to generate the checksums for external storages. We can't do anything there and also updating them in a timely manner is nearly out of scope. For a pure "all data goes through NC" this would be no problem, but the external storages are a real problem. |
cc @rullzer @icewind1991 because we discussed quite a lot about this. |
FYI the fact that NC does not have this apparently causes owncloud-client to warn when connecting to a nextcloud instance, which is a problem in Debian because there is no nextcloud-client (yet): https://alioth-lists.debian.net/pipermail/pkg-owncloud-maintainers/2019-January/003498.html |
For Nc 24 we're likely adding a hashing on the server (I know it's in the plan). But note that the client already does compute, store and check a hash. I do think some users might have bigger expectations of these hashes than is realistic. We are likely only very rarely going to not sync a file because of the hash - I just wouldn't trust it completely. So it won't save much data transfer. Then the only thing it really does is, in theory, warn when bits are changed during data transfer or storage. It won't FIX those situations, or tell you what the right file is, and it's super unlikely these things happen. And IF they happen, there IS already a checksum on http(s) transfers, so you're actually not adding any more reliability. Maybe a good feeling and a bit more load on the server and client is all you get. Bugs like the one mentioned, mistakes that cause wrong dates or 0 byte files - these wouldn't be fixed somehow by having these checksums. Not saying the have no benefits, but just saying it's not magic. Honestly I think it would be a good idea to update the topic on top with our actual GOAL: what are we trying to accomplish? Because as it's written now, it seems to be a double-check if a file transfer happened correctly. But the http protocol already does that, so that seems quite pointless. |
Hi @jospoortvliet , Thanks for the update. There are many syncronisation tools (rclone being one of them) that are able to compare the hash to check the need for data transfer. In my scenario NC would need to support MD5 and SHA1 for greatest compatibility. For example Google Drive only supports MD5. So a direct hash comparison is not possible if only SHA1 is supported in NC in that case. Note you mentioned the "client already does compute, store and check a hash"..... I am looking for the hash to be exposed via webdav. Thanks, Rob. |
I think I just ran into this.
my use-case is updating the tags of my mp3s. Here, one often uses a "preserve timestamp" option not to mess with the "order by latest" view in various music players - so this specific metadata falls flat. Furthermore, I removed some tags, which I assume means zeroing out some meta-data frame, so the file-size stays the same also. Now I would love to have a prompt for a client-server data conflict, so I know that the data on the server is outdated. I guess that there are more file-formats that are prone to this silent not-being-synced issue. |
I had the bug nextcloud/android#11974 (The version of the Android App with possible fixes is not yet available when I sent this comment) For whatever reason, the Android app or Server conflict resolver did something wrong and this is what I got in several of my photos. It basically replaced the original photo on my smartphone and on the server with a black photo |
Let me guess, are they 0 Bytes? :) Happened to a good chunk of mine for some months until I finally clued together what possibly causes this. I had never expected NC to be the culprit, an app that's massively adopted, the base for many organizations' own clouds using NC as a backend with obvious need for mobile workflows. I completely distrusted my phone's NAND over NC for a while. The fact I need to do manual conflict resolving on unchanged files, that NC zeroes local files before the server has a proper upload... I'm puzzled by all of this. |
See my feedback here: nextcloud/android#11974 (comment) "Performance issues are tolerable; data loss is not." |
Just had an incident because of the lacking integrity check. If one uses NC to store many files this feature is essential. |
I'm jumping on the train. I ran some JPEG check script on my NC files and discovered that few % of my uploaded images on a specific period of time that lasted 7 months were corrupted with bunch of 0s. Not fully blank but some zeroed segments, sometimes as big as ~500 bytes. JPEG will be "heal-able" but for some other files it's going to be tough to put the right placeholder in the blank spot (PNGs are tough, PDFs are hell) I think that the focus on synchronization left us with a risky upload tool. The explanations that "migration would be complicated for big systems" when suggesting to add hashes is a bit infuriating. Feature flag with opt-in, major release with breaking changes, that's not something new to the industry. Plus it somehow closes the door for volunteer work on this matter. But I'm open to disruptive approaches. Maybe a nextcloud app to gather client's hashes on a voluntary basis and that can check that the files that land on the server are OK is good enough, without having to rely on core synchronous mechanism. |
Not nearly as complicated as cleaning up after potentially months of hundreds or thousands of people relying on sync working when it was silently corrupting files... Unrecoverably so because folks trusted Nextcloud instances to run on well-backed up drives and their phones having little storage. Too bad good data never reached Nextcloud under certain circumstances. Yikes... PR nightmare for EVERYONE involved. With all due respect. |
I did a first exploration of the code server-side. NC heavily relies on Sabre. There are many app/plugins here and there to tune the default behavior. This is going to take some time to have a global overview of everything related to upload and set a plan (chunked upload, bulk upload...). |
Well what about apps that e.g. want to offer a feature like converting/compressing images upon upload? Or removing metadata or similar features? AFAIK extensibility should allow this and you likely cannot prevent them to do anything bad, you need to trust the apps installed. |
The point of this mechanism is to ensure the transmission process is graceful and error-free. Whatever happens afterward, such as automated or manual editing of media stored within Nextcloud should be a completely different aspect. What we don't want however is during the process of transmission some app interfering with the data as it's still in transmission. I think that was @JulienFS's point. (correct me if I misunderstood you) |
I would advocate for the decision to be left to the final user. My reasoning is : bringing hashes to the core upload mechanism might require a costly major rework, doing it opportunistically with lightweight plugin based tweaks might not offer strong enough guarantees for integrity. Having a side mechanism that checks end-to-end integrity might be good enough. In this regard I would say : I'd better like compressing/whatever apps to have to integrate to the integrity app than let them potentially mess up with the integrity of my data. In the same reasoning, I'd rather like to choose between integrity and post-upload modification than not having a reliable integrity/sync check. Obviously there's not only integrity involved in using hashes : silent file modification or manual efficient resync are also to consider. At this point I'm trying to figure out if there's any easy path to bring an opt-in solution for people who consider these problems more than eventual post-upload modifications by other apps. I'm not a NC developer, simply a FOSS enthusiast which happens to also be a pro dev. I'm willing to put some effort to solve these things but it might be a big piece of work and I'm unsure if I'll have the resources to achieve it. Finding a workaround here and now would at least offer some of us enough time to build or wait for a core feature without having to leave NC.
That would be the "end game". Obviously keeping NC's modularity while fixing hash related problems is the ideal goal. I'll keep investigatin, maybe adding hashes it not that a big deal, but I prefer to be careful as I'm not familiar yet with the code base. PS : if anyone landing here has some product management skills, feel free to start aggregating things related to the lack of hashes and get me in the dev loop. |
Update from the rabbit hole. I found something disturbing during my little exploration : the file creation and update part on the Sabre side. Basically it uses PHP's At this point I can't be really sure that there's no extra mechanism plugged somehow that would make the situation better, so lets not jump to conclusion. However I'm still trying to figure out how some zeros ended in my files, and a stream interruption on an update combined with a zero-filled write buffer would match the pattern. EDIT : I forgot to mention that the bulk upload feature is not leveraging the same code as Sabre's PUT handler. It's re-implementing on top of some of it. For instance and to my understanding files created with bulk upload don't seem to trigger Sabre events, but NC events instead. |
After digging a little bit more : it's not Sabre's default file handling that is being used. I missed the CachingTree passed to the default implementation constructor, which is the starting point for NC's style file handling. server/apps/dav/lib/Connector/Sabre/File.php Line 139 in 54afea4
Glad to see that the implementation is more complex than a simple
I still need to dig deeper.
At this point it starts to look clear to me that a well implemented client should be able to do reliable upload, especially since there's an optional hash check mechanism. I also think that reliability could be increased by not writing to the final file directly, this way we would preserve actual content and not rely on a reupload by the client to fix the file (but this part still needs investigation). When it comes to faster resync, it could be handled by supplying a client hash to an extra endpoint/http method . It has been said previously that "(you) wouldn't trust a file to exist based on a hash" but I would personally trust a modern secure hash. Obviously we would be drifting a bit from standard DAV, but as long as it is optional it shouldn't be a problem (and I think chunked transfers is not that standard, same for bulk upload). That would be a reasonable tradeoff between network usage and CPU usage if we don't store the hash. Only people with a backend storage that bills more for read than write might have a problem, but they could simply not opt-in (or opt out). We could also send a hash from the client, in addition to the write final data after sanity check change: this way we could ensure data integrity before writing the final content to the final path. I know that there's already plenty of CRC/checksum mechanism on the underlying mechanisms (TCP has been cited) but here we could safeguard against application level errors. |
It looks like bulk upload is not using the DAV file handling but is using the native OC file handling. Here is where I landed by trying to follow the epic thread of hooks and inherited classes and autoloaded stuff : server/lib/private/Files/View.php Line 625 in 54afea4
So the bulk upload doesn't have some handy stuff from NC's Sabre implementation. But it has its own handy stuff. There's an 'in stream' checksum that happens for each embedded file. The problem is that it operates on the raw stream instead of the actual file data, that is fetched from the stream later. If something gets messed up in the actual fetching of data the previous checksum won't detect it. This is better than nothing as it somehow safeguard against dramatic stream handling problem but it could be enhanced. The chunked upload is on another path than the regular one and doesn't have the checksum compute/check option. I might be wrong but it looks like it blindly pushes the eventual 'oc-checksum' header to the file info cache without checking it. Thus we could have some checksum already sitting somewhere that could be simply wrong. I don't really know why there are filesystem related classes for DAV/Sabre and other FS related classes in the OwnCloud lib, and why one is not using the other. At some point I was affraid that NC drifted from OC and that we ended with a mess, but things look pretty similar upstream. That being said, maybe OC has shifted its resources to it's "infinite scale" backend... So the first thing would be to ensure that we have a hash coming from the client for the four upload scenarios :
For all these scenarios, we should perform a last minute checksum before writing anything to the final destination. It involves writing aside from the final destination during the upload. It is already the case for chunked upload, obviously. In this later case we would need to read the reassembled file twice : once to get the checksum, then another time to write to the final destination. Ideally in all these scenarios the final compute checksum should be return to the client, a header would fit. That would be a nice way to check that things didn't go crazy. Now to be extra safe we would probably need an extra round-trip from the client to the server. We could check that was is written on disk is actually what should be written on disk. It might sound paranoiac to do this extra check but with the level on inherited filesystem class involved during a simple write operation that might be an option. The endpoint in use for this extra check could also be used for periodical/manually triggered sync check. It could also serve when bootstrapping a synchronized directory to detect already existing files remotely. Considering the mess that filesystem handling is at this point I would strongly advise against server-side hash storing. Maybe there's opportunistic reliable hash storing with some object storage backend but otherwise it would be way too hard to guarantee that any modification to a file event through vanilla NC would properly update the hash. Most likely, before any hash storage to be implemented, we should consider refactoring the whole filesystem handling. Anyway "live" hashes should be good enough. I'm going to take some times to think about all of this then I'm probably going to start writing some code if it fits my schedule. If anyone has some suggestion feel free to share. |
I don't necessarily nextcloud doing the safety checks, I just want to be able to access the checksum from the server, so I can write my own scripts that deal with checksum checks in the way I want. Is there no way at all to retrieve the checksum at the moment (without reading the file into memory and computing it, which would be way too heavy) ? It is really frustrating not being able to implement safety checks, I don't understand how the checksum is not already stored, it feels like very basic information that any storage solution should have. Nextcloud always screws up a small percentage of uploads/downloads in a way that it doesn't clean itself up, which makes it pretty much unusable. |
@sk1806 storing hashes would bring another class of problem though. What if the hash stored doesn't match the actual file ? This is especially problematic since nextcloud can deal with many storage backends. And to be fair with the way hashes are actually dealt with we could end up storing header-provided hashes without even checking that the file was actually written properly... FTR I still plan to work on this hash thing but my job and my family was very demanding for the past few month. I really hope that I can get myself back on that topic by the end of the year. At least at this point I have a good overview of the code base especially related to file upload. If any dev passing by has some throughput please reach out I can surely put someone on a fast track. |
To have reliable up- and downloads, generating a checksum on server is needed.
Upload
Download
On backward compatibility:
Upload:
Download:
Additional for Android Files we can check if files are already uploaded (despite on relying on file names).
(this was shortly discussed in NC15 planning meeting)
Additionally, the ability to request checksum of file via endpoint from #25949:
The text was updated successfully, but these errors were encountered: