-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pointless re-sync of entire folder #523
Comments
Which client version? |
Latest client for Windows (1.1.1) |
Please reopen this issue within the mirall repo. Thx |
I am reopening this issue here, since without the server providing an MD5 sum, there is nothing we can do in the client. |
Note: we could at least have hash sums for the files that we have exclusive access to. |
So what’s the call on this one? Fixed? Important to work on? Please advise. |
I'd say this is very important. Can we get appropriate labels put on this issue? I had a similar issue where my server and my desktop/laptop etc are using Dropbox. I want to have all that moved over to oC - but its a little crazy that the client will want to re-upload and/or re-download everything. Its much cheaper (bandwidth and time-wise) to move everything over on all three platforms and have them all simply notice "oh, everything is okay, nothing to do here". Ideally, the checksum should actually be an indexed value in the database - in fact it should probably even be the primary key used to identify content. I believe that the system already supports "move" operations (moving a file to another folder within owncloud without causing a deletion/re-upload) - but doing this would actually make supporting this concept trivial. Please note, md5sum is a good starting point - but it would be much more appropriate to use a variety of checks and to use various cryptographic checksums to ensure that everything is consistent across all systems:ǂ If filesize and checksum matches but timestamp differs, then only a tiny change should be actioned (fixing the timestamp). ǂ All of this information should be stored in the client and server databases when files/folders are added to the repository. Recalculating hash values every time the client starts would be madness. Having this stored within the client database would also improve the "time to first sync" in the case of content having changed while the client wasn't running. |
Hello, First of all, thanks to the dev for the amazing OwnCloud. I would like to know how far you are with solving this type of issue? I live and work in Africa: Internet access are slow, and most of the time data plans are limited in download. So it’s very expensive to resync a folder on a new computer with some GB of data. With Dropbox, we just copy the files on an USB stick, and copy them to the new computer before syncronizing it. Then we don’t need to download data again and the sync works fine. How is it possible with owncloud? Does the unique ID make this possible? When I copy my files, can I copy the unique ID too? If not, we are forced to download again, what is very unconfortable for large amount of data… I tried to copy the whole folder, but it doesn't seem to work:
I have to precise that I do not use owncloud in a standard way: I don’t want to sync all my data locally, as it is proposed during the initial setup of the client. So I deleted this first “sync pair”, and created new pairs with individual folders. I have to precise too, that for test purpose, I made this on a single computer, with the same client (which should sync the same folder on the server with 2 localization on the same computer) A lot of subject have been opened on this issue, whereas I'm not sure they have the same origin: Thanks for your help! Etienne |
+1 for implementing hash based verification of the content as there's nothing like voting system |
Is there any workaround for now until the md5 based solution will be developed? Having big troubles due to resyncs... |
@AykutCevik For the moment unfortunately not. You can follow owncloud/client#994 and post your logs to help the team. |
My apologies for the excess comment-traffic - I'd intended on suggesting using multiple hashsums but forgot to mention it. I'd suggest using MD5 as a primary hashsum and using SHA256 or SHA512 as the second. Though not a security issue, per se, I would not use md5sum alone for the reasons cited here: |
we maintain unique ids of the files on the server in the filesystem cache table and in the client sqlite database. A complete resync shoudn't happen unless the server or the client databases are changed or deleted somehow. |
@karlitschek: The use case inferred here is where the indexed value is based on a hash function. Specifically, can the unique id identify a file based on the whole of its content or is it simply metadata that is independent of the actual file content? Put another way: If I have two files with the same content, will their ids be identical? In this case the answer must be yes while also ensuring, within reasonable doubt, that we do not have two files with different content with the same id. (See http://git-scm.com/book/ch6-1.html#A-SHORT-NOTE-ABOUT-SHA-1) Rsync takes advantage of hashing to reduce the amount of data transferred in a sync. However, the use case is very different, and rsync does not keep a database of files on hand. In our case we want to leverage the client and server databases keeping these indexed hashes on hand to ensure that we do not re-upload or re-download content needlessly. I'm making a separate issue for consideration of full rsync-style synchronisation. |
The ids are not in relation with the file content. The are just metadata as you phrase it. Two identical files will not have the same etag. We do not calculate a content based finger print because we have a multi backend structure which can make it very hard to read the whole file before syncing. |
Ideally these hashes should be calculated by the client before the initial content upload. The server could have a scrub process to periodically verify these hashes (which would also help satisfy upgrade considerations). By storing these hashes in the database there will be little-to-no pre-sync I/O on the server. Post-sync the server should probably verify the hashes within a short time-frame - but that is a decision I leave to the devs. In this use case of a "pointless re-sync", the only work the server will have to perform will be the SQL queries necessary to see if the hashes being sent by the client already exist in the database. |
It's not that easy because:
|
Hello @dragotin What are the backend structures which are not able to calculate hashes? Do Linux, Mac, Android, iPhone and Windows calculate hashes differently? Is there no way to make these calculation compatible? If data changes don't go through OC server, then it should be possible to verify the hashes periodically or manually. Or to verify them if other metadata changed (time, size, ...). Or to force any new data to go through ownCloud Server. In any case, this is an unusual way to use OC for me, and it should not compromize the core function of synchronization. As you suggested me on http://dragotin.wordpress.com/2013/09/11/after-the-1-4-0-owncloud-client-release/ , I tried again (with 5.0.12 and 1.4.1) to copy an entire folder from a synced computer to a new computer to be synced (including csync_journal.db). Then I configured the sync on the new computer. Everything was downloaded again :-(. All logs and csync_journal.db (before and after the sync) are available here: https://www.sugarsync.com/pf/D6476655_61894308_919677 I really think that this issue is much more important than the cases you mentioned. And I really think too that this issue would be solved using hash sums. |
Hello, I want to plug into this discussion. Would you consider hashing of file metadata (mtime, size) instead of the content? Such a hash could be used in the same way as the random etag, however it would have the advantage that you don't need to redownload files you already have. You could also trivially calculate hash on the secondary mounted storage on the server. On top, you could calculate the hash so if you lose local state db then you can simply recreate it (maybe you don't need it at for to store etags in this case). However, it would require that mtimes be handled consistently on the clients - a client with a wrong clock could make a file appear on another client wit mtime/ctime in the future. But would not affect sync correctness in any way. BTW: how do you handle mimetype determination for secondary mounted storage on the server. Do you maybe already read the first 256K bytes? What do you think? kuba On Oct 8, 2013, at 4:09 PM, etiess notifications@github.com wrote:
|
Thank you, @dragotin. I have to agree with etiess on the point that, while these hurdles are not easy (and I do appreciate that fact), the win gained from implementing hashing is incalculable. If my tone comes off as aggressive, I apologise. I'm having a hard time getting ideas across. If I was a php developer I'd have had PoC patches in place within a weekend. Unfortunately my "good" dev/engineering skills are limited to SQL and bash. I'm an amateur when it comes to php. :-| @etiess, the hash support issue does not appear to be any platform-specific problem. It appears to simply be this:
This certainly adds a small obstacle. zatricky commented:
With the above in mind, I don't see how it would be so hard to implement a "Have we got new files in here?"-type check/scrub on the server. PHP supports inotify* which could help support this feature with very little I/O except on a first-run verification. Checking for new files could even run once per minute while the verifications could run once a day/once a week) @moscicki lamented:
I don't see how any performance issues/bandwidth waste would be mitigated by this. See below and please motivate further.
The ctime/filename don't take that much time/bandwidth/IO to look up. The issue with re-downloading the content is still there in the simplest of cases: 1. Have my desktop set up with oC. 2. Copy a 8GB file from my desktop to my laptop. 3. Touch and rename the file. 4. Add the laptop to oC. 5. Wait for Africa's Interwebs to catch up with a carrier pigeon. |
@moscicki hashing of file modification date is definitely not a good option. Considering the size might help, but only very little, because the size alone tells you pretty little (it isn't even a proper indicator of whether a file has changed; the only thing you can tell is that if the size changed, then the file has changed; but not the other way round). So in many cases you'd still have to check the file anyway. To put it very clearly: The only safe way to tell whether a file has changed (or two files are different) is to consider the file content - e.g. by comparing hash sums. |
+1 for hash based syncing. Perhaps it could be even used where the files are uploaded through sync or web interface, and use fallback to etag and timestamp where that is not the case? Git actually uses hash's on directory contents too, so a hash of list of file hashes in a directory to understand when a directory changes. Using this kind of hashing means you check one hash (at the top of the tree) and you can tell if any change in the tree has changed. This makes server to client checks very easy, obviously a more recursive approach is required on the client directory being synced by the sync client given it (I assume) doesn't get notified when a file gets changed. Identifying changes via hashing of the contents of the file is definitional a prior proven way to sync which numerous other systems that use (git, hg etc as well as other syncing systems I'm not aware of). I'm no php dev either, and I'm sure it's no small task especially given the prior architectural decisions like Samba backends. Thanks for all countless hours of development and the people providing feedback and log files etc. |
@ RandolfCarter: I think an efficient way of calculating ETAG while keeping it's required uniquness properties would provide such a great advantage to owncloud that it is worthwhile to investigate.
@dragotin, @danimo: there is a problem with low mtime resolution on some filesystems for the sync client anyway (e.g. on FAT) - how do you detect local changes there? You do not need to keep times in sync between the clients - you only need to set mtime of the files consistently when file changes are propagated from the server. Of course there is a side-effect: if your clocks are too skewed then files fetched from the server may have mtime in the future. If I upload an existing big data folder for sync - that's fine, I see no problem.
Do you use hash sums to detect local changes on the sync client? I bet not. You take another approach which is "good enough". I would investigate if the same could not be done to ETAG in general. On March 28, 2013 on owncloud@kde.org mailing list I asked this question already: """ In other words: a short description of a conceptual model of sync in owncloud would also allow to get useful feedback from others - there are tons of smart people out there who would certainly suggest some smart ideas. kuba |
Let me explain why we don't use a hash at the moment. In the current implementation that ETAG is calculated using the metadata of a file like mtime, name, ... The reason is performance. ownCloud can be used with petabytes of storage. Some of them could be access and changed independently from ownCloud. Just look at the external filesystem features as an example. You can mount your huge S3,FTP,CIFS,Dropbox,... storage into ownCloud. If we want to calc hashes for every file than we have to download every single file at every sinc run to check if the hash/content of the file is the same. This is not possible obviously. Because of that we only look at the metadata for the ETAG. |
@moscicki said:
Etags are important but cannot resolve the problem of needlessly re-uploading/downloading content which has already been manually/externally synchronised. Re ensuring mtime comparisons stay in sync, there is no easy answer. Instead, I think oC bypasses this problem by only considering mtime at the time of the initial upload/download. This is, I imagine, why the etags were put in place. We don't need to worry about comparing mtime between client and server as long as:
HOWEVER, if the client database has a different mtime to its filesystem, it knows the content might have changed. The client then regenerates an etag and triggers an upload (which might be undesired). The above brings about the scenario where a file is touched but the content is not changed. The mtime being updated triggers an unnecessary upload. The other problem we have (and the reason this bug/issue exists) is where we synchronise content externally, the etag does not exist on the client database, and there is currently no way to tell the client that the content it has is identical to the content already on the server. A hash is the only way to rectify this behaviour. With a hash, we see the mtime has changed, triggering a recalculation of the hash. We see that the content has not changed and we do not pointlessly re-upload the arbitrarily large file. The only steps we might still take, depending on dev decisions, would be:
+1. It would be helpful to have a reference/concept document, even if it is not necessarily easy to read. |
@karlitschek - Ah, thanks. That makes a huge difference in those use cases. I guess the next question is regarding @danimo's comment earlier in the thread:
I don't see a simple way to automatically differentiate between "local" storage (SAN/local disk/RAID) and "remote" storage (S3/FTP/etc). |
Thank you @karlitschek for your explanation: we understand the advantages of etag and local db. But owncloud/client#994 shows that it is not robust enough, and that a problem with the DB would cause massive download. I think @zatricky points out a good idea in the case where the etag has changed (for a good reason, or because of a corruption of the DB):
Is it worth considering both local DB/etag AND (if the etag has changed) hash? |
Etags which may be calculated can solve your problem because you don't care if you loose local sync db. I understand Frank's reasons for not hashing the content in a general case. It may also put extra load on the clients (which already now occasionally tend to consume 100% CPU). My point is that if we can calculate the etags which are unique enough based on the metadata both you and Frank may be happy (to some extent at least ;-)). And I would be happy too. Optimization of specific cases is a different story - we consider using a storage backend which does content checksums automatically for us. Owncloud attempts a beautiful thing with a very generic framework - it would be ideal the framework would also allow to handle optimally particular setups taking advantage of capabilities available at lower levels. Also for storage which allows extended attributes. This also applies to using owncloud with local disk backend assuming that noone else writes into it. Otherwise we will have to live with a least common denominator from all possible use-cases. I know this is NOT easy but asymptotically IMO this framework probably needs to go in this direction somehow.
However I just did an experiment which shows that mtimes ARE propagated between the clients (linux) albeit in a way which cannot be used for reliable hashing:
So as one sees in the example above mtime is propagated but not consistent - client A mtime is not the same as client B mtime. That's why it cannot be relied upon and hashed effectively. Not sure if there is an easy way out in all cases because it needs to support various filesystem limitations on the client side. However, IMO, if a filesystem has a capability then it should be exploited.
kuba
|
I haven't yet thought through the whole conversation but here are some facts:
The idea to calculate the checksum of the file content of a file on client side to avoid re-upload if the file was not really changed but just touched is something to consider. However, I wonder if this is not more an academic than a practical problem, most users probably don't use touch that regularly. On files. |
Thanks for the info, @dragotin. That helps understand the current behaviour better. My using 'touch' was merely to demonstrate how simple reproducing the problem can be. I've reproduced changing the mtime in cmdline without using touch. This is simply done by copying/overwriting files. This covers #5231 as well as this issue's original submitter: http://sprunge.us/bbGB |
Well, copying a file over changes it, right? And as said, calculating the MD5 on the client if the contents really changed is something we can discuss. (Hint: A specific feature request describing exactly that would help). #5231 has a different cause which I will document there in a minute. |
Can we think of any specific issues with the clients calculating the hashes while the server simply "takes note" of the hashes fed to it by the clients? In that case the server doesn't check/verify the hash but simply records it as a value that the other clients can use as a verification. |
@zatricky proposed:
I think that if we do that, we reverse the situation, but we still have a problem. These hashes would be stored in a database on the server, and if this DB is corrupted for any reason, then the client will decide to upload everything. Both sides (client and server) must be able to rebuild the DB by its own, without downloading/uploading everything. But perhaps I misunderstood you? |
@dragotin proposed:
I assume you think that owncloud/client#110 isn't precise enough? In this issue, @danimo wrote:
So we're in a kind of bad loop :-( Who is the egg, who is the chicken? ;) I can open a new issue if you think it's better: I would open it in core, not in mirall. But before that, I would like to summarize our last discussions and your explanation on the sync algorythm somewhere. Is there already a wiki on it? Or a beginning of explanation? Do I start one? (in this case, which template could I use?) Thanks! |
@etiess wrote:
It is true that this could still be an issue. If the clients can simply point out "Hey, my etag is still the same, here's the right hash" then that eliminates the upload. HOWEVER, we still have the same situation anyway for if the etags are corrupted on the server? Either way, the fix isn't really intended to deal with corruption (which should be a rarity) and, in the worst case implementation, only a single client would need to re-upload the content. None of the other clients would need to re-download the content. This would already be a major improvement over the current status. |
Why should it be more frequent to have a problem with the local data base of the client than with the data base on the server? Servers do crash too or do have probems, don't they? |
It should be rare on both sides - but it is no problem for the client to recalculate the hash. |
OK, so a solution could be to consider as a priority the implementation of hash calculation on client side. And then, to implement it on server side later. Because I still believe that hash calculation demands less ressources than redownloading, but perhaps I'm wrong. @dragotin what do you suggest concerning my proposal hereabove about the new issue and the wiki? |
@etiess: I also see hash calculation as being much less of a burden than re-up/downloading. The client cpu is faster than the network in 99.9% of use-cases (disk shouldn't be compared as it is used for local hashing as well as re-up/download anyway). Network usage is also costly for some. Assuming support from the devs, my proposal going forward is as follows:
Figuring out #5305 is a good step (and perhaps more urgent) in solving some of the issues mentioned here but won't cover everything unfortunately. This isn't a small problem/fix and so deserves some forethought. |
Random gem: Amazon S3 apparently supports getting md5hashsums without having to download the content. This of course doesn't cover all back-end cases and also isn't necessarily easy to get to, depending on how the storage is mounted. |
Hello, Well, there's still a debate around MD5, hashsums, ... But concerning this issue in particular, carying data on USB disk, I think it has been solved now, and should be closed (see #5231). @dragotin, @danimo, @karlitschek what do you think? Which issue should be used to continue the discussion about hashing? |
Actually I'm not aware of any problems with the current approach so no need to do expensive hashing as discussed a lot of times already. |
And just to add one more comment. We worked on this issue and we hope that it is fixed with ownCloud 6. But this needs testing and confirmation. |
@karlitschek I'm just suggesting here that this issue #523 could be closed, as it has been solved by @dragotin in #5231. But if you think it needs testing and confirmation in OC6, I understand. The debate around hashing is another question for me. |
That's actually a very good point, Let's close it now. We can always reopen it if there is a new reproducible bug. Thanks |
[See forum posting @ http://forum.owncloud.org/viewtopic.php?f=3&t=5612]
I have a home OC server (4.5.2) up and running on a Linux box.
I took a copy of my parents' photos (13Gig of data) on to a USB disk, took the disk home and then copied the photos (using my local network) on the OC server.
I now setup a sync between the photos directory on my parents' PC and the copy on the OC server at my house.
But the client on my parent's PC is a bit stupid and clearly doesn't check the actual file contents, and so starts to copy all the photos back across to the server.
This is a completely pointless operation.
Surely an md5sum (or similar hash calculation) should be performed to determine if a file copy is required. This is how rsync works under Linux.
The text was updated successfully, but these errors were encountered: