Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pointless re-sync of entire folder #523

Closed
mpfj opened this issue Nov 20, 2012 · 45 comments
Closed

Pointless re-sync of entire folder #523

mpfj opened this issue Nov 20, 2012 · 45 comments
Assignees

Comments

@mpfj
Copy link

mpfj commented Nov 20, 2012

[See forum posting @ http://forum.owncloud.org/viewtopic.php?f=3&t=5612]

I have a home OC server (4.5.2) up and running on a Linux box.

I took a copy of my parents' photos (13Gig of data) on to a USB disk, took the disk home and then copied the photos (using my local network) on the OC server.

I now setup a sync between the photos directory on my parents' PC and the copy on the OC server at my house.

But the client on my parent's PC is a bit stupid and clearly doesn't check the actual file contents, and so starts to copy all the photos back across to the server.

This is a completely pointless operation.

Surely an md5sum (or similar hash calculation) should be performed to determine if a file copy is required. This is how rsync works under Linux.

@ghost ghost assigned dragotin Nov 20, 2012
@blizzz
Copy link
Contributor

blizzz commented Nov 20, 2012

Which client version?

@mpfj
Copy link
Author

mpfj commented Nov 20, 2012

Latest client for Windows (1.1.1)

@DeepDiver1975
Copy link
Member

Please reopen this issue within the mirall repo. Thx

@danimo
Copy link
Contributor

danimo commented May 1, 2013

I am reopening this issue here, since without the server providing an MD5 sum, there is nothing we can do in the client.

@danimo
Copy link
Contributor

danimo commented May 1, 2013

Note: we could at least have hash sums for the files that we have exclusive access to.

@jancborchardt
Copy link
Member

So what’s the call on this one? Fixed? Important to work on? Please advise.

@zatricky
Copy link

zatricky commented Sep 5, 2013

I'd say this is very important. Can we get appropriate labels put on this issue?

I had a similar issue where my server and my desktop/laptop etc are using Dropbox. I want to have all that moved over to oC - but its a little crazy that the client will want to re-upload and/or re-download everything. Its much cheaper (bandwidth and time-wise) to move everything over on all three platforms and have them all simply notice "oh, everything is okay, nothing to do here".

Ideally, the checksum should actually be an indexed value in the database - in fact it should probably even be the primary key used to identify content. I believe that the system already supports "move" operations (moving a file to another folder within owncloud without causing a deletion/re-upload) - but doing this would actually make supporting this concept trivial.

Please note, md5sum is a good starting point - but it would be much more appropriate to use a variety of checks and to use various cryptographic checksums to ensure that everything is consistent across all systems:ǂ
timestamp
filesize
checksum

If filesize and checksum matches but timestamp differs, then only a tiny change should be actioned (fixing the timestamp).
In any other case, an rsync-style differential upload/download should be actioned.

ǂ All of this information should be stored in the client and server databases when files/folders are added to the repository. Recalculating hash values every time the client starts would be madness. Having this stored within the client database would also improve the "time to first sync" in the case of content having changed while the client wasn't running.

@etiess
Copy link

etiess commented Sep 12, 2013

Hello,

First of all, thanks to the dev for the amazing OwnCloud.

I would like to know how far you are with solving this type of issue?

I live and work in Africa: Internet access are slow, and most of the time data plans are limited in download. So it’s very expensive to resync a folder on a new computer with some GB of data. With Dropbox, we just copy the files on an USB stick, and copy them to the new computer before syncronizing it. Then we don’t need to download data again and the sync works fine.

How is it possible with owncloud? Does the unique ID make this possible? When I copy my files, can I copy the unique ID too? If not, we are forced to download again, what is very unconfortable for large amount of data…

I tried to copy the whole folder, but it doesn't seem to work:

  • I copy the entire folder locally, with a new name
  • I set up a new folder sync with mirall (1.4.0 on Windows 8 / owncloud 5.0.11 on Ubuntu Server 13.04)
  • I launch the sync: it downloads everything again for the new sync :-( And it even created conflicted copies of each files :-( ( (and conflict files don't have the same size, see attached picture)
    capture

I have to precise that I do not use owncloud in a standard way: I don’t want to sync all my data locally, as it is proposed during the initial setup of the client. So I deleted this first “sync pair”, and created new pairs with individual folders.

I have to precise too, that for test purpose, I made this on a single computer, with the same client (which should sync the same folder on the server with 2 localization on the same computer)

A lot of subject have been opened on this issue, whereas I'm not sure they have the same origin:
owncloud/client#110
owncloud/client#49
owncloud/client#779
http://forum.owncloud.org/viewtopic.php?f=14&t=15493&p=40791#p40791

Thanks for your help!

Etienne

@bugsyb
Copy link

bugsyb commented Sep 24, 2013

+1 for implementing hash based verification of the content as there's nothing like voting system

@AykutCevik
Copy link

Is there any workaround for now until the md5 based solution will be developed? Having big troubles due to resyncs...

@etiess
Copy link

etiess commented Oct 1, 2013

@AykutCevik For the moment unfortunately not.

You can follow owncloud/client#994 and post your logs to help the team.

@zatricky
Copy link

zatricky commented Oct 7, 2013

My apologies for the excess comment-traffic - I'd intended on suggesting using multiple hashsums but forgot to mention it.

I'd suggest using MD5 as a primary hashsum and using SHA256 or SHA512 as the second.

Though not a security issue, per se, I would not use md5sum alone for the reasons cited here:
https://en.wikipedia.org/wiki/MD5#Security

@karlitschek
Copy link
Contributor

we maintain unique ids of the files on the server in the filesystem cache table and in the client sqlite database. A complete resync shoudn't happen unless the server or the client databases are changed or deleted somehow.
Any signs that this happened?

@zatricky
Copy link

zatricky commented Oct 8, 2013

@karlitschek: The use case inferred here is where the indexed value is based on a hash function. Specifically, can the unique id identify a file based on the whole of its content or is it simply metadata that is independent of the actual file content?

Put another way: If I have two files with the same content, will their ids be identical?

In this case the answer must be yes while also ensuring, within reasonable doubt, that we do not have two files with different content with the same id. (See http://git-scm.com/book/ch6-1.html#A-SHORT-NOTE-ABOUT-SHA-1)

Rsync takes advantage of hashing to reduce the amount of data transferred in a sync. However, the use case is very different, and rsync does not keep a database of files on hand. In our case we want to leverage the client and server databases keeping these indexed hashes on hand to ensure that we do not re-upload or re-download content needlessly.

I'm making a separate issue for consideration of full rsync-style synchronisation.

@dragotin
Copy link
Contributor

dragotin commented Oct 8, 2013

The ids are not in relation with the file content. The are just metadata as you phrase it. Two identical files will not have the same etag. We do not calculate a content based finger print because we have a multi backend structure which can make it very hard to read the whole file before syncing.

@zatricky
Copy link

zatricky commented Oct 8, 2013

Ideally these hashes should be calculated by the client before the initial content upload. The server could have a scrub process to periodically verify these hashes (which would also help satisfy upgrade considerations). By storing these hashes in the database there will be little-to-no pre-sync I/O on the server. Post-sync the server should probably verify the hashes within a short time-frame - but that is a decision I leave to the devs.

In this use case of a "pointless re-sync", the only work the server will have to perform will be the SQL queries necessary to see if the hashes being sent by the client already exist in the database.

@dragotin
Copy link
Contributor

dragotin commented Oct 8, 2013

It's not that easy because:

  • Not all content is uploaded through ownCloud, imaging a samba share that is mounted into ownCloud. People silently add, remove and change files on it.
  • Because not all data changes go through the ownCloud server, each content based fingerprint would have to be verified at access time.

@etiess
Copy link

etiess commented Oct 8, 2013

Hello @dragotin

What are the backend structures which are not able to calculate hashes? Do Linux, Mac, Android, iPhone and Windows calculate hashes differently? Is there no way to make these calculation compatible?

If data changes don't go through OC server, then it should be possible to verify the hashes periodically or manually. Or to verify them if other metadata changed (time, size, ...). Or to force any new data to go through ownCloud Server. In any case, this is an unusual way to use OC for me, and it should not compromize the core function of synchronization.

As you suggested me on http://dragotin.wordpress.com/2013/09/11/after-the-1-4-0-owncloud-client-release/ , I tried again (with 5.0.12 and 1.4.1) to copy an entire folder from a synced computer to a new computer to be synced (including csync_journal.db). Then I configured the sync on the new computer. Everything was downloaded again :-(. All logs and csync_journal.db (before and after the sync) are available here: https://www.sugarsync.com/pf/D6476655_61894308_919677

I really think that this issue is much more important than the cases you mentioned. And I really think too that this issue would be solved using hash sums.

@moscicki
Copy link

moscicki commented Oct 8, 2013

Hello,

I want to plug into this discussion. Would you consider hashing of file metadata (mtime, size) instead of the content? Such a hash could be used in the same way as the random etag, however it would have the advantage that you don't need to redownload files you already have. You could also trivially calculate hash on the secondary mounted storage on the server. On top, you could calculate the hash so if you lose local state db then you can simply recreate it (maybe you don't need it at for to store etags in this case).
And then, you could possible also see if you can cut db use for etags on the server too. This could simply things big time and would make the system much more robust.

However, it would require that mtimes be handled consistently on the clients - a client with a wrong clock could make a file appear on another client wit mtime/ctime in the future. But would not affect sync correctness in any way.

BTW: how do you handle mimetype determination for secondary mounted storage on the server. Do you maybe already read the first 256K bytes?

What do you think?

kuba

On Oct 8, 2013, at 4:09 PM, etiess notifications@github.com wrote:

Hello @dragotin

What are the backend structures which are not able to calculate hashes? Do Linux, Mac, Android, iPhone and Windows calculate hashes differently? Is there no way to make these calculation compatible?

If data changes don't go through OC server, then it should be possible to verify the hashes periodically or manually. Or to verify them if other metadata changed (time, size, ...). Or to force any new data to go through ownCloud Server. In any case, this is an unusual way to use OC for me, and it should not compromize the core function of synchronization.

As you suggested me on http://dragotin.wordpress.com/2013/09/11/after-the-1-4-0-owncloud-client-release/ , I tried again (with 5.0.12 and 1.4.1) to copy an entire folder from a synced computer to a new computer to be synced (including csync_journal.db). Then I configured the sync on the new computer. Everything was downloaded again :-(. All logs and csync_journal.db (before and after the sync) are available here: https://www.sugarsync.com/pf/D6476655_61894308_919677

I really think that this issue is much more important than the cases you mentioned. And I really think too that this issue would be solved using hash sums.


Reply to this email directly or view it on GitHub.

@zatricky
Copy link

zatricky commented Oct 8, 2013

Thank you, @dragotin. I have to agree with etiess on the point that, while these hurdles are not easy (and I do appreciate that fact), the win gained from implementing hashing is incalculable. If my tone comes off as aggressive, I apologise. I'm having a hard time getting ideas across. If I was a php developer I'd have had PoC patches in place within a weekend. Unfortunately my "good" dev/engineering skills are limited to SQL and bash. I'm an amateur when it comes to php. :-|

@etiess, the hash support issue does not appear to be any platform-specific problem. It appears to simply be this:
dragotin doth writ:

Not all content is uploaded through ownCloud, imaging a samba share that is mounted into ownCloud. People silently add, remove and change files on it.

This certainly adds a small obstacle. zatricky commented:

The server could have a scrub process to periodically verify these hashes (which would also help satisfy upgrade considerations)

With the above in mind, I don't see how it would be so hard to implement a "Have we got new files in here?"-type check/scrub on the server. PHP supports inotify* which could help support this feature with very little I/O except on a first-run verification. Checking for new files could even run once per minute while the verifications could run once a day/once a week)

@moscicki lamented:

Would you consider hashing of file metadata (mtime, size) instead of the content?

I don't see how any performance issues/bandwidth waste would be mitigated by this. See below and please motivate further.

however it would have the advantage that you don't need to redownload files you already have

The ctime/filename don't take that much time/bandwidth/IO to look up. The issue with re-downloading the content is still there in the simplest of cases: 1. Have my desktop set up with oC. 2. Copy a 8GB file from my desktop to my laptop. 3. Touch and rename the file. 4. Add the laptop to oC. 5. Wait for Africa's Interwebs to catch up with a carrier pigeon.

@RandolfCarter
Copy link
Contributor

@moscicki hashing of file modification date is definitely not a good option.
It would be exactly the same as considering it for comparison directly. And that has been used in previous versions of the client (< 1.1) and it didn't work properly - machine times can drift, not all file systems support the same resolution for those times (but I bet @dragotin or @danimo can tell you much more about that). And then also the times would have to be kept in sync; and to consider the times would not help anything for the use case of somebody wanting to set up sync for an existing big data folder; the modification dates would most certainly be different.

Considering the size might help, but only very little, because the size alone tells you pretty little (it isn't even a proper indicator of whether a file has changed; the only thing you can tell is that if the size changed, then the file has changed; but not the other way round). So in many cases you'd still have to check the file anyway.

To put it very clearly: The only safe way to tell whether a file has changed (or two files are different) is to consider the file content - e.g. by comparing hash sums.

@tenacioustechie
Copy link

+1 for hash based syncing.

Perhaps it could be even used where the files are uploaded through sync or web interface, and use fallback to etag and timestamp where that is not the case?

Git actually uses hash's on directory contents too, so a hash of list of file hashes in a directory to understand when a directory changes. Using this kind of hashing means you check one hash (at the top of the tree) and you can tell if any change in the tree has changed. This makes server to client checks very easy, obviously a more recursive approach is required on the client directory being synced by the sync client given it (I assume) doesn't get notified when a file gets changed.

Identifying changes via hashing of the contents of the file is definitional a prior proven way to sync which numerous other systems that use (git, hg etc as well as other syncing systems I'm not aware of).

I'm no php dev either, and I'm sure it's no small task especially given the prior architectural decisions like Samba backends.

Thanks for all countless hours of development and the people providing feedback and log files etc.

@moscicki
Copy link

@ RandolfCarter:

I think an efficient way of calculating ETAG while keeping it's required uniquness properties would provide such a great advantage to owncloud that it is worthwhile to investigate.

It would be exactly the same as considering it for comparison directly. And that has been used in previous versions of the client (< 1.1) and it didn't work properly - machine times can drift, not all file systems support the same resolution for those times (but I bet @dragotin or @danimo can tell you much more about that). And then also the times would have to be kept in sync; and to consider the times would not help anything for the use case of somebody wanting to set up sync for an existing big data folder; the modification dates would most certainly be different.

You remark is perfectly valid. However what you want is to compare ETAGs for equality (mtime1==mtime2) and NOT reason about time ordering (mtime1<mtime2). Now, it is not impossible that two different clients touch the file and the resulting mtime will be exactly the same (well, it depends on the resolution, agreed) - but on modern systems where resolution in ms - how likely is that (also combined with the file size check)? ETAGs must be unique to identify a file modification on the server, that's all.

@dragotin, @danimo: there is a problem with low mtime resolution on some filesystems for the sync client anyway (e.g. on FAT) - how do you detect local changes there?

You do not need to keep times in sync between the clients - you only need to set mtime of the files consistently when file changes are propagated from the server. Of course there is a side-effect: if your clocks are too skewed then files fetched from the server may have mtime in the future. If I upload an existing big data folder for sync - that's fine, I see no problem.

Considering the size might help, but only very little, because the size alone tells you pretty little (it isn't even a proper indicator of whether a file has changed; the only thing you can tell is that if the size changed, then the file has changed; but not the other way round). So in many cases you'd still have to check the file anyway.

To put it very clearly: The only safe way to tell whether a file has changed (or two files are different) is to consider the file content - e.g. by comparing hash sums.

That's completely clear.

Do you use hash sums to detect local changes on the sync client? I bet not. You take another approach which is "good enough". I would investigate if the same could not be done to ETAG in general.

On March 28, 2013 on owncloud@kde.org mailing list I asked this question already:

"""
I am looking for a complete and up-to-date reference which describes the intended synchronization model in ownCloud. Do you know if there is one (apart from the ocsync source code)? In particular, what happens if time in not up-to-date on all clients, or if the clock on a client is manually adjusted (in the future or into the past) or if a clock drifts in time. This is normal in heterogeneous distributed environments.
"""

In other words: a short description of a conceptual model of sync in owncloud would also allow to get useful feedback from others - there are tons of smart people out there who would certainly suggest some smart ideas.

kuba

@karlitschek
Copy link
Contributor

Let me explain why we don't use a hash at the moment.
The ETAG is a unique id of a file. From a client or sync algorithm perspective this exactly the same as a hash. If the ETAG is the same than it is the same file, if it's different than it is a different file. Just like a hash. So if this assumption is true than the syncing should work in exactly the same way as with a hash.

In the current implementation that ETAG is calculated using the metadata of a file like mtime, name, ... The reason is performance. ownCloud can be used with petabytes of storage. Some of them could be access and changed independently from ownCloud. Just look at the external filesystem features as an example. You can mount your huge S3,FTP,CIFS,Dropbox,... storage into ownCloud. If we want to calc hashes for every file than we have to download every single file at every sinc run to check if the hash/content of the file is the same. This is not possible obviously. Because of that we only look at the metadata for the ETAG.
I really think that this should be enough. I'm still waiting for a real life example where a file is changed in a directory but has still the same name,size,mtime, .... This really shouldn't happen.
If there are sync problems than there must be a different reason for that that we have to debug and fix.

@zatricky
Copy link

@moscicki said:

Do you use hash sums to detect local changes on the sync client? I bet not. You take another approach which is "good enough". I would investigate if the same could not be done to ETAG in general.

Etags are important but cannot resolve the problem of needlessly re-uploading/downloading content which has already been manually/externally synchronised.

Re ensuring mtime comparisons stay in sync, there is no easy answer. Instead, I think oC bypasses this problem by only considering mtime at the time of the initial upload/download. This is, I imagine, why the etags were put in place. We don't need to worry about comparing mtime between client and server as long as:

  • server's database mtime matches the server's local filesystem mtime
  • client's database mtime matches the client's local filesystem mtime
  • client's database etag matches the server's database etag

HOWEVER, if the client database has a different mtime to its filesystem, it knows the content might have changed. The client then regenerates an etag and triggers an upload (which might be undesired).
@dragotin / @danimo @karlitschek - I'd appreciate any comment on the accuracy of above. ;)

The above brings about the scenario where a file is touched but the content is not changed. The mtime being updated triggers an unnecessary upload. The other problem we have (and the reason this bug/issue exists) is where we synchronise content externally, the etag does not exist on the client database, and there is currently no way to tell the client that the content it has is identical to the content already on the server. A hash is the only way to rectify this behaviour.

With a hash, we see the mtime has changed, triggering a recalculation of the hash. We see that the content has not changed and we do not pointlessly re-upload the arbitrarily large file. The only steps we might still take, depending on dev decisions, would be:

  • tell the server to update the timestamp
  • tell the server the etag has been updated (might be necessary to trigger mtime updates on all clients)

In other words: a short description of a conceptual model of sync in owncloud would also allow to get useful feedback from others - there are tons of smart people out there who would certainly suggest some smart ideas.

+1. It would be helpful to have a reference/concept document, even if it is not necessarily easy to read.

@zatricky
Copy link

@karlitschek - Ah, thanks. That makes a huge difference in those use cases.

I guess the next question is regarding @danimo's comment earlier in the thread:

Note: we could at least have hash sums for the files that we have exclusive access to.

I don't see a simple way to automatically differentiate between "local" storage (SAN/local disk/RAID) and "remote" storage (S3/FTP/etc).

@etiess
Copy link

etiess commented Oct 10, 2013

Thank you @karlitschek for your explanation: we understand the advantages of etag and local db. But owncloud/client#994 shows that it is not robust enough, and that a problem with the DB would cause massive download.

I think @zatricky points out a good idea in the case where the etag has changed (for a good reason, or because of a corruption of the DB):

With a hash, we see the mtime has changed, triggering a recalculation of the hash. We see that the content has not changed and we do not pointlessly re-upload the arbitrarily large file

Is it worth considering both local DB/etag AND (if the etag has changed) hash?

@moscicki
Copy link

Do you use hash sums to detect local changes on the sync client? I bet not. You take another approach which is "good enough". I would investigate if the same could not be done to ETAG in general.

Etags are important but cannot resolve the problem of needlessly re-uploading/downloading content which has already been manually/externally synchronised.

@zatricky:

Etags which may be calculated can solve your problem because you don't care if you loose local sync db. I understand Frank's reasons for not hashing the content in a general case. It may also put extra load on the clients (which already now occasionally tend to consume 100% CPU). My point is that if we can calculate the etags which are unique enough based on the metadata both you and Frank may be happy (to some extent at least ;-)). And I would be happy too.

@karlitschek:

Optimization of specific cases is a different story - we consider using a storage backend which does content checksums automatically for us. Owncloud attempts a beautiful thing with a very generic framework - it would be ideal the framework would also allow to handle optimally particular setups taking advantage of capabilities available at lower levels. Also for storage which allows extended attributes. This also applies to using owncloud with local disk backend assuming that noone else writes into it. Otherwise we will have to live with a least common denominator from all possible use-cases. I know this is NOT easy but asymptotically IMO this framework probably needs to go in this direction somehow.

Re ensuring mtime comparisons stay in sync, there is no easy answer. Instead, I think oC bypasses this problem by only considering mtime at the time of the initial upload/download. This is, I imagine, why the etags were put in place. We don't need to worry about comparing mtime between client and server as long as:

server's database mtime matches the server's local filesystem mtime
client's database mtime matches the client's local filesystem mtime
client's database etag matches the server's database etag
HOWEVER, if the client database has a different mtime to its filesystem, it knows the content might have changed. The client then regenerates an etag and triggers an upload (which might be undesired).
@dragotin / @danimo @karlitschek - I'd appreciate any comment on the accuracy of above. ;)

This is also how I understood it - but I am a mere user ;-)

However I just did an experiment which shows that mtimes ARE propagated between the clients (linux) albeit in a way which cannot be used for reliable hashing:

  1. created file as client A:

    File: `etag/x'
    Size: 0 Blocks: 0 IO Block: 4096 regular empty file
    Device: 901h/2305d Inode: 25035386 Links: 1
    Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2013-10-11 06:53:51.313837559 +0200
    Modify: 2013-10-11 06:53:51.313837559 +0200
    Change: 2013-10-11 06:53:51.313837559 +0200

  2. synced to the server (same physical host)

  3. stat the file stored on the server:

    File: `/boxstorage/etag/files/x'
    Size: 0 Blocks: 0 IO Block: 4096 regular empty file
    Device: 901h/2305d Inode: 27264315 Links: 1
    Access: (0644/-rw-r--r--) Uid: ( 48/ apache) Gid: ( 48/ apache)
    Access: 2013-10-11 06:53:51.000000000 +0200
    Modify: 2013-10-11 06:53:51.000000000 +0200
    Change: 2013-10-11 06:54:15.324439746 +0200

  4. synced as client B (same physical host) - mtime is set as on the server:

    File: `etag-2/x'
    Size: 0 Blocks: 0 IO Block: 4096 regular empty file
    Device: 901h/2305d Inode: 25035391 Links: 1
    Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2013-10-11 06:53:51.000000000 +0200
    Modify: 2013-10-11 06:53:51.000000000 +0200
    Change: 2013-10-11 06:55:11.403174968 +0200

  5. and further on, upon changes and syncs the mtime propagates between the clients with 1s mtime resolution

So as one sees in the example above mtime is propagated but not consistent - client A mtime is not the same as client B mtime. That's why it cannot be relied upon and hashed effectively. Not sure if there is an easy way out in all cases because it needs to support various filesystem limitations on the client side. However, IMO, if a filesystem has a capability then it should be exploited.

The above brings about the scenario where a file is touched but the content is not changed. The mtime being updated triggers an unnecessary upload. The other problem we have (and the reason this bug/issue exists) is where we synchronise content externally, the etag does not exist on the client database, and there is currently no way to tell the client that the content it has is identical to the content already on the server. A hash is the only way to rectify this behaviour.

OK, touching a file and getting a redownload - it is not optimal but somewhat acceptable for me - at least in the current version of owncloud. Losing local state db or getting it corrupted that and getting a full redownload is really suboptimal.

kuba

With a hash, we see the mtime has changed, triggering a recalculation of the hash. We see that the content has not changed and we do not pointlessly re-upload the arbitrarily large file. The only steps we might still take, depending on dev decisions, would be:

tell the server to update the timestamp
tell the server the etag has been updated (might be necessary to trigger mtime updates on all clients)
In other words: a short description of a conceptual model of sync in owncloud would also allow to get useful feedback from others - there are tons of smart people out there who would certainly suggest some smart ideas.

+1. It would be helpful to have a reference/concept document, even if it is not necessarily easy to read.


Reply to this email directly or view it on GitHub.

@dragotin
Copy link
Contributor

I haven't yet thought through the whole conversation but here are some facts:

  • There is no full re-download of data required if the client database is lost. If the client db is missing, the file name and mtimes are compared and if they are equal, the files are NOT downloaded again.
  • mtimes are propagated as epoch values with a second precision if the system provides that, there are rumors that Windows file systems only provide a two seconds precision.
  • note that even if the system clocks of involved systems are not equal, the mtime of an individual file is not affected by that because that is just a number basically.

The idea to calculate the checksum of the file content of a file on client side to avoid re-upload if the file was not really changed but just touched is something to consider. However, I wonder if this is not more an academic than a practical problem, most users probably don't use touch that regularly. On files.

@zatricky
Copy link

Thanks for the info, @dragotin. That helps understand the current behaviour better.

My using 'touch' was merely to demonstrate how simple reproducing the problem can be. I've reproduced changing the mtime in cmdline without using touch. This is simply done by copying/overwriting files. This covers #5231 as well as this issue's original submitter: http://sprunge.us/bbGB

@dragotin
Copy link
Contributor

Well, copying a file over changes it, right? And as said, calculating the MD5 on the client if the contents really changed is something we can discuss. (Hint: A specific feature request describing exactly that would help).

#5231 has a different cause which I will document there in a minute.

@zatricky
Copy link

calculating the MD5 on the client if the contents really changed is something we can discuss

Can we think of any specific issues with the clients calculating the hashes while the server simply "takes note" of the hashes fed to it by the clients? In that case the server doesn't check/verify the hash but simply records it as a value that the other clients can use as a verification.

@etiess
Copy link

etiess commented Oct 11, 2013

@zatricky proposed:

the clients calculating the hashes while the server simply "takes note" of the hashes fed to it by the clients

I think that if we do that, we reverse the situation, but we still have a problem. These hashes would be stored in a database on the server, and if this DB is corrupted for any reason, then the client will decide to upload everything.

Both sides (client and server) must be able to rebuild the DB by its own, without downloading/uploading everything.

But perhaps I misunderstood you?

@etiess
Copy link

etiess commented Oct 11, 2013

@dragotin proposed:

calculating the MD5 on the client if the contents really changed is something we can discuss. (Hint: A specific feature request describing exactly that would help).

I assume you think that owncloud/client#110 isn't precise enough? In this issue, @danimo wrote:

waiting for #523 to be resolved.

So we're in a kind of bad loop :-( Who is the egg, who is the chicken? ;)

I can open a new issue if you think it's better: I would open it in core, not in mirall. But before that, I would like to summarize our last discussions and your explanation on the sync algorythm somewhere. Is there already a wiki on it? Or a beginning of explanation? Do I start one? (in this case, which template could I use?)

Thanks!

@zatricky
Copy link

@etiess wrote:

I think that if we do that, we reverse the situation, but we still have a problem. These hashes would be stored in a database on the server, and if this DB is corrupted for any reason, then the client will decide to upload everything.

It is true that this could still be an issue. If the clients can simply point out "Hey, my etag is still the same, here's the right hash" then that eliminates the upload. HOWEVER, we still have the same situation anyway for if the etags are corrupted on the server?

Either way, the fix isn't really intended to deal with corruption (which should be a rarity) and, in the worst case implementation, only a single client would need to re-upload the content. None of the other clients would need to re-download the content. This would already be a major improvement over the current status.

@etiess
Copy link

etiess commented Oct 11, 2013

Why should it be more frequent to have a problem with the local data base of the client than with the data base on the server? Servers do crash too or do have probems, don't they?

@zatricky
Copy link

It should be rare on both sides - but it is no problem for the client to recalculate the hash.

@etiess
Copy link

etiess commented Oct 11, 2013

OK, so a solution could be to consider as a priority the implementation of hash calculation on client side. And then, to implement it on server side later.

Because I still believe that hash calculation demands less ressources than redownloading, but perhaps I'm wrong.

@dragotin what do you suggest concerning my proposal hereabove about the new issue and the wiki?

@zatricky
Copy link

@etiess: I also see hash calculation as being much less of a burden than re-up/downloading. The client cpu is faster than the network in 99.9% of use-cases (disk shouldn't be compared as it is used for local hashing as well as re-up/download anyway). Network usage is also costly for some. Assuming support from the devs, my proposal going forward is as follows:

  1. Implement server-side db/api support for content-hashing
  2. Implement optional client-side support for content-hashing and client-side support for hash-based decision-making
  3. Implement optional server-side content-hashing

Figuring out #5305 is a good step (and perhaps more urgent) in solving some of the issues mentioned here but won't cover everything unfortunately. This isn't a small problem/fix and so deserves some forethought.

@zatricky
Copy link

Random gem: Amazon S3 apparently supports getting md5hashsums without having to download the content. This of course doesn't cover all back-end cases and also isn't necessarily easy to get to, depending on how the storage is mounted.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html

@etiess
Copy link

etiess commented Nov 21, 2013

Hello,

Well, there's still a debate around MD5, hashsums, ...

But concerning this issue in particular, carying data on USB disk, I think it has been solved now, and should be closed (see #5231).

@dragotin, @danimo, @karlitschek what do you think? Which issue should be used to continue the discussion about hashing?

@karlitschek
Copy link
Contributor

Actually I'm not aware of any problems with the current approach so no need to do expensive hashing as discussed a lot of times already.
Please point us to a reproducible issue here then we can fix it. Thanks.

@karlitschek
Copy link
Contributor

And just to add one more comment. We worked on this issue and we hope that it is fixed with ownCloud 6. But this needs testing and confirmation.

@etiess
Copy link

etiess commented Nov 21, 2013

@karlitschek I'm just suggesting here that this issue #523 could be closed, as it has been solved by @dragotin in #5231. But if you think it needs testing and confirmation in OC6, I understand.

The debate around hashing is another question for me.

@karlitschek
Copy link
Contributor

That's actually a very good point, Let's close it now. We can always reopen it if there is a new reproducible bug. Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests