-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalization of unicode characters causing duplicate files #6338
Comments
@ogoffart Hmm. I wonder whether the normalization happens client-side or server-side. |
I can verify this is a problem. When I create a file 'Amélie' (which is intended to have the 'e' as two codepoints, created like this in python3: `print(b'Ame\xcc\x81lie'.decode())') the client uploads it, and triggers a move to the normalized file path (where the accented e is a single code point) on the next sync run. The PUT that is sent to the server does contain the unnormalized path on linux (I see normalization calls on OSX though?) |
We normalize on OSX for the filesystem normalises there, and the owncloud server uses the other normalisation usually. On Windows and Linux, we don't normalize. Maybe we should apply the same rules as what we do for case insensitive file system. |
@ogoffart Yes, although that's not sufficient. For case-insensitive filesystems looking up 'Foo' and 'foo' will end up pointing to the same file. The unnormalized and normalized file name point to different files though. We definitely need to keep the original name around, and potentially use the normalized form for some comparisons and lookups. |
Please check how Dropbox handles it. Doublicated files are better then data-loss in the first place ... its a hard challenge to solve. @ckamm it sounds from your description the file was moved = renamed and not duplicated?! |
@hodyroff Yes, in my test case the local file was renamed to the unicode-normalized name. |
@sagamusix Is a macOS client involved? Maybe syncs those files that are shared with a Apple computer? (APFS normalization: #5650) |
No, just a Windows 7 client and Linux server. |
CC @PVince81 @DeepDiver1975 regarding server side. |
This will be a larger, invasive change that we can't fit into 2.5 anymore. |
Testing on Ubuntu 18 with 2.5.0 beta1, I can confirm there is normalization going on. Using python3 enter these commands slowly, so that syncing starts in between:
The filesystem temporarily shows
Where the two are encoded as
After the sync completes, only the latter remains in the filesystem, with correct file size. The client activity log shows a sequence of five events: EDIT: 20210316 jw: I don't think they should look different. If they do, then it is probably a bug in the font rendering code. |
Here is what i think happens in @jnweiger 's test. The file is file is originaly written on the file system using the NFD form. The server must however probably use the other normalization, so when we do a PROPFIND on the second sync, we get only the NFC form. As a result, the new file name is downloaded, and the other one is removed as the client thinks it is gone. (In principle, there should be a rename instead) Edit: fixed NFD vs. NFC |
server normalizes to NFC, yes |
@ogoffart Shall we move this to 2.6.1? |
@guruz I think so, we don't even have an approach for dealing with the normalization that happens yet. |
Expected behaviour
Normalization of composited unicode characters should not happen, or at least not cause duplicate files.
Actual behaviour
I synchronized a file from client to server, and the server decided to re-download the file. Now I had two seemingly identically-named files in the same folder, but what has happened became more obvious once I looked at the directory listing on the command line:
Apparently the original filename used a composisted unicode character "í", which was probably normalized on the server. Then the client noticed that the server has a new file which it should download...
Server configuration
Operating system: Bananian (Debian Jessie)
Web server: Apache 2.4
Database: MySQL 5.5
PHP version: 5.6
ownCloud version: 10.0.4
Storage backend (external storage): Local
Client configuration
Client version: Version 2.5.0-nightly20180110 (build 8967)
Operating system: Windows 7
OS language: German
Installation path of client: Default
The text was updated successfully, but these errors were encountered: