Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization of unicode characters causing duplicate files #6338

Open
sagamusix opened this issue Jan 30, 2018 · 16 comments
Open

Normalization of unicode characters causing duplicate files #6338

sagamusix opened this issue Jan 30, 2018 · 16 comments

Comments

@sagamusix
Copy link

Expected behaviour

Normalization of composited unicode characters should not happen, or at least not cause duplicate files.

Actual behaviour

I synchronized a file from client to server, and the server decided to re-download the file. Now I had two seemingly identically-named files in the same folder, but what has happened became more obvious once I looked at the directory listing on the command line:

composited

Apparently the original filename used a composisted unicode character "í", which was probably normalized on the server. Then the client noticed that the server has a new file which it should download...

Server configuration

Operating system: Bananian (Debian Jessie)

Web server: Apache 2.4

Database: MySQL 5.5

PHP version: 5.6

ownCloud version: 10.0.4

Storage backend (external storage): Local

Client configuration

Client version: Version 2.5.0-nightly20180110 (build 8967)

Operating system: Windows 7

OS language: German

Installation path of client: Default

@ckamm
Copy link
Contributor

ckamm commented Feb 1, 2018

@ogoffart Hmm. I wonder whether the normalization happens client-side or server-side.

@ckamm
Copy link
Contributor

ckamm commented Feb 1, 2018

I can verify this is a problem. When I create a file 'Amélie' (which is intended to have the 'e' as two codepoints, created like this in python3: `print(b'Ame\xcc\x81lie'.decode())') the client uploads it, and triggers a move to the normalized file path (where the accented e is a single code point) on the next sync run.

The PUT that is sent to the server does contain the unnormalized path on linux (I see normalization calls on OSX though?)

@ckamm ckamm added the type:bug label Feb 1, 2018
@ogoffart
Copy link
Contributor

ogoffart commented Feb 1, 2018

We normalize on OSX for the filesystem normalises there, and the owncloud server uses the other normalisation usually.

On Windows and Linux, we don't normalize.
What happens here is that the server uses the other normalisation. (Or maybe another mac client)

Maybe we should apply the same rules as what we do for case insensitive file system.

@ckamm
Copy link
Contributor

ckamm commented Feb 1, 2018

@ogoffart Yes, although that's not sufficient. For case-insensitive filesystems looking up 'Foo' and 'foo' will end up pointing to the same file. The unnormalized and normalized file name point to different files though.

We definitely need to keep the original name around, and potentially use the normalized form for some comparisons and lookups.

@hodyroff
Copy link

hodyroff commented Feb 2, 2018

Please check how Dropbox handles it. Doublicated files are better then data-loss in the first place ... its a hard challenge to solve. @ckamm it sounds from your description the file was moved = renamed and not duplicated?!

@ckamm
Copy link
Contributor

ckamm commented Feb 2, 2018

@hodyroff Yes, in my test case the local file was renamed to the unicode-normalized name.

@michaelstingl
Copy link
Contributor

@sagamusix Is a macOS client involved? Maybe syncs those files that are shared with a Apple computer? (APFS normalization: #5650)

@sagamusix
Copy link
Author

No, just a Windows 7 client and Linux server.

@guruz
Copy link
Contributor

guruz commented Feb 2, 2018

CC @PVince81 @DeepDiver1975 regarding server side.

@guruz guruz added this to the 2.5.0 milestone Feb 12, 2018
@ckamm
Copy link
Contributor

ckamm commented Apr 24, 2018

This will be a larger, invasive change that we can't fit into 2.5 anymore.

@ckamm ckamm modified the milestones: 2.5.0, 2.6.0 Apr 24, 2018
@jnweiger
Copy link
Contributor

jnweiger commented Aug 9, 2018

Testing on Ubuntu 18 with 2.5.0 beta1, I can confirm there is normalization going on.

Using python3 enter these commands slowly, so that syncing starts in between:

>>> f=open(b'Ame\xcc\x81lie'.decode(), 'w')
>>> f.write("lkhlhlkkl")
9
>>> f.close()

The filesystem temporarily shows

drwxr-xr-x  3 testy testy    4096 Aug  9 15:48  ./
drwx------ 64 testy testy   20480 Aug  9 15:48  ../
-rw-r--r--  1 testy testy       9 Aug  9 15:48  Amélie
-rw-r--r--  1 testy testy       0 Aug  9 15:48  Amélie

Where the two are encoded as

$ echo Amélie | xxd
00000000: 416d 65cc 816c 6965 0a                   Ame..lie.
$ echo Amélie | xxd
00000000: 416d c3a9 6c69 650a                      Am..lie.

After the sync completes, only the latter remains in the filesystem, with correct file size.
Which is a two byte é glyph, instead of the original three byte é glyph.
They even look slightly different!

The client activity log shows a sequence of five events:
Upload 0 bytes, Download 0 bytes, Upload 9 bytes, Delete, Download 9 bytes.

EDIT: 20210316 jw: I don't think they should look different. If they do, then it is probably a bug in the font rendering code.
In my current Ubuntu Terminal, they look identical as expected.

@ogoffart
Copy link
Contributor

ogoffart commented Dec 4, 2018

Here is what i think happens in @jnweiger 's test.

The file is file is originaly written on the file system using the NFD form.
On the first sync, it is uploaded like that and given a file id which is written in the database.

The server must however probably use the other normalization, so when we do a PROPFIND on the second sync, we get only the NFC form. As a result, the new file name is downloaded, and the other one is removed as the client thinks it is gone. (In principle, there should be a rename instead)

Edit: fixed NFD vs. NFC
Edit: Just tested on my machine, and it is a RENAME and there is never the two files at the same time

@PVince81
Copy link
Contributor

PVince81 commented Dec 4, 2018

server normalizes to NFC, yes

@guruz
Copy link
Contributor

guruz commented Jun 4, 2019

@ogoffart Shall we move this to 2.6.1?

@ckamm
Copy link
Contributor

ckamm commented Jun 4, 2019

@guruz I think so, we don't even have an approach for dealing with the normalization that happens yet.

@ckamm ckamm modified the milestones: 2.6.0, 2.6.1 Jun 4, 2019
@michaelstingl michaelstingl modified the milestones: 2.6.1, 2.6.2 Nov 26, 2019
@TheOneRing TheOneRing modified the milestones: 2.6.2, 2.9.0 Mar 17, 2021
@jnweiger
Copy link
Contributor

@TheOneRing TheOneRing modified the milestones: 2.11, Backlog Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants