Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glob patterns and umlauts on HFS vs. APFS #845

Closed
chrmarti opened this issue Mar 5, 2018 · 8 comments
Closed

Glob patterns and umlauts on HFS vs. APFS #845

chrmarti opened this issue Mar 5, 2018 · 8 comments
Labels
bug A bug. wontfix A feature or bug that is unlikely to be implemented or fixed.

Comments

@chrmarti
Copy link
Contributor

chrmarti commented Mar 5, 2018

What version of ripgrep are you using?

ripgrep 0.8.1 (rev c8e9f25)
+SIMD -AVX

What operating system are you using ripgrep on?

OSX 10.12.6 and 10.13.3

If this is a bug, what are the steps to reproduce the behavior?

The glob patterns do not match umlauts because HFS uses a normalized encoding.

Create a corpus:

#include <stdio.h>

void main()
{
    fopen("a\xc3\xbc", "ab+");
    fopen("b\x75\xcc\x88", "ab+");
}

Also create third through the command line for comparison:

touch cü

ls shows:

aü   bü   cü

If this is a bug, what is the actual behavior?

Running rg --files -g '*ü' shows no output on OSX 10.12.6 with HFS, on OSX 10.13.3 with APFS it is:

cü
aü

If this is a bug, what is the expected behavior?

Ideally on both versions of OSX we would get all three files as matches. The problem comes from HFS changing to the 3-byte code-point for all three variations of creating a file with the umlaut whereas APFS just leaves the representation of the filename as it is given.

Found via microsoft/vscode#43691.

/cc @joaomoreno

@BurntSushi
Copy link
Owner

Yeah ripgrep doesn't handle normalization at all. From your example, it looks like HFS specifically used the decomposed normal form. This doesn't just impact glob matching, but actual search as well. In particular, if you search for a composed Unicode codepoint, you won't find files that contain the same glyph in decomposed form and vice versa. This is because ripgrep doesn't really know which normal form to use.

In this case, if we know that HFS always uses a specific type of Unicode normal form, then in theory, we could do the following:

  • Detect if we're searching an HFS system.
  • If so, apply the same normalization procedure to the glob itself that is applied to file paths on the HFS system.

This is fraught with complications and unknowns. Some of them are significant:

  • If you write [X] as a character class where X is a composed codepoint, then decomposing that is basically not possible without explicit support from the regex engine, which it does not have (and it is deeply non-trivial to add). This is because there is a fundamental assumption that character classes match exactly one codepoint.
  • Even if we could translate the glob into decomposed normal form, the detection of HFS isn't clear to me. I don't know the first thing about it. In particular, this detection must be granular. That is, a glob converted to normal form can only be used on file paths on an HFS file system. If you search across additional HFS file systems, then you'd need to compile an additional glob.

In other words, the only real way to solve this problem is to build normalization support into the regex engine itself. This is basically a rewrite and drastically alters the performance profile of regular expressions. If you read UTS#18, you can see that the Unicode people are well aware of this, which is why features like canonical equivalence are pushed into "extended" level 2 support, which very few regex engines support at all.

Unless there is a simple fix I'm missing---perhaps even one that partially fixes the issue---then I suspect this is a wontfix bug sadly.

@BurntSushi BurntSushi added the bug A bug. label Mar 5, 2018
@BurntSushi
Copy link
Owner

Unless there is a simple fix I'm missing

One possibility is to decompose codepoints in a glob that aren't part of a character class. It would be a frustrating half-measure, and it would still be problematic with respect to needing to detect the HFS file system. One possible way to solve that would be to expose a flag to enable the normalization pass unconditionally if you know you're only searching an HFS file system. This wouldn't work if you search across multiple file systems, and a flag like that seems like not a good UX to me.

@chrmarti
Copy link
Contributor Author

chrmarti commented Mar 6, 2018

Thanks for the great analysis! Reading UTS#18 this seems like a more general problem than just with HFS because other filesystems don't bother with normalization at all. UTS#18 suggests using NFD (or NFKD) for regular expression matching (HFS is using NFD).

Maybe the regex engine could have a mode where it normalizes all inputs (glob and filename to match against) and character classes to NFD. That would also deal with the case where a filesystem without normalization (that seems most except HFS) ends up with a mixture of representations of the same visual character. Maybe that would leave the regex engine's core untouched, but I'm just guessing.

As an aside: We noticed that while OSX's APFS does not change the filename's representation, it still does not allow you to create two files with filenames that have the same normalized representation. So it is aware of the equivalence.

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 6, 2018

Maybe the regex engine could have a mode where it normalizes all inputs (glob and filename to match against) and character classes to NFD.

This is basically what I suggested above, and is exactly problematic for the reasons stated, specifically with regard to character classes. Note that UTS#18 S2.1 (on canonical equivalents) is specifically suggesting that the end user construct their pattern such that it uses NFD, likely for exactly the reasons I mentioned. Unicode hints at this with the last bullet point, which is critical: "Applying the matching algorithm on a code point by code point basis, as usual."

Translating the input to NFD is definitely not something that should be in the regex engine itself, mostly because it doesn't really confer any advantages. It would be something that ripgrep would do as a pre-processing step outside the regex engine if we were to pursue that path. As I hinted above, this would drastically alter the performance profile of ripgrep. Unicode normalization is decidedly not cheap.

this seems like a more general problem than just with HFS because other filesystems don't bother with normalization at all

Indeed! On Unix (and probably Windows too), it is entirely possible to create a file that contains a composed codepoint in its name, and then use the decomposed codepoint in a glob (and vice versa), and that would result in a match failure even though the text strings look the same to an end user. It is a frustrating UX, no doubt about it. Presumably HFS is trying to fix that.

As an aside: We noticed that while OSX's APFS does not change the filename's representation, it still does not allow you to create two files with filenames that have the same normalized representation. So it is aware of the equivalence.

Now that is interesting!

@chrmarti
Copy link
Contributor Author

chrmarti commented Mar 6, 2018

Sounds good. I'll look into mitigating this on VS Code's side. We might get away with supplying two exclusion patterns when when the NFD form differs from the user input.

@BurntSushi
Copy link
Owner

@chrmarti Aye. Just be careful! If you replace [ü] (where ü is composed) with [ü] (where ü is decomposed), then the latter will match either the u or the umlaut independently of one another.

If you just have a literal ü, then the normalization pass is fine either way because it's just a concatenation.

@BurntSushi BurntSushi added the wontfix A feature or bug that is unlikely to be implemented or fixed. label Mar 7, 2018
@maxnoe
Copy link

maxnoe commented Jan 10, 2020

I came across this today again. It would be great if ripgrep would support optionally normalizing all input text.

For me it was not filenames but the actual contents were in different form and I had to search for both decomposed and composed form.

@BurntSushi
Copy link
Owner

Nothing has change since my comments above, so I don't see this happening. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. wontfix A feature or bug that is unlikely to be implemented or fixed.
Projects
None yet
Development

No branches or pull requests

3 participants