Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix($shared-utils): Replace diacritics with String.normalize #1855

Merged

Conversation

larionov
Copy link
Contributor

@larionov larionov commented Sep 13, 2019

Summary

fix #1815

What kind of change does this PR introduce? (check at least one)

  • Bugfix
  • Feature
  • Code style update
  • Refactor
  • Docs
  • Build-related changes
  • Other, please describe:

If changing the UI of default theme, please provide the before/after screenshot:

Does this PR introduce a breaking change? (check one)

  • Yes (Possibly?)
  • No

If yes, please describe the impact and migration path for existing applications:

The slugs in the existing URL's will change, so the external links referencing existing pages might break.

The PR fulfills these requirements:

  • When resolving a specific issue, it's referenced in the PR's title (e.g. fix #xxx[,#xxx], where "xxx" is the issue number)

You have tested in the following browsers: (Providing a detailed version will be better.)

  • Chrome
  • Firefox
  • Safari
  • Edge
  • IE

If adding a new feature, the PR's description includes:

  • A convincing reason for adding this feature
  • Related documents have been updated
  • Related tests have been updated

To avoid wasting your time, it's best to open a feature request issue first and wait for approval before working on it.

Other information:
So diacritics has some errors in the replacements sets, and it wasn't updated for quite some time. I did a bit of research and looks like using String.normalize might work.
It is supported by node > 6 but not supported by IE, and the polyfill is huge, but,
@f3ltron, do I understand correctly that slugify util is only ever run on the server side and never in the browser? If so then it's not a problem then.

We can normalize the string (https://www.unicode.org/reports/tr15/#Compatibility_Equivalence_Figure) and in the process it will show accents as separate characters which can be removed with the regex (thanks @glebtv).

I think this will also help if, say, someone typed the links in the .md files manually, and they look the same but internally different - they will reference the same page anyway.

@larionov larionov changed the title fix($shared-utils): replaced diacritics with String.normalize (fix #1815) fix($shared-utils): replaced diacritics with String.normalize (#1815) Sep 13, 2019
@flozero
Copy link
Collaborator

flozero commented Sep 13, 2019

Nice job @larionov i will have a look when i got time !

@flozero
Copy link
Collaborator

flozero commented Sep 13, 2019

it should

fix #1815

Copy link
Collaborator

@kefranabg kefranabg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @larionov,

Thanks for this PR, seems great when reading tests.

I'm not an expert on this topic so I'll just ask questions for now:

  • /[\u0300-\u036F]/g this will remove all chars code from 768 to 879. Why this specific range? Are you sure this is correct?
  • Are you sure str.normalize('NFKD') is correct? Could you explain?

@larionov
Copy link
Contributor Author

larionov commented Sep 29, 2019

@kefranabg I'm not an expert either but:
First of all I'm not sure what is the whole purpose of removing the diacritics from the slug is, I assumed it's for the readability and browsers having trouble with some specific characters? But so far while testing I didn't see any browser having trouble with them.

At that point I thought that the purpose of this is to make sure the urls are unambiguous and it's easy to type it and make sure you have the same url if it was re-encoded somewhere else somehow.

For that reason I chose the NFKD decomposition:
https://www.unicode.org/reports/tr15/#Compatibility_Composite_Figure

It's purpose is exactly that:

Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors. The visual appearances of the compatibility equivalent forms typically constitute a subset of the expected range of visual appearances of the character (or sequence of characters) they are equivalent to. However, these variant forms may represent a visual distinction that is significant in some textual contexts, but not in others.

So it will make the ï look like i for example, and also extract all diacritics into separate characters.

300-36F is the range of the combining diacritics marks in the unicode table: https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

I got it from here: https://stackoverflow.com/questions/990904/remove-accents-diacritics-in-a-string-in-javascript#answer-37511463

Sorry for the late answer!

@about-code
Copy link

about-code commented Oct 7, 2019

I extended the tests with additional German Umlauts (besides ß) and found that their diacritics are still being replaced.

For example try

Term expected actual
Höhlen höhlen hohlen
Hütten hütten hutten
Häuser häuser hauser

@about-code
Copy link

PS: I am following this PR and its issue due to about-code/glossarify-md#27. As a workaround for users of glossarify-md I currently recommend to replace vuepress' internal slugger via vuepress' markdown.slugify config option. glossarify-md internally uses github-slugger, for example. Passing it to vuepress helps mitigating the issues reported for glossarify-md.

@kefranabg kefranabg changed the title fix($shared-utils): replaced diacritics with String.normalize (#1815) fix($shared-utils): Replace diacritics with String.normalize Oct 7, 2019
@kefranabg
Copy link
Collaborator

Looks good to me, but I'd like one more code review as I'm not 100% confident with this

@kefranabg kefranabg requested review from ulivz and bencodezen October 7, 2019 18:36
@about-code
Copy link

Looks good to me [...]

@kefranabg What has changed since me reporting that German diacritics are not handled correctly? In case it was a misunderstanding: please note that while I wrote, to have extended the tests I meant to have done so only locally to verify the PR. I haven't been pushing the changes yet, that's why tests still run green. But I don't think the current implementation is already sufficient with respect to unicode characters in general. @larionov I would push my changes with German characters, if you don't mind. This is going to make tests fail but I am hardly an expert on the topic, either, and don't know what the unicode character range would have to look like to cover all diacritics.

@larionov
Copy link
Contributor Author

larionov commented Oct 8, 2019

@about-code I think the only way to solve this problem is to make your package to use the same slugger as vuepress does, and you can actually import shared-utils from your project and use it there too.

The German characters difference is not the only difference between your slugger and this one and it doesn't make sense to make it equal character by character, it will break in the end anyway, sooner or later.

@about-code
Copy link

... but I am not sure what the exact purpose of the fix is: is this all about just the cyrillic letter or about changing the way diacritics are handled in general? If it's just about the cyrillic character then just forget about my comments.

@about-code
Copy link

about-code commented Oct 8, 2019

@larionov thanks for your quick response. Just saw it.

Edit:

I think the only way to solve this problem is to make your package to use the same slugger as vuepress does

Since the slugger in my project is a transitive dependency I can't do it this way - only the other way around. But I'm happy with the fact, that vuepress allows for replacing its slugger via its markdown.slugify config. Just joined the discussion, since it would have been even better, of course, if such replacement wouldn't be (or would less likely be) required anymore to use glossarify-md with vuepress. But that's an acceptable constraint.

❤️

@kefranabg
Copy link
Collaborator

@larionov sorry for the delay, could you resolve conflicts? @ulivz could you do an extra check before we merge this?

@osipxd
Copy link

osipxd commented Jul 29, 2020

Can it be merged? This PR has been open for almost a year and the problem it solves still exists.

@bencodezen
Copy link
Member

Hey @osipxd! Thanks for pinging this thread again. I noticed that there's a merge conflict on this since there's a removeDiacritics coming in the new slugify.ts. Did this solve the problem? If not we definitely need to merge this.

@osipxd
Copy link

osipxd commented Aug 3, 2020

I noticed that there's a merge conflict on this since there's a removeDiacritics coming in the new slugify.ts. Did this solve the problem?

removeDiacritics is the source of the bug - andrewrk/node-diacritics#32
But it looks like node-diacritics is abandoned and this bug will not be fixed.

@larionov larionov force-pushed the fix/replace-diacritics-with-string-normalize branch 2 times, most recently from d94c91f to b35e451 Compare August 19, 2020 04:10
larionov pushed a commit to larionov/vuepress that referenced this pull request Aug 19, 2020
@larionov larionov force-pushed the fix/replace-diacritics-with-string-normalize branch from b35e451 to d3c0664 Compare August 19, 2020 04:12
larionov pushed a commit to larionov/vuepress that referenced this pull request Aug 19, 2020
@larionov
Copy link
Contributor Author

larionov commented Aug 19, 2020

@bencodezen @osipxd @ulivz done

@larionov larionov force-pushed the fix/replace-diacritics-with-string-normalize branch from d3c0664 to 10716f6 Compare August 19, 2020 04:28
@bencodezen bencodezen merged commit a03e93d into vuejs:master Aug 19, 2020
@bencodezen
Copy link
Member

Thank you @larionov! Really appreciate our work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cyrillic "л" is replaced by Latin "n" in the anchor link
7 participants