Filter by Unicode General_Category #35

jablko · 2020-12-17T23:07:02Z

Would you consider adding these changes to #25/custom-regex-filter? This PR isn't meant to supersede #25, it's a request to pull this PR into that PR. It builds on #25 but filters based on Unicode General_Category vs. Block/Sequence_Property.

What I did was:

Add a script (test/General_Category/index.js) that loops over each category, as compiled by the unicode-12.1.0 package (excluding union categories, Cased_Letter, Letter, etc.) and writes Markdown files consisting of a heading containing all the code points in that category, and JSON test cases.
After the Markdown files are committed the script does it's best to scrape the actual slug from GitHub and update the test cases.

I started with categories because there are fewer of them than blocks and I did find that most are either all stripped or all kept in the slug. Notable exceptions are:

Dash_Punctuation (all stripped except for U+002D)
Space_Separator (I suspect all stripped except for U+0020 and U+00A0 but haven't confirmed)
Other_Symbol

Other_Symbol is all stripped except for code points with the Alphabetic property, e.g. U+00A6 (and emojis) are stripped but U+24D0 is kept, so in the second commit I reversed the logic in script/generate-regex.js, now listing categories to keep vs. strip:

Start by stripping everything
Keep the Binary_Property/Alphabetic set
List the +/- 6 categories outside Binary_Property/Alphabetic that are also kept

The resulting regex conforms to all but four of the new test cases:

Line_Separator, Paragraph_Separator and Space_Separator: I presume they fail because .trim() is more aggressive than GitHub but I'm not sure that's worth fixing? There's no regression compared to today's behavior and the difference from GitHub's behavior is purely academic?
Surrogate: I'm not sure what's going on here, I can't strip U+DFFF no matter what I put in the regex. I did a little debugging/Googling but concluded it also wasn't worth it.

As mentioned in #25 (comment) the motivation for this is e.g. https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/README.ja.md#型定義ファイルとは何ですか-またどのように入手できますか

github-slugger (master and today's custom-regex-filter branch) doesn't match GitHub's slug because it doesn't strip U+FF1F (Halfwidth and Fullwidth Forms block):

-型定義ファイルとは何ですか？-またどのように入手できますか？
+型定義ファイルとは何ですか-またどのように入手できますか

* [README] Correct Unicode slugs * Don't cherry pick Flet/github-slugger#35

beauroberts · 2021-01-21T07:59:12Z

any chance of merging this @Flet?

* [README] Correct Unicode slugs * Don't cherry pick Flet/github-slugger#35

@Flet

I reverse engineered GitHub’s slugging algorithm. Somewhat based on #25 and #35. To do that, I created two scripts: * `generate-fixtures.mjs`, which generates a markdown file, in part from manual fixtures and in part on the Unicode General Categories, creates a gist, crawls the gist, removes it, and saves fixtures annotated with the expected result from GitHub * `generate-regex.mjs`, which generates the regex that GitHub uses for characters to ignore. The regex is about 2.5kb minzipped. This increases the file size of this project a bit. But matching GitHub is worth it in my opinion. I also investigated regex `\p{}` classes in `/u` regexes. They work mostly fine, with two caveats: a) they don’t work everywhere, so would be a major release, b) GitHub does not implement the same Unicode version as browsers. I tested with Unicode 13 and 14, and they include characters that GitHub handles differently. In the end, GitHub’s algorithm is mostly fine: strip non-alphanumericals, allow `-`, and turn ` ` (space) into `-`. Finally, I removed the trim functionality, because it is not implemented by GitHub. To assert this, make a heading like so in a readme: `#  `. This is a space encoded as a character reference, meaning that the markdown does not see it as the whitespace between the `#` and the content. In fact, this makes it the content. And GitHub creates a slug of `-` for it. Further work: I think it would be nice to release this as is. Then, afterwards, I’d like to modernize the project, add GH Actions to generate the build, add types, and move to ESM. /cc @Flet @jablkojablko Closes GH-22. Closes GH-25. Closes GH-35. Co-authored-by: Dan Flettre <flettre@gmail.com> Co-authored-by: Jack Bates <jack@nottheoilrig.com>

I reverse engineered GitHub’s slugging algorithm. Somewhat based on #25 and #35. To do that, I created two scripts: * `generate-fixtures.mjs`, which generates a markdown file, in part from manual fixtures and in part on the Unicode General Categories, creates a gist, crawls the gist, removes it, and saves fixtures annotated with the expected result from GitHub * `generate-regex.mjs`, which generates the regex that GitHub uses for characters to ignore. The regex is about 2.5kb minzipped. This increases the file size of this project a bit. But matching GitHub is worth it in my opinion. I also investigated regex `\p{}` classes in `/u` regexes. They work mostly fine, with two caveats: a) they don’t work everywhere, so would be a major release, b) GitHub does not implement the same Unicode version as browsers. I tested with Unicode 13 and 14, and they include characters that GitHub handles differently. In the end, GitHub’s algorithm is mostly fine: strip non-alphanumericals, allow `-`, and turn ` ` (space) into `-`. Finally, I removed the trim functionality, because it is not implemented by GitHub. To assert this, make a heading like so in a readme: `#  `. This is a space encoded as a character reference, meaning that the markdown does not see it as the whitespace between the `#` and the content. In fact, this makes it the content. And GitHub creates a slug of `-` for it. Closes GH-22. Closes GH-25. Closes GH-35. Closes GH-38. Co-authored-by: Dan Flettre <flettre@gmail.com> Co-authored-by: Jack Bates <jack@nottheoilrig.com>

jablko added 2 commits December 18, 2020 07:37

Filter by Unicode General_Category

ae2c3b3

Keep all Binary_Property/Alphabetic

df6c9cb

jablko force-pushed the patch-1 branch from 090cbfe to df6c9cb Compare December 18, 2020 14:38

Uppercase_Letter requires ICU >= 62

9326cb4

jablko mentioned this pull request Jan 13, 2021

[README] Correct Unicode slugs DefinitelyTyped/DefinitelyTyped#50583

Merged

jablko added a commit to jablko/DefinitelyTyped that referenced this pull request Jan 20, 2021

Don't cherry pick Flet/github-slugger#35

feb0636

sandersn pushed a commit to DefinitelyTyped/DefinitelyTyped that referenced this pull request Jan 20, 2021

[README] Correct Unicode slugs (#50583)

b005fb7

* [README] Correct Unicode slugs * Don't cherry pick Flet/github-slugger#35

kaznovac pushed a commit to kaznovac/DefinitelyTyped that referenced this pull request Mar 2, 2021

[README] Correct Unicode slugs (DefinitelyTyped#50583)

bea5366

* [README] Correct Unicode slugs * Don't cherry pick Flet/github-slugger#35

UziTech mentioned this pull request Apr 7, 2021

Default heading anchors can become really ugly👾 markedjs/marked#1993

Closed

wooorm mentioned this pull request Aug 22, 2021

Fix to match GitHub’s algorithm on unicode #38

Merged

wooorm closed this in #38 Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter by Unicode General_Category #35

Filter by Unicode General_Category #35

jablko commented Dec 17, 2020

beauroberts commented Jan 21, 2021

Filter by Unicode General_Category #35

Filter by Unicode General_Category #35

Conversation

jablko commented Dec 17, 2020

beauroberts commented Jan 21, 2021