Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter by Unicode General_Category #35

Closed
wants to merge 3 commits into from

Conversation

jablko
Copy link
Contributor

@jablko jablko commented Dec 17, 2020

Would you consider adding these changes to #25/custom-regex-filter? This PR isn't meant to supersede #25, it's a request to pull this PR into that PR. It builds on #25 but filters based on Unicode General_Category vs. Block/Sequence_Property.

What I did was:

  • Add a script (test/General_Category/index.js) that loops over each category, as compiled by the unicode-12.1.0 package (excluding union categories, Cased_Letter, Letter, etc.) and writes Markdown files consisting of a heading containing all the code points in that category, and JSON test cases.
  • After the Markdown files are committed the script does it's best to scrape the actual slug from GitHub and update the test cases.

I started with categories because there are fewer of them than blocks and I did find that most are either all stripped or all kept in the slug. Notable exceptions are:

  • Dash_Punctuation (all stripped except for U+002D)
  • Space_Separator (I suspect all stripped except for U+0020 and U+00A0 but haven't confirmed)
  • Other_Symbol

Other_Symbol is all stripped except for code points with the Alphabetic property, e.g. U+00A6 (and emojis) are stripped but U+24D0 is kept, so in the second commit I reversed the logic in script/generate-regex.js, now listing categories to keep vs. strip:

  • Start by stripping everything
  • Keep the Binary_Property/Alphabetic set
  • List the +/- 6 categories outside Binary_Property/Alphabetic that are also kept

The resulting regex conforms to all but four of the new test cases:

  • Line_Separator, Paragraph_Separator and Space_Separator: I presume they fail because .trim() is more aggressive than GitHub but I'm not sure that's worth fixing? There's no regression compared to today's behavior and the difference from GitHub's behavior is purely academic?
  • Surrogate: I'm not sure what's going on here, I can't strip U+DFFF no matter what I put in the regex. I did a little debugging/Googling but concluded it also wasn't worth it.

As mentioned in #25 (comment) the motivation for this is e.g. https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/README.ja.md#型定義ファイルとは何ですか-またどのように入手できますか

github-slugger (master and today's custom-regex-filter branch) doesn't match GitHub's slug because it doesn't strip U+FF1F (Halfwidth and Fullwidth Forms block):

-型定義ファイルとは何ですか?-またどのように入手できますか?
+型定義ファイルとは何ですか-またどのように入手できますか

jablko added a commit to jablko/DefinitelyTyped that referenced this pull request Jan 20, 2021
sandersn pushed a commit to DefinitelyTyped/DefinitelyTyped that referenced this pull request Jan 20, 2021
* [README] Correct Unicode slugs

* Don't cherry pick Flet/github-slugger#35
@beauroberts
Copy link

any chance of merging this @Flet?

kaznovac pushed a commit to kaznovac/DefinitelyTyped that referenced this pull request Mar 2, 2021
* [README] Correct Unicode slugs

* Don't cherry pick Flet/github-slugger#35
wooorm added a commit that referenced this pull request Aug 22, 2021
I reverse engineered GitHub’s slugging algorithm.
Somewhat based on #25 and #35.

To do that, I created two scripts:

* `generate-fixtures.mjs`, which generates a markdown file, in part
  from manual fixtures and in part on the Unicode General Categories,
  creates a gist, crawls the gist, removes it, and saves fixtures
  annotated with the expected result from GitHub
* `generate-regex.mjs`, which generates the regex that GitHub uses for
  characters to ignore.

The regex is about 2.5kb minzipped.
This increases the file size of this project a bit.
But matching GitHub is worth it in my opinion.
I also investigated regex `\p{}` classes in `/u` regexes. They work
mostly fine, with two caveats:
a) they don’t work everywhere, so would be a major release,
b) GitHub does not implement the same Unicode version as browsers.
I tested with Unicode 13 and 14, and they include characters that
GitHub handles differently.
In the end, GitHub’s algorithm is mostly fine: strip
non-alphanumericals, allow `-`, and turn ` ` (space) into `-`.

Finally, I removed the trim functionality, because it is not
implemented by GitHub.
To assert this, make a heading like so in a readme: `#  `.
This is a space encoded as a character reference, meaning that the
markdown does not see it as the whitespace between the `#` and the
content.
In fact, this makes it the content.
And GitHub creates a slug of `-` for it.

Further work: I think it would be nice to release this as is.
Then, afterwards, I’d like to modernize the project, add GH Actions
to generate the build, add types, and move to ESM.

/cc @Flet @jablkojablko

Closes GH-22.
Closes GH-25.
Closes GH-35.

Co-authored-by: Dan Flettre <flettre@gmail.com>
Co-authored-by: Jack Bates <jack@nottheoilrig.com>
@wooorm wooorm closed this in #38 Aug 24, 2021
wooorm added a commit that referenced this pull request Aug 24, 2021
I reverse engineered GitHub’s slugging algorithm.
Somewhat based on #25 and #35.

To do that, I created two scripts:

* `generate-fixtures.mjs`, which generates a markdown file, in part
  from manual fixtures and in part on the Unicode General Categories,
  creates a gist, crawls the gist, removes it, and saves fixtures
  annotated with the expected result from GitHub
* `generate-regex.mjs`, which generates the regex that GitHub uses for
  characters to ignore.

The regex is about 2.5kb minzipped.
This increases the file size of this project a bit.
But matching GitHub is worth it in my opinion.
I also investigated regex `\p{}` classes in `/u` regexes. They work
mostly fine, with two caveats:
a) they don’t work everywhere, so would be a major release,
b) GitHub does not implement the same Unicode version as browsers.
I tested with Unicode 13 and 14, and they include characters that
GitHub handles differently.
In the end, GitHub’s algorithm is mostly fine: strip
non-alphanumericals, allow `-`, and turn ` ` (space) into `-`.

Finally, I removed the trim functionality, because it is not
implemented by GitHub.
To assert this, make a heading like so in a readme: `# &#x20;`.
This is a space encoded as a character reference, meaning that the
markdown does not see it as the whitespace between the `#` and the
content.
In fact, this makes it the content.
And GitHub creates a slug of `-` for it.

Closes GH-22.
Closes GH-25.
Closes GH-35.
Closes GH-38.

Co-authored-by: Dan Flettre <flettre@gmail.com>
Co-authored-by: Jack Bates <jack@nottheoilrig.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants