Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lozenge ◊ not stripped? #22

Closed
fasiha opened this issue Apr 2, 2019 · 1 comment · Fixed by #38
Closed

Lozenge ◊ not stripped? #22

fasiha opened this issue Apr 2, 2019 · 1 comment · Fixed by #38

Comments

@fasiha
Copy link

fasiha commented Apr 2, 2019

It's my own fault for making headings using the lozenge character (◊, U+25CA), but

var GithubSlugger = require("github-slugger")
var slugger = new GithubSlugger();
console.log(slugger.slug('◊sent'))

prints ◊sent instead of sent like GitHub does.

@Flet
Copy link
Owner

Flet commented Jun 25, 2019

I will check this out!

wooorm added a commit that referenced this issue Aug 22, 2021
I reverse engineered GitHub’s slugging algorithm.
Somewhat based on #25 and #35.

To do that, I created two scripts:

* `generate-fixtures.mjs`, which generates a markdown file, in part
  from manual fixtures and in part on the Unicode General Categories,
  creates a gist, crawls the gist, removes it, and saves fixtures
  annotated with the expected result from GitHub
* `generate-regex.mjs`, which generates the regex that GitHub uses for
  characters to ignore.

The regex is about 2.5kb minzipped.
This increases the file size of this project a bit.
But matching GitHub is worth it in my opinion.
I also investigated regex `\p{}` classes in `/u` regexes. They work
mostly fine, with two caveats:
a) they don’t work everywhere, so would be a major release,
b) GitHub does not implement the same Unicode version as browsers.
I tested with Unicode 13 and 14, and they include characters that
GitHub handles differently.
In the end, GitHub’s algorithm is mostly fine: strip
non-alphanumericals, allow `-`, and turn ` ` (space) into `-`.

Finally, I removed the trim functionality, because it is not
implemented by GitHub.
To assert this, make a heading like so in a readme: `#  `.
This is a space encoded as a character reference, meaning that the
markdown does not see it as the whitespace between the `#` and the
content.
In fact, this makes it the content.
And GitHub creates a slug of `-` for it.

Further work: I think it would be nice to release this as is.
Then, afterwards, I’d like to modernize the project, add GH Actions
to generate the build, add types, and move to ESM.

/cc @Flet @jablkojablko

Closes GH-22.
Closes GH-25.
Closes GH-35.

Co-authored-by: Dan Flettre <flettre@gmail.com>
Co-authored-by: Jack Bates <jack@nottheoilrig.com>
wooorm added a commit that referenced this issue Aug 24, 2021
I reverse engineered GitHub’s slugging algorithm.
Somewhat based on #25 and #35.

To do that, I created two scripts:

* `generate-fixtures.mjs`, which generates a markdown file, in part
  from manual fixtures and in part on the Unicode General Categories,
  creates a gist, crawls the gist, removes it, and saves fixtures
  annotated with the expected result from GitHub
* `generate-regex.mjs`, which generates the regex that GitHub uses for
  characters to ignore.

The regex is about 2.5kb minzipped.
This increases the file size of this project a bit.
But matching GitHub is worth it in my opinion.
I also investigated regex `\p{}` classes in `/u` regexes. They work
mostly fine, with two caveats:
a) they don’t work everywhere, so would be a major release,
b) GitHub does not implement the same Unicode version as browsers.
I tested with Unicode 13 and 14, and they include characters that
GitHub handles differently.
In the end, GitHub’s algorithm is mostly fine: strip
non-alphanumericals, allow `-`, and turn ` ` (space) into `-`.

Finally, I removed the trim functionality, because it is not
implemented by GitHub.
To assert this, make a heading like so in a readme: `# &#x20;`.
This is a space encoded as a character reference, meaning that the
markdown does not see it as the whitespace between the `#` and the
content.
In fact, this makes it the content.
And GitHub creates a slug of `-` for it.

Closes GH-22.
Closes GH-25.
Closes GH-35.
Closes GH-38.

Co-authored-by: Dan Flettre <flettre@gmail.com>
Co-authored-by: Jack Bates <jack@nottheoilrig.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants