refactor(utils): use Remark to parse Markdown server-side #5670

MattiaPrimavera · 2021-10-08T18:10:23Z

Motivation

#hacktoberfest
Fix Unresolved link inside HTML comment #5659

Have you read the Contributing Guidelines on pull requests?

Yes

Test Plan

I did the following to verify my changes work:

add expectations and verified the snapshot change in transforms absolute links in versioned docs test (linkify.test.ts) to confirm md file links contained in html comments are not in the expected result (since comments are stripped out before looking for md broken links)
run npm run start:website:watch locally and add the following lines in website/docs/installation.md file :
```



```
I saved the file and verified the warning is not shown anymore

Related PRs

N/A

…e html comment

netlify · 2021-10-08T18:16:42Z

✔️ [V2]
Built without sensitive environment variables

🔨 Explore the source changes: 9c86c8a

🔍 Inspect the deploy log: https://app.netlify.com/sites/docusaurus-2/deploys/61c2e643b2c8400007f6f73c

😎 Browse the preview: https://deploy-preview-5670--docusaurus-2.netlify.app

github-actions · 2021-10-08T18:16:51Z

⚡️ Lighthouse report for the changes in this PR:

Category	Score
🟢 Performance	96
🟢 Accessibility	98
🟢 Best practices	100
🟢 SEO	100
🟢 PWA	95

Lighthouse ran on https://deploy-preview-5670--docusaurus-2.netlify.app/

packages/docusaurus-utils/src/markdownLinks.ts

Co-authored-by: Alexey Pyltsyn <lex61rus@gmail.com>

packages/docusaurus-plugin-content-docs/src/markdown/__tests__/__fixtures__/docs/doc2.md

Jarod42 · 2021-10-11T12:43:32Z

packages/docusaurus-utils/src/markdownLinks.ts

@@ -32,20 +32,25 @@ export type ReplaceMarkdownLinksReturn<T extends ContentPaths> = {
  brokenMarkdownLinks: BrokenMarkdownLink<T>[];
 };

+const stripHtmlComments = (fileString: string) => {
+  return fileString.replace(/<!--.*?-->/gs, '');
+};


Notice that it would also remove that text when it is inside another block as string or fenced codeblock.

Even worse when start-comment and end-comment are in separated blocks.

Agree

Even worse when start-comment and end-comment are in separated blocks.

Not sure what you mean here?

My first comment was for fenced codeblock (or any "block") for example:

```md html comment have the following form " ```

for which the removal for the tools might not be important.

but cases like 2 separates blocks:

```md start of html comment is '' ```

would be more problematic.

slorber

Thanks for your contribution

This method is called replaceMarkdownLinks and its role is not really to remove HTML comments, just ignore links that are inside HTML comments.

In particular, an HTML comment inside a code block shouldn't be removed for example, but your algo does remove it and the tests should cover use-cases like this

slorber · 2021-10-12T16:14:37Z

packages/docusaurus-utils/src/markdownLinks.ts

@@ -32,20 +32,25 @@ export type ReplaceMarkdownLinksReturn<T extends ContentPaths> = {
  brokenMarkdownLinks: BrokenMarkdownLink<T>[];
 };

+const stripHtmlComments = (fileString: string) => {
+  return fileString.replace(/<!--.*?-->/gs, '');
+};


Agree

Even worse when start-comment and end-comment are in separated blocks.

Not sure what you mean here?

MattiaPrimavera · 2021-10-12T16:25:04Z

Yes thank you @slorber and @Jarod42 . I'll add both the tests mentioned and fix the code!

Josh-Cena · 2021-10-30T06:31:31Z

Another false positive is links within inline code. `[link](./tutorial-basics/congratulations.md)` is converted to [link](/docs/tutorial-basics/congratulations), which is even weirder because it's actually rendered.

Fixing these very niche cases is much more non-trivial than what the current impl does. Just wondering: is it possible to move the entire logic to a Remark plugin instead? Or at least use Remark to parse the MD file? Or... we actually bite the bullet and give up on using simple regex tests?

MattiaPrimavera · 2021-10-31T18:33:03Z

Fixing these very niche cases is much more non-trivial than what the current impl does

Indeed :) At least it does not seem that trivial if there's a constraint to parse every line of the md file only once (for performance requirements for instance).

is it possible to move the entire logic to a Remark plugin instead ?

In case the logic is not going to be moved in another plugin and there's no such a performance requirement, an option might be to first calculate all the links that should be ignored in the md file analysed (extract all fenced blocks, all ``, then all html comments in a newContent: string, then match all md links) in a mdLinksToBeIgnored list, then when replaceMarkdownLinks matches a link that is in mdLinksToBeIgnored list, the function may avoid to transform that link. This means each md file will be parsed at least twice: one to calculate mdLinksToBeIgnored and the other one for replaceMarkdownLinks. What do you think ?

Josh-Cena · 2021-10-31T23:07:55Z

This means each md file will be parsed at least twice: one to calculate mdLinksToBeIgnored and the other one for replaceMarkdownLinks. What do you think ?

That doesn't sound unaffordable to me. From a computational perspective, it's still O(n) :D We need to have something first in order to improve on it.

I don't quite understand how you are going to implement that though, so I'm asking for the code first. From my imagination, we would split the text with backtick fences and HTML open and close comments, go through each chunk, and only process those chunks that are not within these fences...

What I suggest: refactor this algorithm entirely using remark: https://github.com/remarkjs/remark

Use remark-parse to parse it to MDAST, find & replace all links, and then use remark-stringify to serialize it back to Markdown text. Not sure about efficiency, but at least it's better than patching our makeshift algo here and there

@MattiaPrimavera do you think you can handle this? Or do you need help from my side?

slorber

Looks like a nice POC 👍 , wonder about the performance implications but it looks to me it could be slower due to duplicated Remark parsing 😓 if we find a way to move this logic to the loader, that could help deduplicate

slorber · 2021-12-22T10:53:30Z

packages/docusaurus-plugin-content-blog/src/__tests__/__snapshots__/feed.test.ts.snap

@@ -23,7 +23,7 @@ exports[`blogFeed atom shows feed item for each post 1`] = `
        <id>/mdx-blog-post</id>
        <link href=\\"https://docusaurus.io/myBaseUrl/blog/mdx-blog-post\\"/>
        <updated>2021-03-05T00:00:00.000Z</updated>
-        <summary type=\\"html\\"><![CDATA[HTML Heading 1]]></summary>
+        <summary type=\\"html\\"><![CDATA[Heading 2]]></summary>


looks like a different behavior here, don't know for sure what is best though 🤷‍♂️

slorber · 2021-12-22T10:57:49Z

packages/docusaurus-utils/src/markdownParser.ts

+  options: MarkdownParserOptions = {remarkPlugins: []},
+): string | undefined {
+  const {remarkPlugins = []} = options;
+  const mdast = remark().use(mdx).use(remarkPlugins).parse(fileString);


We now need to parse the whole doc as AST before being able to extract an excerpt.

Is there some kind of streaming API that could permit to avoid this full MDAST transformation?

For very large files (like our changelog), we don't really want to transform 2 or 3 times this doc to MDAST, but should at most do it only once, or this can have some performance impact.

slorber · 2021-12-22T10:58:57Z

packages/docusaurus-utils/src/markdownParser.ts

+  const mdast = remark()
+    .use(mdx)
+    // .use(remarkPlugins) // We don't pass plugins here. Let's see if there's any use-case where this is useful
+    .parse(content);


same, wonder if we can avoid processing the whole doc here

Josh-Cena · 2022-01-04T13:41:58Z

Hi @MattiaPrimavera Thanks for your time and effort put into this! Unfortunately, our final resolution has deviated significantly from your initial implementation, and continuing to work on your fork will be quite intractable. I've sent #6261 to properly address this and I'll close this one. Thanks again, and hope you can keep up with this great work in your future path with open source :D

fix(packages/docusaurus-utils): fix warning for unresolved link insid…

a01e8ec

…e html comment

MattiaPrimavera requested review from lex111 and slorber as code owners October 8, 2021 18:10

lex111 reviewed Oct 9, 2021

View reviewed changes

packages/docusaurus-utils/src/markdownLinks.ts Outdated Show resolved Hide resolved

Update packages/docusaurus-utils/src/markdownLinks.ts

e65254b

Co-authored-by: Alexey Pyltsyn <lex61rus@gmail.com>

MattiaPrimavera requested a review from lex111 October 9, 2021 13:21

KyrietS mentioned this pull request Oct 10, 2021

Add action to check for and generate missing documentation premake/premake-core#1728

Merged

6 tasks

Jarod42 reviewed Oct 11, 2021

View reviewed changes

packages/docusaurus-plugin-content-docs/src/markdown/__tests__/__fixtures__/docs/doc2.md Show resolved Hide resolved

fix(packages/docusaurus-utils): test non-greedy regex

46399b8

Jarod42 reviewed Oct 11, 2021

View reviewed changes

slorber requested changes Oct 12, 2021

View reviewed changes

facebook-github-bot added the CLA Signed Signed Facebook CLA label Oct 13, 2021

Josh-Cena changed the title ~~fix(packages/docusaurus-utils): fix warning for unresolved link inside html comment~~ fix(utils): fix warning for unresolved link inside html comment Oct 30, 2021

Josh-Cena added the pr: bug fix This PR fixes a bug in a past release. label Oct 30, 2021

Josh-Cena mentioned this pull request Dec 16, 2021

Comments can end up displaying inside description meta #6108

Open

7 tasks

Merge branch 'main' into unresolved-link-html-comment

cc89cef

MattiaPrimavera requested a review from Josh-Cena as a code owner December 22, 2021 02:48

Josh-Cena linked an issue Dec 22, 2021 that may be closed by this pull request

Comments can end up displaying inside description meta #6108

Open

7 tasks

Josh-Cena changed the title ~~fix(utils): fix warning for unresolved link inside html comment~~ fix(utils): use Remark to parse Markdown server-side Dec 22, 2021

Use Remark

9c86c8a

Josh-Cena changed the title ~~fix(utils): use Remark to parse Markdown server-side~~ refactor(utils): use Remark to parse Markdown server-side Dec 22, 2021

slorber reviewed Dec 22, 2021

View reviewed changes

Josh-Cena mentioned this pull request Dec 22, 2021

Markdown links inside comments shouldn't be resolved #6160

Closed

7 tasks

Josh-Cena mentioned this pull request Jan 4, 2022

refactor: use Remark to parse Markdown on server-side #6261

Closed

Josh-Cena closed this Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(utils): use Remark to parse Markdown server-side #5670

refactor(utils): use Remark to parse Markdown server-side #5670

MattiaPrimavera commented Oct 8, 2021

netlify bot commented Oct 8, 2021 •

edited

Loading

github-actions bot commented Oct 8, 2021 •

edited

Loading

Jarod42 Oct 11, 2021 •

edited

Loading

slorber Oct 12, 2021

Jarod42 Oct 13, 2021 •

edited by Josh-Cena

Loading

slorber left a comment •

edited

Loading

slorber Oct 12, 2021

MattiaPrimavera commented Oct 12, 2021

Josh-Cena commented Oct 30, 2021

MattiaPrimavera commented Oct 31, 2021

Josh-Cena commented Oct 31, 2021 •

edited

Loading

slorber left a comment

slorber Dec 22, 2021

slorber Dec 22, 2021

slorber Dec 22, 2021

Josh-Cena commented Jan 4, 2022

refactor(utils): use Remark to parse Markdown server-side #5670

refactor(utils): use Remark to parse Markdown server-side #5670

Conversation

MattiaPrimavera commented Oct 8, 2021

Motivation

Have you read the Contributing Guidelines on pull requests?

Test Plan

Related PRs

netlify bot commented Oct 8, 2021 • edited Loading

github-actions bot commented Oct 8, 2021 • edited Loading

Jarod42 Oct 11, 2021 • edited Loading

Choose a reason for hiding this comment

slorber Oct 12, 2021

Choose a reason for hiding this comment

Jarod42 Oct 13, 2021 • edited by Josh-Cena Loading

Choose a reason for hiding this comment

slorber left a comment • edited Loading

Choose a reason for hiding this comment

slorber Oct 12, 2021

Choose a reason for hiding this comment

MattiaPrimavera commented Oct 12, 2021

Josh-Cena commented Oct 30, 2021

MattiaPrimavera commented Oct 31, 2021

Josh-Cena commented Oct 31, 2021 • edited Loading

slorber left a comment

Choose a reason for hiding this comment

slorber Dec 22, 2021

Choose a reason for hiding this comment

slorber Dec 22, 2021

Choose a reason for hiding this comment

slorber Dec 22, 2021

Choose a reason for hiding this comment

Josh-Cena commented Jan 4, 2022

netlify bot commented Oct 8, 2021 •

edited

Loading

github-actions bot commented Oct 8, 2021 •

edited

Loading

Jarod42 Oct 11, 2021 •

edited

Loading

Jarod42 Oct 13, 2021 •

edited by Josh-Cena

Loading

slorber left a comment •

edited

Loading

Josh-Cena commented Oct 31, 2021 •

edited

Loading