Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(utils): use Remark to parse Markdown server-side #5670

Conversation

MattiaPrimavera
Copy link

Motivation

Have you read the Contributing Guidelines on pull requests?

Yes

Test Plan

I did the following to verify my changes work:

  • add expectations and verified the snapshot change in transforms absolute links in versioned docs test (linkify.test.ts) to confirm md file links contained in html comments are not in the expected result (since comments are stripped out before looking for md broken links)

  • run npm run start:website:watch locally and add the following lines in website/docs/installation.md file :

    <!-- [doc4](md-link-in-single-line-comment.md) -->
    
    <!--
      [doc5](md-link-in-multi-line-comment.md)
    -->
    

    I saved the file and verified the warning is not shown anymore

Related PRs

N/A

@netlify
Copy link

netlify bot commented Oct 8, 2021

✔️ [V2]
Built without sensitive environment variables

🔨 Explore the source changes: 9c86c8a

🔍 Inspect the deploy log: https://app.netlify.com/sites/docusaurus-2/deploys/61c2e643b2c8400007f6f73c

😎 Browse the preview: https://deploy-preview-5670--docusaurus-2.netlify.app

@github-actions
Copy link

github-actions bot commented Oct 8, 2021

⚡️ Lighthouse report for the changes in this PR:

Category Score
🟢 Performance 96
🟢 Accessibility 98
🟢 Best practices 100
🟢 SEO 100
🟢 PWA 95

Lighthouse ran on https://deploy-preview-5670--docusaurus-2.netlify.app/

Co-authored-by: Alexey Pyltsyn <lex61rus@gmail.com>
@@ -32,20 +32,25 @@ export type ReplaceMarkdownLinksReturn<T extends ContentPaths> = {
brokenMarkdownLinks: BrokenMarkdownLink<T>[];
};

const stripHtmlComments = (fileString: string) => {
return fileString.replace(/<!--.*?-->/gs, '');
};
Copy link

@Jarod42 Jarod42 Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that it would also remove that text when it is inside another block as string or fenced codeblock.

Even worse when start-comment and end-comment are in separated blocks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

Even worse when start-comment and end-comment are in separated blocks.

Not sure what you mean here?

Copy link

@Jarod42 Jarod42 Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first comment was for fenced codeblock (or any "block") for example:

```md
html comment have the following form <!-- comment with possibly [link_in_block_considered_in_comment](link.md) -->"
```

for which the removal for the tools might not be important.

but cases like 2 separates blocks:

```md
start of html comment is '<!--'
```
[link_falsely_considered_as_in_comment](link.md) 
```md
end of html comment is '-->'
```

would be more problematic.

Copy link
Collaborator

@slorber slorber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution

This method is called replaceMarkdownLinks and its role is not really to remove HTML comments, just ignore links that are inside HTML comments.

In particular, an HTML comment inside a code block shouldn't be removed for example, but your algo does remove it and the tests should cover use-cases like this

@@ -32,20 +32,25 @@ export type ReplaceMarkdownLinksReturn<T extends ContentPaths> = {
brokenMarkdownLinks: BrokenMarkdownLink<T>[];
};

const stripHtmlComments = (fileString: string) => {
return fileString.replace(/<!--.*?-->/gs, '');
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

Even worse when start-comment and end-comment are in separated blocks.

Not sure what you mean here?

@MattiaPrimavera
Copy link
Author

Yes thank you @slorber and @Jarod42 . I'll add both the tests mentioned and fix the code!

@facebook-github-bot facebook-github-bot added the CLA Signed Signed Facebook CLA label Oct 13, 2021
@Josh-Cena Josh-Cena changed the title fix(packages/docusaurus-utils): fix warning for unresolved link inside html comment fix(utils): fix warning for unresolved link inside html comment Oct 30, 2021
@Josh-Cena Josh-Cena added the pr: bug fix This PR fixes a bug in a past release. label Oct 30, 2021
@Josh-Cena
Copy link
Collaborator

Another false positive is links within inline code. `[link](./tutorial-basics/congratulations.md)` is converted to [link](/docs/tutorial-basics/congratulations), which is even weirder because it's actually rendered.

Fixing these very niche cases is much more non-trivial than what the current impl does. Just wondering: is it possible to move the entire logic to a Remark plugin instead? Or at least use Remark to parse the MD file? Or... we actually bite the bullet and give up on using simple regex tests?

@MattiaPrimavera
Copy link
Author

Fixing these very niche cases is much more non-trivial than what the current impl does

Indeed :) At least it does not seem that trivial if there's a constraint to parse every line of the md file only once (for performance requirements for instance).

is it possible to move the entire logic to a Remark plugin instead ?

In case the logic is not going to be moved in another plugin and there's no such a performance requirement, an option might be to first calculate all the links that should be ignored in the md file analysed (extract all fenced blocks, all ``, then all html comments in a newContent: string, then match all md links) in a mdLinksToBeIgnored list, then when replaceMarkdownLinks matches a link that is in mdLinksToBeIgnored list, the function may avoid to transform that link. This means each md file will be parsed at least twice: one to calculate mdLinksToBeIgnored and the other one for replaceMarkdownLinks. What do you think ?

@Josh-Cena
Copy link
Collaborator

Josh-Cena commented Oct 31, 2021

This means each md file will be parsed at least twice: one to calculate mdLinksToBeIgnored and the other one for replaceMarkdownLinks. What do you think ?

That doesn't sound unaffordable to me. From a computational perspective, it's still O(n) :D We need to have something first in order to improve on it.

I don't quite understand how you are going to implement that though, so I'm asking for the code first. From my imagination, we would split the text with backtick fences and HTML open and close comments, go through each chunk, and only process those chunks that are not within these fences...


What I suggest: refactor this algorithm entirely using remark: https://github.com/remarkjs/remark

Use remark-parse to parse it to MDAST, find & replace all links, and then use remark-stringify to serialize it back to Markdown text. Not sure about efficiency, but at least it's better than patching our makeshift algo here and there

@MattiaPrimavera do you think you can handle this? Or do you need help from my side?

@Josh-Cena Josh-Cena linked an issue Dec 22, 2021 that may be closed by this pull request
7 tasks
@Josh-Cena Josh-Cena changed the title fix(utils): fix warning for unresolved link inside html comment fix(utils): use Remark to parse Markdown server-side Dec 22, 2021
@Josh-Cena Josh-Cena changed the title fix(utils): use Remark to parse Markdown server-side refactor(utils): use Remark to parse Markdown server-side Dec 22, 2021
Copy link
Collaborator

@slorber slorber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice POC 👍 , wonder about the performance implications but it looks to me it could be slower due to duplicated Remark parsing 😓 if we find a way to move this logic to the loader, that could help deduplicate

@@ -23,7 +23,7 @@ exports[`blogFeed atom shows feed item for each post 1`] = `
<id>/mdx-blog-post</id>
<link href=\\"https://docusaurus.io/myBaseUrl/blog/mdx-blog-post\\"/>
<updated>2021-03-05T00:00:00.000Z</updated>
<summary type=\\"html\\"><![CDATA[HTML Heading 1]]></summary>
<summary type=\\"html\\"><![CDATA[Heading 2]]></summary>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a different behavior here, don't know for sure what is best though 🤷‍♂️

image

options: MarkdownParserOptions = {remarkPlugins: []},
): string | undefined {
const {remarkPlugins = []} = options;
const mdast = remark().use(mdx).use(remarkPlugins).parse(fileString);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now need to parse the whole doc as AST before being able to extract an excerpt.

Is there some kind of streaming API that could permit to avoid this full MDAST transformation?

For very large files (like our changelog), we don't really want to transform 2 or 3 times this doc to MDAST, but should at most do it only once, or this can have some performance impact.

const mdast = remark()
.use(mdx)
// .use(remarkPlugins) // We don't pass plugins here. Let's see if there's any use-case where this is useful
.parse(content);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, wonder if we can avoid processing the whole doc here

@Josh-Cena
Copy link
Collaborator

Hi @MattiaPrimavera Thanks for your time and effort put into this! Unfortunately, our final resolution has deviated significantly from your initial implementation, and continuing to work on your fork will be quite intractable. I've sent #6261 to properly address this and I'll close this one. Thanks again, and hope you can keep up with this great work in your future path with open source :D

@Josh-Cena Josh-Cena closed this Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Signed Facebook CLA pr: bug fix This PR fixes a bug in a past release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Comments can end up displaying inside description meta Unresolved link inside HTML comment
6 participants