-
-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(utils): use Remark to parse Markdown server-side #5670
refactor(utils): use Remark to parse Markdown server-side #5670
Conversation
✔️ [V2] 🔨 Explore the source changes: 9c86c8a 🔍 Inspect the deploy log: https://app.netlify.com/sites/docusaurus-2/deploys/61c2e643b2c8400007f6f73c 😎 Browse the preview: https://deploy-preview-5670--docusaurus-2.netlify.app |
⚡️ Lighthouse report for the changes in this PR:
Lighthouse ran on https://deploy-preview-5670--docusaurus-2.netlify.app/ |
Co-authored-by: Alexey Pyltsyn <lex61rus@gmail.com>
packages/docusaurus-plugin-content-docs/src/markdown/__tests__/__fixtures__/docs/doc2.md
Show resolved
Hide resolved
@@ -32,20 +32,25 @@ export type ReplaceMarkdownLinksReturn<T extends ContentPaths> = { | |||
brokenMarkdownLinks: BrokenMarkdownLink<T>[]; | |||
}; | |||
|
|||
const stripHtmlComments = (fileString: string) => { | |||
return fileString.replace(/<!--.*?-->/gs, ''); | |||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that it would also remove that text when it is inside another block as string or fenced codeblock.
Even worse when start-comment and end-comment are in separated blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
Even worse when start-comment and end-comment are in separated blocks.
Not sure what you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first comment was for fenced codeblock (or any "block") for example:
```md
html comment have the following form <!-- comment with possibly [link_in_block_considered_in_comment](link.md) -->"
```
for which the removal for the tools might not be important.
but cases like 2 separates blocks:
```md
start of html comment is '<!--'
```
[link_falsely_considered_as_in_comment](link.md)
```md
end of html comment is '-->'
```
would be more problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution
This method is called replaceMarkdownLinks
and its role is not really to remove HTML comments, just ignore links that are inside HTML comments.
In particular, an HTML comment inside a code block shouldn't be removed for example, but your algo does remove it and the tests should cover use-cases like this
@@ -32,20 +32,25 @@ export type ReplaceMarkdownLinksReturn<T extends ContentPaths> = { | |||
brokenMarkdownLinks: BrokenMarkdownLink<T>[]; | |||
}; | |||
|
|||
const stripHtmlComments = (fileString: string) => { | |||
return fileString.replace(/<!--.*?-->/gs, ''); | |||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
Even worse when start-comment and end-comment are in separated blocks.
Not sure what you mean here?
Another false positive is links within inline code. Fixing these very niche cases is much more non-trivial than what the current impl does. Just wondering: is it possible to move the entire logic to a Remark plugin instead? Or at least use Remark to parse the MD file? Or... we actually bite the bullet and give up on using simple regex tests? |
Indeed :) At least it does not seem that trivial if there's a constraint to parse every line of the md file only once (for performance requirements for instance).
In case the logic is not going to be moved in another plugin and there's no such a performance requirement, an option might be to first calculate all the links that should be ignored in the md file analysed (extract all fenced blocks, all ``, then all html comments in a |
That doesn't sound unaffordable to me. From a computational perspective, it's still O(n) :D We need to have something first in order to improve on it. I don't quite understand how you are going to implement that though, so I'm asking for the code first. From my imagination, we would split the text with backtick fences and HTML open and close comments, go through each chunk, and only process those chunks that are not within these fences... What I suggest: refactor this algorithm entirely using remark: https://github.com/remarkjs/remark Use remark-parse to parse it to MDAST, find & replace all links, and then use remark-stringify to serialize it back to Markdown text. Not sure about efficiency, but at least it's better than patching our makeshift algo here and there @MattiaPrimavera do you think you can handle this? Or do you need help from my side? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice POC 👍 , wonder about the performance implications but it looks to me it could be slower due to duplicated Remark parsing 😓 if we find a way to move this logic to the loader, that could help deduplicate
@@ -23,7 +23,7 @@ exports[`blogFeed atom shows feed item for each post 1`] = ` | |||
<id>/mdx-blog-post</id> | |||
<link href=\\"https://docusaurus.io/myBaseUrl/blog/mdx-blog-post\\"/> | |||
<updated>2021-03-05T00:00:00.000Z</updated> | |||
<summary type=\\"html\\"><![CDATA[HTML Heading 1]]></summary> | |||
<summary type=\\"html\\"><![CDATA[Heading 2]]></summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options: MarkdownParserOptions = {remarkPlugins: []}, | ||
): string | undefined { | ||
const {remarkPlugins = []} = options; | ||
const mdast = remark().use(mdx).use(remarkPlugins).parse(fileString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now need to parse the whole doc as AST before being able to extract an excerpt.
Is there some kind of streaming API that could permit to avoid this full MDAST transformation?
For very large files (like our changelog), we don't really want to transform 2 or 3 times this doc to MDAST, but should at most do it only once, or this can have some performance impact.
const mdast = remark() | ||
.use(mdx) | ||
// .use(remarkPlugins) // We don't pass plugins here. Let's see if there's any use-case where this is useful | ||
.parse(content); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, wonder if we can avoid processing the whole doc here
Hi @MattiaPrimavera Thanks for your time and effort put into this! Unfortunately, our final resolution has deviated significantly from your initial implementation, and continuing to work on your fork will be quite intractable. I've sent #6261 to properly address this and I'll close this one. Thanks again, and hope you can keep up with this great work in your future path with open source :D |
Motivation
Have you read the Contributing Guidelines on pull requests?
Yes
Test Plan
I did the following to verify my changes work:
add expectations and verified the snapshot change in
transforms absolute links in versioned docs
test (linkify.test.ts
) to confirm md file links contained in html comments are not in the expected result (since comments are stripped out before looking for md broken links)run
npm run start:website:watch
locally and add the following lines inwebsite/docs/installation.md
file :I saved the file and verified the warning is not shown anymore
Related PRs
N/A