-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reference link autocomplete #893
Comments
I think this is a fairly simple regex addition in completion.ts line 496: else if (/\[[^\]]*?\]\[[^\]]*$/.test(lineTextBefore)) {
/* ┌───────────────────────┐
│ Reference link labels │
└───────────────────────┘ */ The above regex checks for essentially "[ ] [ ]" else if (/\[[^\]]*?\]\[[^\]]*$/.test(lineTextBefore) || /\[[^\]]*?$/.test(lineTextBefore)) {
/* ┌───────────────────────┐
│ Reference link labels │
└───────────────────────┘ */ I suppose that's a bit redundant and simply checking for "[ ]" would suffice, and makes more sense: else if (/\[[^\]]*?$/.test(lineTextBefore)) {
/* ┌───────────────────────┐
│ Reference link labels │
└───────────────────────┘ */ I've never done stuff with VSCode extensions or commit to a lovely repo such as this before, but happy to have a go at this. References |
I actually think this is a bug not a feature request now. It should check for only one [ ] not two [ ] [ ] when autocompleting reference links. Two is valid see spec example 523 , it's like a linkname to a linkref to a link, but it would work fine if the regex was only one set of square brackets. |
Looks you are right as a single
😉 |
This is, I guess, related to #260, etc. As you can see, around our codebase, there are so many regular expressions, which are based on presumptions and can only work on the most straightforward forms with local knowledge. When trying to support non-trivial cases, strange scenarios come up very fast. vscode-markdown/src/completion.ts Line 496 in 51ed4fb
"Recognizing reference definitions" and "Inferring the user is typing a reference link" are both challenging:
To support more forms, we need to settle many tricky cases first, and figure out a strategy to balance user experience. I'm now wondering if it's worth creating our own Markdown parser, since it is really hard to find an existing solution (in JavaScript) that meets the needs of an editor. I tried markdown-it, but actually almost give up now. Its architecture blocks fine source mapping, thus, it is suitable for very limited scenarios. I would not say the architecture of markdown-it is a failure. But I'd argue that breaking a CommonMark document into token stream is neither simpler nor more convenient than directly constructing syntax tree. Good luck. |
For my own understanding are you targeting Commonmark first/only or are you wanting compatibility with other popular variants like github markdown? Sounds like thats a separate wider issue you'd like to solve re regex? Happy to start spending some time thinking about this but does sound complex. Do you want to raise an issue separately for tracking/discussion on this? |
Markdown All in One focuses on CommonMark and GitHub Flavored Markdown (GFM), plus a few extensions urgently needed by the community. https://markdown-all-in-one.github.io/docs/decisions/markdown-syntax-and-flavors.html
Issues have characteristics in common. Although this product acts like an initiator with a bunch of islands, I still try to consider things in an overall view first. Sorry for not being clear enough. You can see two main points from those threads (typically, #890, #877, #809):
Back to the reference link completion, which was introduced in early 2019, and have not been maintained for years, it also suffers from flawed presumptions. Additionally, when and how to show completion is also implicitly in question. Maybe your idea (show as soon as hitting There is a piece of history. From the beginning, we match raw content against regular expressions to analyze documents. You can still find some regexp dating back to the first release. I remember somewhere, Yu Zhang told me that we'd better support the most easy-to-use and common cases first, and ignore those corner cases until someone reports them. That was fine before version 3.0.0, when nominally only GFM was supported, and too complex things could be confidently delayed. Besides, around version 1, this product was actually more like an attachment of VS Code's However, with #660 introducing
Naturally, we turned to try
Finally, I guess the way out is to build a standalone system from scratch. It's evidently complex itself, not to mention that we would probably need a shim layer to maintain compatibility with
I feel there is already a thread. Let's check the issue tracker. |
That's a lot to go through, so i will digest it further this evening. I'm not fully into the code yet so take my suggestions here as a n00b possibly talking rubbish. I've fixed this bug in my local repo (need to push to github) but it sounds like I'll need to spend some time testing various edge case's. |
Thank you both for the discussion ❤. I agree (as @Lemmingh said) that many regexp-based approaches are now not capable to deal with more and more Markdown "exceptions". And we perhaps need to turn to a syntax parser in the future, although it is a long way to go. For now, I think it is acceptable to include the simple "patch" from @gavbarnett as it does help some users (Good) and in the same time it doesn't introduce any difficulties to our future refactoring (Not bad). (The only concern is that people will see some completions when they type the first |
Will try to get a pull request in soon. The other odd side effect is when creating check box list's |
As long as the false positive is not very annoying, I feel it'll be OK. After all, I don't expect "match raw content against regular expressions" to give accurate result. @yzhang-gh
We need statistics, which is not possible now. By counting, we can learn a user's preference, and only show completion for the syntax they like. For now, my suggestion is similar to #893 (comment):
This might mitigate some false positive within paragraphs. I see a lot of |
Another thought: We can do some refactoring, then show completion for shortcut reference link only on explicit request ( |
Agree
Sounds good, although I guess it won't help a lot? (When we see the first
It is an option. The thing is I don't think this feature is worth the effort 🤣. |
Right, the difference between collapsed and shortcut is just a trailing And you remind me of "Beyond Markdown", where jgm listed reference link as one of the six pain points. 😂 (Edited by @yzhang-gh: the link to Beyond Markdown) |
This is exactly where I think breaking down the regex into chunks would be handy for re-use. Naive example code (This would need some thought to iron out the kinks, such as "optional whitespace (including up to one line ending)"): /* regex sub-patterns */
//these could be either strings or actual regex.
reg = {
"whitespace" = "[ \t\r\n\f\v]",
"url" = "[TBD]",
"linkLabel" = "[TBD]",
//etc.
}
//build a pattern for checking
"regLinkRefDefinition" = regex_concat(
reg.whitespace, "{0-3}",
reg.linkLabel, ":",
reg.whitespace, "{0-1}",
reg.url,
"(", reg.whitespace, reg.title, ")?"
) //this is probably invalid regex syntax, but just showing the idea. It makes the intent more explicit at least.
/*A link reference definition consists of a link label, indented up to three spaces,
followed by a colon (:), optional whitespace (including up to one line ending),
a link destination, optional whitespace (including up to one line ending),
and an optional link title, which if it is present must be separated from the link destination by whitespace.
No further non-whitespace characters may occur on the line.*/ Regex doesn't seem to natively support concatenation but there are plenty of example on stack exchange for how to do this. It basically sorts out some escape characters joins the strings and converts to a new regex.
Not sure I follow? Why would it require a space before hand? I think if its valid CM then the autocompletion should be offered and the user can just type something else, don't think it really get's in the way. In my view there is a difference between editing (where some guessing is required as mostly you need to be fast and work line by line) and syntax validation/html rendering (where you can read the full file and construct a tree as you say). While editing you're never sure which branch of the tree the user is about to take so you have to offer all possible options (ranked in some way). |
Just to add that we did the similar thing for GFM table regexp vscode-markdown/src/tableFormatter.ts Lines 54 to 69 in 51ed4fb
|
That's good, I've made a start at listing the regex from the spec as best I can, it's a long slog this method (my regex skills are slowly improving). I've been breaking it up per section in the spec for now and making each expression bounded in brackets so they can be combined when needed. (This is not directly related to this issue so I might keep it separate when I do a pull request). I'm especially happy with cmAtxHeadings for now 😄. Clearly this doesn't solve all problems (just got to Setext headings 😱, surely only crazy people that want to watch the world burn would use them!) but it could be a nice foundation to work from. /*
CommonMark 0.29 regexList
use with "u" flag, i.e. /pattern/gu
*/
const cmPre = {
//uses Unicodes and unicode types to be as close to spec as possible
//Characters and lines
"lineEnding": String.raw`(\u000A|\u000D|\u000D\u000A)`,
"whitespace": String.raw`[\u0020|\u0009\|\u000A|\u000B|\u000C|\u000D]`,
"unicodeWhitespace": String.raw`[\p{Z}|\u0009|\u000D|\u000A|\u000C]`,
"space": String.raw`[\u0020]`,
"spaceOrTab": String.raw`[\u0020|\u0009]`,
"asciiPunctChar": String.raw`[\u0021-\u002F|\u003A-\u0040|\u005B-\u0060|\u007B-\u007E]`,
"punctChar": String.raw`[\u0021-\u002F|\u003A-\u0040|\u005B-\u0060|\u007B-\u007E|\p{Pc}|\p{Pd}|\p{Pe}|\p{Pf}|\p{Pi}|\p{Po}|\p{Ps}]`,
//Tabs
"tab": String.raw`[\u0020\u0020\u0020\u0020|\u0009]`,
//Insecure characters
"insecureChar": String.raw`[\u0000]`
}
const cmBlocks = {
"thematicBreak": String.raw `(^${cmPre.space}{0,3}((\*${cmPre.spaceOrTab}?){3,}|(-${cmPre.spaceOrTab}?){3,}|(_${cmPre.spaceOrTab}?){3,})${cmPre.spaceOrTab}*)`
}
const cmAtxHeadings = {
"hClosing": String.raw `((${cmPre.spaceOrTab}+[#]+${cmPre.spaceOrTab}*)?$)`,
"hContent": String.raw `((.*?)?)`, //suspect this could be better
"h": String.raw`(^${cmPre.space}{0,3}[#]{1,6}${cmPre.spaceOrTab}+)`,
"h1": String.raw`(^${cmPre.space}{0,3}[#]{1}${cmPre.spaceOrTab}+)`,
"h2": String.raw`(^${cmPre.space}{0,3}[#]{2}${cmPre.spaceOrTab}+)`,
"h3": String.raw`(^${cmPre.space}{0,3}[#]{3}${cmPre.spaceOrTab}+)`,
"h4": String.raw`(^${cmPre.space}{0,3}[#]{4}${cmPre.spaceOrTab}+)`,
"h5": String.raw`(^${cmPre.space}{0,3}[#]{5}${cmPre.spaceOrTab}+)`,
"h6": String.raw`(^${cmPre.space}{0,3}[#]{6}${cmPre.spaceOrTab}+)`
}
// Example isolating hContent (the title string) from any header removing spaces & tabs:
// pattern = String.raw`(?<=` + h + String.raw`)` +
// hContent +
// String.raw`(?=` + hClosing + String.raw`)` here's what that final example pattern should look like: |
That will be good as we plan to release a new version very soon. As for the regex expressions, I think we can have a standalone |
Markdown is designed to be natural language with formatting marks, like those ancient manuscripts. Its mysterious rules are designed to mimic how a human reads a piece of writing. Parsing Markdown is more like analyzing natural language. Although I'm not familiar with natural language processing, I can see Markdown's parsing strategy (block-inline) is hierarchical and recursive from the beginning. It is different from the lexical-semantic workflow that we use to parse programming languages. When working with normal programming languages, we divide source code into tokens in the first phase, and can predict what should be next as soon as we start. While with Markdown, we had better take node as basic unit, and it's still difficult to infer what could be there even after analyzing the whole document. Matching a whole structure is exactly matching raw content, which has proved to be full of limitations, and is why the current implementation is so error-prone. If we're talking about patching existing code, I don't mind. However, #893 (comment) and #893 (comment) appear to be of compiler design. Then, I am against it. Writing regexp literals is clearer and easier to reveal errors, although it limits many advanced possibilities. Concatenating strings to build regexp is honestly useful in many scenarios. However, it requires extreme carefulness due to an unsolvable hazard, which is also the biggest reason that TC39 rejected the Escaping Proposal. I do not think it's worth reviewing a short piece of code again and again just to ensure its runtime behavior is correct. Personally, I wish there could be a Concatenation Proposal for manipulating RegExp internal tree, but guess TC39 will reject it for complexity.
Never mind, I was daydreaming. I thought since shortcut reference link is dangerous, we can make it hard to appear. However, with only a
Embarrassingly, Markdown is strange as said above. CommonMark forces you to scan the entire document at least twice to know its content. Indeed, some of our current functions implicitly scan the document four or five times in a run, which IMO, is unnecessary cost. Even after that, you cannot be confident enough about what is going on, because it's Markdown. Thus, you are right, how to guess is an art. CommonMark regards everything as valid. A sequence that does not fit into known syntax is textual content. This is where most difficulties lie in.
Agree. |
Agree. Wondering if the extension should help inforced preferred style guide then? Doesn't really help with completion though. I'd not looked into TC39 escape proposal before, makes for interesting reading. But we could implement an insecure function for this/for concatenation if we're carefully about the regex chunks. Agree using regex instead of strings is better. This is all an interesting discussion but I'm aware we've gone down a few rabbit holes from the initial |
Do you mean parser design? We can discuss and manage them at https://github.com/markdown-all-in-one/markdown-language-service. |
Sorry, missed that!, just shows how long this browser tab's been open! Agree - don't reinvent the wheel Ok - will hop over there. I think the above 2 issues can remain here for now as they have merit anyway and are relatively straightforward. But a full parser design maybe the better longer term |
Is there a set of tests for auto completion in the code already? If so how do i use it? If not I'll just test manually and list my tests. |
No, so we need to test it by ourselves. |
Sorry this is taking a while, only got a few hours in the evenings for it. I'm currently trying to tighten up the regex to more accurately match in a few places. Getting there! I think I might have to change some bits to a regex with multiple (& global) flags rather than line by line as link ref definitions can be multiline. Currently able to match all but 2 examples of link reference definition in the common mark spec. One i think i can solve (link ref definition within code blocks should be ignored), the other (link ref definition with whitespace line break in the title text should be ignored) i may leave as a problem for the next person. |
Proposal
When adding a new link the option to autocomplete from reference links should be implemented.
Example:
If the bottom reference line exists anywhere in the file then starting a new link text should provide option to autocomplete from the reference link text.
This would seem similar but not the same as what is currently implemented
The text was updated successfully, but these errors were encountered: