-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JavaScript] Highlighting issues #1600
Comments
Good finds. The second is definitely a bug. The syntax's implementation of the automatic semicolon insertion algorithm wasn't looking for the The first is a bit trickier. In most places, |
Thank you for the fast fix!
I am pretty sure there must be an easier solution than doubling the syntax. Ok, I agree that the first test case may be a low priority bug, but it is still a bug and it breaks highlighting to the end of the file. But can you please also look at other examples I posted,I mean for example |
Take a look at how the ECMAScript grammar is formally specified; its productions (corresponding loosely to sublime-syntax contexts) are parameterized by
Ordinarily, there ought to be reasonable fallback behavior in such cases, but here I don't see how that could work. Open to suggestions. (a
)=>a
(
a)=>a These are unfixable. JavaScript's syntax in these places is not deterministic context-free. All that Sublime can do is make a reasonable guess and try to recover gracefully from a wrong guess. If the highlighting fails ungracefully, that can and should be fixed. I have proposed a syntax extension that would allow Sublime to parse arbitrary (not-necessarily-deterministic) context-free languages, but it's not a trivial feature. I'm hoping in the next month or so to come up with a reference implementation that might help to illuminate the issue. for(a
of/a/)a This is also, in principle, not deterministic context-free. It might be possible to create a cover grammar that addresses this, but it would also likely involve a fair bit of duplication. I'll take a look at the ECMAScript grammar and see what can be done. ~class
extends
/\//
{}
/a
~class extends class{}{}/a Have you tried #1581? |
It doesn't mean it cannot be implemented. I don't see the problem: if something cannot be parsed using the current sublime's syntax parser, then extend the set of parser's capabilities. Refactor the sublime's syntax notation, introduce new functionalities which will provide a way of parsing JavaScript syntax.
My suggestion is: implement new syntax notation for the sublime highlighter (
Sorry, but I reject to believe in that. Everything can be fixed, the only question is the cost of the fix.
Sounds awesome, but only when it becomes part of sublime's core (I don't want to install 3rd party plugins in order to properly highlight JavaScript) To summarize, these test cases are not "unfixable". It may be currently hard to fix them because of sublime's poor syntax notation capabilities, but I hope it will be improved in the near future. Of course, according to the fact that some of these test cases are almost nowhere seen in real code, I don't insist this to be fixed immediatelly, but I also disagree that such test cases can be simply ignored. Sublime is the most overpowered text editor out there, so it should parse JavaScript properly. These issues may be fixed in a month, in a year, or in 10 or 20 years, but they should be fixed and not closed as wontfix, since they represent a valid syntax.
What about this? The code looks disgusting. I don't think this is acceptable... |
This repository is for the default packages. Suggestions for new core functionality, including extensions to the sublime-syntax format, should be made in the core issues tracker as they are outside the scope of this repository. The corner cases in question cannot be fixed by any change within the scope of this repository. I have already linked the issue on the core repository that would allow those cases to be addressed here.
That proposal cannot be implemented by a third-party plugin. The parser is built into the Sublime core and is not extensible via plugins.
Are you sure that your interpretation of that snippet is correct? |
Yes, I am sure. Please copy-paste the following snippet in the browser console and you will see (
a)=>{}
/a\/*/
someFunction();
function someFunction(){
var str = "This is a valid JavaScript code, not a comment";
Thom1729_said: 'If the highlighting fails ungracefully, that can and should be fixed.';
console.log('123')
}
//*/ |
I've tracked down the issue. Arrow function expressions are specified strangely in the formal grammar so that if the This is a fixable bug. It is not related to the arrow function parameter list issue; the existing fallthrough logic for arrow function parameter lists misparsed as parenthesized expressions works just fine. The problem begins after the block. If the arrow function were a Here's a simpler example: x=>{}
/a/g; Somewhat alarmingly, both Babel and Uglify parse this incorrectly. At least Sublime is in good company! On the other hand, I would expect most unsophisticated syntax highlighters (such as GitHub's) to mark this correctly, because it would trigger the obvious dumb heuristics for regexp detection. This is a nice find. How did you run into this? Not in production code, I hope. Due to the complexity of the bug, it may be a few days before I have the time to design a solution. |
In a good, but not in the best. For example, CodeMirror gets it right. |
I'm not sure that I'd consider CodeMirror to be "the best" company. The syntax highlighting seems to be a dumb tokenizer. As I mentioned, I would expect most such systems to highlight this example correctly by pure guesswork. Unfortunately, such heuristics will inevitably guess wrong in many cases. See PR #1009 for a discussion and some links. Most syntax highlighters use simple heuristics, with generally tolerable results. It would be unreasonable to expect such a highlighter to reliably handle complex syntax. At the opposite extreme, some editors build a real abstract syntax tree. This is theoretically ideal, but the drawbacks are significant. Implementing a language requires implementing a full parser and tokenizer or porting an existing one, so most such editors support only one language or a small handful of languages. In addition, full-blown AST parsers are much more resource-intensive than dumb heuristic highlighters and may not be suited to real-time code editing. Sublime's syntax highlighting engine is intermediate between these extremes. It implements a real deterministic context-free parser, allowing it to exactly parse many common languages and to do a very good job with nearly all common languages. But the engine is very fast -- guaranteed linear time, I believe, outside of edge cases. Incremental parsing is supported, which is very important for a real-time editor. This efficiency comes at a cost: incremental parsing limits lookahead, while the linear time guarantee limits the parser to deterministic context-free languages. JavaScript is context-free, but not deterministic context-free. Python is not context-free at all. Extending Sublime's engine to exactly handle more general languages is not easy in the first place, but any such extension must also preserve reasonable time complexity guarantees for most syntaxes. This is a difficult problem to solve. I think that it's worth solving, which is why I wrote my proposal for nondeterministic parsing, and why I'm working on my own parser implementation to use as a testbed for such extensions. It's important to maintain perspective. Sublime's syntax highlighting engine is already the best in the business by a comfortable margin. |
I know you wrote the proposal, you've said it four times. But I don't see the problem about this issue: once the proposal is implemented in the core, there may be an option like "Enable perfect parsing". If a user doesn't check that option, the highlighter will work in linear time complexity, but some edge cases will not be highlighted correctly. Otherwise, if a user checks it, all edge cases will be correctly highlighted, but linear complexity will not be guaranteed. I think it is the best idea so far, all users will be satisfied: ordinar users will use basic parsing and advanced users will use advanced parsing. But anyway, sublime should definitely provide both ways. btw, I've reported the babel issue, I'm posting the link here for reference; babel/babel#7993 |
It would be a lot harder than you might think to make such a feature optional. Either a syntax definition uses the new feature, or it doesn't. In order to make the feature optional, either there would need to be two parallel syntax definitions (a maintenance headache and inevitable source of bugs) or the syntax definition would somehow have to be crafted to (mostly) work without the feature. That sounds unlikely. Moreover, I don't think that "ordinary" and "advanced" users have significantly different needs when it comes to syntax highlighting accuracy/performance tradeoffs. The feature would have to provide reasonable performance guarantees to be useful at all. |
ok, here is my point of view: There are text editors like VisualStudio or Eclipse which highlight the code while typing and do it perfectly (no mishighlights at all) and do it very fast (I suppose in linear time). I cannot talk about VisualStudio, because it is not open-source, but Eclipse has a parser (a serious parser, not just a highlighter or tokenizer) which runs in a separate thread and highlights the code while user types. Eclipse supports wide range of languages and provides an easy way for adding new languages. As I said, I'm not a mathematician and I'm not familiar with the technical definition of a language, but here is my understanding: language can be considered as a set of arrays of bytes. Every element of that set represent a valid code in the given language (a code which doesn't throw syntax error when ran). According to that definition (and I suppose it is not far away from the real definition), I think the proposal Thom is talking about is not sufficient. To demonstrate what I mean, let's consider the following example: example: let's define a new language. A string is considered to be a valid code in that language if and only if it's sha512 hash starts with byte 0x00. So, my question is: how would sublime properly highlights valid scripts as valid and invalid scripts as invalid in such language? I mean, how the syntax notation (even with the Thom's extension) would be able to distinguish them? In most languages that is not the case, but I don't see a reason not to support even these languages. Native sublime's packages may not include such langauges, but it would be good to allow syntax package writers to support arbitrary languages. Therefore, here is my proposal: ProposalSorry for my inability to write this in a technical form, but anyway, here is my idea. Implement a new syntax notation in the core which will be able to parse arbitrary language and distinguish it's parts (expressions, functions, whatever it has). Suppose we have a simplified c-like language:
(the above code is invalid on purpose to demonstrate the ability of this proposal to detect it (redefinition of the variable So we basically only have variable declarations of
(this is a pseuodo-code, don't be surprised if you find a typo) So basically, the parser always starts from the The interesting part here is ProblemsThere are several problems with this method (but I think it is still not unsolvable). One of the problems is that someone may write syntax notation in such form that the parser will spin in a loop forever (suppose we swap lines Another problem is the exponential growth of the time needed for parsing some statements. A good example is the examle Thom provided in their proposal: (
a = (
b = (
c = (
d
) => null
) => null
) => null
) => null This will trigger a loop with exponential complexity which for large test cases, even if runs in a separate thread, will not complete in a reasonably short time, causing the rest of the syntax to be mishighlighted. But, as thom said:
I think it can be applied in my proposal too. Another issue with my proposal is the rollback algorithm. Since The conclusionI think this idea is worth considering. I am anyway writing my own parser with such syntax notation for parsing Esolang, so even if sublime doesn't want to implement this, I will publish my own highlighter which is, of course not that fast, but at least it perfectly highlights all scripts in the given language, and more important: it will be opensource. My proposal allows generating of the abstract syntax tree (which is also the result of parsing a non-terminal) which perfectly fits into the requirements: we can exactly tell what is a function, what is a variable, what is an expression and so on. This would provide a lot of new functionalities (highlighting non-declared variables as invalid or assigning a
It sounds reasonable to me. A lot of users are satisfied with the basic syntax highlighter. I really don't see a problem in writing a syntax definition in two different formats. Some languages doesn't even need this proposal, so we can keep the old syntax notation for these languages. But, in my opinion, it would be a good idea to provide a way for writing a syntax notation for arbitrary esolangs. Even if nobody supports this, I don't think it would be a big disaster, because I am anyway working on my own parser which will be able to parse any given language, but the only problem is because I am sure sublime would do it probably a lot more neat and do a lot more optimizations, that is why I think it would be a good idea to implement it in the core. (Just to mention, this proposal is a general idea, it has nothing to do with the JavaScript test cases from this issue. Such test cases should and must be fixed anyway) |
Sublime doesn't need to highlight every computable language. Ideally, it should be able to highlight every actual programming language. The ability to compute SHA512 hashes just isn't relevant to this objective because it isn't required to parse any programming language. Most programming languages belong to a class called the context-free languages. These languages can be represented by a context-free grammar, commonly represented in EBNF or a similar form. There are well-behaved, efficient algorithms for parsing context-free languages, particularly a subset called the deterministic context-free languages. The actual parsers used by compilers and interpreters use a range of similar algorithms that efficiently parse these classes of languages. Sublime's parser is essentially a deterministic pushdown automaton. Sublime syntax definitions are essentially state transition tables for a DPDA. A DPDA can be run very fast -- with some caveats, in linear time. Just as importantly, the engine can be written in efficient C code and compiled to assembly. This is extremely important for a real-time syntax highlighter. In addition, Sublime's implementation does not look across line boundaries, which allows it to implement optimizations that minimize reparsing as the text changes. Eclipse and Visual Studio allow users to essentially write their own parsing engines. These extensions are compiled to native code (or to VM code). This allows users to parse anything that's parseable by any means. There is no free lunch here; parsing more complex languages is slower. It is mathematically impossible for Visual Studio or any other software to parse JavaScript in linear time. (No one holds up those editors as models of efficiency and responsiveness in the first place.) This approach has several additional downsides:
You're proposing a format, but not an algorithm. The format supports nondeterministic context-free languages. What algorithm parses them? There are several possible answers, none of which runs in linear time. I wrote the nondeterminism proposal the way that I did because the algorithm is far more important than the definition syntax; the meat of the proposal isn't the name In your example, you have arbitrary JavaScript code. For Sublime, it would probably be Python. Sublime's engine does not run in Python; the overhead of invoking the Python runtime mid-parse would be tremendous. Doing so for every identifier would be disastrous. Suppose that the code should be written in C++ and compiled, and you introduce a whole new class of problems. You yourself point out that this feature would require a total reparse on every change, and your intuition that this is unfixable is basically correct. Finally, I would note that finding type errors and such is generally the job of linters. There are high-quality linters available for most common languages, and robust linter integration is available via the Sublime-Linter package. These linters often lag behind the user's input. It would be disastrous if the linter's work held up highlighting. |
I explained the algorithm and described it very briefly, but I also mentioned that it is very unoptimal and not for real-time parsing. But anyway, forget about it, the main thread here is how to fix the test cases which break the highlight completely (14 out of 17 test cases I posted break the highlight). When you are idle, please open some prs with the appropriate fixes. I am sure not all 17 issues need your proposal. |
The following require nondeterministic parsing:
In these cases, the best we can do is make reasonable guesses and fail gracefully. Fixing Your example 7 is fixed by #1581. Example 12 is partially fixed by #1581. Several others are marked as already fixed by various PRs. The remaining issues are:
|
Thanks for those fixes! I think the 18th is easy too (see this). |
Regarding item 14, @Thom1729, that’s likely a vestige of previously targeting older versions of the language. @user-982634 brought your proposed delayed-scope-assignment/non-deterministic-parsing idea to my attention (after I mistook them for someone who used to troll our repo, oops). It’s a very convincing solution! |
See Bubblegum |
It has already been fixed. |
Regarding above issue with regexp:
- meta_include_prototype: false
- meta_scope: string.regexp.js
- - match: "/"
+ - match: '[^\]?/'
scope: punctuation.definition.string.end.js |
@trongthanh LGTM. Probably it has been fixed in the current master branch. At least, I don't think I am using a custom JS syntax. For people who are interesting in reproducing: module.exports = {
external: [
/\/.*/,
'str'
],
rules: []
}; |
I can't reproduce either. I'll try to reproduce with an older branch if possible, but I may not have the chance today. In any case, the latest master seems to work. |
@Thom1729 is your summary of remaining issues in #1600 (comment) still up to date? OP includes many different issues, some of which are marked as solved but not as many as your summary suggests and most of the talk in this issue is related to syntax parsing in general, not the issues at hand. While they are a very good read for the interested reader, they aren't exactly related to the reported issues and make following the status of these kind of difficult. I suggest moving the unresolved issues, grouping those that are unfixable with the current system for archival purposes and those that can be fixed into separate issues. |
@Thom1729 & @jfcherng, I just tried on my home laptop running Ubuntu and couldn't reproduce it. It's also ST3 dev build 3210. So it's pretty strange. I'll take a closer look on the machine has the issue (ST dev build 3210 on macOS) and try to find what's the real cause. Edited: OK, It's so weird that when I removed the patched JavaScript.sublime-syntax on my problematic machine, the issue is also not reproducible. Maybe there were some intermittent config changes I made to ST3 combined with JSCustom plugin that cause the issue. So never mind. |
I've created a separate issue for everything in here that isn't fixed. It should be fine to close this issue. |
~await/a
a
instanceof/a/
a
in/a/
for(a
of/a/)a
~{*a(){yield
/a/}}
async a=>{for await(a of a)a}
~class
extends
/\//
{}
/a
try{}catch{}if(a)a
(a
)=>a
(
a)=>a
(
a
)=>a
~class extends class{}{}/a
a=>{}
/a/
if(0)
goto/a
a:
break a
/a/
a:for(;;)continue a
/a/
'\u{1}'
The text was updated successfully, but these errors were encountered: