-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enh(autodetect) multiple autodetect fixes #2745
Conversation
119031e
to
f9f3bc8
Compare
Moving this to 10.4 also since it'll probably conflict with the UTF8 stuff and I still want to merge in our long lost CSS branch also... |
8297e73
to
d64762e
Compare
6478b01
to
e18cd94
Compare
This is definitely going to make more sense if you view the commits one at a time vs as a whole. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extensive work! Generally looks good, see some comments.
@@ -49,7 +49,7 @@ export default function(hljs) { | |||
var AT_PROPERTY_RE = /@-?\w[\w]*(-\w+)*/ // @-webkit-keyframes | |||
var IDENT_RE = '[a-zA-Z-][a-zA-Z0-9_-]*'; | |||
var RULE = { | |||
begin: /(?:[A-Z_.-]+|--[a-zA-Z0-9_-]+)\s*:/, returnBegin: true, end: ';', endsWithParent: true, | |||
begin: /([*]\s?)?(?:[A-Z_.\-\\]+|--[a-zA-Z0-9_-]+)\s*(\/\*\*\/)?:/, returnBegin: true, end: ';', endsWithParent: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What /**/
is meant for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common css hack from the old days.
src/languages/tcl.js
Outdated
begin: '\\$(\\{)?(::)?[a-zA-Z_]((::)?[a-zA-Z0-9_])*\\(([a-zA-Z0-9_])*\\)', | ||
end: '[^a-zA-Z0-9_\\}\\$]' | ||
begin: '\\$(::)?[a-zA-Z_]+((::)?[a-zA-Z0-9_]+)*' + regex.optional(ARRAY_ACCESS), | ||
}, | ||
{ | ||
begin: '\\$(\\{)?(::)?[a-zA-Z_]((::)?[a-zA-Z0-9_])*', | ||
end: '(\\))?[^a-zA-Z0-9_\\}\\$]' | ||
begin: '\\$\\{(::)?[a-zA-Z_]((::)?[a-zA-Z0-9_])*' + regex.optional(ARRAY_ACCESS) + '\\}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... This doesn't look right to me. Consider e. g. $foo($bar)
: array index can be a full blown expression. You can have more fun:
set foo bar
set ${foo}(1) 42 ; # this is not captured by the grammar
puts $bar(1)
It has nothing to do with autodection fixes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you actually know TCL are you just looked it up? I’ll go back and review this again because I think maybe I see some issues now but I’m not sure. It be really helpful to have some additional test to show what supposed to be possible here. On some of these languages I’m really shooting blind just looking at the examples I have and what the existing rules seem to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to knew it... wrote several programs in Tck/Tk long ago. Now I can hardly remember anything.
- can start with `*` (css hacks) - can include a comment after attribute name before : (css hacks)
- `value` is too common variable name to score points as keyword - reduce 2x relevance for beginKeywords - bump csharp relevance slightly
- operators get 0 relevance (consistency: no other grammars score them) - "name" gets 0 relevance since almost any identifier will match This reduces false positives in the language-detection.el rosetta data set significantly.
- Add relevance for groovy shebang line - Ternary should not grant extra relevance
- "name" gets 0 relevance since almost any identifier will match Applying same logic as used with Clojure.
- only count => in `fn` context - prevent beginKeywords double relevancy - reduce relevance of `match`
- add `__FILE__` to keywords - add `proc` and `lambda` Kernel methods to build_ins - stricter rule for identifying method definition - highlight variables - `|` style params now get no relevance (can be too many other things) - add SHEBANG rule - make Ruby REPL matching a little stricter
- built-ins should only match if they are a call - fix function detection
- I looked but couldn't find any reference to this.
- There is no reason to do this every other language gets credit for simple strings.
- This is found in other langauges and isn't a strong signal.
- also fix pgsql markup test
cc6b05e
to
9d06fe9
Compare
These changes were made using a modified test suite from https://github.com/andreasjansson/language-detection.el
Before:
After:
So a 4-5% measurable improvement and there are probably also additional improvements "hidden" as in matches that are still not correct, but likely to be better than they were before - though that's a hard one to measure. (it's very hard to tell the difference between Lisp-like languages, etc)
There were a few common areas of focus:
illegal
to allow auto-detect to work for languages that were being flagged as illegalbeginKeywords
rules... I'm wondering if this shouldn't be done at the parser level itself.||
in Ruby matching params when it could be OR or a concat operator in SQL, etc...)