-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enh(php) Switch highlighter to partially case-insensitive #3426
enh(php) Switch highlighter to partially case-insensitive #3426
Conversation
Actually let me rewrite the keyword lists to arrays first so the PR will be easier to follow... then you would just need a
Assuming the magic constants are case-insensitive. |
Great but there are more valid cases like: Is there an option to overwrite global case sensitivity with the |
But do people do that? That seems very UGLY/bad practice to me. (esp the latter)
We don't have any easy way to do that... all the grammar regex must be
|
People rather use lowercase. I saw some cases of uppercase: TRUE, FALSE, NULL. So we can ignore it. |
Oh I wonder if it's more common for literals, we could just support those if those are common... |
Can you give me commit access to your fork? |
06493e5
to
ba2b233
Compare
Do you think we could match function dispatch similar to how we do in CPP so we just catch all invocations? https://github.com/highlightjs/highlight.js/blob/main/src/languages/cpp.js#L401 |
Looks good. Invocations of? |
Any function... if it's a function we can highlight it as |
That what I thought. Currently, the highlighter does not support function invocation, only declarations. Do you want this to be part of this MR? |
Well you had "function names (case-insensitive)" on your list... so that's the first thing I jumped too... overall we're trying to move away from complex parsing an do simpler pattern matching... so for many languages the same rule can be used to detect function definition and function invocation.
The same |
If 99% of the time people use lower case keywords I'm ok with just trying that and seeing how it works (with the exception of the literals)... would that complete |
This should be fine. Will look at it tomorrow. |
957e4b3
to
629912a
Compare
@joshgoebel Ready for review. |
How so? |
Ugh, ok - that's a lot... :-) We can keep it as-is and see if it works ok I suppose. |
629912a
to
6b39cb6
Compare
src/languages/php.js
Outdated
const result = []; | ||
items.forEach(item => { | ||
item.replace(/\|0$/, ""); | ||
result.push(item); | ||
}); | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use map
for transforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this needs to match \d+
not just 0
... so it's future proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is all I see. I might step in after and make a negative look-ahead list helper... but this looks pretty clean right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshgoebel done
I think this is all I see. I might step in after and make a negative look-ahead list helper... but this looks pretty clean right now.
Good idea
6b39cb6
to
f5c885f
Compare
src/languages/php.js
Outdated
const FUNCTION_INVOKE = { | ||
relevance: 0, | ||
match: [ | ||
/(?:->|::|\s|\(|\\)/, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is just an attempt to eliminate false positives - if so what would an example or two be? Why do we need these at all vs just relying on [ident](
...?
If this is critical please provide commented code examples of all the variants you're using here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is just an attempt to eliminate false positives - if so what would an example or two be? Why do we need these at all vs just relying on
[ident](
...?If this is critical please provide commented code examples of all the variants you're using here.
All the correct cases are in the Function invoke
tests. I also added a comment and test for First-class Callable Syntax,
which was working correctly before this PR (because there was no function invoke). Now it would be a false positive, so I added all the rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following... Isn't this just an issue with the non-function guard? If I do this, all tests pass:
diff --git a/src/languages/php.js b/src/languages/php.js
index 9f45841a..297bd55b 100644
--- a/src/languages/php.js
+++ b/src/languages/php.js
@@ -321,7 +321,8 @@ export default function(hljs) {
relevance: 0,
match: [
// to prevent First-class Callable Syntax from being confused as the function title
- /(?:->|::|\s|\(|\\)/,
+ // /(?:->|::|\s|\(|\\)/,
+ /\b/,
// to prevent keywords from being confused as the function title
regex.concat("(?!fn\\b|function\\b|", normalizeKeywords(KWS).join("\\b|"), "|", normalizeKeywords(BUILT_INS).join("\\b|"), "\\b)"),
/\w+/,
@@ -375,7 +376,8 @@ export default function(hljs) {
FUNCTION_INVOKE,
{
// swallow composed identifiers to avoid parsing them as keywords
- begin: /(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/
+ begin: /(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*(?!\s*\()(?![a-zA-Z0-9_\x7f-\xff])/,
+ // scope:"wrong"
},
CONSTRUCTOR_CALL,
{
IE, fixing the guard and using only a word boundary to match the start of function invokes... thoughts? This using multi-match as "look-behind" is dangerous and I really don't want to encourage it at all if it can be easily avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshgoebel Strange because I applied your fix and they fall.
1 failing
1) php
should not cause polynomial backtracking:
Error: php/[10]/begin: Polynomial backtracking. By repeating any character that matches /\xa0/i, an attack string can be created.
[a-zA-Z0-9_\x7f-\xff]*(?!\s*
^~~~~~~~~~~~~~~~~~~~~~ ^~~
Full pattern:
/(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*(?!\s*\()(?![a-zA-Z0-9_\x7f-\xff])/m
After changing \s*
to \s
I have a below error and a few offers.
-DateTimeImmutable::createFromMutable(<span class="hljs-keyword">new</span> <span class="hljs-title class_">\DateTime</span>(<span class="hljs-string">'now'</span>));
+DateTimeImmutable::<span class="hljs-function title_ invoke__">createFromMutable</span>(<span class="hljs-keyword">new</span> <span class="hljs-title class_">\DateTime</span>(<span class="hljs-string">'now'</span>));
This regex do not detect static function invoke:
DateTimeImmutable::createFromMutable();
or method invoke
$date->format();
and we do not want to higlight First-class Callable Syntax as a function
$test();
P.S. I'm run tests inside Docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not cause polynomial backtracking:
That can be ignored [for now] and fixed. I didn't have any markup test errors. Your'e saying you're seeing the same result right - all tests pass with my patch , other than the security issue?
So if we can fix that, then can this be simplified as I'm suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can names TRULY contain non-breaking spaces (0xA0)? That sounds like a bug to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not cause polynomial backtracking:
That can be ignored [for now] and fixed. I didn't have any markup test errors. Your'e saying you're seeing the same result right - all tests pass with my patch , other than the security issue?
So if we can fix that, then can this be simplified as I'm suggesting?
It will be annoying
Can names TRULY contain non-breaking spaces (0xA0)? That sounds like a bug to me...
Yes, but I never saw anyone using it. Looks bad for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshgoebel fixed it with *
because only space is valid after the name of the function and before opening bracket.
Did the same in the constructor code.
I hope that's all, thx for this solution.
2eb5d1f
to
f9afb8f
Compare
src/languages/php.js
Outdated
@@ -374,7 +374,8 @@ export default function(hljs) { | |||
FUNCTION_INVOKE, | |||
{ | |||
// swallow composed identifiers to avoid parsing them as keywords | |||
begin: /(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/ | |||
begin: /(::|->)+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*(?! *\()(?![a-zA-Z0-9_\x7f-\xff])/, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure they all needed to change, unless the [ ]
is more precise? (it may likely be) IIRC the issue was with this line alone because \xA0*
overlaps with \s*
.... which is fixed easily enough by changing the look-ahead \s to a space, if that works for our use cases. :-) The character groups in the regex here also could have been fixed. (I might just do this)
So, is this more precise? (in which case I'll likely change them to [ ]
for readability.
Any final thoughts? Thanks for all the hard work on this. I think the finish line is just ahead. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure they all needed to change, unless the
[ ]
is more precise? (it may likely be) IIRC the issue was with this line alone because\xA0*
overlaps with\s*
.... which is fixed easily enough by changing the look-ahead \s to a space, if that works for our use cases. :-) The character groups in the regex here also could have been fixed. (I might just do this)So, is this more precise? (in which case I'll likely change them to
[ ]
for readability.Any final thoughts? Thanks for all the hard work on this. I think the finish line is just ahead. :)
new
\DateTime
();
This is also allowed so I will introduce const SEPARATOR [ \t\n]
. This should cover all the cases. Probably this will be better in all those places, if non-breaking space may be part of the name and \s
select it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also allowed
I imagined it might be. :-)
so I will introduce const SEPARATOR
I think "WHITESPACE" might be better (and add a comment saying it purposely excludes 0xA0
), etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "WHITESPACE" might be better (and add a comment saying it purposely excludes
0xA0
), etc...
Done @joshgoebel
a8c3799
to
d6e220e
Compare
- also add `title.function.invoke` to the official list (Rust was already using)
444581b
to
9c711ed
Compare
@wkania Thanks for all the effort. I might release 11.4 soon just to see how this case change plays out. |
Changes
After discussion related to PHP partial case-insensitive .
Checklist
CHANGES.md