Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Perl) No support for m style regex with arbitrary delimiters #2952

Closed
GwenDragon opened this issue Jan 7, 2021 · 11 comments · Fixed by #2960
Closed

(Perl) No support for m style regex with arbitrary delimiters #2952

GwenDragon opened this issue Jan 7, 2021 · 11 comments · Fixed by #2960
Assignees
Labels
bug help welcome Could use help from community language
Milestone

Comments

@GwenDragon
Copy link

Describe the issue
Regex detection after keyword m fails if having a slash f.ex. $d .= '/' if $d !~ m(/$);

Which language seems to have the issue?
Perl

Are you using highlight or highlightAuto?
initHighlightingOnLoad

Sample Code to Reproduce
https://jsfiddle.net/0t6zh59g/

my %domain_data;
foreach my $d (@domains) {
    my $label = $d;
    $d .= '/' if $d !~ m(/$);
    $label =~ s#^https?://##;
    $label =~ s#[^A-Za-z]#_#g;
    next if $label eq "_";
    $domain_data{$d} = [ $label, 0 ];
}

Screenshot
2021-01-07 14 00 59 jsfiddle net ff0b47f76bf6

Expected behavior
Only /$ is hilighted as regex.

Additional context
Perl's match operator allows a unescaped / , the regex parser of HLJS should detect this.
More valid regex are f.ex. m|/$| or m#/$# or m{/$}
Same issue happens with $d .= '/' if $d !~ qr(/$); and other regexes as parameters for Regexp Quote Like Operators

@GwenDragon GwenDragon added bug help welcome Could use help from community language labels Jan 7, 2021
@joshgoebel
Copy link
Member

joshgoebel commented Jan 7, 2021

  • And in a case like m#/$# how would one escape a literal # in the regex?
  • Are there limitations on which characters may follow m to denote a regex "enclosure"?
  • If not are certain character more common by convention?

@GwenDragon
Copy link
Author

GwenDragon commented Jan 7, 2021

This is valid Perl code.

use 5.010;
my $test = qq"test";
if ($test =~ m#test#) {
	say "m#test#: ", 1;
}
if ($test =~ m.test.) {
	say "m.test.: ", 1;
}
if ($test =~ m+test+) {
	say "m+test+: ", 1;
}

Are there limitations on which characters may follow m to denote a regex "enclosure"?

//EDIT:
I do not know the BNF of Perl.
I guess all non-apphanum cars and brackets <>(){}[]!! and so on.
https://perldoc.perl.org/perlre#The-Basics
https://perldoc.perl.org/perlop#Gory-details-of-parsing-quoted-constructs

@joshgoebel
Copy link
Member

joshgoebel commented Jan 7, 2021

Ok. But in a case like m#...# how would one escape a literal # inside the regex? Or is that not possible?

In a normal regex evidently a backslash escape may be used (assuming our grammar is correct):

/\//

So how do I write:

m###

Where the inside # is part of the regex?

@joshgoebel joshgoebel changed the title (Perl) Regex hilihght fails if regex contains / (Perl) No support for m style regex with arbitrary delimiters Jan 7, 2021
@joshgoebel joshgoebel changed the title (Perl) No support for m style regex with arbitrary delimiters (Perl) No support for m style regex with arbitrary delimiters Jan 7, 2021
@GwenDragon
Copy link
Author

Ok. But in a case like m#...# how would one escape a literal # inside the regex? Or is that not possible?

Ok, in this case i guess m#\##

@joshgoebel
Copy link
Member

That would be simple enough. Could you confirm your guess?

@GwenDragon
Copy link
Author

Regular case is to escape the opening or closing char if part of a regex.

I will come back the next days if i know more on special syntax of Perl quoting.

@joshgoebel
Copy link
Member

When searching for single-character delimiters, escaped delimiters and \ are skipped. For example, while searching for terminating /, combinations of \ and / are skipped. If the delimiters are bracketing, nested pairs are also skipped. For example, while searching for a closing ] paired with the opening [, combinations of \, ], and [ are all skipped, and nested [ and ] are skipped as well. However, when backslashes are used as the delimiters (like qq\ and tr\), nothing is skipped. During the search for the end, backslashes that escape delimiters or other backslashes are removed (exactly speaking, they are not copied to the safe location).

I was following this right up until the end. What is the last part trying to say? I thought I was following along with "skipped" until they said "nothing is skipped" and then changed to using "removed".

@HaraldJoerg
Copy link

The last sentence describes a separate process, and the sentence before that is a special case: If you use \ as a delimiter, then you can't escape characters with \. The first occurrence of \ terminates the construct, since \\ is not "skipped" (so, contradicting the claim of the first sentence, which explains the "However").
The last sentence can be best explained with an example:
print ('qwertz' =~ s z\zzyzr) # prints 'querty'
The delimiter is z. The search string contains an escaped delimiter \z which is skipped while searching for the end (as written in the first sentence). The last sentence says that the \ is removed before this escaped delimiter, so that the search string actually is a literal z and not an end-of-string assertion \z (I actually have some doubts that backslashes that escape other backslashes are removed in that step, but don't want to dig deeper since this shouldn't be relevant for syntax highlighting).

@HaraldJoerg
Copy link

On using # as a delimiter:

That would be simple enough. Could you confirm your guess?

Indeed, m#\## works as intended to escape a literal #.
However, # has another quirk (and I don't recommend its use as delimiter): while m#\## is a valid pattern match, m #\## is a lonely m followed by the comment \##.

Here's a compilation of some common and some annoying Perl delimiters. Perl's substitution s/a/b/ is challenging because it allows for an odd number of delimiters, in particular with quotes as delimiters. GitHub's highlighting seems to get almost all of them right:

use 5.020;
use strict;
use warnings;

sub saeaoagr () {
    print "foo";
    qr/x/;
}

# Those are the most popular
say ("fee" =~ s/e/o/gr  . "bar");
say ("fee" =~ s!e!o!gr  . "bar");
say ("fee" =~ s|e|o|gr  . "bar");
say ("fee" =~ s{e}{o}gr . "bar");
say ("fee" =~ s(e)(o)gr . "bar");
say ("fee" =~ s[e][o]gr . "bar");

# Those have syntactic significance
say ("fee" =~ s?e?o?gr  . "bar");
say ("fee" =~ s'e'o'gr  . "bar");  # ' # quote to fix

# Those are valid, but infrequent (and weird)
say ("fee" =~ s"e"o"gr  . "bar");  # " # quote to fix
say ("fee" =~ s aeaoagr . "bar");
say ("fee" =~ s#e#o#gr  . "bar");

# Those must not be confused with the previous two
say ("fee" =~ saeaoagr  . "bar");  # calls saeaoagr()
say ("fee" =~ s #e#o#gr              that's a comment, not a regex
     (e)(o)gr . "bar");            # and here's the regex.

@joshgoebel
Copy link
Member

joshgoebel commented Jan 9, 2021

Are are all those delimiters valid for all regex "types"/operations (not sure what to call them)?

s, tr, y, m, qr ?

@HaraldJoerg
Copy link

Yes, they are. Perl calls the whole lot "Quote and Quote-like operators".

@joshgoebel joshgoebel self-assigned this Jan 13, 2021
@joshgoebel joshgoebel added this to the 10.6 milestone Jan 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help welcome Could use help from community language
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants