Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Can't set autoupdate hash regex to match a line ending #4688

Closed
stevehipwell opened this issue Jan 25, 2022 · 16 comments
Closed

[Bug] Can't set autoupdate hash regex to match a line ending #4688

stevehipwell opened this issue Jan 25, 2022 · 16 comments
Labels

Comments

@stevehipwell
Copy link

Bug Report

Current Behavior

I can't add a $ to the end of the autoupdate hash regex to make it match the specific artifact rather than another artifact that starts with the same string.

Expected Behavior

I'd expect to be able to match an artifact even if there is another artifact earlier in the checksums with the basename as a prefix.

Additional context/output

The cosign app in the main bucket is currently broken and it should be fixable by setting "regex": "^$sha256\\s+$basename$" but this isn't working. The SBOM artifact starts with the artifact name and appears higher in the checksums file.

Possible Solution

System details

n/a

@stevehipwell
Copy link
Author

FYI @rashil2000 I know you've been working in this area recently.

@rashil2000
Copy link
Member

There was a recent PR on a related issue #4619. Also see #4180.

Other than that, I have not touched the regex handling.

@stevehipwell
Copy link
Author

@rashil2000 I saw those two when I was checking if this issue had already been logged. I think this is a separate issue although based on the docs it should just work.

@rashil2000
Copy link
Member

I see. Thanks for the report!

@stevehipwell
Copy link
Author

Based on further testing it appears that the regex isn't running in multiline mode so on a file with multiple lines the ^ and $ characters aren't matching line beginnings and endings respectively. If this is intentional it should probably be made clearer in the docs.

@niheaven
Copy link
Member

niheaven commented Jan 25, 2022

Based on further testing it appears that the regex isn't running in multiline mode so on a file with multiple lines the ^ and $ characters aren't matching line beginnings and endings respectively. If this is intentional it should probably be made clearer in the docs.

Hmm, this is the internal design of PowerShell's Regular Expressions, not Scoop's. PowerShell's RegEx has many design different with other languages, you could check https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_regular_expressions?view=powershell-5.1 for reference.

When using anchors in PowerShell, you should understand the difference between Singleline and Multiline regular expression options.

Multiline: Multiline mode forces ^ and $ to match the beginning end of every LINE instead of the beginning and end of the input string.
Singleline: Singleline mode treats the input string as a SingleLine. It forces the . character to match every character (including newlines), instead of matching every character EXCEPT the newline \n.

@rashil2000
Copy link
Member

rashil2000 commented Jan 25, 2022

@stevehipwell Can you update the Wiki with this information?

@niheaven
Copy link
Member

And it's better to redirect user to MS's help page, since PS's RegEx is neither PCRE nor Perl 😂

@stevehipwell
Copy link
Author

@niheaven AFAK this is very much a Scoop issue, Scoop chose to run the regexes in singleline mode despite processing multiline files, I'd suggest that this is a bug that should be fixed with the default regex patterns improved to leverage multiline mode.

@niheaven
Copy link
Member

niheaven commented Jan 25, 2022

PowerShell's default is neither Multiline nor Singleline mode (ref: Singleline Mode, Multiline Mode), and we don't change it since it would meet most of our demands.

By default, $ matches only the end of the string, and . matches every character except for the newline character \n or \u000A.

If we change it to some specific mode, I don't know how many manifest's RegEx may be influenced, a RegEx switch is more better choice.

PS, many manifests use (?sm) as switch, and this should be only (?s) to change .'s behavior. The (?sm) switch is a mistake written by myself when I'm not familiar with PS's RegEx.

@stevehipwell
Copy link
Author

PowerShell's default is neither Multiline nor Singleline mode (ref: Singleline Mode, Multiline Mode), and we don't change it since it would meet most of our demands.

I don't think it meets most requirements as the ^ and $ tokens should be used per line to make the regex work correctly; in the meanwhile you have to hack not-characters (e.g.[^_]) to stop the regex breaking when new artifacts types are added with the artifact name as a prefix. This is especially problematic when the order of the checksums is random.

@niheaven
Copy link
Member

This tends to be a bug of $basename since we should use it to identify EXACT base name. PowerShell's anchors are a bit confused, in cosign, a RegEx likes $sha256.*?$basename$ should not match anything, but indeed it matched.

I'll need some times to deeply check the logic of scoop and PowerShell's RegEx mechanism to figure out what's wrong.

@stevehipwell
Copy link
Author

@niheaven I think you'll need multiline on to make the matches absolute when the input is a multiline file. For cosign I'd have expected ^$sha256\s+$basename$ (I'm not escaping \ here for easier readability) to work based on the docs; without multiline on I had to use $sha256\s+$basename[^_] but this will still match part of longer hashes and artifacts which have the basename as a prefix and are seperated by a character other than _, which I've explicitly blocked, if they come first in the file.

As things like SBOMs start to be added to GH artifacts this problem is likely to start showing up more often; I'm surprised this hasn't been a problem in the past but I guess in a lot of cases additional artifacts are generally lower in the order (not that this should be part of the implementation).

@niheaven
Copy link
Member

In cosign case, $sha256\s+$basename\r?\n should be better and is recommended by PowerShell.

Changing RegEx's matching mode is a breaking change for scoop, and we need more cases to support this type of change and reviewing existed manifests in all buckets (at least official ones). For now, use option switch (?imnsx-imnsx) as a temporally solution.

@stevehipwell
Copy link
Author

@niheaven I hadn't tested the multiline modifier, for cosign (?im)^$sha256\s+$basename$ works perfectly; I should have reminded myself how PowerShell does modifiers rather than assuming it was a separate arg.

I think the Scoop default regexs need reviewing, I can't think of a case where not having (?im) set is a desired feature? The only reason to keep a regex in singleline mode is if matching multiline strings, the reason not to is to be able to clearly separate out individual lines with the ^ & $ tokens.

@niheaven
Copy link
Member

The default is neither Singleline nor Multiline, they both need trigger.

I'm afraid changing default behavior to multiline may break some existing RegEx, and if not so, default changing is just one or two lines added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants