Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsmParser: parse source comment using scanner instead of regex #15209

Merged
merged 3 commits into from
Jul 1, 2024

Conversation

clonker
Copy link
Member

@clonker clonker commented Jun 20, 2024

Fixes #15207
Fixes #13496

@clonker clonker marked this pull request as ready for review June 20, 2024 13:04
Copy link
Collaborator

@matheusaaguiar matheusaaguiar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicks and small suggestions.

libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
@matheusaaguiar
Copy link
Collaborator

matheusaaguiar commented Jun 20, 2024

Also, just to confirm/point to, I see that we have some tests in test/libyul/Parser.cpp which seems to cover parsing. Not sure if we need or would benefit from adding something more there.

@cameel cameel added refactor and removed refactor labels Jun 20, 2024
Copy link
Member

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a changelog entry for the bugfix.

Also note that github is a bit picky about the fixes/closes annotation. fixes issue #15207 did not work, The PR is not connected to the issue and won't close it automatically. You need Fixes #15207, i.e. without anything in between the word and the issue number.

Also, just to confirm/point to, I see that we have some tests in test/libyul/Parser.cpp which seems to cover parsing. Not sure if we need or would benefit from adding something more there.

The repro is a bit large and otherwise trivial, so I think it's ok to skip it.

I think we're missing some coverage for whitespace and newline handling behavior, because the PR passes all tests even though the way it handles whitespace seems to have changed.

libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
@clonker clonker force-pushed the asm_parser_use_scanner branch 2 times, most recently from 7cf8e50 to b3fd10e Compare June 21, 2024 16:21
Changelog.md Outdated Show resolved Hide resolved
liblangutil/Scanner.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Show resolved Hide resolved
liblangutil/Scanner.h Outdated Show resolved Hide resolved
@aarlt
Copy link
Member

aarlt commented Jun 25, 2024

I guess you could now also remove the rm ./*_bytecode_too_large_*.sol ./*_combined_too_large_*.sol in .circleci/parallel_bytecode_report.sh - if I remember correctly, the crashes where mainly happening because of the regexp stuff.

@cameel
Copy link
Member

cameel commented Jun 26, 2024

I guess you could now also remove the rm ./*_bytecode_too_large_*.sol ./*_combined_too_large_*.sol in .circleci/parallel_bytecode_report.sh - if I remember correctly, the crashes where mainly happening because of the regexp stuff.

Looks like it worked. Which is actually odd, because this PR doesn't fix #13496, does it? I mean, if it does then great, maybe we should close it too, but it's also possible that it got somehow fixed on develop and we didn't realize :)

@clonker
Copy link
Member Author

clonker commented Jun 27, 2024

I guess you could now also remove the rm ./*_bytecode_too_large_*.sol ./*_combined_too_large_*.sol in .circleci/parallel_bytecode_report.sh - if I remember correctly, the crashes where mainly happening because of the regexp stuff.

Looks like it worked. Which is actually odd, because this PR doesn't fix #13496, does it? I mean, if it does then great, maybe we should close it too, but it's also possible that it got somehow fixed on develop and we didn't realize :)

It does segfault for me in the issue you linked above on develop and not on this branch when, e.g., invoking solc --via-ir --asm test.sol. I assume it is because of the debug comments:

image

Invoking the same with --debug-info none does not segfault on develop, either.

@aarlt
Copy link
Member

aarlt commented Jun 27, 2024

I guess you could now also remove the rm ./*_bytecode_too_large_*.sol ./*_combined_too_large_*.sol in .circleci/parallel_bytecode_report.sh - if I remember correctly, the crashes where mainly happening because of the regexp stuff.

Looks like it worked. Which is actually odd, because this PR doesn't fix #13496, does it? I mean, if it does then great, maybe we should close it too, but it's also possible that it got somehow fixed on develop and we didn't realize :)

It does segfault for me in the issue you linked above on develop and not on this branch when, e.g., invoking solc --via-ir --asm test.sol. I assume it is because of the debug comments:

image

Invoking the same with --debug-info none does not segfault on develop, either.

yep exactly! its because of the debug comments. I ran into this when I was working on the debug attribute parsing stuff. I would say we should merge this soon so I can rework my PR.

@aarlt
Copy link
Member

aarlt commented Jun 27, 2024

In general it is a bug in std::regexp in some GCC versions that created that segfault.

Changelog.md Outdated Show resolved Hide resolved
Copy link
Member

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need some some adjustments in the scanner and a few more tests. Other than that it looks pretty good now.

Comment on lines 837 to 838
if (c == '\\')
scanEscape();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this code correctly, this does not quite match the behavior of the original regex. The code here will just skip invalid escape sequences and stray \ at the end of input. If we want to match the original regex, we'd have to instead:

  • not interpret unrecognized escapes, just keep them as is,
  • report an unterminated string if we reach \ at the end of input (well, I guess reporting an illegal escape instead would be fine too, as long as it's still an error).

Copy link
Member Author

@clonker clonker Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's true, it wasn't reflecting the RegEx behavior - although snippets are really just skipped during parsing, the important bit is the correct detection of the tail. That being said, it absolutely makes sense to interpret the 'special comment' literal string sensibly. Now everything that is escaped and interpretable is being interpreted as such and the rest is appended verbatim. I have added a couple tests to the scanner for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, the most sensible behavior for such escapes would actually be to report an error, but that's breaking. They should never happen in Yul coming from the codegen and are only possible in user-supplied Yul, so maybe it's acceptable, buy I wouldn't do it in this PR. Maybe this is something that we should consider later though. Really, neither skipping them, nor keeping them "as is" is ideal.

libyul/AsmParser.cpp Outdated Show resolved Hide resolved
libyul/AsmParser.cpp Outdated Show resolved Hide resolved
test/libyul/Parser.cpp Show resolved Hide resolved
@cameel
Copy link
Member

cameel commented Jun 27, 2024

It does segfault for me in the issue you linked above on develop and not on this branch when, e.g., invoking solc --via-ir --asm test.sol. I assume it is because of the debug comments:

Ah, right. I focused on parsing the long hex string (which should not be affected by this PR) but overlooked the fact that such a string will also end up in a snippet as a part of debug info. In that case, yeah, this PR does fix it.

cameel
cameel previously approved these changes Jun 28, 2024
Copy link
Member

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good so I'm approving already.

I still have some final suggestions, mostly regarding wording and comments. The only bigger one would to change how unterminated escapes are handled, though that's not incorrect, just leaves some unexpected artifacts in the (invalid) literal.

Changelog.md Outdated Show resolved Hide resolved
Changelog.md Outdated Show resolved Hide resolved
test/libyul/Parser.cpp Outdated Show resolved Hide resolved
liblangutil/Scanner.cpp Outdated Show resolved Hide resolved
Comment on lines 862 to 863
if (c == '\\')
scanEscape();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more consistent with the other branch to treat unterminated \ as IllegalEscapeSequence. Currently you continue, run scanEscape() and then error with IllegalStringEndQuote right after the loop.

It does work, but it's still weird to let scanEscape() run when we're at the end of input and m_char is 0, which I guess is assumed to never actually be read. The function will actually take that 0 and put into the literal and that literal is still accessible via currentLiteral() even when we error out.

Also, you should either handle the return value from scanEscape() or assert that it's not supposed to fail:

Suggested change
if (c == '\\')
scanEscape();
if (c == '\\')
{
if (isSourcePastEndOfInput())
return setError(ScannerError::IllegalEscapeSequence);
bool validEscape = scanEscape();
solAssert(validEscape);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Done that, tests now reflect a case where there's a dangling backslash R"("test\)" and a missing string end quote R"("test)".

{
if (isSourcePastEndOfInput())
return setError(ScannerError::IllegalEscapeSequence);
bool const validEscape = scanEscape(false /* _rejectInvalidEscapes */);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just that I understand correctly: the only thing you need to do here is to check for escaped quotes, right? Couldn't you do that here directly instead of the entire refactoring of scanEscape? We don't actually do anything with the source quote in the snippets other than discarding them, so not sure why we'd go out of our way to deal with all sorts of escape sequences here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change was based off of this comment. I am completely fine with just checking for escaped quotes as well - all depends on the intended use of the scanner mode beyond this pr I guess.

Copy link
Member

@ekpyron ekpyron Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanEscape(false) is generally rather weird, though, isn't it? Either you want to properly handle all kinds of escape sequences or you don't, but handling some properly and ignoring others is strange, isn't it :-)?
I think the simpler solution to just only handle escaped quotes here matches more directly what we need now, and keeps the PR simpler - and if we ever want to use this for anything else we can revisit this then...

Copy link
Member Author

@clonker clonker Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't even need to check for escaped quotes, this should be enough for the purpose of just skipping over it

if (c == '\\')
{
	if (isSourcePastEndOfInput())
		return setError(ScannerError::IllegalEscapeSequence);
	advance();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and that's exactly what the regex did, right? Let's just do that then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a code snippet runinng to EOF here actually even an error before the change :-)?

Copy link
Member Author

@clonker clonker Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have reported an error because of missing end quote (eg "literal\ - the \ is no longer part of the match so the regex would have found "literal). Or did you mean in the current develop implementation? There as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, when scanning string literals hitting EOF is an error. But I meant, whether it was an error if a source snippet of a comment in Yul, parsed with the regex, it was an error if the quote didn't terminate :-).
Technically, it it wasn't an error then, making it an error now is breaking.
There is error 1544_error even before this PR, but will this ever be hit in this case, i.e. will the regex even match that? Ah, the terminating quote there is "? right? So we do get that error in that case also on develop? Then it's fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the previous condition for error 1544 is rather convoluted, but looks like that's what it did.

Comment on lines +855 to +859
/// @src 1 : 111:
/// 222 "
/// abc\"def
///
/// " @src 0:333:444
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where and how does this actually happen now - and how does it prevent

/// @src 1:1:1 "
let x := 42
/// "

from swallowing the non-comment?

Copy link
Member Author

@clonker clonker Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the example it'll just skip all escaped stuff, so the snippet amounts to abcdef. the non-comment is not swallowed because it will never reach that part of the asmparser, there's an outer loop taking care of that, so only 1:1:1 " would arrive in parseSrcComment and error out with unterminated quote

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so more precisely the input to parseSrcComment in the test you referenced up there is (with escapes)

1\t: 111:\n222 "\nabc\"def\n\n" @src 0:333:444

for which (1,111,222) is identified as components and abcdef as snippet (the rest is now skipped by scanner as discussed); everything else goes into the tail (being @src 0:333:444) which is returned to an outer loop.

The outer loop will match it as source location comment and parseSrcComment is called again, this time with 0:333:444 as input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the point is that Scanner::scanSingleLineDocComment() just merges the contents of all consecutive lines starting with /// (but not more than that) - alright!

Copy link
Member

@ekpyron ekpyron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say I'm reasonably convinced by this now and Kamil approved it earlier, so I'm fine with merging after squashing the commits a bit!

@clonker clonker merged commit 8d788fb into ethereum:develop Jul 1, 2024
72 checks passed
@clonker clonker deleted the asm_parser_use_scanner branch July 1, 2024 17:29
@aarlt
Copy link
Member

aarlt commented Jul 1, 2024

Just to leave it here: the segmentation fault was related to a bug in GCC's std::regex implementation. This bug is at least reproducible with GCC 9, 10, 11, 12, 13 & 14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants