Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow splitting string literals into multiple parts #7524

Merged
merged 1 commit into from
Nov 26, 2019

Conversation

ghallak
Copy link
Contributor

@ghallak ghallak commented Oct 7, 2019

This should work for both normal and hex string literals, and the splitting can be done either on the same line or on multiple lines.

Closes #7292

Copy link
Collaborator

@erak erak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Changelog entry and also adjust the documentation.

@ghallak ghallak requested a review from erak October 16, 2019 14:05
Copy link
Collaborator

@erak erak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that the current implementation might lead to some confusion. Please consider the following example:

function f() public pure returns (bytes32) {
    bytes32 escapeCharacters = hex"0000"
    "deaf"
    "feed"
    "beef"
    "feed"
    "fade";
    return escapeCharacters;
}

Is it apparent that we're dealing with a hexadecimal literal here? I think it could be confused with a string literal and would therefor suggest to require a new line to start with another hex:

function f() public pure returns (bytes32) {
    bytes32 escapeCharacters = hex"0000"
    hex"deaf"
    hex"feed"
    hex"beef"
    hex"feed"
    hex"fade";
    return escapeCharacters;
}

In addition to that, I think only allowing whitespaces or newlines in hexstrings after an even amount of nibbles should be allowed. Please also see #7374, which applies these rules to underscores.

@ghallak
Copy link
Contributor Author

ghallak commented Oct 17, 2019

@erak I have modified in this commit 9eb1096 Scanner::scanHexString to accept a sequence of hex string literals in a way that will solve both of the problems that you've mentioned above.

I tried to do the same in Scanner::scanString but that caused the command line test standard_yul_embedded_object_name to fail, because in this line:

"content": "object \"NamedObject\" { code { let x := dataoffset(\"DataName\") sstore(add(x, 0), 0) } data \"DataName\" \"abc\" object \"OtherObject\" { code { revert(0, 0) } } }"

\"DataName\" \"abc\" is being parsed as a single string DataNameabc which is not correct.

Now, both the scanner (to allow multipart hex string literals) and the parser (to allow multipart regular string literals) are modified. Can this be done better (probably by modifying either the scanner or the parser, not both)?

@erak
Copy link
Collaborator

erak commented Oct 23, 2019

@ghallak Great, thanks. I think the implementation is fine. I've tested your changes and found out, that hex"aa" hex"ZZ" "cc" is still allowed, but we need to return an error here.

Could you also please add more syntax test for hex string literals?

@Marenz
Copy link
Contributor

Marenz commented Nov 11, 2019

Tests are failing

@@ -498,7 +498,7 @@ terminate the string literal. Newline only terminates the string literal if it i
Hexadecimal Literals
--------------------

Hexadecimal literals are prefixed with the keyword ``hex`` and are enclosed in double or single-quotes (``hex"001122FF"``). Their content must be a hexadecimal string and their value will be the binary representation of those values.
Hexadecimal literals are prefixed with the keyword ``hex`` and are enclosed in double or single-quotes (``hex"001122FF"``), and the can also be split into multiple consecutive parts (``hex"00112233" "44556677"`` is equivalent to ``hex"0011223344556677"``). Their content must be a hexadecimal string and their value will be the binary representation of those values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer

Suggested change
Hexadecimal literals are prefixed with the keyword ``hex`` and are enclosed in double or single-quotes (``hex"001122FF"``), and the can also be split into multiple consecutive parts (``hex"00112233" "44556677"`` is equivalent to ``hex"0011223344556677"``). Their content must be a hexadecimal string and their value will be the binary representation of those values.
Hexadecimal literals are prefixed with the keyword ``hex`` and are enclosed in double or single-quotes (``hex"001122FF"``), and the can also be split into multiple consecutive parts (``hex"00112233" hex"44556677"`` is equivalent to ``hex"0011223344556677"``). Their content must be a hexadecimal string and their value will be the binary representation of those values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriseth Do you mean that hex"00112233" hex"44556677" should be an allowed syntax? Are you suggesting that each part of a hex string should start with hex?

In the original issue description here #7292 (comment) the hex"00112233" "44556677" was given as an example, and from what I understood from the original issue description and from the previous comments on this PR, I have tried to implement the following:

  • For regular strings, no parts should start with hex.

Right:

"aa" "bb" "cc"
"dd"

Wrong:

"aa" "bb" hex"cc"
"dd"
  • For hex strings on a single line, only the first part should start with hex and the rest of the strings on the same line should not.

Right:

hex"aa" "22"

Wrong:

hex"aa" hex"22"
  • For hex strings on multiple lines, only the first part on each line should start with hex.

Right:

hex"11" "22"
hex"33" "44"

Wrong:

hex"11" "22"
"33" "44"

What do you think of the above rules?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghallak newlines are fragile creatures, so I would like to avoid having to distinguish newlines from other whitespace as much as possible, so I think it would be better to force people to use a hex prefix even for non-newline whitespace. What do the others think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted the docs here 7a2b6dd

@@ -1614,9 +1614,18 @@ ASTPointer<Expression> Parser::parsePrimaryExpression()
}
break;
case Token::StringLiteral:
{
string literal = m_scanner->currentLiteral();
while (m_scanner->peekNextToken() == Token::StringLiteral)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole change could be simplified and shortened by introducing Token::HexStringLiteral and using this loop here to concatenate string literals and hex string literals while checking that their type stays the same. This would also allow us to create a better error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've introduced Token::HexStringLiteral here af7d002 and made the relevant changes.

Comment on lines +123 to +126
if (currentToken() == Token::HexStringLiteral)
expectToken(Token::HexStringLiteral, false);
else
expectToken(Token::StringLiteral, false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriseth Is there a better way to do this? Maybe a function like expectToken that works for multiple tokens instead of a single one?

Comment on lines +1628 to +1629
if (m_scanner->currentToken() == Token::Illegal)
fatalParserError(to_string(m_scanner->currentError()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriseth I had to add this so that in order to get the error message Expected even number of hex-nibbles within double-quotes. instead of Expected ';' but got 'Illegal' for this test. Is it necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution is to just not call next() above in case you have an illegal token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe you can transform the loop above into a do-while loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried both of these solution but neither of them worked.

I think that in order for them to work this line which results in the error message Expected even number of hex-nibbles within double-quotes should be executed, and in order for this line to be executed, the function Parser::parsePrimaryExpression should be called after it parses the hex string but that can't happen, and this line is being hit, which is producing the error message Expected ';' but got 'Illegal'.

In case I don't call next() when an illegal token is found, I'll get the following message instead Expected ';' but got 'HexStringLiteral'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - now I understand. I think it's fine like that!

@chriseth
Copy link
Contributor

Could you squash everything into a single commit, please?

@ghallak
Copy link
Contributor Author

ghallak commented Nov 26, 2019

@chriseth Done here fa2541a

Copy link
Collaborator

@erak erak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor changes that I'd like to see done. Looks food otherwise. Thanks @ghallak :)

Changelog.md Outdated
@@ -2,6 +2,7 @@

Language Features:
* Allow to obtain the selector of public or external library functions via a member ``.selector``.
* Parser: Allow splitting string literals into multiple parts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Parser: Allow splitting string literals into multiple parts.
* Parser: Allow splitting (hexadecimal) string literals into multiple parts.

I think we should mention both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Parser: Allow splitting string literals into multiple parts.
* Parser: Allow splitting string and hexadecimal string literals into multiple parts.

Or perhaps the long version?

@@ -498,7 +498,7 @@ terminate the string literal. Newline only terminates the string literal if it i
Hexadecimal Literals
--------------------

Hexadecimal literals are prefixed with the keyword ``hex`` and are enclosed in double or single-quotes (``hex"001122FF"``). Their content must be a hexadecimal string and their value will be the binary representation of those values.
Hexadecimal literals are prefixed with the keyword ``hex`` and are enclosed in double or single-quotes (``hex"001122FF"``), and the can also be split into multiple consecutive parts (``hex"00112233" hex"44556677"`` is equivalent to ``hex"0011223344556677"``). Their content must be a hexadecimal string and their value will be the binary representation of those values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Hexadecimal literals are prefixed with the keyword ``hex`` and are enclosed in double or single-quotes (``hex"001122FF"``), and the can also be split into multiple consecutive parts (``hex"00112233" hex"44556677"`` is equivalent to ``hex"0011223344556677"``). Their content must be a hexadecimal string and their value will be the binary representation of those values.
Hexadecimal literals are prefixed with the keyword ``hex`` and are enclosed in double or single-quotes (``hex"001122FF"``), and they can also be split into multiple consecutive parts (``hex"00112233" hex"44556677"`` is equivalent to ``hex"0011223344556677"``). Their content must be a hexadecimal string and their value will be the binary representation of those values.

@@ -221,6 +221,7 @@ namespace langutil
K(FalseLiteral, "false", 0) \
T(Number, nullptr, 0) \
T(StringLiteral, nullptr, 0) \
T(HexStringLiteral, nullptr, 0) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
T(HexStringLiteral, nullptr, 0) \
T(HexStringLiteral, nullptr, 0) \

@@ -0,0 +1,8 @@
contract test {
function f() public pure returns (bytes32) {
bytes32 escapeCharacters = hex"aa" hex"ax";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps move the invalid hex"ax" to another invalid hex string test and replace this by a "valid", but too short hex string? It took me a second to realize that the x in the second part makes this a hex string with uneven nibbles. But I'm also fine with the current test.

@ghallak
Copy link
Contributor Author

ghallak commented Nov 26, 2019

@erak All the requested changes were implemented here f7d33ea.

Copy link
Collaborator

@erak erak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! @chriseth Do you want to have another look?

@chriseth chriseth merged commit ba8ff17 into ethereum:develop Nov 26, 2019
@chriseth
Copy link
Contributor

Thanks a lot for your help, @ghallak!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow splitting string literals over multiple lines
4 participants