Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unescape strings at tokenizer level #1885

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Shaikh-Ubaid
Copy link
Collaborator

@Shaikh-Ubaid Shaikh-Ubaid commented Jun 8, 2023

@Shaikh-Ubaid Shaikh-Ubaid force-pushed the escape_unescape branch 2 times, most recently from 6680c9e to 9f2ceda Compare June 8, 2023 19:34
@Shaikh-Ubaid Shaikh-Ubaid marked this pull request as ready for review June 8, 2023 19:39
@Shaikh-Ubaid
Copy link
Collaborator Author

The step Run mamba-org/provision-with-micromamba@main seems to fail for three jobs at the CI.

@Shaikh-Ubaid Shaikh-Ubaid requested a review from certik June 8, 2023 19:50
@Shaikh-Ubaid Shaikh-Ubaid force-pushed the escape_unescape branch 3 times, most recently from 27b1579 to 4c266ab Compare June 9, 2023 02:11
@Shaikh-Ubaid Shaikh-Ubaid marked this pull request as draft June 9, 2023 02:24
@Shaikh-Ubaid
Copy link
Collaborator Author

(lp) ubaid@ubaids-MacBook-Pro lpython % cat examples/expr2.py                     
r"""
Text
123\n\t
"""
(lp) ubaid@ubaids-MacBook-Pro lpython % python -m tokenize examples/expr2.py      
0,0-0,0:            ENCODING       'utf-8'        
1,0-4,3:            STRING         'r"""\nText\n123\\n\\t\n"""'
4,3-4,4:            NEWLINE        '\n'           
5,0-5,0:            ENDMARKER      ''             
(lp) ubaid@ubaids-MacBook-Pro lpython % lpython examples/expr2.py --show-ast        
(Module
    [(Expr
        (ConstantStr
            "\nText\n123\n\t\n"
            ()
        )
    )]
    []
)

In the above example, we have a raw string. For raw strings, escape sequences are treated as regular characters. We see that the current output of lpython is not as expected. Previously, the unescaping of string was done at the parser level, where we did not unescape strings which have PREFIX_STRING (see semantics.h) of r.

@certik
Copy link
Contributor

certik commented Jun 9, 2023

Good point. I see two ways forward:

  • unescape at the parser level
  • add a special token for raw strings of the type r"xxx", and then handle unescaping correctly. Modify the parser accordingly.

@Shaikh-Ubaid
Copy link
Collaborator Author

add a special token for raw strings of the type r"xxx", and then handle unescaping correctly. Modify the parser accordingly.

It seems that adding token for strings is possible, but is leading to many new tokens being added (raw_strings, bytes, raw_bytes, formatted strings, raw_formatted strings, etc.)

raw_str1 = ("r" | "R") (string1 | string2);
raw_str2 = ("r" | "R") (string3 | string4);
bytes1 = ("b" | "B") (string1 | string2);
bytes2 = ("b" | "B") (string3 | string4);
raw_bytes1 = ("rb" | "rB" | "Rb" | "RB" | "br" | "bR" | "Br" | "BR") (string1 | string2);
raw_bytes2 = ("rb" | "rB" | "Rb" | "RB" | "br" | "bR" | "Br" | "BR") (string3 | string4);
...

For the moment it seems it is better to proceed with unescaping at the parser level.

@certik
Copy link
Contributor

certik commented Jun 12, 2023

I think if you use 'rb' then it is case insensitive in re2c. But it's fine.

@certik
Copy link
Contributor

certik commented Jun 12, 2023

Looks good to me. @Shaikh-Ubaid let me know when this is ready for review.

@Shaikh-Ubaid
Copy link
Collaborator Author

Shaikh-Ubaid commented Jun 12, 2023

For the moment it seems it is better to proceed with unescaping at the parser level.

There is another concern here. For LFortran when unescaping strings, we need to know the quote used to create the string (' or "). I think this information is not available with the parser. So, it seems it needs unescaping at the tokenizer level.

"b'\n\n\\n'"
"b'\n\\n\\\\n'"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example above, the given string is

b'''
\n\\n'''

With respect to this example, it seems that all strings (including raw strings, bytes, raw bytes, etc.) need unescaping at the tokenizer level. And then we need to escape them if they are raw strings/ bytes/ raw bytes, etc (I think at the parser level).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do raw strings have to be escaped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, Sir. I will look into it.

@@ -499,7 +499,7 @@
(Expr
(JoinedStr
[(ConstantStr
"\\n"
"\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug.

self.emit("} else {", 2)
self.emit( 's.append("[]");', 3)
self.emit("}", 2)
else:
self.emit('s.append("\\"" + get_escaped_str(x.m_%s) + "\\"");' % field.name, 2)
self.emit('s.append("\\"" + str_escape_c(x.m_%s) + "\\"");' % field.name, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this change.

|| token == yytokentype::TK_RAW_STRING
|| token == yytokentype::TK_BYTES
|| token == yytokentype::TK_RAW_BYTES) {
t = t + " " + "\"" + str_escape_c(yystype.string.str()) + "\"";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this change (the str_escape_c).

@@ -184,7 +184,7 @@
(data (;0;) (i32.const 4) "\0c\00\00\00\01\00\00\00")
(data (;1;) (i32.const 12) " ")
(data (;2;) (i32.const 16) "\18\00\00\00\01\00\00\00")
(data (;3;) (i32.const 24) "\0a ")
(data (;3;) (i32.const 24) "\n ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this change.

@Shaikh-Ubaid
Copy link
Collaborator Author

See #1902 (comment).

Support raw strings token

Support bytes and raw bytes

Add support for unicode, fmt, raw_fmt strings
@Shaikh-Ubaid Shaikh-Ubaid changed the title Support Escaping/Unescaping strings in tokenizer Unescape strings at tokenizer level Jun 13, 2023
@Shaikh-Ubaid Shaikh-Ubaid marked this pull request as ready for review June 13, 2023 23:26
@Shaikh-Ubaid
Copy link
Collaborator Author

Good point. I see two ways forward:

  • unescape at the parser level
  • add a special token for raw strings of the type r"xxx", and then handle unescaping correctly. Modify the parser accordingly.

This PR currently unescapes at tokenizer level. (LFortran also currently unescapes at tokenizer level since it needs to know the quote used to create the string literal).

The concern in the current approach of unescaping at the tokenizer level in LPython is that we are needing to create several/many new tokens like raw strings, unicode strings, formatted strings, raw formatted strings, bytes, raw bytes (and these also need support at the parser level).

@Shaikh-Ubaid
Copy link
Collaborator Author

In the current approach, we unescape all strings, bytes unless it is marked raw in which case we do not unescape (and therefore the escape sequences present (if any) in the raw string/bytes would be treated as character literals).

@Shaikh-Ubaid
Copy link
Collaborator Author

Ready.

@Shaikh-Ubaid Shaikh-Ubaid requested a review from certik June 13, 2023 23:40
@@ -129,7 +129,7 @@
(Expr
(ConstantStr
"Text"
()
"u"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a bug?

Copy link
Collaborator Author

@Shaikh-Ubaid Shaikh-Ubaid Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the prefixes "u" and "U" were meaningful for Python 2. In Python 3, all string literals are Unicode by default. Thus, it seems the prefixes "u" and "U" do not add much/any value in Python 3 (Reference https://chat.openai.com/share/1270fc0a-fb93-4638-822d-3d1619488027).

With respect to the above, I think it might not be a bug fix. Previously, we were passing the kind only for "u" (prefix). Currently, we also pass the kind value for "U" (capital U) (prefix).

@certik
Copy link
Contributor

certik commented Jun 14, 2023

Thanks for this. I think you might have found and fixed another bug (#1885 (comment)).

That bug aside, it does seem this approach is more complicated than doing it in the parser. Let's keep this PR open for a while and think about it.

@czgdp1807 let us know your opinion on this approach as well.

Copy link
Collaborator

@czgdp1807 czgdp1807 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the status of this? Do we need this?

@Shaikh-Ubaid
Copy link
Collaborator Author

What's the status of this? Do we need this?

We need your opinion on it. Please, let us know what we should do.

We currently handle unescaping at the parser level for LPython (and tokenizer level for LFortran). This PR supports unescaping at tokenizer level for LPython.

Any approach works for me. Please feel free to close/merge it as required.

@czgdp1807
Copy link
Collaborator

If there is no significant benefit of bringing un-escaping to the tokenizer level in LPython, then I wouldn't work on it for the time being. There are higher priority things to be done (like bug fixing, performance improvements to perform better than LPython's competitors), so let's work on those.

@czgdp1807 czgdp1807 marked this pull request as draft August 28, 2023 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants