Skip to content

The language reference doesn't explain anything about string literals containing newlines #19399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nodakai opened this issue Nov 29, 2014 · 4 comments

Comments

@nodakai
Copy link
Contributor

nodakai commented Nov 29, 2014

"ab\ncd"

denotes a Unicode string U+0061 U+0062 U+000a U+0063 U+0064.

"ab
cd"

also denotes the same Unicode string.

On the other hand,

"ab\
cd"

and

"ab\
    cd"

denotes a Unicode string U+0061 U+0062 U+0063 U+0064. The Rust lexer ignores an "escaped newline" optionally followed by a sequence of "whitespace" characters. (Update: the following complain about the lack of Rust's definition of "whitespace" was incorrect and I retract it. defined by the below function in libsyntax/parser/lexer/mod.rs)

pub fn is_whitespace(c: Option<char>) -> bool {
    match c.unwrap_or('\x00') { // None can be null for now... it's not whitespace
        ' ' | '\n' | '\t' | '\r' => true,
        _ => false
    }
}

This predicate doesn't follow the traditional definition of "space" (by the C language) or Unicode's definition of "whitespace". So if we use Unicode ideographic space (colloquially known by Japanese as "full-width space"), the space-munchinig logic doesn't work. For example

"ab\
 cd"

denotes a Unicode string U+0061 U+0062 U+3000 U+0063 U+0064. Of course, such a decision is totally up to language designers, but it is desirable to give a clear explanation about it.

As for a character literal, it's interesting that the lexer rejects some kinds of "space" characters:

            '\t' | '\n' | '\r' | '\'' if delim == '\'' => {
                let last_pos = self.last_pos;
                self.err_span_char(
                    start, last_pos,
                    if ascii_only { "byte constant must be escaped" }
                    else { "character constant must be escaped" },
                    first_source_char);
                return false;
            }

For example, this Rust code is rejected:

println!("{}", '
');
@nodakai
Copy link
Contributor Author

nodakai commented Nov 29, 2014

Sorry, I overlooked that Section 3.4 Whitespace already explained the peculiarity of Rust's definition of "whitespace".

Still, I believe most of the above comment is still valid. Especially, as for the rejection of some of character literals, it disagrees with the EBNF definition of the syntax:

char_lit : '\x27' char_body '\x27' ;

char_body : non_single_quote
          | '\x5c' [ '\x27' | common_escape | unicode_escape ] ;

I don't think we should try to be too formal with something like EBNF from the beginning. Illustrative examples will be much more useful for day-to-day use. But the language reference must at least refer to all the features of Rust in some ways.

@steveklabnik
Copy link
Member

We have a real grammar now, so I'm considering this closed. Thanks!

@nodakai
Copy link
Contributor Author

nodakai commented Feb 17, 2015

@steveklabnik I assume you had src/doc/grammar.md in your mind when you referred to "a real grammar." But it doesn't give any explanations on the above two points as of now, so this issue remains relevant.

@steveklabnik
Copy link
Member

I thought the grammar was actually tested, so it's accurate, which was the primary complaint, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants