-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Syntax for raw string literals #9411
Comments
I think functionality equivalent to the C++11 syntax is best, but ideally not as noisy. We also need to consider how text editor syntax files will handle it, but I don't think it will be too much of a problem. |
@thestinger Do you have any suggestions? Functionally equivalent to C++11 requires something of the form The only adjustment I can think of would be to remove |
How about
|
@stevenashley The lexer will see the |
Ah, of course. I can't think of a substitute for |
How about this:
As far as I know we don't allow
Alternatively, we could also throw away the
Or we make both forms valid: |
@Kimundi That looks like a reinvention of Lua's |
It's probably also worth re-mentioning the various use-cases which have a desire for some sort of syntax that is not what we have today:
Those are the use cases that I could think of, others may have more |
Would case 4 become |
I suppose under the C++ syntax that's what it would be, which is arguably just as confusing as four backslashes. |
@kballard I would say it's better than Luas syntax here.
Looking at @alexcrichton's use cases:
|
Those all look totally reasonable to me. |
Me also. I prefer Kimundi's proposed syntax over C++11 syntax. Nicely done. |
(All of these mean the token language is no longer regular, right?) |
@Kimundi: Regular expression:
That said, I am not as adverse to this syntax as I was initially. While I think it looks weird, and it would feel weird every time I type it, I would be ok with using it. @huonw I believe you are correct. Is that particularly important? |
The restriction to sequences of I would personally prefer C++11 (or any variant that does not restrict the user-selected token sequence to such an impoverished alphabet), and instead leave restrictions (e.g. to The theoretician in me wants to say "here's a compromise: the end user sequence is strings drawn from a two element alphabet, for example the regexp But it is not a big deal to me; its certainly not as important as just having some choose-your-own delimiter option, even if it did end up being solely drawn from strings of (one last note: I realized after I wrote this that I misrepresented kimundi's proposal slightly, since kimundi's proposal is not a mere restriction of the C++11 proposal, so its not as if we could start with C++11 and just add a lint. But I think the rest of my note holds up. Especially the last part, where I said its not a big deal to me. :) ) |
@pnkfelix: All fair points, however I think in practice you'd never need to have more than one or two @kballard: Likewise, in that example there would be no need for more than one
Personally, I'm weary of the "any string as delimiter" approach: It can more easily lead to inconsistencies and style issues because every literal might use a different one. Restricting it to one character at least restricts the possibly variations to one dimension, the length, and that people will tend to make as short as possible. ;) |
@huonw's point (that a choose-your-own-delimiter implies non-regular token language) might be important, depending on what our lowest common denominator is for tool support. E.g. if some IDE only supports regular tokens for its syntax highlighting. (Or a better example: If we don't want to put in the effort necessary to figure out how to handle non-regular languages on all the major IDEs that we hope to support.) I'll try to bring this up at the weekly meeting on Tuesday, solely to determine whether whether a regular token language is a hard constraint or not. (That is, I hope to avoid a bikeshed during the meeting...) |
@huonw Yep. Raw strings are not embeddable within a regular language as it means that the string terminator must also be regular. A document containing a terminator would be unable to be embedded. I don't think it is a big problem as they are parsable by any regex engine supporting back references and non-greedy matching. For example: A regex that parses |
@pnkfelix @huonw: You could also just hack around that: Of course, that only "really" works if the failure case is something inconsequential like syntax highlighting failure. |
@Kimundi Given the number of non-regular languages out there (lots of languages have some equivalent of either raw strings or heredocs), I would be surprised if any tools would need hacks like that at all. |
@kballard Right, just wanted to throw that out there as fallback workaround. :) |
Because @pnkfelix alluded to it, and I also got a comment along those lines on IRC: Even though I'd personally be not in favor of allowing it at all, if we'd want to allow arbitrary delimiters strings anyway, then that'd be still compatible with my proposal: Just allow any string not containing Would certainly give good opportunities for self documenting literals:
|
@Kimundi I think allowing spaces (or any whitespace) is a mistake. Makes it harder to tell what's intentional and what's a typo in the source. |
Ruby also uses a = 5
puts "Value of a: #{a}"
# => "Value of a: 5"
puts 'Value of a: #{a}'
# => "Value of a: #{a}" |
@steveklabnik That syntax is incompatible with parsing lifetimes. If it weren't, I'd have already submitted a PR for supporting 4-char codes using |
@kballard awesome, just wanted to make sure that all of the other implementations were covered in what we're looking at. |
According to the weekly meeting 2013-09-24, the regular language issue is a non-issue (because of a desire to allow comment nesting, which already makes it non-regular). |
I see this as a twofold issue, as 'raw' string literals are really separated into two groups from what I can tell. The use cases described so far are: regexes, which have lots of backslashes; Windows paths, which have lots of backslashes; giant blobs of raw text, which may contain literally anything as often such blobs are generated by other programs or are programs in an unknown-to-rust other language; and format! string directives, which have lots of backslashes. So for 3/4 cases the only important attribute is a readable way to hold backslashes (which means that regular -style escaping will not suffice). There are a few good proposals which solve this problem; my favorite is r"foo""bar" syntax where only the " char is handled specially (with doubling as the escape). The listed drawback to this approach is that it "Does not reproduce verbatim every character found in the source sequence, which makes it slightly harder/more confusing to read, and more annoying to do things like pasting a raw string into your source file (e.g. raw HTML)." No scheme will pass through verbatim every character in the source sequence for all sequences. The workaround of ensuring that any single character always passes through verbatim except if its context is composed of other characters which comprise the end delimiter is more complicated than an unconditional (character-based rather than sequence-based) escaping scheme and harder to quickly check. Using r"foo""bar" syntax will also allow, if a user does insert text containing single double quotes, a compile-time failure so that they can fix the string. It's not a 100% solution since someone wanting two adjacent double-quotes (who would only get one back) would not be warned, but it's a very simple syntax which shouldn't take long for users to learn especially if their likely first mistake of using one quote instead of two would cause a compile-time error. I don't see a strong case for embedding large blobs of text in source, as that practice is poor form in general: editors rarely provide much support for working with arbitrary languages embedded in strings and the approach is increasingly awkward as the blobs grow in size. I would advise against encouraging this antipattern with language workarounds, especially considering that they do not fully solve the problem of escaping (either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step). Using an include! macro to reference separate files to insert data blobs seems a better approach, as it does fully avoid the problems in delimiters/escaping, and the data blobs can then be constructed separately and statically checked for correctness in their native language as part of the build process without having to extract them from Rust code for that purpose. As a language which likes to demonstrate what can be achieved with types I think it would be a shame to see big text blobs being considered idiomatic rust; we should rather discourage stringly-typed data. |
I'd really prefer to have format string syntax and regex syntax that simply use another escape character (like I'd prefer to keep the amount of different options of string literals (and I suppose the complexity of a correct lexer) as low as possible in this case. :( |
Schemes that use user-controlled delimiters can pass verbatim every possible sequence, merely by modifying the delimiter appropriately. @ben0x539 We already have fewer string literals than most languages (that is to say, we have one string literal). |
@kballard: right, as I discussed--by allowing a wide variety (such as is the case with user-defined ones) of schemes we can get around the fact somewhat, but that will not obviate the need to either update the scheme or escape the contents when making changes: "either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step". In the interest of keeping the language simple and elegant I would think a solution involving a finite, small number of valid formats for contained text would be ideal, unless there is an as-yet-unmentioned good reason to be placing and maintaining large blobs of a different language inside Rust code. |
@sp3d "constantly worry about updating delimiters"? You make it sound like the contents of these raw strings change frequently, and with zero predictability. If I'm embedding a section of raw HTML, I'm pretty sure I can come up with a delimiter that's highly unlikely to show up. The problem with the |
We could always do something like r(delim)textdelim, for example, |
@kballard @Kimundi okay, the team gave the go-ahead to implement Kimundi's proposed r" with hash-tally delimited raw-strings. So now the fun begins; I'd be happy to help shepherd a PR through. |
@pnkfelix Huzzah! I'm quite happy to do the implementation myself |
I'm currently also looking at the lexer code. If I'm right, this can be done with only local changes to one function. Working at the change atm. |
@Kimundi Yes I'm pretty sure it can be done in |
@Kimundi I have most of a patch already, can we talk/compare notes on irc or so? |
This branch parses raw string literals as in #9411.
This branch parses raw string literals as in #9411.
Closed by #9674, nice work everyone! |
These strings can contain arbitrary characters and do not process *any* escape sequences. The only special characters are line endings which are normalized to \n as in regular strings. Everything else is represented verbatim. After careful consideration and studying this thread [1], I have decided to inherit Rust's syntax for raw strings. Seriously, it's very good: - Double quotes as a 'this is a string' marker. - Low level of syntactic noise in simple cases. - Arbitrary sequences of characters can be embedded by using a sufficient number of # characters for padding. - Only one dimension of variance: padding length. This gives us consistent syntax and makes it easier for humans to recognize the raw strings in text. Thank you, Kimundi, for your brilliance. Though, the usage of # for padding may be reconsidered in Sash as I intend to use # in so-called 'multipart identifiers' to adopt mixfix call syntax. It may be better to choose some other character to not overload the #. Also, raw string do report bare CR characters as regular strings do. [1] rust-lang/rust#9411
|
@boosh You can use an arbitrary number of |
@jonas-schievink Ah great, thank you! I thought it was strange it hadn't been considered 👍 |
Excellent choice. For my own language design I've looked at every syntax out there as well as coming up with several of my own, and this is the least verbose and complex, while also allowing any string to be delimited. Fortunately, bad arguments, like the one for doubling I actually somewhat prefer Kimundi's second proposal, which is a superset of the one adopted: rX"text"X, where X is either empty or is any sequence starting with |
A raw string literal is a string literal that does not interpret any embedded sequence, meaning no backslash-escapes. A lot of languages (certainly most that I've used) support some syntax for raw string literals. They're useful for embedding any string that wants to have a bunch of backslashes in it (typically because the function the string is passed to wants to interpret them itself), such as regular expressions. Unfortunately, Rust does not have a raw string literal syntax.
There's been a discussion on the mailing list for the past few days about this. I will try to put a quick summary here.
There's two questions at stake. The first is, should Rust have a raw string literal syntax? The second is, if so, what particular syntax should be used? I think the answer to the first is definitely Yes. It's useful enough, and has enough overwhelming precedence in other languages, that we should add it. The question of concrete syntax is the harder one.
The syntaxes that have been proposed so far, along with their Pros and Cons:
C++11 syntax, e.g.
R"delim(raw text)delim"
.Pros:
Cons:
Python syntax, e.g.
r"foo"
Pros:
Cons:
r"foo\""
evaluates to the stringfoo\"
(with the literal backslash).D syntax, e.g.
r"raw text"
,raw text
, orq"(raw text)"
/q"delim\nraw text\ndelim"
Pros:
Cons:
C#/SQL/something else, using a simple raw string syntax such as
r"text"
where doubling up the quote inserts a single quote, as inr"foo""bar"
Pros:
Cons:
Perl quote-like operators, e.g.
q{text}
. Unfortunately, most viable delimiters will result in an ambiguous parse.Ruby quote-like operators, e.g.
%q{text}
. Unfortunately, this also is ambiguous (with the % token).Lua syntax, e.g.
[=[text]=]
Pros:
Cons:
println!([[Hello, {}!]], "world")
in an introduction to Rust would be awfully confusing (see previous point about being non-string-like).Go syntax, e.g.
raw text
. This is one of the variants of D strings as wellPros:
Cons:
foo
in doc comments.A new syntax using ASCII Control characters STX and ETX
Pros:
Cons:
A syntax proposed over IRC is
delim"raw text"delim
.Pros:
Cons:
Some form of Heredoc syntax was also suggested, but heredocs are really primarily concerned with embedding multiline input, not raw input. They also have issues around dealing with indentation and the first/last newline.
During this discussion, only two Rust team members (that I'm aware of) chimed in. Alex Chricton raised issues with the Lua syntax, and threw out the suggestion of Go's syntax, though only as something to consider rather than a recommendation. Felix Klock expressed a preference for C++11 syntax, and more generally stated that he wants a syntax with user-delimited sequences. There was also at least one community member in favor of C++11 syntax.
My own preference at this point is for C++11 syntax as well. At the very least, something similar to C++11 syntax, that shares all of its properties, but there seems to be no value in inventing a new syntax when there's precedent in C++11.
The text was updated successfully, but these errors were encountered: