Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified String Literals #3475

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Unified String Literals #3475

wants to merge 8 commits into from

Conversation

pitaj
Copy link
Contributor

@pitaj pitaj commented Aug 15, 2023

Rendered

This RFC proposes to unify the syntax of the existing string literal and raw string literal forms, supporting both the use of escape sequences and avoiding the need to escape backslashes and quotation marks. This proposal also uses the new syntax to improve format string ergonomics, reducing the need for double-brace escapes.

@joshtriplett
Copy link
Member

I really like this approach.

However, the format placeholder escaping seems like a substantial layering violation. I don't see an obvious implementation of this that doesn't turn a language concern (string lexing) into a library concern (format string placeholders). And the standard library is not the only thing in the ecosystem that handles placeholders.

@programmerjake
Copy link
Member

an idea for how string literals with format placeholders would work with proc-macros and concat! and similar: a string literal is a list of segments (that are always concatenated back together when it's used in rust code to produce a &str -- I'm referring to that as its value) where each spot where a segment ends and the next starts is a format placeholder start (all # used to indicate a placeholder start are removed). e.g.:

source code segments &str value
"abc {xyz}" ["abc {", "xyz}"] "abc {xyz}"
#"abc #{xyz}"# ["abc {", "xyz}"] "abc {xyz}"
#"abc {xyz}"# ["abc {xyz}"] "abc {xyz}"
"abc {{xyz}" ["abc {{xyz}"] "abc {{xyz}"
#"abc ##{xyz}"# ["abc #{", "xyz}"] "abc #{xyz}"
#"#{"# ["{", ""] "{"

concat! concatenates input strings: if any input strings were surrounded by #s, then it preserves format placeholder starts and produces an output string literal surrounded by at least 1 # (so nested concat! behave like one giant flattened concat!). otherwise, when all inputs have no surrounding #, for backwards compatibility concat! concatenates &str values as usual and then parses the resulting concatenated string literal into segments rather than preserving format placeholder starts.

source code segments &str value macro-expanded token for proc-macros
concat!("{", "{") ["{{"] "{{" "{{"
concat!("{", #"{"#) ["{", "{"] "{{" #"#{{"#
concat!(#"{"#, "{") ["{{", ""] "{{" #"{#{"#
concat!(#"{"#, #"{"#) ["{{"] "{{" #"{{"#
concat!(#"a{}b#{c}"#, "x{y}z") ["a{}b{", "c}x{", "y}z"] "a{}b{c}x{y}z" #"a{}b#{c}x#{y}z"
concat!("{", "a}") ["{", "a}"] "{a}" "{a}"
concat!("{", "{a}") ["{{a}"] "{{a}" "{{a}"

@VitWW
Copy link

VitWW commented Aug 15, 2023

Adding a hash to escaping is a good idea!

If in normal string we have { and } for opening and closing formatting,
it would be nice to have BOTH hashed #{ and }# (symmetrical to #" and "#), but not half-hashed #{ and }

format!("Hello, {}!", "world");    // => Hello, world!

// these are good brackets
format!(#"Hello, #{}#!"#, "world"); // => Hello, world!

// not these!
format!(#"Hello, #{}!"#, "world"); // => Hello, world!

@steffahn
Copy link
Member

steffahn commented Aug 15, 2023

However, the format placeholder escaping seems like a substantial layering violation. I don't see an obvious implementation of this that doesn't turn a language concern (string lexing) into a library concern (format string placeholders). And the standard library is not the only thing in the ecosystem that handles placeholders.

Looks like, string lexing already is also a library concern, given that proc_macro directly does not expose values of literal tokens. See e.g. here, here, here in syn.

That said, many macros build on top of syn, at which point then, a string literal does look like a single uniform type, independent of the underlying syntax being ordinary or raw strings, and with a .value() property that just returns the underlying String contents, unescaped.


edit… That is not to say there aren’t any problems. For example, the RFC is not clear about how/whether

println!(concat!(#"output: #{}"#), 42);

should work. And how about any of these:

fn main() {
    let x = 42;
    println!(concat!(#"{x} #{x}"#, ""), x = x);
    println!(concat!(#"{x} #{x"#, "}"), x = x);
    println!(concat!(#"{x} #{"#, "x}"), x = x);
    println!(concat!(#"{x} #"#, "{x}"), x = x);
    println!(concat!(#"{x} "#, "#{x}"), x = x);
    println!(concat!(#"{x}"#, " #{x}"), x = x);
    println!(concat!(#"{x"#, "} #{x}"), x = x);
    println!(concat!(#"{"#, "x} #{x}"), x = x);
    println!(concat!(#""#, "{x} #{x}"), x = x);
}

edit2 I’ll be re-reading @programmerjake's ideas on this point again, now that I’ve actually noticed the potential issue myself in the first place.

@ehuss ehuss added the T-lang Relevant to the language team, which will review and decide on the RFC. label Aug 15, 2023
@programmerjake
Copy link
Member

@steffahn If following my proposed concat! semantics, those would print:

fn main() {
    let x = 42;
    println!(concat!(#"{x} #{x}"#, ""), x = x); // prints: {x} 42
    println!(concat!(#"{x} #{x"#, "}"), x = x); // prints: {x} 42
    println!(concat!(#"{x} #{"#, "x}"), x = x); // prints: {x} 42
    println!(concat!(#"{x} #"#, "{x}"), x = x); // prints: {x} #42
    println!(concat!(#"{x} "#, "#{x}"), x = x); // prints: {x} #42
    println!(concat!(#"{x}"#, " #{x}"), x = x); // prints: {x} #42
    println!(concat!(#"{x"#, "} #{x}"), x = x); // prints: {x} #42
    println!(concat!(#"{"#, "x} #{x}"), x = x); // prints: {x} #42
    println!(concat!(#""#, "{x} #{x}"), x = x); // prints: 42 #42
}

@pitaj
Copy link
Contributor Author

pitaj commented Aug 15, 2023

Essentially, my plan was the following regarding format placeholders:

Format placeholders are not a string lexing question at all. #"this string has #{} in it"# is just a string literal that in any other context resolves to the string this string has #{} in it.

The concat question is interesting. I am inclined to define that concat! always returns a bare string literal, so all of @steffahn's examples would print 42 #{x}. We can lint those cases where it is used with a guarded string literal with a guarded placeholder in format string position.

When a macro or format_args is parsing a format string, it simply needs to know the prefix used:

  • if #*, the placeholder is always #*{} and doubled curly braces are passed through literally
  • otherwise, the placeholder is always {} and curly braces are escaped by doubling

When first introduced, this new syntax without the r prefix will be treated as an unexpected literal kind by syn and other proc macros. For macros which parse directly (like indoc) they will just have to handle this manually like with the other string forms.

syn will need to carefully consider the API here. On the one hand, most macros are not dealing with format strings so want just a simple .value() no matter the string type. On the other hand, macros dealing with format strings may want syn to encode the semantic difference in the API. Then again, they should have plenty of time while this feature is unstable to ensure correctness.

Personally, I think it makes the most sense for syn to treat this the same as the other string literals (bundled into LitStr), but provide an extra .guard_prefix() accessor returning Option<u8> or Option<&str> for format string use cases.

@VitWW
Copy link

VitWW commented Aug 15, 2023

Oh, I think exist a confusion between "formatted strings" (like f"some {x} value", f#"another #{x} value") and macros (like format!("some {} value", x), println!(#"some {} value"#, x)).
This proposal of hashed formatting is good for "formatted strings", but not for macros.
For macros we must have similar macros with different formatting (like format2!("some #{} value", x), println2!(#"some #{} value"#, x)).

@programmerjake
Copy link
Member

@VitWW we can use the exact same macros because macros can see the original literal's syntax

@programmerjake
Copy link
Member

I think it would be cleaner in terms of layering for ###{ sequences to be a form of escape sequence that translates to { plus an indication that a formatting placeholder starts there. then, all formatting macros need to do is parse a placeholder everywhere there's a placeholder start indication. concat! would preserve those indications (unless restricted by backwards compatibility).

@pitaj
Copy link
Contributor Author

pitaj commented Aug 15, 2023

Probably easier to store the placeholder indices in metadata somewhere.

But I'd still prefer to just not involve the lexer at all with placeholders, if possible.

@quaternic
Copy link

After the proposed new syntax additions, the RFC is also proposing that the current syntax for raw string literals would be removed in a future edition, and the discussed drawbacks are largely about that.

Is that removal necessary or even desirable? As far as I can see, the new syntax does not add a perfect replacement for raw string literals: A literal that doesn't and cannot contain escape sequences within it. The proposed syntax only has literals where the possible escape sequences are made arbitrarily long.

Pathological example:

let current_raw =        r"\######n";
let proposed_raw = #######"\######n"#######;
let escaped =            "\\######n";

The problem with proposed_raw is that I now have to count the hashes to be sure that it's actually r"\######n" and not a weirdly expressed "\n". When I see the r, it's immediately clear that what I see is what I get, since no escape is possible.

I believe the current r"..." provides the valuable information that there are definitely no escapes within the string. When I see a raw string, I can immediately assume that what I see is what I get without thinking about escape sequences and scanning it for backslashes possibly followed by hashes.

It seems to me like the two, "guarded" and "raw" could just be considered independent features of string literals: the r-prefix makes it raw by disabling all escape handling within the literal, the additional #s differentiate the syntax of the literal from its contents by guarding the delimiter- and escape-sequences.

@programmerjake
Copy link
Member

Pathological example:

let current_raw =        r"\######n";
let proposed_raw = #######"\######n"#######;
let escaped =            "\\######n";

it seems to me that you could just write it: "\######n" since we could define it such that the escapes wouldn't activate unless the number of # is equal.

@pitaj
Copy link
Contributor Author

pitaj commented Aug 16, 2023

Another option is to use a prefix that makes the difference in quantity more obvious:

let proposed_raw = ##########"\######n"##########;

@pitaj
Copy link
Contributor Author

pitaj commented Aug 16, 2023

we could define it such that the escapes wouldn't activate unless the number of # is equal

I don't really like this. As it is, it can be easy to use one too few # and accidentally end up with a literal \###n in your string. With the "at least" rule, we can throw an "unexpected escape" error, preventing that accident mode.

@quaternic
Copy link

Agreed, an escape sequence followed by an # should be an error to get reasonable feedback from the compiler.

Another option is to use a prefix that makes the difference in quantity more obvious:

let proposed_raw = ##########"\######n"##########;

That helps the specific case if the code was written with that in mind, but most likely many literals would only have as many guards as are necessary, and no more. It's not hard to imagine e.g. clippy would point that out, (goes to check), and it turn outs it does: needless_raw_string_hashes (There's some discussion on whether is should be a warning in PR#112373 adding an equivalent to the compiler, which was closed in favor of the clippy lint.)

That also doesn't change the core issue that while you can make the escape sequence bigger to make it more obvious that there aren't any, you still need to scan the string contents for possible escapes.

I can see the case for not adding r"..." to the language if we already had #-guarded literals that can do essentially the same things, only needing extra guards in some unusual cases. But replacing r"..." when its ~always been part of the language is not as easy to justify.

For that reason it seems better to focus the proposal on allowing non-raw literals to use guarded escapes. I still believe that would unify the literals in a simple way: The prefixed r (for "raw") is simply an opt-out from having any escape sequences at all.

@VitWW
Copy link

VitWW commented Aug 16, 2023

@programmerjake
I mean formatting doesn't require to have escaping.
Nothing wrong to have format!("{}", x); format!(#"{}"#, x); format!(r#"{}"#, x); ...
So we need somehow to have both {} or #{} alternatives.

The easiest way - to have "alternative" macros, like format2!("##{}", "##", x); format2!("%{}", "%", x);, where second argument is a prefix before opening bracket.

In this way

format!("{}", x) == format2!("{}", "", x) == format2!("%{}", "%", x) == format2!("#{}", "#", x)

So format2! is extended version of format!

@programmerjake
Copy link
Member

imho format2 has unnecessary flexibility and user-unfriendlyness due to having user-specified escapes that don't just use the surrounding # from the string. plus, format2 seems like a whole new thing that everything dealing with format args will have to support, whereas ##"##{x}"## works transparently without all the other format-args-taking macros needing to change. (e.g. needing print2/write2/println2/format_args2/assert_eq2/assert2/panic2/etc. and all the third-party libraries that use format_args in their macros...)

@steffahn
Copy link
Member

steffahn commented Aug 16, 2023

Another thing that could be done, instead of format2, is to use currently-invalid format-string syntax to specify the prefix within the format string. E.g. to make some arbitrary choice, using {(…prefix…)} at the beginning of the format string.

assert_eq!(
    format!(#"{(#)}The natural numbers, denoted "N", are the set {#{}, #{}, ...}."#, 1, 2),
    r#"The natural numbers, denoted "N", are the set {1, 2, ...}."#,
);

This, too, could support custom prefixes, and would be independent of string literal syntax

assert_eq!(
    format!("{(%)}The natural numbers, denoted \"N\", are the set {%{}, %{}, ...}.", 1, 2),
    r#"The natural numbers, denoted "N", are the set {1, 2, ...}."#,
);

but it saves the need for a new alternative macro, and keeps compatibility with format_args-taking macros. (At least most of them… I don’t know if there isn’t any format_args-taking/using macros that do decide to prepend anything to the format string; or append / insert something somewhere expecting parameters to be parsed as expected.)

@programmerjake
Copy link
Member

iirc all of the assert* macros prepend stuff to the format string.

@steffahn
Copy link
Member

Testing the macro expansion of assert and assert_eq in the playground looks to me like they prepend nothing. assert doesn't prepend anything to the output and assert_eq passes a format_args using the original format string to this function.

Clarify behavior of format placeholders
Specify behavior of concat on guarded strings
Further address the removal of raw strings
Add alternatives for `concat` and `#\`
@pitaj
Copy link
Contributor Author

pitaj commented Aug 17, 2023

Updated the RFC to discuss the concerns brought up so far, including concat! and the removal of raw strings.

@quaternic

That helps the specific case if the code was written with that in mind, but most likely many literals would only have as many guards as are necessary, and no more. It's not hard to imagine e.g. clippy would point that out, (goes to check), and it turn outs it does: needless_raw_string_hashes (There's some discussion on whether is should be a warning in PR#112373 adding an equivalent to the compiler, which was closed in favor of the clippy lint.)

  1. these cases are pretty rare
  2. that lint doesn't cover this form of string literal yet
  3. even if it did, that lint can easily be modified to accept the example I gave

That said, I'm not going to die on this hill. That could be left to a future RFC, but I'm going to leave it in the RFC unless there's heavy consensus otherwise.

@mattheww
Copy link

I think this RFC should say more about the proposed behaviour in different editions.

#"..."# isn't one of the forms that was reserved in the 2021 edition by RFC 3101, so it currently lexes as three tokens.

That suggests that the new syntax shouldn't be introduced until the 2024 edition.

@pitaj
Copy link
Contributor Author

pitaj commented Aug 17, 2023

That's a good point. My initial thoughts are:

  • reserve the syntax in edition 2024
  • stabilize it soon after that (leaving time for format macros to catch up)
  • remove raw strings in edition 2027

We could introduce a different prefix so this could be used on previous editions, but I don't think it's worth it.

not sure how to express this without some form of
parameterization on the number of prefix `#`s
@m-ou-se
Copy link
Member

m-ou-se commented Aug 24, 2023

Note that format_args looks at the processed string to find placeholders. It doesn't look for literal { and } in the source code:

// This works fine.
let a = 1;
println!("\x7ba\x7d"); // This is just "{a}". This prints: 1

So if you really want to propose using #{} rather than {} depending on how the string looks in source code, you'll also have to think about cases like #"#\#x7b}"# and #"\#x23{}"# and so on.

@m-ou-se
Copy link
Member

m-ou-se commented Aug 24, 2023

I think we should keep format_args!() independent of how the strings are represented in source code. Otherwise you get inconsistency like this:

let a = 1;
println!(  "{a}" ); // prints: 1
println!( #"{a}"#); // prints: {a}
println!(r#"{a}"#); // prints: 1

@m-ou-se
Copy link
Member

m-ou-se commented Aug 24, 2023

(Tip: If you do end up changing the RFC to remove the format_args part, it might make sense to open a new PR to start a clean github thread, as this thread is pretty much entirely focussed on format_args.)

@pitaj
Copy link
Contributor Author

pitaj commented Aug 24, 2023

@tmccombs

However, I'm less happy with the changes to the format macro. For one thing, it feels a little too magical. I don't like that the macros now depend on the precise syntax used to create the string literal.

What's magical about it? Macros have always worked at the syntax level, and even have to extract the contents of the string manually.

And even though the proc_macro crate doesn't currently expose the value of the string, what about when it does? It will also now need to expose information about how many #, if any, there were.

Is this supposed to be a difficult issue? Seems pretty simple to expose an API for it. Or they can just continue to look at the span like they already are.

I also think that escaping braces and backslash escaping should be orthogonal. As proposed, if you want to avoid having to escape quotes or backslashes in a format string that also means you have to prefix placeholders with one or more #, which may not be desirable.

Would keeping raw strings satisfy this desire? I don't see why we should add yet another syntax when we can use the guarding prefix for this. In my eyes, {} is a kind of contextual escape sequence, so it makes sense.

My vote would be to leave changes to the formatting macros off of this RFC, and maybe have a separate RFC for a mechanism to avoid having to escape braces in format strings.

If we introduce this new syntax without the changes to formatting macros, there's probably no way to achieve the same ergonomics. You'd either need a new set of macros, a special metadata placeholder, or wait for f-strings. And there would certainly be no way to get this syntax while maintaining backwards compatibility.

@pitaj
Copy link
Contributor Author

pitaj commented Aug 24, 2023

@m-ou-se

So if you really want to propose using #{} rather than {} depending on how the string looks in source code, you'll also have to think about cases like #"#\#x7b}"# and #"\#x23{}"# and so on.

In the current proposal, all format_args needs from the source code is the prefix. It can use just the content of the string when actually processing the formatting.

I think we should keep format_args!() independent of how the strings are represented in source code. Otherwise you get inconsistency ...

I don't think that's inconsistent. r strings are just different. And if they are eventually removed as proposed, no problem at all. I think it's actually more consistent with how other escapes will work.


I'll let everyone know right now that I am highly invested in keeping the formatting changes in this RFC, because my perception of the ergonomic benefits in case of literal { (not uncommon in my experience) is higher than any proposed detriment I've seen so far.

If there was a way to split this up I would, but since the formatting behavior is tied to the guarding, it's imperative that they be introduced at the same time.

@m-ou-se
Copy link
Member

m-ou-se commented Aug 24, 2023

So if you really want to propose using #{} rather than {} depending on how the string looks in source code, you'll also have to think about cases like #"#\#x7b}"# and #"\#x23{}"# and so on.

In the current proposal, all format_args needs from the source code is the prefix. It can use just the content of the string when actually processing the formatting.

So that would mean that #"\#x23{}"# should be treated as a format placeholder (identical to "{}")? 👀

@pitaj
Copy link
Contributor Author

pitaj commented Aug 24, 2023

So that would mean that #"\#x23{}"# should be treated as a format placeholder (identical to "{}")? 👀

Yes, just like "\x7B\x7D" would be today.

@ijackson
Copy link

Rust's literal syntax is already quite complicated[1] and it seems to me that this makes it even more complicated. [1] I'm currently trying to write some code that, for Reasons, needs to reimplement string quote matching. This is super hard right now (so many corner cases with suffixes and prefixes) and this proposal would make it harder.

@tmccombs
Copy link

What's magical about it? Macros have always worked at the syntax level, and even have to extract the contents of the string manually.

Think about it from the perspective of a rust developer who isn't super familiar with how proc macros work. Having the type of string you use impacting how you type placeholders is a little surprising.

Is this supposed to be a difficult issue? Seems pretty simple to expose an API for it. Or they can just continue to look at the span like they already are.

It complicates the API and is another thing macro authors need to worry about. I wouldn't characterize it as a "difficult issue"z, but it is something to keep in mind.

Would keeping raw strings satisfy this desire? I don't see why we should add yet another syntax when we can use the guarding prefix for this.

Somewhat. But consider something like:

eprintln!(#"Error: "#{}"\#n#{}"#, error.msg, error.stack_trace);

If I use raw strings, I can't escape the newline, and the following line can't be indented. If I use an unguarded string, I have to escape the double quotes. If I use a guarded string, I have to put "#" in front of the placeholders. Perhaps this is a little contrived, but I don't think overly so.

In my eyes, {} is a kind of contextual escape sequence, so it makes sense.

Thinking about this some more, I think that my disagreement with this statement is the crux of my dislike for the format string changes proposed. I don't see the placeholder {} as a kind of escape sequence, but as something distinct and using the number of guard "#"s for two different purposes feels wrong to me. But that is just my personal opinion.

If we introduce this new syntax without the changes to formatting macros, there's probably no way to achieve the same ergonomics. You'd either need a new set of macros, a special metadata placeholder, or wait for f-strings. And there would certainly be no way to get this syntax while maintaining backwards compatibility.

That's only true after this feature has stabilized. But yes, I agree. If this were stabilized without the changes to format strings, that would limit the options for that going forward.

All that said, I don't absolutely love it, but I am not entirely opposed to the idea anymore.

Finally, I'd like to suggest an alternative to the #{} placeholder syntax. If in a guarded string with n hash characters, a placeholder must use n+1 matched braces, so for example: #"{{}}"#, ##"{{{}}}"## etc. I'm not sure whether I like that , or the original proposal better, but thought I'd throw it out there.

@Aloso
Copy link

Aloso commented Aug 27, 2023

Somewhat. But consider something like:

eprintln!(#"Error: "#{}"\#n#{}"#, error.msg, error.stack_trace);

This can't be parsed. The #"Error: "#{}"\#n#{}"# literal would be tokenized like this:

#"Error: "#
{}
"\#n#{}"
#

So it is basically impossible to put a formatting placeholder in double quotes, unless you escape them with \#".

@pitaj
Copy link
Contributor Author

pitaj commented Aug 27, 2023

So it is basically impossible to put a formatting placeholder in double quotes, unless you escape them with \#".

Which is another great argument to require a leading backslash, making it an actual escape sequence. No ambiguity.

@VitWW
Copy link

VitWW commented Aug 27, 2023

Somewhat. But consider something like:

eprintln!(#"Error: "#{}"\#n#{}"#, error.msg, error.stack_trace);

That's why my and @steffahn sub-proposal comment of alternative custom and independent from type of strings formatting is much better alternative:

format!("{(%)}The natural numbers, denoted \"N\", are the set {%{}, %{}, ...}.", 1, 2)

@tmccombs
Copy link

How about something like this:

println!( {#"e is "{{}}""#}, e)

Where you can put braces around the format string to require additional braces for placeholders.

@pitaj pitaj marked this pull request as ready for review September 3, 2023 05:35
@pitaj
Copy link
Contributor Author

pitaj commented Sep 3, 2023

Alright I've changed the placeholder syntax from #{} to {#} which solves the quotes-around-placeholder issue. I've also added a good amount of alternatives discussion that hopefully covers the bases.

Still not sure about the lexing notation but I figure this is ready for review.

@nikomatsakis
Copy link
Contributor

I agree that this approach by Swift seems nice, but...

  • If we keep raw strings around, it just feels like too much -- I feel like we're strictly adding complexity and it doesn't seem necessary. I'd be happier with the RFC if we deprecated raw strings or removed them in a new edition (with cargo fix, obviously).
  • Alternatively, I'm unconvinced by \# making sense -- maybe we keep raw strings, as a way to skip escape sequences, but we have # as an orthogonal, independent thing for managing the end sequence.
  • Speaking from personal experience, I find the f"" and m"" use cases much more compelling. Escaping quotation marks definitely comes up, but indentation and format strings come up a lot.
  • On the other hand, @scottmcm pointed out that this approach eliminates the need to add "r" variants of everything (e.g., byte strings, etc); this even shows up with f, since one of the reasons I want f"fo" is to be able to get back a String, and I can imagine wanting to be able to get back a String that includes { without needing to escape it or double them. Interesting.

So TL;DR

  • I think we should either double down on this syntax and remove raw strings or
  • We should limit it to changing the terminator and not have it change escaping.

Both make some sense. I am somewhat surprising myself by leaning towards the latter.

All that said, I think that if we're going to go diving in this area, it would feel better to me if we decided on some kind of set of changes all at once so that we can say "Rust has new string literals that are way better" (these could be distinct RFCs, though, as smaller, targeted RFCs generally feel better).

@tmandry
Copy link
Member

tmandry commented Oct 5, 2023

  • We should limit it to changing the terminator and not have it change escaping.

If we go this route I would much prefer an alternative syntax like """ (three or more double quotes to start the string, paired with the same number of double quotes to end it).

That being said, I would like there to be a solution for not having to double up {, and # gives us a clear way to do that. So maybe the hashes (or even some other sigil, like C#'s $ or @) are the best overall solution.

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Oct 5, 2023

I thought more about this. I was thinking that, if we really wanted to "dare to ask for more", at least from my perpsective, I would want all multiline strings to strip indentation by default (with some way to opt out). I realize this is more about the "code strings" RFC, but I'm commenting here because opting out of that sort of thing seems like it may be a remaining role for raw strings. (I would also want rustfmt to indent inside strings by default, as a result.)

Obviously making this change would require an edition.

@Aloso
Copy link

Aloso commented Oct 5, 2023

Alternatively, I'm unconvinced by \# making sense -- maybe we keep raw strings, as a way to skip escape sequences, but we have # as an orthogonal, independent thing for managing the end sequence.

That would also be my preferred solution. It just doesn't happen very often that you want \ to mean a literal backslash, but also need escape sequences in the same string. If you want raw strings (e.g. for regular expressions), r"" or r#""# already solve your use case. If you just don't want to escape ", a #""# string type that may still contain escape sequences like \n would be best.

I've been thinking about how formatting could be simplified. If f-strings are added, that would be an opportunity to change how formatting works.

For example, :? isn't needed if there's a wrapper to delegate the Debug impl to Display. So "{foo:?}" could be written as "{Dbg(foo)}". Arbitrary expressions in format strings are needed to make this ergonomic. {foo:#?} could become {Alt(Dbg(foo))}, and so on. This is longer, but easier to understand, since the format_args! syntax is quite complex. With f-strings, string interpolation could be re-imagined from the ground up. It could even support nested strings, e.g. f"foo: {foo.join(", ")}", like JavaScript's template strings.

If we want something like this, it indeed doesn't make sense to special-case escaping in format_args! strings, since they'll become obsolete at some point.

However, I don't think f-strings should return a String. I'd rather be able to pass f-strings around, interpolate them, and print them with zero allocations. So f-strings would replace format_args!, and other macros like format! and println! could be replaced with normal functions:

println(f"foo: {foo}");

@pitaj
Copy link
Contributor Author

pitaj commented Oct 5, 2023

@nikomatsakis thanks for taking a look.

If we keep raw strings around, it just feels like too much -- I feel like we're strictly adding complexity and it doesn't seem necessary. I'd be happier with the RFC if we deprecated raw strings or removed them in a new edition (with cargo fix, obviously).
...
these could be distinct RFCs, though, as smaller, targeted RFCs generally feel better

The RFC did originally include removal of raw strings. Because it seemed a little controversial, and such as RFC could be independently done later, I left it for future possibilities (maybe I should strengthen the wording there). I would like these strings to replace raw strings altogether (thus "unified" in the title).

In my view, the primary feature of raw strings is that they are copy-pasteable. As long as you have the right amount of wrapping #s, you can copy and paste text from anywhere and not worry about needing to escape anything. Unified strings would have this capability, while also allowing the user an escape hatch for including escape sequences and format placeholders where needed. (best of both worlds)

Alternatively, I'm unconvinced by \# making sense -- maybe we keep raw strings, as a way to skip escape sequences, but we have # as an orthogonal, independent thing for managing the end sequence.

I think the best argument against this is that rb-strings and raw rc-strings are practically useless the moment you need to include non-utf8 bytes. Even with # for the end sequence, you have to fall back to non-raw literals with escaped backslashes and formatting placeholders. They may be less common than quotes, but that doesn't mean that they are uncommon or shouldn't get a similar treatment. And applying to all syntactical elements of string literals is more consistent than just quotes.

this approach eliminates the need to add "r" variants of everything (e.g., byte strings, etc); this even shows up with f, since one of the reasons I want f"fo" is to be able to get back a String, and I can imagine wanting to be able to get back a String that includes { without needing to escape it or double them.

Yes I agree. The composability of this syntax is one of its strongest features.

I was thinking that, if we really wanted to "dare to ask for more", at least from my perpsective, I would want all multiline strings to strip indentation by default (with some way to opt out).
...
Obviously making this change would require an edition.

I don't think this is possible as an edition change, since it would change the semantics of string literals silently between editions. We allow someone to change the edition in Cargo.toml (or the edition flag passed to rustc) and make the necessary changes manually automatically - this would break that.

@tmccombs
Copy link

tmccombs commented Oct 5, 2023

I don't think this is possible as an edition change,

It could be done for just the new types of multi-line strings (so not the existing raw strings).

@pitaj
Copy link
Contributor Author

pitaj commented Oct 5, 2023

@tmccombs but that wouldn't even require an edition change

@tmandry
Copy link
Member

tmandry commented Oct 6, 2023

I don't think this is possible as an edition change, since it would change the semantics of string literals silently between editions. We allow someone to change the edition in Cargo.toml (or the edition flag passed to rustc) and make the necessary changes automatically - this would break that.

No, we require running cargo fix --edition to make edition-specific migrations, and this should be possible to do with that. Disjoint capture in closures is an example of a migration that changed the semantics of existing code and included a migration to restore the original semantics.

@pitaj
Copy link
Contributor Author

pitaj commented Oct 6, 2023

I don't think this is possible as an edition change, since it would change the semantics of string literals silently between editions. We allow someone to change the edition in Cargo.toml (or the edition flag passed to rustc) and make the necessary changes automatically - this would break that.

No, we require running cargo fix --edition to make edition-specific migrations, and this should be possible to do with that. Disjoint capture in closures is an example of a migration that changed the semantics of existing code and included a migration to restore the original semantics.

Just realized I wrote "automatically" when I meant "manually". Not sure if that changes your reply.

No, we require running cargo fix --edition to make edition-specific migrations

I was not aware of that. Where is this requirement documented? It was my impression that changing semantic like this isn't allowed (except in rare cases) because changing the edition manually is explicitly supported.

@workingjubilee
Copy link
Member

workingjubilee commented Oct 10, 2023

We try not to churn code, but the only ironclad rule so far is that there is only one stdlib, so that stdlib must work with all editions, which largely prevents adding edition-dependent hacks in std.

We could rewrite almost all of how Rust parses in the next edition, hypothetically.

- guarding prefix -> guarding sequence
- fixed a few places that were still written
for the prior placeholder syntax
- invalid format string error message discussion
@pitaj
Copy link
Contributor Author

pitaj commented Nov 12, 2023

Pushed a minor update that strengthens some of the wording around removing raw strings and fixes some places I missed when changing where the guarding goes in placeholders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.