Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUGGESTION] String interpolation in Syntax 2 #159

Closed
msadeqhe opened this issue Dec 7, 2022 · 19 comments
Closed

[SUGGESTION] String interpolation in Syntax 2 #159

msadeqhe opened this issue Dec 7, 2022 · 19 comments
Assignees

Comments

@msadeqhe
Copy link

msadeqhe commented Dec 7, 2022

Currently, ()$ is used for string interpolation in Syntax 2, while {} is already available in C++20 and without introducing any new symbols, {} can be extended to support expressions for string interpolation in a way similar to f-strings in Python. {} for string interpolations is available in C# and Python (two popular programming languages) besides C++20.

Also ()$ is not an operator, it is an expression block inside a string, ()$ is more like a language construct than a postfix unary operator, and it shouldn't be treated as a postfix unary operator, therefore it could be $() instead of ()$, and it could be just () or in a better way, it could be just {} because { and } have less usage in strings than ( and ).

@msadeqhe msadeqhe changed the title [SUGGESTION] [SUGGESTION] String interpolation in Syntax 2 Dec 7, 2022
@switch-blade-stuff
Copy link

switch-blade-stuff commented Dec 7, 2022

I do agree that {} would be better than ()$ for the following reasons:

  • Parity with the existing C++20 formatting syntax, making it easy to transfer existing code to Cpp2, and inter-operate with existing Cpp1 code.
  • Since the syntax is the same as std::format, formatted strings can be just replaced with a call to std::format, which would also be a more efficient implementation than string concats, since format uses a single buffer.
  • As msadeqhe said, () are much more likely to appear (even with the following $) than {}. Just as an example: a string that contains a price calculation formula such as final price being (price * tax_rate * discount)$. Here, $ denotes the currency, while (...) is the formula.
  • Since {} are symmetric, they are easier to work with on a subconscious level. To me, at least, symmetric brackets look much less noisy than the asymmetric syntax, {} are also close on the keyboard.

Additionally, i think that string formatting should be opt-in, for exame via a string literal prefix. Such as F"{capture}"

  1. Formatting is not zero-cost, it may require string construction and copies, and as such should not be the default when a zero-cost option is available.
  2. String literals should by default be compile-time. String formatting is not compile-time (in the current implementation it generates a std::string) and as such cannot be used in constexpr code and requires memory allocation for every literal.

IMO, string literals should always be a compile-time constant, and never a dynamic type by default. As such, formatting should be opt-in to indicate the extra cost of formatting, and to indicate that the resulting literal types are different.

@hsutter
Copy link
Owner

hsutter commented Dec 9, 2022

The opt-in (e.g,. F prefix) part is related to #45.

@hsutter
Copy link
Owner

hsutter commented Dec 9, 2022

Thanks! I'm open to changing this syntax, but note that it's essential not to look at string interpolation in isolation. I view it as one place we do "capture" which to me means "take an expression in context and store a copy of its value for use later":

I was writing such a long answer that I turned it into a Design note: Capture.

I'll keep this open for now because of the suggestion of having a string literal prefix to opt into interpolation. However, I think that suggestion is also related to raw string literal, which are currently an opt-out in the other direction (strings already allow special handling and if you want to disable it you opt out by saying "no, instead of the default I want a non-preprocessed raw string here" whereas this suggestion is to do the opposite for interpolation and opt into enabling it by saying "no, instead of the default I want an interpolated non-raw string here").

@hsutter hsutter self-assigned this Dec 9, 2022
@gregmarr
Copy link
Contributor

gregmarr commented Dec 9, 2022

However, I think that suggestion is also related to raw string literal, which are currently an opt-out in the other direction (strings already allow special handling and if you want to disable it you opt out by saying "no, instead of the default I want a non-preprocessed raw string here" whereas this suggestion is to do the opposite for interpolation and opt into enabling it by saying "no, instead of the default I want an interpolated non-raw string here").

This seems a little weird to me, but I'm having a hard time describing exactly why that is. I guess it's because of years of history for us veterans that normal C++1 string literals just by default have compile-time processing for some escape sequences, simple termination at the first ", and no embedded newlines. It's thus an opt-in with extra syntax to get a more complicated user-defined termination condition, and embedded newlines. It's a bit of an effort for us to view it as "everything from the start sequence to the end sequence is raw".

My thoughts right now are that one way to view the opt-in request for variable interpretation in string literals is that it should be "don't pay for what you don't use", and thus you need to opt into the extra expense of runtime string interpretation. I'm not sure which method is best for that. Perhaps it's just a simple conversion of the C# syntax to a postfix syntax: ""$ makes the string an interpreted string, and inside that string, {} is an interpretation marker, similar to std::format.

On the other hand, maybe selecting string literal behavior is one place where a prefix is required because it affects parsing. If you need a R"foo( as a prefix to be able to know when to stop parsing the raw literal, then maybe $"" for "this is a string with runtime captures" would affect parsing too. Could it support something like this:

str := $"This is an interpreted string with an embedded double quote in the replacement: {foo.replace('"', ' ')}";

The $ prefix then means that any { in the string needs to balanced by a } before the capture is completed, and normal string processing resumes.

@switch-blade-stuff
Copy link

switch-blade-stuff commented Dec 10, 2022

IMO, Cpp2 should still follow the "don't pay for what you don't use" philosophy.
Using std::string is much less efficient than a normal literal, since memory allocation and string copying is involved. It can also not be used in constexpr context, so doing something like compile-time string parsing is out of the question. Even more, most of the time you use strings, you do not preform any kind of formatting anyway.

To me, it does not really make sense to make an expensive, non compile-time, less used option - the default, and the more used, compile-time, zero-cost - the opt-in. Why are we forcing the user to opt-out of a feature to get the more convenient and non-allocating, non-copying version?

Additionally, this would require the user to worry about the lifetime of a literal (i.e. you cannot treat literals as static constants), making it hard to use with old compile-time APIs that expect std::string_view or const char*, since you need to keep the string alive to avoid dangling references.

If we want to prevent the use of const char* for strings, we can make string literals produce a std::string_view (and I thought this is what we already encourage in Cpp1 world, no?). Personally, this is already how I use string literals in Cpp1.

As for {}, like @gregmarr mentioned, it would also bring parity with other languages that use this formatting syntax like C# and Python in addition to being less common in text, easier to use, and allowing for std::format backend.

@hsutter
Copy link
Owner

hsutter commented Dec 13, 2022

I've been mulling this feedback over, and here's where I landed as a path to pursue for now:

one way to view the opt-in request for variable interpretation in string literals is that it should be "don't pay for what you don't use",

Right, and the current implementation does follow the zero-overhead "don't pay for what you don't use" principle: If you don't write an interpolation, there is no overhead. That is, "xyzzy" is emitted as "xyzzy", and "xy(x)$zzy" is emitted as "xy" + cpp2::to_string(x) + "zzy".

The main advantage I can see for allowing a prefix or suffix outside the string to enable interpolation is to make it a bit more visible, but I do think the (var)$ in the body is pretty visible.

And I worry that if we allow interpolation only inside a $"string" or a F"string", we'd really be (needlessly?) making the default "string" a "semi-raw" string... halfway between a string that does interpolation and the other special-character processing for tabs and newlines, and one that does none of it.

I'll keep thinking about this, and I'll especially be on the lookout for experience with the current design as I and others write more code with it. As always with experiments, the decision to pursue a particular design path is always tentative, to see where it leads and be open to new information discovered by trying it out; if the path turns out to lead to issues, we backtrack with that new information and try a different branch in the design space tree walk (i.e., the design space tree walk is typically depth-first). For now, I'll keep pressing down this path to see where it leads as a reasonable direction to experiment with, so I'll close this for now and we can reopen it when there's new data. Thanks for understanding, and again for the input!

@hsutter hsutter closed this as completed Dec 13, 2022
@gregmarr
Copy link
Contributor

Right, and the current implementation does follow the zero-overhead "don't pay for what you don't use" principle: If you don't write an interpolation, there is no overhead. That is, "xyzzy" is emitted as "xyzzy", and "xy(x)$zzy" is emitted as "xy" + cpp2::to_string(x) + "zzy".

There is a minor overhead on the compile side of looking for replacements in the string. That's probably small enough not to worry about, but I guess we'll see.

The main advantage I can see for allowing a prefix or suffix outside the string to enable interpolation is to make it a bit more visible, but I do think the (var)$ in the body is pretty visible.

I'm still trying to wrap my head around differences in the types between "This is a string with no replacements." and "This is a string with (replacements)$". The first is a compile time constant, and the second is not. I'm not sure if the contents of the string itself is enough of an indicator of the type and performance difference. I guess we'll see as the project goes along.

@hsutter
Copy link
Owner

hsutter commented Dec 13, 2022

I'm still trying to wrap my head around differences in the types between "This is a string with no replacements." and "This is a string with (replacements)$". The first is a compile time constant, and the second is not.

That reminds me of something I was thinking but forgot to write down: We already have a similar situation with capture in C++ today with lambdas, where a no-capture lambda []{ f(); } can be assigned to a pointer to function, but a capturing lambda [x]{ f(x); } can't. I haven't heard about it being a pain point...? And any mistake will be caught at latest at compile time, and typically in the IDE/code-editor with a red squiggle (shift-left FTW!)... my favorite IDE, Godbolt CE, red-squiggles it nicely.

Does that help?

@gregmarr
Copy link
Contributor

gregmarr commented Dec 13, 2022

It's simply that whether or not a "string literal" is a compile-time constant string or a runtime-assembled variable string is based on which characters are used in the string, which has never been the case in the past. It's not necessarily bad, but it's definitely different.

I haven't heard about it being a pain point...? And any mistake will be caught at latest at compile time, and typically in the IDE/code-editor with a red squiggle

It's definitely a pain point based on questions I've seen about how to make it work. However, it's not a security or correctness or even really an unexpected error pain point. Trying to assign a capturing lambda to a function pointer is an error that will be caught every time.

I think as I've been writing this I've narrowed down my primary concerns in this area to ownership and performance issues. As long as these runtime generated strings can only be assigned to something that owns them and is going to destroy them properly when they go out of scope, such as a std::string, then it's not a safety issue, and just perhaps a performance issue. As long as both s2 and s3 below fail to compile, then the safety issue should be fine.

main: () -> int = {
    world : std::string_view = "world";
    s1 : std::string_view = "Hello (world) $";
    s2 : std::string_view = "Hello (world)$";
    s3 : std::string_view = "Hello " + world;
}

FYI, I noticed while compiling this that currently s2 has an unneeded string combine here because it tacks on the empty string at the end. That can be eliminated.

std::string_view s2 { "Hello " + cpp2::to_string(world) + "" }; 

@switch-blade-stuff
Copy link

today with lambdas, where a no-capture lambda []{ f(); } can be assigned to a pointer to function, but a capturing lambda [x]{ f(x); } can't. I haven't heard about it being a pain point...?

@hsutter I think that it isn't an issue with lambdas since the captures are placed at the start (i.e. they are a prefix), and as such are easy to see, so you do not need to examine the body of a lambda to see if it captures anything.

Examining strings for interpolation syntax may require more effort than that, especially for lengthy strings. Considering that parentheses are fairly common in text, confusing () for ()$ is fairly easy, aside from the fact that ()$ might happen in text naturally.

But I do agree, it would be better to experiment with both the implicit and the explicit approaches to see if one is better than the other.

If we do pursue the implicit approach though, I would advocate to look at the {} syntax even more, as braces appear in normal text very rarely and as such stand out much more than parentheses and will be less likely to cause accidental interpolation (and are faster to type than ()$!).

hsutter added a commit that referenced this issue Dec 15, 2022
…comment in #159

Don't add `+ ""` or `"" + ` when interpolations are at the very beginning or end of the string, or adjacent.
@hsutter
Copy link
Owner

hsutter commented Dec 15, 2022

@gregmarr Yes, I'd long noticed the + "" and it just didn't bug me enough to fix it, but now that you mention it of course there's an easy find/replace fix... done, thanks!

@switch-blade-stuff Understood, but again when considering an alternate interpolation syntax, just remind people to also look at the other three places where capture happens (expression-scope functions, postconditions, and in the future source generation) and make sure it works well for all of them, not just strings. :) String interpolation shouldn't be that special, I think. Thanks!

@gregmarr
Copy link
Contributor

You're welcome. However, you need some whitespace before your \"\" + match attempt assuming we are still using \" to embed a ":

s := "\"(world)$"

results in

"\"" + cpp2::to_string(world)

which will end up being replaced to

"\ cpp2::to_string(world)

look at the other three places where capture happens (expression-scope functions, postconditions, and in the future source generation) and make sure it works well for all of them, not just strings. :) String interpolation shouldn't be that special, I think. Thanks!

I had an interesting thought. What if an interpolated string was a shortcut for std::format, replacing the arg id with the actual expression, so you could apply format-spec.

    rgb := 0
    s := "The color is {rgb:#06x}."$;
    auto rgb = 0;
    auto s = std::format("The color is {:#06x}.", rgb);

Taking some examples from cppreference

    pi := 3.14f;
    s1 := "{pi:10f}"$; // s1 = "  3.140000" (width = 10)
    s3 := "{pi:.5f}"$; // s3 = "3.14000" (precision = 5)

@switch-blade-stuff
Copy link

switch-blade-stuff commented Dec 15, 2022

What if an interpolated string was a shortcut for std::format, replacing the arg id with the actual expression, so you could apply format-spec.

This is the same idea I had earlier, and imo it would be a good thing to use std::format since modern C++ would be using it for formatting (hopefully), so it would give parity with Cpp1, and would be more efficient than string concatenation.

@gregmarr
Copy link
Contributor

This is the same idea I had earlier

So it is. Maybe that's why I had been thinking about it. :)

@hsutter
Copy link
Owner

hsutter commented Dec 15, 2022

@gregmarr

s := "\"(world)$"

Fixed, thanks!

@hsutter
Copy link
Owner

hsutter commented Dec 15, 2022

@switch-blade-stuff Hmm, as long as the format-spec was a suffix and only legal in a string interpolation (not other captures, and this way capture would still be the same everywhere just a subset would be legal in the other contexts) that could be interesting. I'll put it in the queue of future things to look at. Thanks!

Also to xref: Related to #133 and #134

Azmah-Bad pushed a commit to Azmah-Bad/cppfront that referenced this issue Feb 24, 2023
…comment in hsutter#159

Don't add `+ ""` or `"" + ` when interpolations are at the very beginning or end of the string, or adjacent.
@svew
Copy link

svew commented Jun 28, 2023

Apologies if this is obvious, but is string interpolation even capturing?

My understanding is that string interpolation occurs at the definition, not at some later point, so no "capture" should be occuring because the inputted values are used immediately. It's like a special function that takes a variadic number of arguments, not a lambda where you need to specify if you want to copy a value as it is at definition rather than what it is when called.

If my understanding is incorrect and it is indeed capturing somehow, I'd also propose then that there's technically two concepts here that don't need to be bundled into one:

  1. The declaration of "I want to embed some value into my string"
  2. Where that value comes from

I think that the argument around whether (name)$ formatting is good for string interpolation falls under (1), whereas (2) would be the question of capturing. In short, my syntax suggestion would be:

auto mystr = "My name is {name$}, and I'm {10} years old!";

Where we use the {} brackets to declare "I want to embed some value in my string", and then use a postfix $ to say we want to capture that variable in our string interpolation. However, if we're passing in a constant, we wouldn't add a capturing declaration, since we wouldn't do so within a lambda context.

I'm also personally much more inclined towards using braces than parens, it's the standard C++1 and all other modern languages have adhered to. If we're talking about metrics of "reducing the amount that programmers have to learn", by using braces, we have one less quirk that programmers moving to-or-from C++2 have to remember. String interpolation is already a special-use syntax.

@JohelEGP
Copy link
Contributor

@hsutter
Copy link
Owner

hsutter commented Jun 28, 2023

Quick ack:

I try to explain this in the ~3 minutes of the talk starting at 1:30:53 -- maybe the way I say it there could be helpful?

is string interpolation even capturing?

I think so, it's capturing by value where the string/lambda is declared and then using the value later. The main difference is that as a convenience it converts the captured thing to a string as needed (so it's like capturing an integer i via to_string(i)$ without having to write the to_string() part).

Example:

main: () = {
    name: std::string;

    // lambda capture of (name)$
    name = "xyzzy";
    lambda := :()= std::cout << "Hello " << (name)$ << "\n";  // capture by value at definition
    name = "plugh";
    lambda();  // use captured value: prints "xyzzy"

    // string capture of (name)$
    name = "xyzzy";
    message := "Hello (name)$\n";  // capture by value at definition
    name = "plugh";
    std::cout << message;          // use captured value: prints "xyzzy"
}

My understanding is that string interpolation occurs at the [string's] definition, not at some later point,

Right. Same as lambda capture occurs at the lambda capture occurs at the lambda's definition.

so no "capture" should be occuring because the inputted values are used immediately.

In string capture the values are read immediately where the string is defined, and captured (copies are stored) for later use when the string is used.

not a lambda where you need to specify if you want to copy a value as it is at definition rather than what it is when called.

In lambda capture the values are read immediately where the lambda is defined, and captured (copies are stored) for later use when the lambda is used.

Does that help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants