Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[suggestion] Compiler should call ToString on non-string types used with string.operator+ #10966

Open
jamesqo opened this issue Apr 29, 2016 · 16 comments
Labels
Area-Compilers Code Gen Quality Room for improvement in the quality of the compiler's generated code Feature - Tuples Tuples Feature Request help wanted The issue is "up for grabs" - add a comment if you are interested in working on it
Milestone

Comments

@jamesqo
Copy link
Contributor

jamesqo commented Apr 29, 2016

Background

There's a lot of code out there that uses string concatenation to append arbitrary objects to strings, like so:

public class MutablePerson
{
    public string Name { get; set; }
    public int Age { get; set; }

    public override void ToString()
    {
        return this.Name + " is " + this.Age + " years old.";
    }
}

At compile-time, Roslyn tranforms calls to string.operator+ to string.Concat to avoid allocating intermediary strings, so what we end up with is this:

public override void ToString()
{
    return string.Concat(this.Name, " is ", this.Age, " years old.");
}

Unfortunately, string.Concat only accepts object parameters, meaning that Age (which is an int) is implicitly boxed when it is passed into the function. Of course, boxing is no good for performance, so if the developer wants to avoid this, he/she has to explicitly call ToString on the variable, e.g.

return this.Name + " is " + this.Age.ToString() + " years old.";

and the compiler will do the right thing and omit the box instruction.

An example of where this really became a pain was dotnet/corefx#8025, where a bunch of ToString calls had to be added to avoid boxing on each item.

Proposal

Instead of calling the string.Concat overloads that accept objects, the compiler should exclusively generate code targeting the ones accepting a string. For value types, the code generated would coerce the variable into a string via ToString:

"foo" + someStruct + "bar"; // what was written
string.Concat("foo", someStruct.ToString(), "bar"); // what was emitted

Existing strings could simply be concatenated directly, without a ToString call (same behavior as today):

"quux" + someString; // what was written
string.Concat("quux", someString); // what was emitted

And for other reference types, a ToString would be emitted with an extra null check:

"baz" + someClass; // what was written
string.Concat("baz", someClass?.ToString()); // what was emitted
@svick
Copy link
Contributor

svick commented Apr 29, 2016

Here is a similar issue about avoiding allocations for string.Format(): dotnet/corefx#1514. (Though string.Format() has different requirements, since it supports custom format strings.)

I think the solution proposed there, adding generic overloads, would help here too, even if only for cases where the number of concatenated objects is small. It could potentially be even more efficient, for example it could convert some common types (like int) to string without allocating anything, if the result was written directly to the resulting string, instead of allocating an intermediate string for the int (see dotnet/corefx#4632).

@miloush
Copy link

miloush commented Apr 29, 2016

Be careful here with operators:

class Test
{
    public static implicit operator string(Test t)
    {
        return "op";
    }

    public override string ToString()
    {
        return "to";
    }
}

static void Main(string[] args)
{
    Test someClass = null;
    string a = "baz" + someClass; // what was written
    string b = string.Concat("baz", someClass?.ToString()); // what was emitted

    someClass = new Test();
    string c = "baz" + someClass; // what was written
    string d = string.Concat("baz", someClass?.ToString()); // what was emitted
}
  • a = bazop
  • b = baz
  • c = bazop
  • d = bazto

@jamesqo
Copy link
Contributor Author

jamesqo commented Apr 29, 2016

@svick

It could potentially be even more efficient, for example it could convert some common types (like int ) to string without allocating anything

Unfortunately, that optimization can only applied to certain types (specifically these) since the string representation of numbers is dependent on some things at runtime (e.g. CultureInfo), so unfortunately can't be optimized away.

I think the solution proposed there, adding generic overloads, would help here too, even if only for cases where the number of concatenated objects is small.

I understand where you're coming from, but why should we need generic overloads in this case? If something like this:

int age = 17;
string message = "You are " + age + " years old";

was transformed to this:

int age = 17;
string message = string.Concat("You are ", age.ToString(), " years old");

then we wouldn't even need to avoid boxing; it would call the string.Concat(string, string, string) overload without having to convert age to an object.

@jamesqo
Copy link
Contributor Author

jamesqo commented Apr 29, 2016

@miloush Good point; isn't that true for the current way it's handled in the compiler as well, though? I think detection of operator overloading comes before the string.Concat rewriting.

@AdamSpeight2008
Copy link
Contributor

String.Format internally is a call to StringBuilder.AppendFormat()

@svick
Copy link
Contributor

svick commented Apr 29, 2016

@jamesqo I didn't mean that Roslyn itself could convert the int to string at compile time, I meant that string.Concat<string, int, string> could somehow find out how many characters the int needs, allocate a string of the right length and then fill the values in, all without allocating any intermediary objects.

So, your solution saves boxing the int, mine would also save allocating the string just for the int.

But as long as that doesn't exist, calling ToString() is a good optimization.

@jamesqo
Copy link
Contributor Author

jamesqo commented Apr 30, 2016

@svick

I meant that string.Concat<string, int, string> could somehow find out how many characters the int needs, allocate a string of the right length and then fill the values in, all without allocating any intermediary objects.

Ah, I see what you mean now. That would probably be tricky with how it's currently implemented, though; all of the ToString methods for numbers are implemented natively in coreclr, and we'd have to refactor them out into two different methods-- one returning the count of chars needed by the number, and one that writes them out to a char buffer. Additionally, we'd need to special case for each type, e.g. add a bunch of if (typeof(T) == typeof(Foo)) checks to string.Concat.

@AdamSpeight2008
Copy link
Contributor

@jamesqo once we can pattern-match on type a lot of these if (typeof(T) == typeof(Foo)) can "simplified"

@jamesqo
Copy link
Contributor Author

jamesqo commented Apr 30, 2016

@AdamSpeight2008 Yeah, but the point still stands as it'd be very tricky to get the theoretical length of what ToString for the object would be and write it out without allocating an intermediary string, and the optimization would only apply to certain types/a limited number of arguments. Maybe in the future, if we get variadic template arguments, like C++ has:

template <typename... Args>
void myfun(Args... params) { }

then yeah, maybe it might be worth it. For now, I think this is the best we can do though since it's a pure win over how it's currently implemented.

@AdamSpeight2008
Copy link
Contributor

We could get the maximum .ToString() of most of core types. Eg .MaxValue.ToString.Length + 1

ApproxMaxCharLength<T>( obj : T ) : Int
{ 
  return match( typeof( T ) ) with
  {
    | string -> ((string)obj).Length; 
    |   bool ->  5; //  false
    |   byte ->  3; //  256
    |  sbyte ->  4; // -128
    |   char ->  1; // 
    | uint16 ->  6; //  65535
    |  int16 ->  6; // -32768
    | uint32 -> 11; // -2147483648 
    |  int32 -> 10; //  4294967295
    | uint64 -> 20; //  18446744073709551615
    |  int64 -> 20; // –9223372036854775808
    |    ... ->  0;
    // not done the floating point ones.
  }
}  

An initial first approximation of the string length, could be calculated thus.

var intialSize = args.Sum( ApproxMaxCharLength ) + TextOfFormatString.Length; 

I think this would be an over allocation. Note this is exclude any variation caused my the culture and additional alignment and format specifiers on the arg hole eg { x , -16 : X4 }

At which point (don't done any deep analysis) would the extra processing to get a good estimate of the size of string to allocation, outweighs the processing cost of just blindly done the building of the string.

@svick
Copy link
Contributor

svick commented May 2, 2016

@AdamSpeight2008 I wonder how useful would approximating the length be.

If you can figure out all the lengths accurately, you only need one allocation. If you approximate them, you need two: the initial one based on the approximation, and then the final one of the right length, where you copy the characters from the initial one.

Getting accurate lengths would be relatively expensive (you basically run the same algorithm as ToString(), only instead of writing chars, you increment a counter), but it might still be worth it to save the allocation.

On the other hand that temporary string could be allocated on the stack (as long as it's relatively short), which could make the approximate approach worth it again.

@AdamSpeight2008
Copy link
Contributor

AdamSpeight2008 commented May 2, 2016

@svick
If the approximate length is greater or equal, I think it be a win as we don't have to reallocate a new array. In the case it is greater, it would have to reallocate and copy to new array, just like today.

@AdamSpeight2008
Copy link
Contributor

AdamSpeight2008 commented May 2, 2016

@svick
The approximation routine can also check to see it is likely to throw an error, thus not allocating anything. Also if the format string doesn't contain any argholes eg{0} then doesn't allocate, just return the original format string. Wrote a little approximation routine, based on the StringBuilder.AppendFormat implementation, that excludes the IFormatable aspect.

Args:= {"ABCD", 12}
FS := "a:= {0} b:= {1}"
Approx: 24 chars
Actual: 15 chars
Args:= {"ABCD", 12}
FS := "a:= {0,-16} b:= {1,4,X4}"
Approx:= 46 chars
Actual:= 36 chars

Altering the algorithm to pick the larger of the alignment or maxlength(args(Index)) results in

Args:= {"ABCD", 12}
FS := "a:= {0} b:= {1}"
Approx: 24 chars
Actual: 15 chars
Args:= {"ABCD", 12}
FS := "a:= {0,-16} b:= {1,4,X4}"
Approx:= 38 chars
Actual:= 36 chars

https://gist.github.com/AdamSpeight2008/ef6834483148afcac59234eef36ef36f

@gafter
Copy link
Member

gafter commented Apr 21, 2017

@jaredpar I don't know why you labeled this Language Design. It appears to be a suggestion for improved IL generation in the compiler (i.e. an "optimization").

@gafter gafter added Area-Compilers Concept-Code Quality Improvement Feature Request Feature - Tuples Tuples help wanted The issue is "up for grabs" - add a comment if you are interested in working on it labels Apr 21, 2017
@gafter gafter added this to the Unknown milestone Apr 21, 2017
@jaredpar
Copy link
Member

@gafter added language design because the spec is specific about how the conversion should work here:

Otherwise, any non-string argument is converted to its string representation by invoking the virtual ToString method inherited from type object

My original reading made me worry this is potential violating the spec because moving from a virtual call to non-virtual (struct case). Thinking through it though I don't think it's actually a problem.

But there are two other problems that need to be addressed.

Mutations

The optimization cannot be to someStruct.ToString because that is a potential compat break. Have to assume that ToString mutates and hence must do it on a copy so that the mutation remains unobservable.

Side Effects

Consider this code:

"hello" + e1 + (new Widget()) + "world"
// translates to 
string.Concat("hello", e1, e2, "world")

This cannot be optimized to the following:

string.Concat("hello", e1.ToString(), (new Widget()).ToString(), "world")

Imagine the case where e1.ToString() throws an exception. This would be observable as the new Widget() code no longer runs as a part of the concat operation. Before it ran because the ToString occurred inside string.Concat.

In order to do this optimization side effects need to be taken into account. It has roughly the problem space as await spilling does. In the face of side effects like this I'm not sure the optimization is worth the complexity. May have to limit it to cases where there are no side effects.

@gafter gafter added Code Gen Quality Room for improvement in the quality of the compiler's generated code and removed Concept-Code Quality Improvement labels Sep 17, 2019
@canton7
Copy link
Contributor

canton7 commented May 1, 2020

I think this is resolved by #35006? With the order-of-evaluation stuff tracked by #38641.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Code Gen Quality Room for improvement in the quality of the compiler's generated code Feature - Tuples Tuples Feature Request help wanted The issue is "up for grabs" - add a comment if you are interested in working on it
Projects
None yet
Development

No branches or pull requests

8 participants