-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid unnecessary boxing with String.Concat #415
Conversation
{ | ||
// The lowered expressions coming in will be boxing conversions if the input is a value type. | ||
// When passed to String.Concat, the resulting object will have ToString called on it. | ||
// We can optimize away the boxing allocation by just calling ToString on the value type directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a breaking change and a deviation from C# spec in some cases. If the value type is mutable and its method ToString() is mutating, then currently this mutation is not observable because it's made on a boxed copy. With this change it will directly mutate the value, that can have observable effects. This looks like a safe optimization only for known immutable value types like int, Guid, DateTime etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The majority of the cases I've seen this occur with are in fact with chars, ints, bytes, and just a handful of other core types, so I think the decrease in value for the change would be fairly negligable. If you agree this is worthwhile to pursue, I'll go ahead and make that change...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should use a whitelist of most common immutable value types. Any breaking changes should be first approved by C# LDM or Roslyn Compat Council.
I'll go and add a test that asserts the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VladimirReshetnikov do you consider it a breaking change to call ToString on the immutable value types? Or are you just referring to the current behavior which would call it on any struct?
CC @gafter for his feedback. The change itself looks good modular @VladimirReshetnikov comment about mutations. |
Thanks, guys. I've updated the PR based on @VladimirReshetnikov 's feedback. |
// As such, we special case core types we know to be safe (and the most | ||
// important ones based on usage, anyway). | ||
|
||
switch (operand.Type.SpecialType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can safely add enums, Guid, DateTimeOffset, TimeSpan, and nullable versions of all whitelisted types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all enums map to SpecialType.System_Enum? I wasn't sure if that referred to the base Enum type or to any enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, SpecialType.System_Enum indicates the abstract class System.Enum. Enums (concrete value types) have their property TypeKind == TypeKind.Enum
Updated based on the additional feedback. |
/// <summary> | ||
/// Checks if a type is an immutable value type primitive. | ||
/// </summary> | ||
public static bool IsImmutableValueTypePrimitive(this SpecialType specialType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need this method? If it is a value type and has any SpecialType, then it should be safe. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gafter That places an undocumented restriction on the future expansion of the SpecialType
enumeration. While it's unlikely to ever be a problem, that's certainly a dangerous position to put yourself in when it can be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @sharwell
Better to find a place down the road where we forgot to do this optimization vs. finding out we accidentally started doing it to a type where it is not safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like I should leave it as is for now. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we really trying to "defend" against the possibility that we'll add a platform special value type that has a mutating ToString() method? That seems really, really unlikely. I'd rather see this method removed and have the simpler test in the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gafter, I don't feel strongly about it. I removed the extension per your feedback.
Nice! 👍 |
// important ones based on usage, anyway). | ||
|
||
TypeSymbol typeSymbol = operand.Type; | ||
if (typeSymbol.SpecialType.IsImmutableValueTypePrimitive() || typeSymbol.TypeKind == TypeKind.Enum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider including TimeSpan, DateTimeOffset, Guid, Version and Nullable<T> for immutable T.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good suggestion. I was aiming to keep this fairly lean; is it cheap to look up such arbitrary types? If so, I can do so; otherwise, I'd suggest we leave this as is for now, and it can always be updated later with more types if necessary.
d2bbbf5
to
3535ade
Compare
Thanks, all. I rebased/squashed to address merge conflict with @VladimirReshetnikov 's additional test so that the Merge pull request button works in the GitHub UI. |
LGTM as well. JaredPar from a phone From: Vladimir Reshetnikov notifications@github.com [:+1:] Reply to this email directly or view it on GitHubhttps://github.com//pull/415#issuecomment-74013669. |
While a bit out of scope for Roslyn, I think it might also be beneficial to have a generic version of |
Does this change result in different output for code like the following? string s = "Value: " + 3.1 + (Thread.CurrentCulture = {some other culture}); In the past, the |
In expressions like Since I think such code is very common, would it make sense to add special overloads of |
@svick: Yes, I allude to that in my original PR description, at least in the case of literal chars. In your mentioned case, I think there's a limit to how many such overloads you can reasonably add to the libraries. That'd be two overloads for the two-parameter version (string/char, char/string), 6 for the three-parameter version (string/string/char, string/char/string, char/string/string, string/char/char, char/string/char/, char/char/string), etc., and that's just for the case of char; similar needs arise for Int32 and so on. @aelij above mentioned adding generic overloads; that would help to avoid the boxing (at the expense of potentially needing many generic instantiations), and is something folks like @KrzysztofCwalina are considering for this and other such cases for the future, but it also wouldn't address the string allocation you allude to (unless type tests were done in the implementation to special case each of the specific cases, which would theoretically be a possibility). @sharwell: Yes, that would change the formatting; this can be seen with the following: var dot = CultureInfo.GetCultureInfo("en-US");
var comma = CultureInfo.GetCultureInfo("fr-FR");
Thread.CurrentThread.CurrentCulture = dot;
Console.WriteLine("Value: " + (object)3.1 + ((Thread.CurrentThread.CurrentCulture = comma) != null ? "" : ""));
Thread.CurrentThread.CurrentCulture = dot;
Console.WriteLine("Value: " + 3.1.ToString() + ((Thread.CurrentThread.CurrentCulture = comma) != null ? "" : "")); prints Value: 3,1
Value: 3.1 So, yes, there is a potential behavioral change here. The thought of someone changing culture in the middle of string concatenation bothers me even more than the thought of someone mutating a mutable struct in the middle of a ToString call :), but nevertheless it is a behavioral change. @theoy , @pharring, thoughts? Up to you guys what we do here. I know such corner-cases behavioral changes have been taken for Roslyn... thoughts on this one? |
@stephentoub Yeah, I agree that adding all those overloads would be way too much. But I think that having just the two two-parameter overloads would improve the common case while not adding too much.
How come? How can you get the characters of |
To avoid allocations you would just need to be able to determine its resulting length, and you can determine the number of base-10 digits in an Int32 with some math, at least for invariant culture (it's possible there's some culture I'm not aware of where this wouldn't hold.) |
Goodness, changing the culture as a side-effect of a string concat! To steal an expression from someone else: @sharwell just Nikoved it (in reference to @VladimirReshetnikov, who is the person most likely to come up with bizarre edge cases). |
Bizarre indeed. The C# spec seems to support @stephentoub change over the current behavior. It implies the
It wouldn't be the first time we deviated from the spec though. |
Why did we remove the method to explicitly check the |
I completely agree, and I see I failed to mention this in my reply. The unfortunate part of this is we are now following the spec, but only for a select few types. The way I interpret the spec is we should only ever call the
Then the resulting strings get passed to the appropriate overload of 💡 The optimization in this pull request should be restricted to eliminating boxing operations without also changing the order of operations. I recommend first updating the compiler to evaluate string concatenation arguments in the correct order, and then re-evaluating this optimization in that context. Edit: this will allow further optimizations such as the fact that it's always safe to evaluate |
👍 |
@stephentoub @sharwell Sam, this is an amazing example! To preserve the current behavior we would need to evaluate arguments to string.Concat from left to right, then invoke ToString() on those arguments that are immutable value types from left to right, and then invoke string.Concat. It might require some shuffling of values on the CIL evaluation stack. Interestingly, the C# spec does not specify an order in which ToString() methods of operands of a string concatenation operator are invoked. We can imagine that ToString() of one operand changes the current culture, and it affects ToString() of the other operand. Or they can set some global flags, or throw different exceptions. We likely want a test plan for this, and profiling results of real applications demonstrating a significant performance win that would warrant these changes. Any breaking changes need an approval form the Compat Council or LDM. |
@jaredpar Sorry, I do not follow you argument. Do you mean that the current behavior (without proposed changes) deviates from the spec. Can you give a concrete example? |
@VladimirReshetnikov, it sounds like you're saying this PR should just be closed (at least for now). Is that the consensus? |
I'd suggest to first add more test coverage around string concatenation and investigate if the current behavior differs from the spec. |
@VladimirReshetnikov, that's fine, but I don't have the time for that right now. If someone else would like to and has the time, great; we can leave this open pending that outcome. If not, and if that's required to move forward with this PR, then we should close it. |
Potential compromise: Only invoke this optimization when the arguments are side effect free. Essentially doesn't call into a method or property. It should be completely safe to do the optimization in that case. |
@jaredpar, thanks, I'll take a look at doing something along that line of thought. |
Maybe C# 7 can add a |
@jaredpar, I spent some time on that idea, got it all coded up, and then realized that it would end up being a significantly larger change to get it working correctly. As noted in some of the comments, the existing rewriter for this is very local, to the point where in these calls we may not actually have the full set of arguments supplied by the developer. I think this would require finding all such sibling expressions, and that's a much more invasive change. |
Maybe i missed it, but why don't we just apply this optimization to things we know are pure? i.e. the standard primitive types like bool, char, and all the numerics? You'd get the allocation savings, and you wouldn't have to worry about any changes in behavior. |
@CyrusNajmabadi System.Double is not pure by this definition, since it uses the current culture to decide what character to use for the decimal point. |
@CyrusNajmabadi, it's already limited as you say. The concern now is that this changes order of operations, and a subsequent expression in the concatenation could change the current culture in a way that would observably change the results. |
@CyrusNajmabadi @gafter "pure" describes an object (or operation) which does not affect other objects. In this case, we are also concerned with the other way around - i.e. objects which cannot be affected by any other object (or operation). While the primitive types are all pure, the |
@stephentoub Prior to resolving the ordering issue, we should be able to implement this optimization for types which are immutable, pure, and not affected by other code. Notably:
|
Thanks, @sharwell. I'd considered that and dismissed it in a moment of thinking that it'd be so limited to not be worthwhile, but after reading your comment and thinking about it again, I agree it would make sense for at least char. I'll revise for that. |
b454e9b
to
d90c626
Compare
Ok, take 2 :) I took @sharwell's suggestion and limited the optimization to just the special value types that don't use culture (or any other state external to the value) in ToString, such that no subsequent expression in the Concat could affect the result of ToString. Since char is verified as such, I also added in the optimization to convert a char literal used in the Concat to be a string literal; this saves not only the boxing at run time but also the string allocation at run time, and depending on the shape of the Concat, it can also result in further constant folding. (Please let me know if there's some reason that this optimization should be backed out, and I'll do so.) |
I'm glad you took the opportunity to improve the compile-time treatment of |
// calling ToString on subsequent arguments). For value types, it could mean mutating | ||
// the original instead of the boxed copy. Therefore, we can only apply the | ||
// optimization in cases where we know ToString to be non-mutating. | ||
// - It's possible that subsequent expressions in the concatenation mutate state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Since the current behavior is inconsistent with the C# spec, you should include a note here that the attention to preserving this behavior is for strict preservation of the backwards-compatible behavior. Also, I still think we should wait for the "compatibility committee" (or whoever) to review the situation and determine whether or not we should change the Roslyn behavior to actually follow the spec.
❓ How long will it take for the compatibility evaluation to be complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compatibility council is not currently considering this issue because there is no proposal to break compatibility. If such a proposal is made it would be evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to proceed with the current, more-limited optimization, to get something in place now. If we want to submit a follow-up proposal to break compat by expanding the scope, that'll be a relatively trivial PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When string concatenation encounters non-strings (e.g. path + '/', name + someInt32, etc.), it calls overloads of String.Concat that accept objects. These overloads then just call ToString on each of the objects, mapping to the C# spec which states that "any non-string argument is converted to its string representation by invoking the virtual ToString method inherited from type object." When any of the individual items being concatenated is a value type, that object first gets boxed to be passed to String.Concat as an object, only to then have ToString called on the boxed object. This commit changes the local rewriter for string concatenation to test whether an argument is of an appropriate type, and if it is, to call ToString on it directly, rather than first boxing it. This can then affect which overload of Concat is used, as the type of the argument has changed to be String. The primary benefit of this is saving the allocation per value-type item. There are some secondary benefits, as well; for example, as there is a four-string overload of Concat but no four-object overload of Concat, if this optimization is able to force all of the items to be strings, the four-string overload can be used rather than allocating an object array for the inputs and then another string array for the resulting strings (inside of Concat). This commit also includes an additional optimization specific to const chars; rather than doing a ToString call at run time, we can do it at compile-time and emit a literal string instead of a literal char, saving both the boxing and the string allocation at run time.
d90c626
to
70d2dee
Compare
@sharwell, I cleaned up the comment you cited a bit. @gafter, I realized that the changes I'd made to the various 2-3-many arg rewriters were unnecessary, and I could do the change in a single place in the entry rewriter. This not only simplified the change, it also helped to improve constant folding in some cases and alleviated an array allocation in the many-arg rewriter. I've pushed the updated commit. I'm happy with this if others are. @gafter? @jaredpar? (@sharwell's official proposal to allow for more optimizations can then be considered separately.) |
👍 |
1 similar comment
👍 |
Thanks, all. |
Avoid unnecessary boxing with String.Concat
When string concatenation encounters non-strings (e.g. path + '/', name + someInt32, etc.), it calls overloads of String.Concat that accept objects. These overloads then just call ToString on each of the objects, mapping to the C# spec which states that "any non-string argument is converted to its string representation by invoking the virtual ToString method inherited from type object." When any of the individual items being concatenated is a value type, that object first gets boxed to be passed to String.Concat as an object, only to then have ToString called on the boxed object.
This commit changes the local rewriter for string concatenation to test whether an argument is a value type, and if it is, to call ToString on it directly, rather than first boxing it. This can then affect which overload of Concat is used, as the type of the argument has changed to be String. The primary benefit of this is saving the allocation per value-type item. There are some secondary benefits, as well; for example, as there is a four-string overload of Concat but no four-object overload of Concat, if this optimization is able to force all of the items to be strings, the four-string overload can be used rather than allocating an object array for the inputs and then another string array for the resulting strings (inside of Concat).
(Beyond this optimization, there's the potential for another, which converts literal char arguments to be literal string arguments, avoiding the need for ToString's allocation altogether. But as that doesn't stick to the letter of the spec as I read it, I've not done so.)
There are lots of places where such an optimization is beneficial. Just searching through the Reference Source, there are many examples where code is unnecessarily boxing due to this. Just a few examples:
Roslyn itself bumps up against this, for example:
roslyn/src/Features/Core/Diagnostics/DiagnosticAnalyzerService.DiagnosticAnalyzersAndStates.cs
Line 180 in 6f35e33
roslyn/src/Features/Core/Completion/CommonCompletionUtilities.cs
Line 173 in 03e3045
roslyn/src/Samples/CSharp/ImplementNotifyPropertyChanged/Impl/ExpansionChecker.cs
Line 102 in b648430
roslyn/src/Features/Core/ExtractMethod/MethodExtractor.cs
Line 165 in 03e3045
roslyn/src/Features/Core/LanguageServices/AnonymousTypeDisplayService/AbstractAnonymousTypeDisplayService.cs
Line 83 in 47ff630
Etc.