[Proposal] Localizable String Interpolation #7529
Replies: 3 comments 2 replies
-
This seems like 99% tooling and 1% language request, which to me feels really awkward. Does that 1% need a language feature or can't it be built out of facilities already provided by the language, such as custom string interpolation handlers? I'd hate to admit it, but it also seems like interceptors paired with an analyzer and source generator could also accomplish this with existing syntax as well, including all of the tooling support around the resource files. |
Beta Was this translation helpful? Give feedback.
-
https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/#string-formatting feels relevant to this proposal. |
Beta Was this translation helpful? Give feedback.
-
There are so many moving parts in this system that can be greatly simplified and do without language, compiler, tooling changes depending on what needed to achieve here but I really think that before we're discussing how and what needs to change it's better to have a general discussion about localization where people should share how they are solving it today in their products, how it's done in other platforms and why existing solutions in .NET aren't sufficient and such so really research this topic and gather enough feedback then we can speak about what should be done in the .NET ecosystem to make it better so imo delving into details and possible implementation(s) just makes it a futile discussion. |
Beta Was this translation helpful? Give feedback.
-
Localizable String Interpolation
In C# 6 string interpolation using
$"..."
was introduced. It's the bees' knees (or for readers in en-US-CA, "hella sick"). Over time significant improvements have been made like the introduction ofDefaultInterpolatedStringHandler
for high-performance interpolation, and in .NET 8 theCompositeFormat
class was introduced that approaches the problem from a different direction, allowing pre-parsing of old-stylestring.Format
format strings. This means that users seeking high-performance non-localized logging, debug messages, etc need look no further than$"..."
and users of classicstring.Format
with its{0}
s and{1:R}
s can easily benefit from performance improvements without losing any existing localization functionality.But what about localizing interpolated strings? And how can we localize them easily and efficiently, without lots of boxing, temporary allocations or redundant parsing?
$"..."
for clear and straightforward code. However, the literal components of the string are lowered directly into the compiled code, as is the order of those literals. This makes the resulting string near-impossible to localize without abandoning the use of$"..."
. Were you to use theFormattableString
type that would unlock the facility to localize the string (yourself) by looking up its.Format
in a string table and using.GetArguments
or.GetArgument
to fetch its arguments, but this involves boxing and multiple temporary allocations, so the performance is not great.CompositeFormat
API combined with existing Resource infrastructure would allow a disciplined development team to take each of their$"..."
literals and hand-convert it into an old-stylestring.Format
format string, then put those format strings into.resx
files and look up the appropriate format at runtime, much like how exception messages are localized in the BCL today. The use of old-style format strings comes with many disadvantages that remain even with the aid ofCompositeFormat
, unfortunately - mandatory boxing for strings with more than a small number of insertions, more room for error due to positional instead of semantic expansions, and less implicit context for translators due to positional expansions.An ideal solution would combine our existing tooling and features with some new smarts to enable localizing
$"..."
interpolated literals with minimal changes to existing code, which would enable every C# developer to easily add localization to their software in an incremental fashion without losing development velocity or introducing new bugs. If done right, this new solution could also enable new scenarios for high-performance string formatting, logging, and localization.The examples I provide below are real-world examples from my main domain of expertise, game development, so they don't necessarily map 1:1 to localization scenarios in realms like web development or enterprise software, but hopefully they will be comprehensible.
What An Ideal Solution Would Offer
First, the mandatory:
$"..."
string literals, more or less as-is, albeit with a little work to make them localizable.This might mean assigning the literal to a local of a specific type (using implicit conversion to tell Roslyn to get to work), or annotating it in some way. But an interpolated literal like
$"hello {name}, you are our {customerCount++}th customer today!"
should be localizable without too much work from the developer.DefaultInterpolatedStringHandler
andCompositeFormat
, not to mention existing technologies like.resx
files, though ideally all of this would happen automatically. Having to introduce an entire new set of APIs and types to support this would be undesirable.Then the nice-to-haves:
$"..."
literal would transform into a sort of closure at compile time, containing a localizable reference to the format along with each value needed to format it. This closure would be a struct, allowing its use and indefinite storage without a heap allocation. (Some use cases would involve unavoidable boxing of the closure, however.){expr1} {expr2}
instead of{0} {1}
. This provides valuable context for a translator, makes it easier to identify a given string when skimming a string table or inspecting state in the debugger, and makes structurally-identical-but-semantically-different strings distinct from each other.in this case we would want to be able to capture the expression
Data.FinalDamage > 0 ? $"{Data.FinalDamage} {Data.Type.ToKeyword().Text}" : "no"
as a strongly typed literal instance, and then utilize it in the construction oflogMessage
without ever callingToString
on it.(cond ? $"a" : $"b")
would work, by turning the ternary expression into an instance of a single closure type that selects a different format string based on the value ofcond
. This would be hard to do automatically, but is a very valuable tool to have. For the example above, this would allowdamageText
to always have the same type regardless of which arm is selected.The Vague Shape Of A Possible Solution
I've recently prototyped an implementation of some of these ideas, without the benefit of any changes to the C# language or roslyn compiler stack, to feel out the benefits and identify challenges. So far in testing with a moderately complex video game, it has successfully reduced allocation counts and improved performance, while also making it easy to allow hot-swapping between languages at run-time or loading a modified string table after application start. Based on this experience, here's my rough proposal for what a good solution would look like:
$"..."
. I don't know what the syntax for this would look like, so for the purposes of this proposal I will use the intentionally-not-feasible$🌎"..."
. Changing an existing interpolation literal to this new syntax would cause its type to change fromstring
to an unnamed type conforming to certain protocols, much like the unnamed type generated bynew { a = 1 }
.CompositeFormat
instance) in a localization tableSystem.Text.LocalizedInterpolation.Localized("TheValueOfA", $🌎"a's value is {a}.")
we would get a string table entry somewhere - perhaps somewhere resource-y - like:CompositeFormat
-like representation I'll call anInterpolationFormat
. This can probably actually be done viaCompositeFormat
with a little extra work. AnInterpolationFormat
can be used as the driver to convert a given localized interpolation literal into text, i.e.would under the hood use the data from the parsed InterpolationFormat to step through and either append string literals or call the literal's
EmitValue
method as appropriate.ToString
implementation and implement the relevant interfaces likeIFormattable
or whatever's appropriate, which will look up the right format string in the appropriate string table based on the thread's current/default culture. This allows taking one of these literals and passing it to methods that only accept String quite easily (it could even expose an implicit conversion operator tostring
, but probably shouldn't.)switch
expressions in this proposed model. Maybe whatever source generator or compiler machinery handles this would be able to recognize common scenarios, and if it fails to reduce the ternary-or-switch to a single closure type, you would get an error much like you do for ambiguous implicitly-typed ternary expressions today. In my prototype the implementation of theStringTableKey
getter on the interpolated literal type is itself aswitch
expression that selects a different key depending on the value inside the closure. For theonTurnText
example from above, we might have something like:with corresponding string table entries for each arm:
and a generated key selector like:
note that in this case it's fine for a given arm to not use all the values available in the closure, and the closure type for a switch with multiple arms needs to make sure it captures every interpolation expression that might appear in the string table.
switch
-expression:Q&A
This section attempts to answer common questions and clarify some points.
TheValueOfA
in your examples?switch
over string literals.$"Hello, {firstName} {lastName}!"
, in some cultures you would reverse the order of the first and last name. If the template is$"Hello, {0} {1}!"
that semantic meaning is lost, and it is harder to tell whether it is{firstName} {lastName}
or perhaps{prefix} {lastName}
i.e. "Mr. Smith".$"Hello, {prefix} {lastName}!"
, the words and grammar used when localizing depend on context. In some cultures, you use different words and different grammar depending on the status of both the speaker and the listener. This also makes it valuable to keep the semantic meaning. (This also provides a use case for ternaries/switch expressions as described above.)$"{count} item(s) found"
, in some cultures the word 'items' may be translated into one of many different words depending on the nature of the item being counted. Given that, it is valuable to be able to assign a semantic name to this string, and switch expressions could also be useful.Proposal In Action
Let's examine a sample scenario and walk through the process of how a developer would localize it and maintain it. Our hypothetical end user Sarah Developer is building some client software that updates a local file, and has started writing an error handler. When an error occurs while saving changes, she selects an appropriate error message to show to the user in a MessageBox based on the cause of the error:
Everything is going well, and then Sarah gets a request to prepare the error handler for localization so that the company's users can see these error messages in their preferred language. Using this proposal, her first step is to turn each of the interpolated string literals into localizable ones, and then update the use of the message based on its new type:
At this point, the type of
errorMessage
has changed fromstring
to what we'll call__interpolated_string_1
, a struct. Theswitch
expression now lowers to the creation of a struct-typed closure:and the generated closure type contains a key selector that encodes the logic of the now-missing
switch
expression, along with an emitter method:alongside all of this, some part of the build pipeline (either the compiler or an analyzer) has generated an invariant string table containing the literals that used to live with the rest of the assembly's string literals (this is probably an embedded resource), and it looks something like this:
At runtime, some part of the stack is responsible for loading this data - let's assume we're using
.resx
andResourceManager
. During this loading process the template strings could be validated in advance (instead of at time of first use), and we could pre-parse them usingCompositeFormat
for better performance - but those details aren't important right now. The lowered code above followed the creation of the closure with a call toToString
, in order to show the message in aMessageBox
. An unoptimizedToString
implementation might look something like this:At this point it's now possible to call ToString on our localized interpolation string closure and get a string out of it, and the temporary allocations seen in the sample code are all possible to optimize out. Let's move on to the important question: How do you localize it?
Our protagonist Sarah Developer has finished updating her code to use localized interpolation, and checked the compiler-generated .resx file into source control. (This duplication is unpleasant, but is already present if you look at how .resx localization works in a repository like dotnet/runtime. Critically, it is automatic.) She hands the .resx file off to the localization team to be localized.
The localization team immediately comes back and asks: "Why don't these strings have identifiers or descriptions?" A fantastic question. Sarah reads the relevant documentation and realizes that she can easily assign this interpolated string a name, and does so by updating the original code:
This results in a new, clearer string table, something like the following:
This string table can be handed off to the localization team who, consulting documentation, know that they can make a culture-specific version of it containing translated text. However, they ask Sarah for more detail on one of the messages. She obliges by updating her code once more, perhaps like this:
The Localized method has an optional parameter the developer can use to provide in-line commentary just like a code comment, and it flows through to the string table. Now the resulting invariant string table is truly ready for the localization team (I won't include it here again). Let's assume it works like existing resx string tables, so the invariant string is also there in the table the localizers create, so it looks something like this:
This new string table can get checked in to source control next to the auto-generated one, and at build time gets bundled up with all the application's other resources into a satellite assembly or embedded resource.
A week passes, and Sarah gets reports that the UnknownError message is occurring frequently for cases where a record was deleted while the user was editing it. She is asked to add a specialized error message for this scenario, so she updates the code to add a new switch arm. She also looks through her open issues and notices a request to revise one of the other error messages, and does so:
This updates the invariant string table, and when she prepares to commit to revision control, the diff for the string table looks something like this:
If Sarah's team uses automated localization tooling, it may automatically update all the other string tables based on this diff or file issue tickets. If not, she can glance at this diff and notify the localization team of the necessary changes.
Beta Was this translation helpful? Give feedback.
All reactions