Include "simple" UTF-8 validation and transcoding logic for interpreted and low-footprint scenarios #48006

GrabYourPitchforks · 2021-02-08T19:18:04Z

See comment at #47928 (comment) for more information. Essentially, the current UTF-8 transcoding and validation logic is "fast" logic, but there may be scenarios (interpreted, low-footprint, etc.) where we'd prefer "simple" logic instead. Imagine a Foo.SpeedOpt.cs and Foo.Simple.cs.

@marek-safar Do you know offhand of any scenarios where this might be useful? What runtimes would we compile in each mode?

ghost · 2021-02-08T19:18:07Z

Tagging subscribers to this area: @tarekgh, @krwq, @eiriktsarpalis, @layomia
See info in area-owners.md if you want to be subscribed.

Issue Details

See comment at #47928 (comment) for more information. Essentially, the current UTF-8 transcoding and validation logic is "fast" logic, but there may be scenarios (interpreted, low-footprint, etc.) where we'd prefer "simple" logic instead. Imagine a Foo.SpeedOpt.cs and Foo.Simple.cs.

@marek-safar Do you know offhand of any scenarios where this might be useful? What runtimes would we compile in each mode?

Author:	GrabYourPitchforks
Assignees:	-
Labels:	`area-System.Text.Encoding`, `up-for-grabs`
Milestone:	-

marek-safar · 2021-02-09T13:02:14Z

I'm wondering if we should control the size optimizations with a feature switch instead of hardcoding it for specific platforms. We used browser, android, ios, tvos as size-sensitive targets but even on these platforms, someone could prefer speed over size.

Related PR #47918

@jkotas @stephentoub @eerhardt thoughts?

stephentoub · 2021-02-09T14:58:47Z

I'd be ok with that as long as the speed-focused targets trim out the size-focused paths as well and not just the other way around, and as long as it doesn't negatively impact performance for the speed-focused targets, e.g. with extra cruft being left in that can't be successfully and cheaply removed by the JIT.

jkotas · 2021-02-09T16:43:16Z

We should quantify how much we are talking about here. A lot of the vectorized code should be removed by linker already on platforms that do not have hardware accelerated Vectors or hardware intrinsics. There is some small tax left after that, but I do not think it will add up to much. We are paying similar tax by e.g. artificially splitting methods between platform neutral and platform specific parts that is much more pervasive pattern and that it may be more worthy to do something about.

GrabYourPitchforks · 2021-02-09T18:35:58Z

@jkotas If I get codegen numbers for various changes would that help? My numbers would be for x64 since that's what I build. But presumably we could extrapolate to wasm and other archs.

jkotas · 2021-02-09T19:06:05Z

I am not sure how you can reliably extrapolate from x64 codegen numbers to impact on wasm.

GrabYourPitchforks · 2021-02-09T19:37:59Z

It would be something like "the [non-vectorized] x64 codegen went from xxx bytes to yyy bytes." It's not directly comparable to wasm, but I think as a first-order approximation any reduction as a percentage of the original footprint would be consistent across archs.

pavelsavara · 2021-03-10T20:42:44Z

Hi @jkotas, @GrabYourPitchforks,

There is first attempt PR on simple implementation, which actually passed all unit tests un the CI. It also includes your code review feedback so far.

When I build it locally on my ubuntu with ./build.sh -os Browser -c Release -arch wasm I can see following difference in sizes of artifacts/bin/ILLinkTrimAssembly/net6.0-Browser-Release-wasm/trimmed-runtimepack/System.Private.CoreLib.dll

Original larger size: 3044864
New smaller size: 3038208
Difference: 6656 bytes.

Open questions

shall we also simplify Utf16Utility.GetPointerToFirstInvalidChar()
is it worth merging it ?
note that System.Linq.csproj has OptimizeForSize with Browser, Android, iOS, tvOS but for Utf8 we used just Browser, tvOS in System.Private.CoreLib.Shared.projitems after @jkotas comment

Thanks
Pavel

GrabYourPitchforks · 2021-03-10T21:30:28Z

shall we also simplify Utf16Utility.GetPointerToFirstInvalidChar()

Unfortunately I don't think that would have significant payoff. Much of the complicated code is hidden behind an "is SIMD enabled?" switch, and that entire branch should be pruned for wasm builds. The code that remains should already be fairly small.

is it worth merging it ?

I wish I could answer that question. 🙂 There are other people (@eerhardt?) who have a deeper understanding of the size-on-disk requirements. Maybe they could chime in either here or in the original linked issue?

eerhardt · 2021-03-10T21:55:49Z

@pavelsavara - would it be possible to get size measurements (before and after) of your change in a default blazorwasm app? Follow the steps in #49373 (comment) for how to do this.

In general a 6KB uncompressed IL savings turns into roughly 2KB .br compressed savings. But the question is how much of that savings will still be there after trimming the entire app? If the change was "free", I'd definitely think a 1-2KB .br size savings would be "worth it". But as the expense/maintenance cost goes up, the savings need to go up as well.

pavelsavara · 2021-03-11T13:25:30Z

Very good estimate @eerhardt :-)

Old size

 396526 Mar 11 13:36 System.Private.CoreLib.dll.br
 473588 Mar 11 13:36 System.Private.CoreLib.dll.gz
1165312 Mar 11 13:36 System.Private.CoreLib.dll

New size

 394156 Mar 11 13:33 System.Private.CoreLib.dll.br
 470186 Mar 11 13:33 System.Private.CoreLib.dll.gz
1158656 Mar 11 13:33 System.Private.CoreLib.dll

Brotli Difference: 2370 bytes

eerhardt · 2021-03-11T17:17:33Z

is it worth merging it ?

I'd definitely take a 2KB size reduction in the default Blazor WASM app!

So I guess it is up to the UTF-8 owners (@GrabYourPitchforks?) to decide if the maintenance cost isn't unbearable. If the answer is "the maintenance cost is really great here", I don't think I'd push super hard for 2KB. But if the answer is "yeah, this should be OK to maintain", I'd like to see us move forward with the change.

ghost · 2021-03-11T17:18:49Z

Tagging subscribers to 'size-reduction': @eerhardt, @SamMonoRT, @marek-safar, @tannergooding, @CoffeeFlux
See info in area-owners.md if you want to be subscribed.

Issue Details

See comment at #47928 (comment) for more information. Essentially, the current UTF-8 transcoding and validation logic is "fast" logic, but there may be scenarios (interpreted, low-footprint, etc.) where we'd prefer "simple" logic instead. Imagine a Foo.SpeedOpt.cs and Foo.Simple.cs.

@marek-safar Do you know offhand of any scenarios where this might be useful? What runtimes would we compile in each mode?

Author:	GrabYourPitchforks
Assignees:	-
Labels:	`area-System.Text.Encoding`, `size-reduction`, `untriaged`, `up-for-grabs`
Milestone:	-

GrabYourPitchforks · 2021-03-11T18:30:14Z

I don't think it adds significant maintenance burden. It's a clean change with separate implementations kept in separate files. And we have good regression test coverage here.

GrabYourPitchforks · 2021-03-17T18:45:01Z

Resolved by #49372.

GrabYourPitchforks added area-System.Text.Encoding help wanted [up-for-grabs] Good issue for external contributors labels Feb 8, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 8, 2021

pavelsavara mentioned this issue Mar 9, 2021

Include "simple" UTF-8 validation and transcoding logic for interpreted and low-footprint scenarios #49372

Merged

eerhardt added the size-reduction Issues impacting final app size primary for size sensitive workloads label Mar 11, 2021

GrabYourPitchforks removed untriaged New issue has not been triaged by the area owner help wanted [up-for-grabs] Good issue for external contributors labels Mar 17, 2021

GrabYourPitchforks closed this as completed Mar 17, 2021

SingleAccretion mentioned this issue Mar 30, 2021

Use OptimizeForSize condition in SPC only on Browser #50413

Merged

pavelsavara mentioned this issue Apr 15, 2021

[wasm] Major performance regression serializing and deserializing json #50260

Closed

ghost locked as resolved and limited conversation to collaborators Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include "simple" UTF-8 validation and transcoding logic for interpreted and low-footprint scenarios #48006

Include "simple" UTF-8 validation and transcoding logic for interpreted and low-footprint scenarios #48006

GrabYourPitchforks commented Feb 8, 2021

ghost commented Feb 8, 2021

marek-safar commented Feb 9, 2021

stephentoub commented Feb 9, 2021 •

edited

Loading

jkotas commented Feb 9, 2021 •

edited

Loading

GrabYourPitchforks commented Feb 9, 2021

jkotas commented Feb 9, 2021

GrabYourPitchforks commented Feb 9, 2021

pavelsavara commented Mar 10, 2021 •

edited

Loading

GrabYourPitchforks commented Mar 10, 2021

eerhardt commented Mar 10, 2021

pavelsavara commented Mar 11, 2021

eerhardt commented Mar 11, 2021

ghost commented Mar 11, 2021

GrabYourPitchforks commented Mar 11, 2021

GrabYourPitchforks commented Mar 17, 2021

Include "simple" UTF-8 validation and transcoding logic for interpreted and low-footprint scenarios #48006

Include "simple" UTF-8 validation and transcoding logic for interpreted and low-footprint scenarios #48006

Comments

GrabYourPitchforks commented Feb 8, 2021

ghost commented Feb 8, 2021

marek-safar commented Feb 9, 2021

stephentoub commented Feb 9, 2021 • edited Loading

jkotas commented Feb 9, 2021 • edited Loading

GrabYourPitchforks commented Feb 9, 2021

jkotas commented Feb 9, 2021

GrabYourPitchforks commented Feb 9, 2021

pavelsavara commented Mar 10, 2021 • edited Loading

GrabYourPitchforks commented Mar 10, 2021

eerhardt commented Mar 10, 2021

pavelsavara commented Mar 11, 2021

eerhardt commented Mar 11, 2021

ghost commented Mar 11, 2021

GrabYourPitchforks commented Mar 11, 2021

GrabYourPitchforks commented Mar 17, 2021

stephentoub commented Feb 9, 2021 •

edited

Loading

jkotas commented Feb 9, 2021 •

edited

Loading

pavelsavara commented Mar 10, 2021 •

edited

Loading