Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce struct size from 56 bytes to the recommended 16 #20

Merged
merged 1 commit into from
Oct 28, 2016

Conversation

atifaziz
Copy link
Contributor

ByteSize weighs in at 56 bytes, which is far from the recommendation of 16 bytes for value types (to avoid excessive copying and bloating other value types). This PR turns KiloBytes, GigaBytes, TeraBytes and PetaBytes into pure computed properties because these are trivial calculations and not all of them may be needed by the user. This brings the size of ByteSize to 16 bytes and also avoids the cost of computing them during initialization if they're never used.

@omar
Copy link
Owner

omar commented Oct 28, 2016

Thank you for this. Could you link to where the recommendation of 16 byte is?

Side note: I knew I recognized your name from somewhere. I'm a big fan of ELMAH, thanks for making it!

@omar omar merged commit 8dde88a into omar:master Oct 28, 2016
@omar
Copy link
Owner

omar commented Oct 28, 2016

This change is now on NuGet as v1.2.2 - https://www.nuget.org/packages/ByteSize/1.2.2.

@atifaziz atifaziz deleted the struct-size branch October 28, 2016 05:35
@atifaziz
Copy link
Contributor Author

Thanks for merging and publishing an update so quickly!

@atifaziz
Copy link
Contributor Author

atifaziz commented Oct 28, 2016

Could you link to where the recommendation of 16 byte is?

It's actually more of a guideline and anything above should be measured to understand the impact.

Since you asked for a link mentioning that magical 16 bytes number, see the following passage in “Choosing Between Class and Struct”:

X AVOID defining a struct unless the type has all of the following characteristics:

  • It logically represents a single value, similar to primitive types (int, double, etc.).
  • It has an instance size under 16 bytes.
  • It is immutable.
  • It will not have to be boxed frequently.

Here's also good old blog entry that goes into overall optimization details around value types:

How are value types implemented in the 32-bit CLR? What has been done to improve their performance?

Also bear in mind that when you embed one value type in another, like in Nullable<T> or KeyValuePair<TKey, TValue>, then the size is compounded.

Benchmarks

On measuring the performance impact of this change, here are the numbers from before:

Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-6600U CPU 2.60GHz, ProcessorCount=4
Frequency=2742185 ticks, Resolution=364.6727 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0

Type=Benchmark  Mode=Throughput  Toolchain=Clr  
Runtime=Clr  
Method Platform Jit Median StdDev
Test X64 LegacyJit 244.7583 us 2.6329 us
Test X64 RyuJit 172.1453 us 1.2531 us
Test X86 LegacyJit 11,551.8453 us 26.6448 us

And after reducing the value size (which means less copying and computation in the constructor), the impact is approximately 3 fold increase in performance:

Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-6600U CPU 2.60GHz, ProcessorCount=4
Frequency=2742185 ticks, Resolution=364.6727 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0

Type=Benchmark  Mode=Throughput  Toolchain=Clr  
Runtime=Clr  
Method Platform Jit Median StdDev
Test X64 LegacyJit 91.9229 us 0.7212 us
Test X64 RyuJit 51.7332 us 0.2708 us
Test X86 LegacyJit 3,761.4194 us 16.2540 us

Benchmark Code

// ReSharper disable CheckNamespace

using System;
using System.Linq;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using ByteSizeLib;

static class Program
{
    static void Main() =>
        Console.WriteLine(BenchmarkRunner.Run<Benchmark>());
}

[Config(typeof(Config))]
public class Benchmark
{
    [Benchmark]
    public static void Test()
    {
        var total = ByteSize.FromBytes(8);
        for (var i = 0; i < 10000; i++)
            total = Add(total, total);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static ByteSize Add(ByteSize a, ByteSize b) => a + b;

    sealed class Config : ManualConfig
    {
        public Config()
        {
            var jobs =
                from jit in Job.AllJits
                from runtime in new[] { Job.Clr.Runtime }
                select jit.With(runtime);
            Add(jobs.ToArray());
        }
    }
}

Disassemblies

Following is the disassembly of the 32-bit x86 JIT-ed code of Benchmark.Test:

00C2046D  sub         esp,70h  
; ⁞
;       var total = ByteSize.FromBytes(8);
00C20481  lea         edi,[ebp-78h]  
00C20484  xor         eax,eax  
00C20486  lea         ecx,[eax+0Eh]  
00C20489  rep stos    dword ptr es:[edi]  
00C2048B  fld         qword ptr ds:[0C20570h]  
00C20491  sub         esp,8  
00C20494  fstp        qword ptr [esp]  
00C20497  lea         ecx,[ebp-78h]  
00C2049A  call        dword ptr ds:[785468h]  
00C204A0  lea         edi,[ebp-40h]  
00C204A3  lea         esi,[ebp-78h]  
00C204A6  mov         ecx,0Eh  
00C204AB  rep movs    dword ptr es:[edi],dword ptr [esi]  
;       for (var i = 0; i < 10000; i++)
00C204AD  xor         esi,esi  
;           total = Add(total, total);
00C204AF  lea         eax,[ebp-40h]  
00C204B2  sub         esp,38h  
00C204B5  movq        xmm0,mmword ptr [eax]  
00C204B9  movq        mmword ptr [esp],xmm0  
00C204BE  movq        xmm0,mmword ptr [eax+8]  
00C204C3  movq        mmword ptr [esp+8],xmm0  
00C204C9  movq        xmm0,mmword ptr [eax+10h]  
00C204CE  movq        mmword ptr [esp+10h],xmm0  
00C204D4  movq        xmm0,mmword ptr [eax+18h]  
00C204D9  movq        mmword ptr [esp+18h],xmm0  
00C204DF  movq        xmm0,mmword ptr [eax+20h]  
00C204E4  movq        mmword ptr [esp+20h],xmm0  
00C204EA  movq        xmm0,mmword ptr [eax+28h]  
00C204EF  movq        mmword ptr [esp+28h],xmm0  
00C204F5  movq        xmm0,mmword ptr [eax+30h]  
00C204FA  movq        mmword ptr [esp+30h],xmm0  
00C20500  lea         eax,[ebp-40h]  
00C20503  sub         esp,38h  
00C20506  movq        xmm0,mmword ptr [eax]  
00C2050A  movq        mmword ptr [esp],xmm0  
00C2050F  movq        xmm0,mmword ptr [eax+8]  
00C20514  movq        mmword ptr [esp+8],xmm0  
00C2051A  movq        xmm0,mmword ptr [eax+10h]  
00C2051F  movq        mmword ptr [esp+10h],xmm0  
00C20525  movq        xmm0,mmword ptr [eax+18h]  
00C2052A  movq        mmword ptr [esp+18h],xmm0  
00C20530  movq        xmm0,mmword ptr [eax+20h]  
00C20535  movq        mmword ptr [esp+20h],xmm0  
00C2053B  movq        xmm0,mmword ptr [eax+28h]  
00C20540  movq        mmword ptr [esp+28h],xmm0  
00C20546  movq        xmm0,mmword ptr [eax+30h]  
00C2054B  movq        mmword ptr [esp+30h],xmm0  
00C20551  lea         ecx,[ebp-40h]  
00C20554  call        dword ptr ds:[784D94h]  
;       for (var i = 0; i < 10000; i++)
00C2055A  inc         esi  
00C2055B  cmp         esi,2710h  
00C20561  jl          00C204AF  

After applying this PR, the code is considerably smaller as well as the allocated stack space (sub esp, 70h vs. sub esp, 10h or 112 vs. 16 bytes):

0297046C  sub         esp,10h  
; ⁞
;       var total = ByteSize.FromBytes(8);
02970482  fld         qword ptr ds:[2970500h]  
02970488  sub         esp,8  
0297048B  fstp        qword ptr [esp]  
0297048E  call        729A4B83  
02970493  sub         esp,8  
02970496  fstp        qword ptr [esp]  
02970499  call        728B1CD4  
0297049E  mov         ecx,eax  
029704A0  fld         dword ptr ds:[2970508h]  
029704A6  lea         eax,[ebp-14h]  
029704A9  mov         dword ptr [eax],ecx  
029704AB  mov         dword ptr [eax+4],edx  
029704AE  fstp        qword ptr [eax+8]  
;       for (var i = 0; i < 10000; i++)
029704B1  xor         esi,esi  
;           total = Add(total, total);
029704B3  lea         eax,[ebp-14h]  
029704B6  sub         esp,10h  
029704B9  movq        xmm0,mmword ptr [eax]  
029704BD  movq        mmword ptr [esp],xmm0  
029704C2  movq        xmm0,mmword ptr [eax+8]  
029704C7  movq        mmword ptr [esp+8],xmm0  
029704CD  lea         eax,[ebp-14h]  
029704D0  sub         esp,10h  
029704D3  movq        xmm0,mmword ptr [eax]  
029704D7  movq        mmword ptr [esp],xmm0  
029704DC  movq        xmm0,mmword ptr [eax+8]  
029704E1  movq        mmword ptr [esp+8],xmm0  
029704E7  lea         ecx,[ebp-14h]  
029704EA  call        dword ptr ds:[10C4D94h]  
;       for (var i = 0; i < 10000; i++)
029704F0  inc         esi  
029704F1  cmp         esi,2710h  
029704F7  jl          029704B3  

The effects are similar for 64-bit. Here's what the JIT compiled before:

00007FFA795708C2  sub         rsp,108h  
; ⁞
;       var total = ByteSize.FromBytes(8);
00007FFA795408DF  xor         ecx,ecx  
00007FFA795408E1  lea         rax,[rsp+98h]  
00007FFA795408E9  vxorpd      xmm1,xmm1,xmm1  
00007FFA795408EE  vmovdqu     xmmword ptr [rax],xmm1  
00007FFA795408F3  vmovdqu     xmmword ptr [rax+10h],xmm1  
00007FFA795408F9  vmovdqu     xmmword ptr [rax+20h],xmm1  
00007FFA795408FF  mov         qword ptr [rax+30h],rcx  
00007FFA79540903  lea         rcx,[rsp+98h]  
00007FFA7954090B  vmovsd      xmm1,qword ptr [7FFA79540A20h]  
00007FFA79540914  call        00007FFA79540148  
00007FFA79540919  vmovdqu     xmm0,xmmword ptr [rsp+98h]  
00007FFA79540923  vmovdqu     xmmword ptr [rsp+0D0h],xmm0  
00007FFA7954092D  vmovdqu     xmm0,xmmword ptr [rsp+0A8h]  
00007FFA79540937  vmovdqu     xmmword ptr [rsp+0E0h],xmm0  
00007FFA79540941  vmovdqu     xmm0,xmmword ptr [rsp+0B8h]  
00007FFA7954094B  vmovdqu     xmmword ptr [rsp+0F0h],xmm0  
00007FFA79540955  mov         rcx,qword ptr [rsp+0C8h]  
00007FFA7954095D  mov         qword ptr [rsp+100h],rcx  
;       for (var i = 0; i < 10000; i++)
00007FFA79540965  xor         esi,esi  
;           total = Add(total, total);
00007FFA79540967  lea         rcx,[rsp+0D0h]  
00007FFA7954096F  vmovdqu     xmm0,xmmword ptr [rsp+0D0h]  
00007FFA79540979  vmovdqu     xmmword ptr [rsp+60h],xmm0  
00007FFA79540980  vmovdqu     xmm0,xmmword ptr [rsp+0E0h]  
00007FFA7954098A  vmovdqu     xmmword ptr [rsp+70h],xmm0  
00007FFA79540991  vmovdqu     xmm0,xmmword ptr [rsp+0F0h]  
00007FFA7954099B  vmovdqu     xmmword ptr [rsp+80h],xmm0  
00007FFA795409A5  mov         rdx,qword ptr [rsp+100h]  
00007FFA795409AD  mov         qword ptr [rsp+90h],rdx  
00007FFA795409B5  vmovdqu     xmm0,xmmword ptr [rsp+0D0h]  
00007FFA795409BF  vmovdqu     xmmword ptr [rsp+28h],xmm0  
00007FFA795409C6  vmovdqu     xmm0,xmmword ptr [rsp+0E0h]  
00007FFA795409D0  vmovdqu     xmmword ptr [rsp+38h],xmm0  
00007FFA795409D7  vmovdqu     xmm0,xmmword ptr [rsp+0F0h]  
00007FFA795409E1  vmovdqu     xmmword ptr [rsp+48h],xmm0  
00007FFA795409E8  mov         rdx,qword ptr [rsp+100h]  
00007FFA795409F0  mov         qword ptr [rsp+58h],rdx  
00007FFA795409F5  lea         rdx,[rsp+60h]  
00007FFA795409FA  lea         r8,[rsp+28h]  
00007FFA795409FF  call        00007FFA79540098  
;       for (var i = 0; i < 10000; i++)
00007FFA79540A04  inc         esi  
00007FFA79540A06  cmp         esi,2710h  
00007FFA79540A0C  jl          00007FFA79540967  

And which reduces to the following after the PR (stack allocation went from sub esp, 108h to sub esp, 50h or 264 to 80 bytes):

00007FFA795704B2  sub         esp,50h  
; ⁞
;       var total = ByteSize.FromBytes(8);
00007FFA795604C6  vmovsd      xmm0,qword ptr [7FFA79560548h]  
00007FFA795604CF  call        00007FFAD8D11D04  
00007FFA795604D4  vcvttsd2si  rcx,xmm0  
00007FFA795604D9  vmovsd      xmm0,qword ptr [7FFA79560550h]  
00007FFA795604E2  mov         qword ptr [rsp+40h],rcx  
00007FFA795604E7  vmovsd      qword ptr [rsp+48h],xmm0  
;       for (var i = 0; i < 10000; i++)
00007FFA795604EE  xor         esi,esi  
;           total = Add(total, total);
00007FFA795604F0  lea         rcx,[rsp+40h]  
00007FFA795604F5  lea         rdx,[rsp+30h]  
00007FFA795604FA  mov         r8,qword ptr [rsp+40h]  
00007FFA795604FF  mov         qword ptr [rdx],r8  
00007FFA79560502  vmovsd      xmm0,qword ptr [rsp+48h]  
00007FFA79560509  vmovsd      qword ptr [rdx+8],xmm0  
00007FFA7956050F  lea         rdx,[rsp+20h]  
00007FFA79560514  mov         r8,qword ptr [rsp+40h]  
00007FFA79560519  mov         qword ptr [rdx],r8  
00007FFA7956051C  vmovsd      xmm0,qword ptr [rsp+48h]  
00007FFA79560523  vmovsd      qword ptr [rdx+8],xmm0  
00007FFA79560529  lea         rdx,[rsp+30h]  
00007FFA7956052E  lea         r8,[rsp+20h]  
00007FFA79560533  call        00007FFA79560098  
;       for (var i = 0; i < 10000; i++)
00007FFA79560538  inc         esi  
00007FFA7956053A  cmp         esi,2710h  
00007FFA79560540  jl          00007FFA795604F0  

@omar
Copy link
Owner

omar commented Oct 28, 2016

Thanks for the thorough response. Makes sense the only data we need to store is the total number of bits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants