Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optimization] DXC is missing common factor optimization in some cases #6593

Closed
lizhengxing opened this issue May 7, 2024 · 1 comment
Closed
Assignees
Labels
bug Bug, regression, crash

Comments

@lizhengxing
Copy link
Collaborator

Description
DXC is missing common factor optimization in some cases. Here's an example code:

cbuffer TemporalAAData : register ( b10 )
{
    float2 viewportRelativeSize ;
    float4 outputDimensions ;
}

Texture2D < float4 > HistoryColor : register ( t2 ) ;
SamplerState s_Linear : register ( s1 ) ;
RWTexture1D < float3 > outColorBuffer : register ( u0 ) ;

float4 Test ( in Texture2D < float4 > tex , in SamplerState linearSampler , in float2 uv )
{
    float2 samplePos = uv;
    float2 texPos1 = floor ( samplePos - 0.5f ) + 0.5f ;

    float2 texPos0 = texPos1 - 1 ;
    float2 texPos3 = texPos1 + 2 ;
    float2 texPos12 = texPos1 + samplePos;

    // DXC should recognize (outputDimensions . zw * viewportRelativeSize) is a common factor.
    texPos0 *= outputDimensions . zw * viewportRelativeSize ;
    texPos3 *= outputDimensions . zw * viewportRelativeSize ;
    texPos12 *= outputDimensions . zw * viewportRelativeSize ;

    float4 result = 0.0f ;
    result += tex . SampleLevel ( linearSampler , float2 ( texPos0 . x , texPos0 . y ) , 0.0f );
    result += tex . SampleLevel ( linearSampler , float2 ( texPos12 . x , texPos0 . y ) , 0.0f );
    result += tex . SampleLevel ( linearSampler , float2 ( texPos3 . x , texPos0 . y ) , 0.0f );
    result += tex . SampleLevel ( linearSampler , float2 ( texPos0 . x , texPos12 . y ) , 0.0f );
    result += tex . SampleLevel ( linearSampler , float2 ( texPos12 . x , texPos12 . y ) , 0.0f );
    result += tex . SampleLevel ( linearSampler , float2 ( texPos3 . x , texPos12 . y ) , 0.0f );
    result += tex . SampleLevel ( linearSampler , float2 ( texPos0 . x , texPos3 . y ) , 0.0f );
    result += tex . SampleLevel ( linearSampler , float2 ( texPos12 . x , texPos3 . y ) , 0.0f );
    result += tex . SampleLevel ( linearSampler , float2 ( texPos3 . x , texPos3 . y ) , 0.0f );
    return result ;
}

[ numthreads ( 8 , 8 , 1 ) ] void cs_main ( uint3 GroupID : SV_GroupID , uint GroupIndex : SV_GroupIndex , uint3 GTID : SV_GroupThreadID , uint3 DispatchThreadID : SV_DispatchThreadID )
{
    uint2 pixelCoord = GTID . xy ;

    outColorBuffer [ pixelCoord.x ] = Test ( HistoryColor , s_Linear , outputDimensions . zw ) . rgb;
}

Steps to Reproduce
Run the hlsl code above with the command line below:
dxc.exe -T cs_6_3 -E cs_main dxc-common-factor-optimization.hlsl

Actual Behavior
After compiling the example hlsl, you will see those dxils in the result.

  %24 = fmul fast float %7, %14
  %25 = fmul fast float %24, %22
  %26 = fmul fast float %8, %15
  %27 = fmul fast float %26, %23
  %28 = fmul fast float %7, %16
  %29 = fmul fast float %28, %22
  %30 = fmul fast float %8, %17
  %31 = fmul fast float %30, %23
  %32 = fmul fast float %7, %18
  %33 = fmul fast float %32, %22
  %34 = fmul fast float %8, %20
  %35 = fmul fast float %34, %23

The common factor %7 * %22 is calculated 3 times.

If we rewrite the relevant code like this:

    float2 temp = outputDimensions . zw * viewportRelativeSize;
    texPos0 *= temp ;
    texPos3 *= temp ;
    texPos12 *= temp ;

The generated dxils are:

  %24 = fmul fast float %22, %7
  %25 = fmul fast float %23, %8
  %26 = fmul fast float %24, %14
  %27 = fmul fast float %25, %15
  %28 = fmul fast float %24, %16
  %29 = fmul fast float %25, %17
  %30 = fmul fast float %24, %18
  %31 = fmul fast float %25, %20

The common factor %7 * %22 is only calculated 1 time.

DXC should recognize the common factor and generate the optimized dxils.

Environment

  • DXC version: 1.8 - 1.8.0.14576 (main, 381750e)
  • Host Operating System: Windows 11
@lizhengxing lizhengxing added bug Bug, regression, crash needs-triage Awaiting triage labels May 7, 2024
@lizhengxing lizhengxing self-assigned this May 7, 2024
lizhengxing added a commit that referenced this issue May 7, 2024
This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can recognize the a*c is a common factor and redundant.

However, this upstream change might overlook some obvious common factors. One case has been observed is:
  %2 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %3 = extractvalue %dx.types.CBufRet.f32 %2, 3
  %4 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %5 = extractvalue %dx.types.CBufRet.f32 %4, 1
  %6 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %7 = extractvalue %dx.types.CBufRet.f32 %6, 3
  %8 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %9 = extractvalue %dx.types.CBufRet.f32 %8, 1

  %11 = fmul fast float %3, %10
  %12 = fmul fast float %11, %5

  %14 = fmul fast float %7, %13
  %15 = fmul fast float %14, %9   ---> %3*%5 == %7*%9 --> they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC don't know (%3, %7) and (%7, %9) are redundant when running Reassociate pass. The redundancy of (%3, %7) and (%7, %9) will be eliminated after GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and it will run GVN pass after it to remove the redundancy generared in the second run of Reassociate pass.

One issue of this PR is changing the order of floating point operations will cause the precision issue and very different result of some edge cases.

So this PR adds the disable-aggressive-reassociation flag (it's false by default) to the non-upstream change. It can be used to mitigate and debug the issue caused by this PR.

Fixes #6593.
lizhengxing added a commit that referenced this issue May 8, 2024
This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common factor and redundant.

This is part 1 of the fix for #6593.
@damyanp damyanp removed the needs-triage Awaiting triage label May 8, 2024
@damyanp damyanp added this to the May 2024 Release milestone May 8, 2024
@damyanp damyanp moved this to Triaged in HLSL Triage May 8, 2024
lizhengxing added a commit that referenced this issue May 15, 2024
This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common factor and redundant.

This is part 1 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 15, 2024
This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common factor and redundant.

This is part 1 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 15, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 15, 2024
This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common factor and redundant.

This is part 1 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 15, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 15, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %2 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %3 = extractvalue %dx.types.CBufRet.f32 %2, 3
  %4 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %5 = extractvalue %dx.types.CBufRet.f32 %4, 1
  %6 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %7 = extractvalue %dx.types.CBufRet.f32 %6, 3
  %8 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %9 = extractvalue %dx.types.CBufRet.f32 %8, 1
  ....
  %11 = fmul fast float %3, %10
  %12 = fmul fast float %11, %5
  ....
  %14 = fmul fast float %7, %13
  %15 = fmul fast float %14, %9 ---> %3*%5 == %7*%9 --> they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%3, %7) and (%7, %9) are redundant when running Reassociate pass. The redundancy of (%3, %7) and (%7, %9) will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable enable-aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 16, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 16, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %4, 1
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)  ---> %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %6, 3                                                     ---> %Float4_1.w is redundant with %Float4_0.w
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)  ---> %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %8, 1                                                     ---> %Float2_1.y is redundant with %Float2_0.y
  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y    ---> (%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y) --> they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 16, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1

  /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)

  /* %Float4_1.w is redundant with %Float4_0.w */
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3

  /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)

  /* %Float2_1.y is redundant with %Float2_0.y */
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1

  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y

(%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 17, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 17, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 17, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1

  /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)

  /* %Float4_1.w is redundant with %Float4_0.w */
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3

  /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)

  /* %Float2_1.y is redundant with %Float2_0.y */
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1

  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y

(%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 21, 2024
This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common factor and redundant.

This is part 1 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 21, 2024
This PR pulls the upstream change, Reassociate: add global reassociation
algorithm
(llvm/llvm-project@b8a330c),
into DXC with miminal changes.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common
factor and redundant.

This is part 1 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 21, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 21, 2024
)

This PR (#6598)
pulls the upstream global reassociation algorithm change in DXC and can
reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders,
some shaders got worse compilation results and couldn't benefit from
this upstream change.

This PR adds a flag for the upstream global reassociation change. It
would be easier to roll back if a shader get worse compilation result
due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 21, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1

  /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)

  /* %Float4_1.w is redundant with %Float4_0.w */
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3

  /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)

  /* %Float2_1.y is redundant with %Float2_0.y */
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1

  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y

(%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
lizhengxing added a commit that referenced this issue May 21, 2024
Although DXC applied the upstream change, Reassociate: add global
reassociation algorithm
(llvm/llvm-project@b8a330c) in this PR
(#6598), it still
might overlook some obvious common factors.

One case has been observed is:
```
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1

  /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)

  /* %Float4_1.w is redundant with %Float4_0.w */
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3 

  /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)

  /* %Float2_1.y is redundant with %Float2_0.y */
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1

  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y 

(%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor

```
The upstream change can't identify this common factor because DXC
doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y)
are redundant when running Reassociate pass. Those redundancies will be
eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run
Reassociate pass again after GVN pass and then run GVN pass again to
remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision
issue. In case some shaders get unexpected results due to this PR, use
"-opt-disable aggressive-reassociation" to disable this PR and roll
back.

This is part 3 of the fix for
#6593.
lizhengxing added a commit that referenced this issue May 21, 2024
This PR pulls the upstream change, Reassociate: add global reassociation
algorithm
(llvm/llvm-project@b8a330c),
into DXC with miminal changes.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common
factor and redundant.

This is part 1 of the fix for #6593.

(cherry picked from commit 6f9c107)
@lizhengxing
Copy link
Collaborator Author

Close it since all the fixes are merged in DXC main branch.

lizhengxing added a commit that referenced this issue Sep 3, 2024
The aggressive reassociation optimization for this common factor issue (#6593) can reduce redundant calculations obviously.

But it also exposed some issues for the final codegen and caused perf regressions for some shaders in runtime tests.

Fixing those codegen issues is non-trival, as a workaround, this PR disables the aggressive reassociation optimization by default and will get the perf loss back.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug, regression, crash
Projects
Archived in project
Development

No branches or pull requests

2 participants