Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why CTS Nightly started failing #5161

Closed
jkwak-work opened this issue Sep 26, 2024 · 12 comments · Fixed by #5197
Closed

Investigate why CTS Nightly started failing #5161

jkwak-work opened this issue Sep 26, 2024 · 12 comments · Fixed by #5197
Assignees
Labels
goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang kind:ci & infra Continuous integration and infrastructure issue kind:regression

Comments

@jkwak-work
Copy link
Collaborator

VK-CTS nightly CI started failing from Sep/20.
The output log looks like the process was crashed while running the CTS test.
And it appears to be crashing always at the following test,

Test case 'dEQP-VK.glsl.texture_gather.offset.min_required_offset.2d_array.depth32f.size_npot.compare_greater.mirrored_repeat_clamp_to_edge'..
  Pass (Pass)

Test case 'dEQP-VK.glsl.texture_gather.offset.implementation_offset.2d.rgba8.size_pot.clamp_to_edge_repeat'..
Error: Process completed with exit code 1.
@jkwak-work jkwak-work added goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang kind:regression kind:ci & infra Continuous integration and infrastructure issue labels Sep 26, 2024
@jkwak-work jkwak-work self-assigned this Sep 26, 2024
@jkwak-work
Copy link
Collaborator Author

I was not able to reproduce the issue on my local machine nor on the runner machine.
When I tried to reproduce the issue on the runner machine, I also used the slang.dll built by the runner action.
I am puzzled about how to reproduce it.

I am going to disable a few tests and trigger the CI.
Since CI is only why to reproduce the issue at the moment, this may allow me to narrow down or bisect to where to problem is.

I can also setup a local runner on my machine and see if I can reproduce it via CI process.

@jkwak-work
Copy link
Collaborator Author

The failing CTS nightly tests can be found from here

@aleino-nv
Copy link
Collaborator

Do we have some kind of asserts enabled in Release mode?

I ask because the process exit code on Windows is never just 1 on a "crash" (e.g. segfault) -- it's usually some 'large' number.

This makes me think the exit is controlled. Maybe even exit(1) is being called.
If so, then probably there is a more detailed error message somewhere. Do we capture stderr? Is the full (.xml format) dEQP log available?

@jkwak-work
Copy link
Collaborator Author

I was able to reproduce the issue only through the CI workflow.
And the following three tests are related to the problem,

dEQP-VK.glsl.texture_gather.offset.implementation_offset.2d.rgba8.size_pot.clamp_to_edge_repeat
dEQP-VK.glsl.texture_gather.offset.implementation_offset.2d.rgba8.size_pot.mirrored_repeat_clamp_to_edge
dEQP-VK.glsl.texture_gather.offset.implementation_offset.2d.rgba8.size_pot.repeat_mirrored_repeat

When I removed them from the tests, the CI went green.
When I tested only three of them, the CI was also passing, which is strange.

I believe this is related to the recent upgrade of SPIR-V headers.
I will see if the issue can be observed with and without the specific commit.

@jkwak-work
Copy link
Collaborator Author

jkwak-work commented Sep 27, 2024

Do we have some kind of asserts enabled in Release mode?

I ask because the process exit code on Windows is never just 1 on a "crash" (e.g. segfault) -- it's usually some 'large' number.

This makes me think the exit is controlled. Maybe even exit(1) is being called. If so, then probably there is a more detailed error message somewhere. Do we capture stderr? Is the full (.xml format) dEQP log available?

That is an interesting perspective. I agree that it may not be a crash. I will try to get the TestResults.qpa printed when reproduced.

@jkwak-work
Copy link
Collaborator Author

I suspected that the following merge was the cause, but I was still able to reproduce the issue with commits before it.

9d40ce4 Update spirv-tools version (#5089)

I start to think that some external tools might be updated on the system and it might be causing the problem.

There is another issue 5175 that appears to be related to the Vulkan SDK installed on the same runner machine.
That kind of things could explain what I have seen so far.

As for the exit code 1, I checked TestResults.qpa when the issue was reproduced.
You can see the log in a new step called "Dump TestResults.qpa if failed" in a CI log.
The log file was simply truncated in the middle of output.

 <Text>Note: texture level&apos;s size is (2, 2)</Text>
 <Image Name="InputTextureLevel6" Width="1" Height="1" Format="RGBA8888" CompressionMode="PNG" Description="Input texture, level 6">
  iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVQImWM4eVLv
  OwAG1wK40eeSHwAAAABJRU5ErkJggg==
</Image>
 <Text>Note: texture level&apos;s size is (1, 1)</Text>
 <Text>Texture base level is 0</Text>
 <Text>s and t wrap modes are VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE and VK_SAMPLER_ADDRESS_MODE_REPEAT, respectively</Text>
 <Text>Minification and magnification filter modes are VK_FILTER_NEAREST and VK_FILTER_NEAREST, respectively (note that they should have no effect on gather result)</Text>
 <Text>Using texture swizzle [default swizzle state]</Text>
 <Section Name="Iteration0" Description="Iteration 0">
  <Text>Texture coordinates run from (-0.3, -0.4) to (1.5, 1.6)

When it ran without errors, the log is exactly same but continues as following,

  <Text>Texture coordinates run from (-0.3, -0.4) to (1.5, 1.6)</Text>
  <ImageSet Name="VerifyResult" Description="Verification result">
   <Image Name="Rendered" Width="64" Height="64" Format="RGBA8888" CompressionMode="PNG" Description="Rendered image">
    iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAB30lEQVR4nO2aMYoq
    QRRFzx86cBWKS3ABJsZGHYiYiOCIKzByFcazAAPBxEAUF6C4AgMDE7fgZD97ZdXI
    /M9McMV6J7jBgaYul+6kqT/7/X6PmsOBAxwiNx4zhnHkOh060IncdMoUpv88o1aj
    BrVUv/2k7yuR/QDF+7u6AlCFauq60P2fZ0cwSl0TmqlrQOPB4wU8wwRVi0DX4ntG
    FoGmRaBhcU/2n4APoC6gxgdQF1BTlGVZqkvQatGCVuQuFy5widz5zBnOkWu3aUM7
    cvU6dahHbr1mDev06OzfAB9AXUCND6AuoKb4/FRXAHrQS90SlqnbwjZ1c5inrg/9
    1F3h+uDoAp5hgp5FYGkR2FoE5haBvkXganFP9p+AD6AuoMYHUBdQU6xWq5W6BLsd
    O9hFbrFgAYvIzWbMYBa54ZAhDCN3OnGCU+QqFSpQSY/O/g3wAdQF1PgA6gJqio8P
    dQVgA5vUTWCSuiMcU1fCl396N7ilbgCDB0cX8AwTbCwCE4vA0SJQWgRuFoGBxT3Z
    fwI+gLqAmuwH8PsBfj8gc3wAdQE1PoC6gBq/H/Cr4i+AD6AuoMYHUBdQ4/cD/H5A
    5vgA6gJqfAB1ATV+P+BXxV8AH0BdQI0PoC6gxu8H+P2AzPkLqhOcqFNmwIgAAAAA
    SUVORK5CYII=
</Image>
</ImageSet>
</Section>
 <Section Name="Iteration1" Description="Iteration 1">
  <Text>Texture coordinates run from (-0.3, -0.4) to (1.5, 1.6)</Text>
  <ImageSet Name="VerifyResult" Description="Verification result">
   <Image Name="Rendered" Width="64" Height="64" Format="RGBA8888" CompressionMode="PNG" Description="Rendered image">
    iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAABsElEQVR4nO2aoa0C
    QRRFDz8rqAJCA5tQwBo0CoHAYAjZIqiCBtYiSDAIEkIBbAkIxJqtAffdDDOf5Ccg
    Ltl5R1xxzZzcAIbXu16vV9TUNTXUQVeWlFAG3XzOHOZBt9mwgc2/bwyHDGEY1z/v
    +HaJ5Afo5XmeqyUYDBjAIOiqigqqoFuvWcM66JqGBpqgKwoKKIJuPGYM4/jp5D8B
    NoBaQI0NoBZQk81magVgApO4+/PTDtzhHndTmMbdCEZxd4LTi6cz+IYJJi48jQvP
    3YVn6sIzcuE5uXgm+a+ADaAWUGMDqAXUZI/H46GWYLFgAYugOxw4wCHozmfOcA66
    3Y4d7IJuuWQJy6BrW1po46eT/wTYAGoBNTaAWkBNdjyqFYALXOJuD/u428I27law
    irsb3OKuD/0XT2fwDRNcXHj2LjxbF56VC8/Nhafv4pnkvwI2gFpAjQ2gFlDTsz9H
    E8cGUAuosQHUAmrsPuAj8Q5gA6gF1NgAagE1dh9g9wGJYwOoBdTYAGoBNXYf8JF4
    B7AB1AJqbAC1gBq7D7D7gMSxAdQCamwAtYAauw94x7dLJD/AL+GomXAi8LbRAAAA
    AElFTkSuQmCC
</Image>

When I investigated the source code code vk-gl-cts, I didn't find any lines that exit with a value "1".
The exit code 1 is typical when a process is terminated by an external tool.
This makes me wonder if the github runner script is killing the process on a certain condition.

I will see if I can get more information with the following workflow settings.

env:
  ACTIONS_RUNNER_DEBUG: true
  ACTIONS_STEP_DEBUG: true

@jkwak-work
Copy link
Collaborator Author

I cannot reproduce the issue when I enabled the debug logging of github CI workflow.
#5194

env:
  ACTIONS_RUNNER_DEBUG: true
  ACTIONS_STEP_DEBUG: true

That seems like a bug in the CI runner process.
I am going to enable the logging and call it done.
If the issue persists, I will reopen this issue.

@jkwak-work
Copy link
Collaborator Author

The same CTS failure is observed again today in the same way as before.
I submitted a change to print more debugging information with the following settings, but they didn't seem to do anything.

    ACTIONS_RUNNER_DEBUG: true
    ACTIONS_STEP_DEBUG: true

I am going to disable the following tests that the crash always happens with,

dEQP-VK.glsl.texture_gather.offset.implementation_offset.2d.rgba8.size_pot.clamp_to_edge_repeat
dEQP-VK.glsl.texture_gather.offset.implementation_offset.2d.rgba8.size_pot.mirrored_repeat_clamp_to_edge
dEQP-VK.glsl.texture_gather.offset.implementation_offset.2d.rgba8.size_pot.repeat_mirrored_repeat

I will create a new issue not to forget about those tests, but as a low priority issue.

@jkwak-work jkwak-work reopened this Oct 17, 2024
@jkwak-work
Copy link
Collaborator Author

I disabled three tests mentioned on my previous comment.
PR 7

@jkwak-work
Copy link
Collaborator Author

Closing the issue for now.

@jkwak-work
Copy link
Collaborator Author

jkwak-work commented Oct 17, 2024

I think there is an actual problem for the tests that are occasionally causing the crash.
When they ran with the validation enabled, the validation fails.

D:\sbf\git\slang\test_cts\build\Release\bin>set DISABLE_CTS_SLANG_SERVER_MODE=1

D:\sbf\git\slang\test_cts\build\Release\bin>set SLANG_RUN_SPIRV_VALIDATION=1

D:\sbf\git\slang\test_cts\build\Release\bin>deqp-vk.exe --deqp-case=dEQP-VK.glsl.texture_gather.offset.implementation_offset.2d.rgba8.size_pot.clamp_to_edge_repeat

Writing test log into TestResults.qpa
dEQP Core vulkan-cts-1.3.8.0-77-g706d4bdefac330cb5c30234902231ab61686b83e (0x706d4bde) starting..
  target implementation = 'Default'

Test case 'dEQP-VK.glsl.texture_gather.offset.implementation_offset.2d.rgba8.size_pot.clamp_to_edge_repeat'..
1
Disabled SLANG SERVER MODE
error: line 53: Expected Image Operand ConstOffset to be a const object
  %29 = OpImageGather %v4float %24 %27 %int_0 ConstOffset %21

(0): internal error 99999: Validation of generated SPIR-V failed. SPIRV generated:
; SPIR-V
; Version: 1.5
; Generator: Khronos; 40
; Bound: 37
; Schema: 0
               OpCapability Shader
               OpMemoryModel Logical GLSL450
               OpEntryPoint Fragment %main "main" %11 %offset %u_sampler %31 %entryPointParam_main_o_color %v_texCoord
               OpExecutionMode %main OriginUpperLeft

               ; Debug Information
               OpSource Slang 1
               OpName %v_texCoord "v_texCoord"  ; id %9
               OpName %SLANG_ParameterGroup_offset_std140 "SLANG_ParameterGroup_offset_std140"  ; id %13
               OpMemberName %SLANG_ParameterGroup_offset_std140 0 "u_offset"
               OpName %offset "offset"  ; id %17
               OpName %u_sampler "u_sampler"  ; id %26
               OpName %entryPointParam_main_o_color "entryPointParam_main.o_color"  ; id %34
               OpName %main "main"  ; id %2

               ; Annotations
               OpDecorate %v_texCoord Location 0
               OpDecorate %SLANG_ParameterGroup_offset_std140 Block
               OpMemberDecorate %SLANG_ParameterGroup_offset_std140 0 Offset 0
               OpDecorate %offset Binding 1
               OpDecorate %offset DescriptorSet 0
               OpDecorate %u_sampler Binding 0
               OpDecorate %u_sampler DescriptorSet 0
               OpDecorate %entryPointParam_main_o_color Location 0

               ; Types, variables and constants
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
      %float = OpTypeFloat 32
    %v2float = OpTypeVector %float 2
%_ptr_Input_v2float = OpTypePointer Input %v2float
%_ptr_Private_v2float = OpTypePointer Private %v2float
        %int = OpTypeInt 32 1
      %v2int = OpTypeVector %int 2
%SLANG_ParameterGroup_offset_std140 = OpTypeStruct %v2int
%_ptr_Uniform_SLANG_ParameterGroup_offset_std140 = OpTypePointer Uniform %SLANG_ParameterGroup_offset_std140
      %int_0 = OpConstant %int 0
%_ptr_Uniform_v2int = OpTypePointer Uniform %v2int
         %22 = OpTypeImage %float 2D 2 0 0 1 Unknown
         %23 = OpTypeSampledImage %22
%_ptr_UniformConstant_23 = OpTypePointer UniformConstant %23
    %v4float = OpTypeVector %float 4
%_ptr_Private_v4float = OpTypePointer Private %v4float
%_ptr_Output_v4float = OpTypePointer Output %v4float
 %v_texCoord = OpVariable %_ptr_Input_v2float Input
         %11 = OpVariable %_ptr_Private_v2float Private
     %offset = OpVariable %_ptr_Uniform_SLANG_ParameterGroup_offset_std140 Uniform
  %u_sampler = OpVariable %_ptr_UniformConstant_23 UniformConstant
         %31 = OpVariable %_ptr_Private_v4float Private
%entryPointParam_main_o_color = OpVariable %_ptr_Output_v4float Output

               ; Function main
       %main = OpFunction %void None %3
          %4 = OpLabel
          %7 = OpLoad %v2float %v_texCoord
               OpStore %11 %7
         %20 = OpAccessChain %_ptr_Uniform_v2int %offset %int_0
         %21 = OpLoad %v2int %20
         %24 = OpLoad %23 %u_sampler
         %27 = OpLoad %v2float %v_texCoord
         %29 = OpImageGather %v4float %24 %27 %int_0 ConstOffset %21
               OpStore %31 %29
               OpStore %entryPointParam_main_o_color %29
               OpReturn
               OpFunctionEnd

Failed to compile: 80004005
  InternalError (Compiling GLSL to SPIR-V failed at vkPrograms.cpp:701)

DONE!

Test run totals:
  Passed:        0/1 (0.0%)
  Failed:        1/1 (100.0%)
  Not supported: 0/1 (0.0%)
  Warnings:      0/1 (0.0%)
  Waived:        0/1 (0.0%)

@jkwak-work
Copy link
Collaborator Author

I created a new issue for the validation failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang kind:ci & infra Continuous integration and infrastructure issue kind:regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants