-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: __builtin_amdgcn_workgroup_size_x incorrectly returns 0 on Code Object Model 5 #119
Comments
I made some progress in tracking this down and I think I see why __builtin_amdgcn_workgroup_size_x is returning the wrong results. Looking at AMDGPULowerKernelAttributes.cpp in llvm I noticed that __builtin_amdgcn_implicitarg_ptr points to a region which is not being populated. Instead if I use the dispatch ptr which points to an instance of hsa_kernel_dispatch_packet_t I can get things to work. So the in memory representation of that should look like - typedef struct hsa_signal_s {
uint64_t handle;
} hsa_signal_t;
typedef struct hsa_kernel_dispatch_packet_s {
uint16_t header;
uint16_t setup;
uint16_t workgroup_size_x;
uint16_t workgroup_size_y;
uint16_t workgroup_size_z;
uint16_t reserved0;
uint32_t grid_size_x;
uint32_t grid_size_y;
uint32_t grid_size_z;
uint32_t private_segment_size;
uint32_t group_segment_size;
uint64_t kernel_object;
void* kernarg_address;
uint64_t reserved2;
hsa_signal_t completion_signal;
} hsa_kernel_dispatch_packet_t; And I was able to indeed verify this by printing this out for every workitem: To verify that this hypothesis was correct, using this snippet of code that reads the workgroup size at the right offset indeed fixes the issue and causes the kernel to calculate correctly - hsa_kernel_dispatch_packet_t* ptr = (hsa_kernel_dispatch_packet_t*)__builtin_amdgcn_dispatch_ptr();
// Read uint16_t starting at offset 4 bytes from ptr
uint16_t workgroup_size_x;
memcpy(&workgroup_size_x, (uint8_t*)ptr + 4, sizeof(uint16_t)); |
Think I was able to track this down further. Code Object Version 5 uses implicitarg ptr for __builtin_amdgcn_workgroup_size_x which is not being populated. Code Object Version 4 uses the dispatch pointer to answer __builtin_amdgcn_workgroup_size_x and hence works. To get this to work I had to set -mcode-object-version=4 in my call to clang. Based on MetadataStreamerMsgPackV5::emitHiddenKernalArgs I verified that the struct pointed to by the implicitarg ptr which should probably look like #pragma pack(push, 1)
struct ImplicitArg {
uint32_t hidden_block_count_x;
uint32_t hidden_block_count_y;
uint32_t hidden_block_count_z;
uint16_t hidden_group_size_x;
uint16_t hidden_group_size_y;
uint16_t hidden_group_size_z;
uint16_t hidden_remainder_x;
uint16_t hidden_remainder_y;
uint16_t hidden_remainder_z;
uint64_t hidden_tool_correlation_id;
uint64_t hidden_reserved_1;
uint64_t hidden_global_offset_x;
uint64_t hidden_global_offset_y;
uint64_t hidden_global_offset_z;
uint16_t hidden_grid_dims;
uint16_t hidden_reserved_2;
uint16_t hidden_reserved_3;
uint16_t hidden_reserved_4;
uint64_t hidden_printf_buffer;
uint64_t hidden_hostcall_buffer;
uint64_t hidden_multigrid_sync_arg;
uint64_t hidden_heap_v1;
uint64_t hidden_default_queue;
uint64_t hidden_completion_action;
uint32_t hidden_dynamic_lds_size;
uint32_t hidden_private_base;
uint32_t hidden_shared_base;
};
#pragma pack(pop) Is zeroed in all cases for all my workitems Open questions:
|
After studying llvm/libc/utils/gpu/loader/amdgpu/Loader.cpp I was able to make this work even on Code Object Version 5. Will blog with a simple standalone example over the next day or two. |
@deepankarsharma I have not looked at "hansa". I don't understand why the kernel above is calling builtins directly. The compiler expects the runtime that launches the kernels it builds to obey a contract which include setting up the implicit kernel arguments as dictated by the code object metadata. I assume "hansa" has not done so? The metadata is described at https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata . |
hansa is an experimental effort on my part to see if one can use upstream clang to do gpgpu out of the box on linux distros using amd gpus/apus. Issue with implicit kernel arguments and other setup was exactly what I ran into. Fortunately I now have vector add example working out of the box. |
https://github.com/deepankarsharma/hansa/blob/main/main.cpp contains a working example of launching a simple kernel that works out of the box on clang on both a cutting edge distro (archlinux with clang v18.1.8) that uses code model v5 and an older distro (ubuntu 22.04 with clang 15). Closing issue. |
llvm#122426) …181)" This reverts commit a531800.
Problem Description
Machine Details:
NAME="Ubuntu"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
CPU:
model name : AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
GPU:
Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Marketing Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Name: gfx1103
Marketing Name: AMD Radeon Graphics
Name: amdgcn-amd-amdhsa--gfx1103
Operating System
Ubuntu 22.04.4 LTS (Jammy Jellyfish)
CPU
AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
GPU
AMD Radeon RX 7900 XTX
ROCm Version
ROCm 6.1.0
ROCm Component
clang-ocl
Steps to Reproduce
In the following kernel __builtin_amdgcn_workgroup_size_x always returns zero on APU. This causes the indexing calculation to be incorrect. If I replace that call with a hardcoded value of 64 (which is what the code in main.cpp is setting it) then the calculation comes out to be correct.
Trace from the running program -
You can run the above kernel using the following steps
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module version 6.7.0 is loaded
HSA System Attributes
Runtime Version: 1.13
Runtime Ext Version: 1.4
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
Agent 1
Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5137
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 15509504(0xeca800) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 15509504(0xeca800) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 15509504(0xeca800) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx1103
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
Chip ID: 5567(0x15bf)
ASIC Revision: 7(0x7)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2700
BDFID: 1024
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 35
SDMA engine uCode:: 17
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1103
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
The text was updated successfully, but these errors were encountered: