Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too low 4GB allocation limit on Intel Arc GPUs (CL_DEVICE_MAX_MEM_ALLOC_SIZE) #627

Closed
ProjectPhysX opened this issue Mar 14, 2023 · 19 comments
Labels
merged change was merged question

Comments

@ProjectPhysX
Copy link

ProjectPhysX commented Mar 14, 2023

The CL_DEVICE_MAX_MEM_ALLOC_SIZE on Intel Arc GPUs is currently set to 4GB (A770 16GB) and 3.86GB (A750). Trying to allocate larger buffers makes the cl::Buffer constructor return error -61. Disabling the error by setting buffer flag (1<<23) during allocation turns compute results into nonsense when the buffer size is larger than 4GB.

This is likely related to 32-bit integer overflow in index calculation.

A 4GB limit on buffer allocation is not contemporary in 2023, especially on a 16GB GPU; it's not 2003 anymore where computers were limited to 32-bit. A lot of software needs to be able to allocate larger buffers in order to fully use the available VRAM capacity. FluidX3D for example needs up to 82% of VRAM in a single buffer; if the allocation limit is 25% on the A770 16GB, only 4.9GB of the 16GB can be used by the software.

The limit should be removed altogether, by setting CL_DEVICE_MAX_MEM_ALLOC_SIZE = CL_DEVICE_GLOBAL_MEM_SIZE = 100% of physical VRAM capacity, and making sure that array indices are computed with 64-bit integers. Nvidia and AMD both allow full VRAM allocation in a single buffer for a long time already.

@jinz2014
Copy link

What is the Linux kernel installed ?

@ProjectPhysX
Copy link
Author

@jinz2014 kernel 6.2.0-060200-generic, on Ubuntu 22.04.

@jinz2014
Copy link

#617
I think developers are aware of the issue. Look forward to the solution in the future.

@BA8F0D39
Copy link

BA8F0D39 commented Mar 15, 2023

Not an intel employee but I found many bugs related to memory transfer

  1. Linux kernel driver causes random memory corruption even if you allocate memory less than 4GB in A770
  2. The blocking memory transfer functions in compute runtime unblocks prematurely and returns invalid data because the memory transfer hasn't completed yet
  3. Have more than one independent memory transfer sometimes cause memory corruption. E.g. copying A to B at the same time as copying C to D causes corruption.
  4. Reading VRAM is 2x slower than writing data into VRAM. No other GPU has this
  5. Segment fault if you transfer GPU data from one CPU thread to another CPU thread.
  6. Compute runtime preventing you from allocating all of the VRAM. Vulkan and OpenGL allows you to allocate all of the VRAM for some reason. Gaming under Linux is unaffected by most of the bugs.

Should we file a bug at https://bugzilla.kernel.org ?

@jinz2014
Copy link

@BA8F0D39 Thank you for your summary. Are there links/reproducers available for each item in your list ?

@BA8F0D39
Copy link

BA8F0D39 commented Mar 21, 2023

@jinz2014

  1. If you run compute benchmarks https://github.com/intel/compute-benchmarks , memory transfer benchmarks have infinite memory bandwidth, which is impossible. Therefore, some of the blocking memory functions prematurely unblock without completing their transfer first causing the benchmarks to show inifinite bandwidth.
                                          MapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=Write useEvents=1)         22.151         22.163          0.11%         22.101         22.165  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=WriteInvalidate useEvents=1)            inf    2581110.154            inf    2581110.154            inf  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=Read useEvents=1)         22.150         22.162          0.12%         22.097         22.165  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=Write useEvents=1)         22.143         22.162          0.13%         22.093         22.163  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=WriteInvalidate useEvents=1)            inf            inf            inf    2581110.154            inf  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=Read useEvents=1)         22.400         22.403          0.03%         22.388         22.404  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=Write useEvents=1)         22.402         22.403          0.02%         22.388         22.404  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=WriteInvalidate useEvents=1)            inf   10324440.615            inf   10324440.615            inf  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=Read useEvents=1)         22.401         22.402          0.02%         22.387         22.403  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=Write useEvents=1)         22.401         22.403          0.02%         22.387         22.403  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=WriteInvalidate useEvents=1)            inf   10324440.615            inf   10324440.615            inf  [GPU]         [GB/s]
                                             ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=None)          8.908          8.911          0.08%          8.893          8.915  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=Usm)         21.459         21.351          0.84%         21.281         21.703  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=Map)         21.427         21.422          1.17%         21.101         21.722  [CPU] 
  1. Having multiple memory transfers at the same time in the compute benchmarks also corrupts memory.
[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/49
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/51
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/53
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/55
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/57
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/59
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/61
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/63
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/65
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/67
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/69
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/71
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/73
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/75
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/77
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/79
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/81
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/83
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/85
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/87
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/89
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/91
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/93
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/95
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/72
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/76
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/80
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/84
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/88
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/92
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/96
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/100
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/104
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/108
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/112
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/116
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/120
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/124
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/128
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101
  1. clpeak shows that reading is 2x slower than writing to VRAM. No other GPU has this
    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 21.64
      enqueueReadBuffer               : 8.92
      enqueueWriteBuffer non-blocking : 22.81
      enqueueReadBuffer non-blocking  : 9.10
      enqueueMapBuffer(for read)      : 20.58
        memcpy from mapped ptr        : 22.62
      enqueueUnmap(after write)       : 23.62
        memcpy to mapped ptr          : 22.44
  1. If you run the compute benchmark multhread tests, they all fail if they use more than one CPU thread.
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=1 workgroupsPerProcess=1 synchronize=0)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=1 workgroupsPerProcess=1 synchronize=1)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=1 workgroupsPerProcess=300 synchronize=0)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=1 workgroupsPerProcess=300 synchronize=1)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=2 workgroupsPerProcess=1 synchronize=0)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=2 workgroupsPerProcess=1 synchronize=1)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=2 workgroupsPerProcess=300 synchronize=0)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=2 workgroupsPerProcess=300 synchronize=1)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=4 workgroupsPerProcess=1 synchronize=0)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=4 workgroupsPerProcess=1 synchronize=1)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=4 workgroupsPerProcess=300 synchronize=0)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=4 workgroupsPerProcess=300 synchronize=1)                                        ERROR

@MichalMrozek
Copy link
Contributor

  1. If you run compute benchmarks https://github.com/intel/compute-benchmarks , memory transfer benchmarks have infinite memory bandwidth, which is impossible. Therefore, some of the blocking memory functions prematurely unblock without completing their transfer first causing the benchmarks to show inifinite bandwidth.

Those MapBuffer test have WriteInvalidate flags, which means it doesn't need to do memory transfer at all, as host will be overwriting the contents. That's why reported value is high, as there is no transfer, so it is very short in time. This shows that driver properly optimizes those API calls.

@FreddieWitherden
Copy link

The limit should be removed altogether, by setting CL_DEVICE_MAX_MEM_ALLOC_SIZE = CL_DEVICE_GLOBAL_MEM_SIZE = 100% of physical VRAM capacity, and making sure that array indices are computed with 64-bit integers. Nvidia and AMD both allow full VRAM allocation in a single buffer for a long time already.

I am unsure how practical this is. A lot of optimizations in the compiler are based around the buffer size being limited to 4 GiB. From what I can gather from the ISA a lot of the memory instructions only support base + 32-bit offset calculations. If larger allocations are permitted the compiler will not be able to emit these instructions and will have to use those which accept 64-bit addresses. This requires extra registers in the kernel and emulated 64-bit arithmetic.

@iamhumanipromise
Copy link

I have a laptop with 9th Gen Core i9 (9th Gen gfx also), 32GB RAM. Random Discord convos have talked about using it for Stable Diffusion. That being said, the 4GB limit seems to apply there as well.

So is this some sort of carryover from GFX8/GFX9 cards? Change the allocation size when using dedicated GPU vs. iGPU?

@MaciejPlewka
Copy link
Contributor

MaciejPlewka commented Apr 13, 2023

It's possible to use allocations greater than 4GB. Please take a look at this guide https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md

@iamhumanipromise
Copy link

iamhumanipromise commented Apr 14, 2023

That's the programmer's guide. I'm specifically talking about after the python environment has been launched and I'm executing already-generated code without this flag enabled.

Looks like I have to file an issue with that project, in this case: it seems as it has been closed, there is no way to to make this work without the dev modifying it? (Or forking, modifying, etc)

@ProjectPhysX
Copy link
Author

ProjectPhysX commented Apr 15, 2023

It's possible to use allocations greater than 4GB. Please take a look at this guide https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md

I just tried it again with my A750 and the latest driver 22.49.25018.23 and kernel 6.2.11-060211-generic on Ubuntu 22.04.2 LTS, and it is still broken. Passing the CL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL = (1 << 23) to cl::Buffer constructor / clCreateBuffer disables the buffer allocation error -61, but simulation results then become nonsense.

To reproduce:

git clone https://github.com/ProjectPhysX/FluidX3D.git
cd FluidX3D
chmod +x make.sh

Change src/opencl.hpp line 213 to

device_buffer = cl::Buffer(device.get_cl_context(), CL_MEM_READ_WRITE|(1<<23), capacity(), nullptr, &error);

Set the benchmark grid resolution in src/setup.cpp from 256³ to 384³ by commenting line 1137 and uncommenting line 1138.

Compile and run with:

./make.sh

If it's broken, it will show impossibly large performance/bandwidth, like

| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|   28144 |   4306 GB/s |       497 |         6281  10% |                  0s |

If it works, reported bandwidth will be realistic, like:

| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2500 |    382 GB/s |       149 |         9994  40% |                  0s |

@bashbaug
Copy link
Contributor

Hi @ProjectPhysX, please note that you fundamentally need two changes to make >4GB allocations "work":

  1. You need to relax allocation limits using CL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL. It looks like you are doing this in the steps above and it seems like this is working because you are no longer seeing the CL_INVALID_BUFFER_SIZE allocation error.
  2. You need to tell the compiler that you are using >4GB allocations by passing the -cl-intel-greater-than-4GB-buffer-required program build option. This isn't mentioned in the steps above, so I'm guessing it's missing and this is what is causing the nonsense simulation results. Note, this isn't free, which is why this option is not enabled by default.

If you want to play around with >4GB allocations without hacking around in your code, please consider trying the OpenCL Intercept layer and specifically the RelaxAllocationLimits control, which will automatically do both of these steps for you.

@ProjectPhysX
Copy link
Author

  1. You need to tell the compiler that you are using >4GB allocations by passing the -cl-intel-greater-than-4GB-buffer-required program build option. This isn't mentioned in the steps above, so I'm guessing it's missing and this is what is causing the nonsense simulation results. Note, this isn't free, which is why this option is not enabled by default.

Thanks, it works! I totally missed that build flag. Would be better to enable above 4GB allocations by default though.

@ElliottDyson
Copy link

ElliottDyson commented Jan 12, 2024

Any updates on enabling this by default?

I understand there seems to be some performance compromises if this is done? Just out of curiosity, are these being worked on?

If this has found to no be possible, could a method be implemented so that instead of erroring it gets the CPU to send over 4GB chunks at a time, then the remaining amount, so that greater than 4GB can be sent in "one go" still, whether it be for PyTorch, or any other similar applications that need to make use of greater than 4GB of allocation?

I understand that custom compiling of say, PyTorch, is something that can be done as you have suggested with custom compiler flags. However, some of us aren't developers which makes that quite a hurdle.

Thank you

@dbenedb
Copy link

dbenedb commented Jan 21, 2024

I'm seconding @ElliottDyson request. 4GB allocation limit makes Intel's GPUs like ARC A770 useless for any Stable Diffusion work.

@ElliottDyson
Copy link

I'm seconding @ElliottDyson request. 4GB allocation limit makes Intel's GPUs like ARC A770 useless for any Stable Diffusion work.

I'm not sure how visible this thread is since it's been closed. Perhaps we should be opening a new one that references this? Not sure what typical GitHub etiquette is about something like this though, which is why I haven't done so yet.

@lorenzeszz
Copy link

With Intel UHD it is only 1.86GB on a 32GB RAM system. OpenVINO GPU Plugin is useless for me.

@simonlui
Copy link

simonlui commented Oct 25, 2024

I'm going to add another usecase that I have hit the wall with that indisputably needs a larger buffer size than this 4GB limit that has been enforced. Video diffusion has now started to hit its stride with bigger models coming out like Mochi but even when quantized to 8 bit, I can not generate more than 7 frames with ComfyUI and a corresponding wrapper plugin because the generated data is overflowing some memory somewhere and I either get a PI_ERROR_OUT_OF_HOST_MEMORY or UNKNOWN PI error when trying to run settings that a 4060 Ti with 16 GB of VRAM is able to handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged change was merged question
Projects
None yet
Development

No branches or pull requests