-
Notifications
You must be signed in to change notification settings - Fork 124
[L0] Enable Immediate Command List by default given Intel DG2 #1951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nrspruit
commented
Aug 7, 2024
- Enabled Immediate Command list usage per queue given Intel DG2 HW.
- Removed default setting of false on windows.
|
Compute Benchmarks level_zero run (with params: ): |
|
Compute Benchmarks level_zero run (): Summary
Chartsapi_overhead_benchmark_sycl SubmitKernel out of order---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title api_overhead_benchmark_sycl SubmitKernel out of order
todayMarker off
dateFormat X
axisFormat %s
section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=0<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)
This PR (26.281 μs) : crit, 0, 26
baseline (23.082 μs) : 0, 23
- : 0, 0
- : 0, 0
api_overhead_benchmark_sycl SubmitKernel in order---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title api_overhead_benchmark_sycl SubmitKernel in order
todayMarker off
dateFormat X
axisFormat %s
section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=1<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)
This PR (25.33 μs) : crit, 0, 25
baseline (22.972 μs) : 0, 22
- : 0, 0
- : 0, 0
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024
todayMarker off
dateFormat X
axisFormat %s
section QueueInOrderMemcpy(api=sycl<br>IsCopyOnly=0<br>sourcePlacement=Device<br>destinationPlacement=Device<br>size=1KB<br>count=100)
This PR (330.433 μs) : crit, 0, 330
baseline (298.574 μs) : 0, 298
- : 0, 0
- : 0, 0
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024
todayMarker off
dateFormat X
axisFormat %s
section QueueInOrderMemcpy(api=sycl<br>IsCopyOnly=0<br>sourcePlacement=Host<br>destinationPlacement=Device<br>size=1KB<br>count=100)
This PR (204.698 μs) : crit, 0, 204
baseline (222.377 μs) : 0, 222
- : 0, 0
- : 0, 0
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024
todayMarker off
dateFormat X
axisFormat %s
section QueueMemcpy(api=sycl<br>sourcePlacement=Device<br>destinationPlacement=Device<br>size=1KB)
This PR (6.699 μs) : crit, 0, 6
baseline (6.408 μs) : 0, 6
- : 0, 0
- : 0, 0
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240
todayMarker off
dateFormat X
axisFormat %s
section StreamMemory(api=sycl<br>type=Triad<br>size=10KB<br>useEvents=0<br>contents=Zeros<br>memoryPlacement=Device)
This PR (3.091 μs) : crit, 0, 3
baseline (3.116 μs) : 0, 3
- : 0, 0
- : 0, 0
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024
todayMarker off
dateFormat X
axisFormat %s
section ExecImmediateCopyQueue(api=sycl<br>IsCopyOnly=1<br>MeasureCompletionTime=0<br>src=Device<br>dst=Device<br>size=1KB<br>ioq=0)
This PR (2.875 μs) : crit, 0, 2
baseline (2.806 μs) : 0, 2
- : 0, 0
- : 0, 0
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024
todayMarker off
dateFormat X
axisFormat %s
section ExecImmediateCopyQueue(api=sycl<br>IsCopyOnly=1<br>MeasureCompletionTime=0<br>src=Host<br>dst=Host<br>size=1KB<br>ioq=1)
This PR (2.403 μs) : crit, 0, 2
baseline (2.322 μs) : 0, 2
- : 0, 0
- : 0, 0
miscellaneous_benchmark_sycl VectorSum---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title miscellaneous_benchmark_sycl VectorSum
todayMarker off
dateFormat X
axisFormat %s
section VectorSum(api=sycl<br>numberOfElementsX=512<br>numberOfElementsY=256<br>numberOfElementsZ=256)
This PR (858.246 μs) : crit, 0, 858
baseline (859.353 μs) : 0, 859
- : 0, 0
- : 0, 0
Velocity-Bench Hashtable---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title Velocity-Bench Hashtable
todayMarker off
dateFormat X
axisFormat %s
section hashtable
This PR (330.956108 M keys/sec) : crit, 0, 330
baseline (328.705328 M keys/sec) : 0, 328
- : 0, 0
- : 0, 0
Velocity-Bench Bitcracker---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title Velocity-Bench Bitcracker
todayMarker off
dateFormat X
axisFormat %s
section bitcracker
This PR (35.6949 s) : crit, 0, 35
baseline (35.7419 s) : 0, 35
- : 0, 0
- : 0, 0
Velocity-Bench CudaSift---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title Velocity-Bench CudaSift
todayMarker off
dateFormat X
axisFormat %s
section cudaSift
This PR (218.517 ms) : crit, 0, 218
baseline (218.846 ms) : 0, 218
- : 0, 0
- : 0, 0
Velocity-Bench Easywave---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title Velocity-Bench Easywave
todayMarker off
dateFormat X
axisFormat %s
section easywave
This PR (239 ms) : crit, 0, 239
baseline (246.0 ms) : 0, 246
- : 0, 0
- : 0, 0
Velocity-Bench QuickSilver---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title Velocity-Bench QuickSilver
todayMarker off
dateFormat X
axisFormat %s
section QuickSilver
This PR (117.23 MMS/CTT) : crit, 0, 117
baseline (117.06 MMS/CTT) : 0, 117
- : 0, 0
- : 0, 0
Velocity-Bench Sobel Filter---
config:
gantt:
rightPadding: 10
leftPadding: 120
sectionFontSize: 10
numberSectionStyles: 2
---
gantt
title Velocity-Bench Sobel Filter
todayMarker off
dateFormat X
axisFormat %s
section sobel_filter
This PR (612.759 ms) : crit, 0, 612
baseline (610.354 ms) : 0, 610
- : 0, 0
- : 0, 0
DetailsSubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)Environment Variables:Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 Output:TestCase,Mean,Median,StdDev,Min,Max,Type SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)Environment Variables:Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 Output:TestCase,Mean,Median,StdDev,Min,Max,Type QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Device destinationPlacement=Device size=1KB count=100)Environment Variables:Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100 Output:TestCase,Mean,Median,StdDev,Min,Max,Type QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Host destinationPlacement=Device size=1KB count=100)Environment Variables:Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100 Output:TestCase,Mean,Median,StdDev,Min,Max,Type QueueMemcpy(api=sycl sourcePlacement=Device destinationPlacement=Device size=1KB)Environment Variables:Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024 Output:TestCase,Mean,Median,StdDev,Min,Max,Type StreamMemory(api=sycl type=Triad size=10KB useEvents=0 contents=Zeros memoryPlacement=Device)Environment Variables:Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=StreamMemory --csv --noHeaders --iterations=10000 --type=Triad --size=10240 --memoryPlacement=Device --useEvents=0 --contents=Zeros Output:TestCase,Mean,Median,StdDev,Min,Max,Type ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Device dst=Device size=1KB ioq=0)Environment Variables:Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024 Output:TestCase,Mean,Median,StdDev,Min,Max,Type ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Host dst=Host size=1KB ioq=1)Environment Variables:Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024 Output:TestCase,Mean,Median,StdDev,Min,Max,Type VectorSum(api=sycl numberOfElementsX=512 numberOfElementsY=256 numberOfElementsZ=256)Environment Variables:Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256 Output:TestCase,Mean,Median,StdDev,Min,Max,Type hashtableEnvironment Variables:Command:/home/test-user/bench_workdir/hashtable/hashtable_sycl --no-verify Output:hashtable - total time for whole calculation: 0.405545 s bitcrackerEnvironment Variables:Command:/home/test-user/bench_workdir/bitcracker/bitcracker -f /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000 Output:---------> BitCracker: BitLocker password cracking tool <--------- ==================================
|
0c07008 to
2a381d5
Compare
2a381d5 to
1912fb4
Compare
As discovered in oneapi-src#1951 (comment) there is an issue running the command-buffer CTS tests on L0 with command-lists enabled. After investigating, this was because the CTS tests we're not using the output event parameter to command-buffer enqueue. In the L0 adapter code a path for registering an event with the queue to waiting on the workload was guarded by this event being set. Therefore `urQueueFinish` was not working as expected, as the queue was not aware of this work to wait on. Fixed by always registering the work to wait on with the queue, and miss out on propagation of the created event when user doesn't pass an output event, rather than not creating it at all.
1912fb4 to
e2cc90c
Compare
-pre-commit PR for oneapi-src/unified-runtime#1951 Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
|
Failures seen in intel/llvm#15054 for memcpy 2d, until that is resolved this PR is in draft. |
|
@nrspruit Is this PR a WIP? |
Hello @omarahmed1111 , yes, this patch exposes a problem in a L0 Driver so this is pending and will most likely need to be updated before it can be merged. I will remove the 0.10.x until we can get this resolved. |
e2cc90c to
a6d049f
Compare
- Enabled Immediate Command list usage per queue given Intel DG2 HW. - Removed default setting of false on windows. - Added check to only enable this default given a minimum driver version. Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
a6d049f to
6cd6338
Compare
|
E2E failure is unrelated to this change. |
-pre-commit PR for oneapi-src/unified-runtime#1951 Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
|
Based on CI results here: intel/llvm#15031 && intel/llvm#15054 there are no failures caused by this change. The Jenkins jobs are randomly failing. |
|
awaiting one more re-review before ready to merge. |
-pre-commit PR for oneapi-src/unified-runtime#1951 Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
[L0] Enable Immediate Command List by default given Intel DG2
[L0] Enable Immediate Command List by default given Intel DG2