[L0] Enable Immediate Command List by default given Intel DG2 #1951

nrspruit · 2024-08-07T21:59:55Z

Enabled Immediate Command list usage per queue given Intel DG2 HW.
Removed default setting of false on windows.

github-actions · 2024-08-07T22:02:24Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/10292370779

github-actions · 2024-08-07T22:14:55Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/10292370779
Job status: success. Test status: success.

Summary

Benchmark	This PR	baseline
api_overhead_benchmark_sycl SubmitKernel out of order	26.281	`23.082`
api_overhead_benchmark_sycl SubmitKernel in order	25.33	`22.972`
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	330.433	`298.574`
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	`204.698`	222.377
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	6.699	`6.408`
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	`3.091`	3.116
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.875	`2.806`
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	2.403	`2.322`
miscellaneous_benchmark_sycl VectorSum	`858.246`	859.353
Velocity-Bench Hashtable	330.956108	`328.705328`
Velocity-Bench Bitcracker	`35.6949`	35.7419
Velocity-Bench CudaSift	`218.517`	218.846
Velocity-Bench Easywave	`239`	246.0
Velocity-Bench QuickSilver	117.23	`117.06`
Velocity-Bench Sobel Filter	612.759	`610.354`

Charts

api_overhead_benchmark_sycl SubmitKernel out of order

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_sycl SubmitKernel out of order
    todayMarker off
    dateFormat  X
    axisFormat %s

    section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=0<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)

        This PR (26.281 μs)   : crit, 0, 26

        baseline (23.082 μs)   :  0, 23

    -   : 0, 0

    -   : 0, 0

api_overhead_benchmark_sycl SubmitKernel in order

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_sycl SubmitKernel in order
    todayMarker off
    dateFormat  X
    axisFormat %s

    section SubmitKernel(api=sycl<br>Profiling=0<br>Ioq=1<br>DiscardEvents=0<br>NumKernels=10<br>KernelExecTime=1<br>MeasureCompletion=0)

        This PR (25.33 μs)   : crit, 0, 25

        baseline (22.972 μs)   :  0, 22

    -   : 0, 0

    -   : 0, 0

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section QueueInOrderMemcpy(api=sycl<br>IsCopyOnly=0<br>sourcePlacement=Device<br>destinationPlacement=Device<br>size=1KB<br>count=100)

        This PR (330.433 μs)   : crit, 0, 330

        baseline (298.574 μs)   :  0, 298

    -   : 0, 0

    -   : 0, 0

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section QueueInOrderMemcpy(api=sycl<br>IsCopyOnly=0<br>sourcePlacement=Host<br>destinationPlacement=Device<br>size=1KB<br>count=100)

        This PR (204.698 μs)   : crit, 0, 204

        baseline (222.377 μs)   :  0, 222

    -   : 0, 0

    -   : 0, 0

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section QueueMemcpy(api=sycl<br>sourcePlacement=Device<br>destinationPlacement=Device<br>size=1KB)

        This PR (6.699 μs)   : crit, 0, 6

        baseline (6.408 μs)   :  0, 6

    -   : 0, 0

    -   : 0, 0

memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240
    todayMarker off
    dateFormat  X
    axisFormat %s

    section StreamMemory(api=sycl<br>type=Triad<br>size=10KB<br>useEvents=0<br>contents=Zeros<br>memoryPlacement=Device)

        This PR (3.091 μs)   : crit, 0, 3

        baseline (3.116 μs)   :  0, 3

    -   : 0, 0

    -   : 0, 0

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section ExecImmediateCopyQueue(api=sycl<br>IsCopyOnly=1<br>MeasureCompletionTime=0<br>src=Device<br>dst=Device<br>size=1KB<br>ioq=0)

        This PR (2.875 μs)   : crit, 0, 2

        baseline (2.806 μs)   :  0, 2

    -   : 0, 0

    -   : 0, 0

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024
    todayMarker off
    dateFormat  X
    axisFormat %s

    section ExecImmediateCopyQueue(api=sycl<br>IsCopyOnly=1<br>MeasureCompletionTime=0<br>src=Host<br>dst=Host<br>size=1KB<br>ioq=1)

        This PR (2.403 μs)   : crit, 0, 2

        baseline (2.322 μs)   :  0, 2

    -   : 0, 0

    -   : 0, 0

miscellaneous_benchmark_sycl VectorSum

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title miscellaneous_benchmark_sycl VectorSum
    todayMarker off
    dateFormat  X
    axisFormat %s

    section VectorSum(api=sycl<br>numberOfElementsX=512<br>numberOfElementsY=256<br>numberOfElementsZ=256)

        This PR (858.246 μs)   : crit, 0, 858

        baseline (859.353 μs)   :  0, 859

    -   : 0, 0

    -   : 0, 0

Velocity-Bench Hashtable

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench Hashtable
    todayMarker off
    dateFormat  X
    axisFormat %s

    section hashtable

        This PR (330.956108 M keys/sec)   : crit, 0, 330

        baseline (328.705328 M keys/sec)   :  0, 328

    -   : 0, 0

    -   : 0, 0

Velocity-Bench Bitcracker

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench Bitcracker
    todayMarker off
    dateFormat  X
    axisFormat %s

    section bitcracker

        This PR (35.6949 s)   : crit, 0, 35

        baseline (35.7419 s)   :  0, 35

    -   : 0, 0

    -   : 0, 0

Velocity-Bench CudaSift

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench CudaSift
    todayMarker off
    dateFormat  X
    axisFormat %s

    section cudaSift

        This PR (218.517 ms)   : crit, 0, 218

        baseline (218.846 ms)   :  0, 218

    -   : 0, 0

    -   : 0, 0

Velocity-Bench Easywave

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench Easywave
    todayMarker off
    dateFormat  X
    axisFormat %s

    section easywave

        This PR (239 ms)   : crit, 0, 239

        baseline (246.0 ms)   :  0, 246

    -   : 0, 0

    -   : 0, 0

Velocity-Bench QuickSilver

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench QuickSilver
    todayMarker off
    dateFormat  X
    axisFormat %s

    section QuickSilver

        This PR (117.23 MMS/CTT)   : crit, 0, 117

        baseline (117.06 MMS/CTT)   :  0, 117

    -   : 0, 0

    -   : 0, 0

Velocity-Bench Sobel Filter

---
config:
    gantt:
        rightPadding: 10
        leftPadding: 120
        sectionFontSize: 10
        numberSectionStyles: 2
---
gantt
    title Velocity-Bench Sobel Filter
    todayMarker off
    dateFormat  X
    axisFormat %s

    section sobel_filter

        This PR (612.759 ms)   : crit, 0, 612

        baseline (610.354 ms)   :  0, 610

    -   : 0, 0

    -   : 0, 0

Details

SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)

Environment Variables:

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=sycl Profiling=0 Ioq=0 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),26.281,26.341,5.63%,23.607,460.934,[CPU],[us]

SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0)

Environment Variables:

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
SubmitKernel(api=sycl Profiling=0 Ioq=1 DiscardEvents=0 NumKernels=10 KernelExecTime=1 MeasureCompletion=0),25.330,25.568,6.15%,22.364,433.367,[CPU],[us]

QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Device destinationPlacement=Device size=1KB count=100)

Environment Variables:

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Device destinationPlacement=Device size=1KB count=100),330.433,332.998,2.91%,300.883,732.033,[CPU],[us]

QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Host destinationPlacement=Device size=1KB count=100)

Environment Variables:

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
QueueInOrderMemcpy(api=sycl IsCopyOnly=0 sourcePlacement=Host destinationPlacement=Device size=1KB count=100),204.698,197.360,17.86%,190.390,577.874,[CPU],[us]

QueueMemcpy(api=sycl sourcePlacement=Device destinationPlacement=Device size=1KB)

Environment Variables:

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
QueueMemcpy(api=sycl sourcePlacement=Device destinationPlacement=Device size=1KB),6.699,6.406,19.81%,6.019,116.584,[CPU],[us]

StreamMemory(api=sycl type=Triad size=10KB useEvents=0 contents=Zeros memoryPlacement=Device)

Environment Variables:

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=StreamMemory --csv --noHeaders --iterations=10000 --type=Triad --size=10240 --memoryPlacement=Device --useEvents=0 --contents=Zeros

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
StreamMemory(api=sycl type=Triad size=10KB useEvents=0 contents=Zeros memoryPlacement=Device),3.091,3.097,2.84%,0.686,3.368,[CPU],[GB/s]

ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Device dst=Device size=1KB ioq=0)

Environment Variables:

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Device dst=Device size=1KB ioq=0),2.875,2.866,7.42%,2.668,60.993,[CPU],[us]

ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Host dst=Host size=1KB ioq=1)

Environment Variables:

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
ExecImmediateCopyQueue(api=sycl IsCopyOnly=1 MeasureCompletionTime=0 src=Host dst=Host size=1KB ioq=1),2.403,2.397,6.06%,2.280,43.808,[CPU],[us]

VectorSum(api=sycl numberOfElementsX=512 numberOfElementsY=256 numberOfElementsZ=256)

Environment Variables:

Command:

/home/test-user/bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256

Output:

TestCase,Mean,Median,StdDev,Min,Max,Type
VectorSum(api=sycl numberOfElementsX=512 numberOfElementsY=256 numberOfElementsZ=256),858.246,858.902,0.45%,821.607,871.997,[GPU],bw [GB/s]

hashtable

Environment Variables:

Command:

/home/test-user/bench_workdir/hashtable/hashtable_sycl --no-verify

Output:

hashtable - total time for whole calculation: 0.405545 s
330.956108 million keys/second

bitcracker

Environment Variables:

Command:

/home/test-user/bench_workdir/bitcracker/bitcracker -f /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Output:

---------> BitCracker: BitLocker password cracking tool <---------

==================================
Retrieving Info

Reading hash file "/home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================
Type of attack: User Password
Psw per thread: 1
max_num_pswd_per_read: 60000
Dictionary: /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt
MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!

time to subtract from total: 0.00403515 s
bitcracker - total time for whole calculation: 35.6949 s

cudaSift

Environment Variables:

Command:

/home/test-user/bench_workdir/cudaSift/cudaSift

Output:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1235 1272 33.5324% 1 2