-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenMP] Memory transfer optimizations #180
Comments
…307.6 (llvm#180) [objwriter/12.x] Update dependencies from dotnet/arcade
Summary: The complex DOT instructions perform a dot-product on quadtuplets from two source vectors and the resuling wide real or wide imaginary is accumulated into the destination register. The instructions come in two forms: Vector form, e.g. cdot z0.s, z1.b, z2.b, llvm#90 - complex dot product on four 8-bit quad-tuplets, accumulating results in 32-bit elements. The complex numbers in the second source vector are rotated by 90 degrees. cdot z0.d, z1.h, z2.h, llvm#180 - complex dot product on four 16-bit quad-tuplets, accumulating results in 64-bit elements. The complex numbers in the second source vector are rotated by 180 degrees. Indexed form, e.g. cdot z0.s, z1.b, z2.b[3], #0 - complex dot product on four 8-bit quad-tuplets, with specified quadtuplet from second source vector, accumulating results in 32-bit elements. cdot z0.d, z1.h, z2.h[1], #0 - complex dot product on four 16-bit quad-tuplets, with specified quadtuplet from second source vector, accumulating results in 64-bit elements. The specification can be found here: https://developer.arm.com/docs/ddi0602/latest Reviewed By: SjoerdMeijer, rovka Differential Revision: https://reviews.llvm.org/D61903 llvm-svn: 360870
Summary: feature coverage is a useful signal that is available during the merge process, but was not printed previously. Output example: ``` $ ./fuzzer -use_value_profile=1 -merge=1 new_corpus/ seed_corpus/ INFO: Seed: 1676551929 INFO: Loaded 1 modules (2380 inline 8-bit counters): 2380 [0x90d180, 0x90dacc), INFO: Loaded 1 PC tables (2380 PCs): 2380 [0x684018,0x68d4d8), MERGE-OUTER: 180 files, 78 in the initial corpus MERGE-OUTER: attempt 1 INFO: Seed: 1676574577 INFO: Loaded 1 modules (2380 inline 8-bit counters): 2380 [0x90d180, 0x90dacc), INFO: Loaded 1 PC tables (2380 PCs): 2380 [0x684018,0x68d4d8), INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 1048576 bytes MERGE-INNER: using the control file '/tmp/libFuzzerTemp.111754.txt' MERGE-INNER: 180 total files; 0 processed earlier; will process 180 files now llvm#1 pulse cov: 134 ft: 330 exec/s: 0 rss: 37Mb llvm#2 pulse cov: 142 ft: 462 exec/s: 0 rss: 38Mb llvm#4 pulse cov: 152 ft: 651 exec/s: 0 rss: 38Mb llvm#8 pulse cov: 152 ft: 943 exec/s: 0 rss: 38Mb llvm#16 pulse cov: 520 ft: 2783 exec/s: 0 rss: 39Mb llvm#32 pulse cov: 552 ft: 3280 exec/s: 0 rss: 41Mb llvm#64 pulse cov: 576 ft: 3641 exec/s: 0 rss: 50Mb llvm#78 LOADED cov: 602 ft: 3936 exec/s: 0 rss: 88Mb llvm#128 pulse cov: 611 ft: 3996 exec/s: 0 rss: 93Mb llvm#180 DONE cov: 611 ft: 4016 exec/s: 0 rss: 155Mb MERGE-OUTER: succesfull in 1 attempt(s) MERGE-OUTER: the control file has 39741 bytes MERGE-OUTER: consumed 0Mb (37Mb rss) to parse the control file MERGE-OUTER: 9 new files with 80 new features added; 9 new coverage edges ``` Reviewers: hctim, morehouse Reviewed By: morehouse Subscribers: delcypher, #sanitizers, llvm-commits, kcc Tags: #llvm, #sanitizers Differential Revision: https://reviews.llvm.org/D66030 llvm-svn: 368617
@llvm/issue-subscribers-openmp Author: Johannes Doerfert (jdoerfert)
When we have memory transfers from the host to a device, or any long running (I/O) method that can be split in a begin and wait part, we can try to hide the latency. (For now this is focused on memory transfers in OpenMP target offloading but the scheme should apply to CUDA and other languages as well.)
Given a blocking cross device memory transfer such as
|
Fold stackmaps into control point during lowering.
This patch updates Flang lowering to use the `host_eval` clause in `omp.target` operations to pass host information into the applicable clauses inside of the target region, instead of the previous approach where these clauses were attached to the `omp.target` operation itself.
When we have memory transfers from the host to a device, or any long running (I/O) method that can be split in a begin and wait part, we can try to hide the latency. (For now this is focused on memory transfers in OpenMP target offloading but the scheme should apply to CUDA and other languages as well.)
Given a blocking cross device memory transfer such as
blocking_memcpy_host2device(Dst, Src, N)
, we want to first split it in two parts, the "issue" and the "wait", something like:handle = async_issue_memcpy_host2device(Dst, Src, N); wait(handle, Dst, Src, N)
. Then, we want to move the two calls apart, thus causing the issue to be executed earlier and the wait later. There is a chance that the code we can legally move in-between is now executed while the memcpy is performed, effectively reducing the latency. Note that this also works if we start with a async version.The text was updated successfully, but these errors were encountered: