-
Notifications
You must be signed in to change notification settings - Fork 801
[SYCL] [FPGA] Add latency control feature to FPGA extension docs #4917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,38 +4,55 @@ | |
| The Intel FPGA `lsu` class is implemented in `sycl/ext/intel/fpga_lsu.hpp` which | ||
| is included in `sycl/ext/intel/fpga_extensions.hpp`. | ||
|
|
||
| The class `cl::sycl::ext::intel::lsu` allows users to explicitly request that the | ||
| The class `sycl::ext::intel::lsu` allows users to explicitly request that the | ||
| implementation of a global memory access is configured in a certain way. The | ||
| class has two member functions, `load()` and `store()` which allow loading from | ||
| and storing to a `multi_ptr`, respectively, and is templated on the following | ||
| 4 optional paremeters: | ||
| 4 optional parameters: | ||
|
|
||
| 1. **`cl::sycl::ext::intel::burst_coalesce<B>`, where `B` is a boolean**: request, | ||
| 1. **`sycl::ext::intel::burst_coalesce<B>`, where `B` is a boolean**: request, | ||
| to the extent possible, that a dynamic burst coalescer be implemented when | ||
| `load` or `store` are called. The default value of this parameter is `false`. | ||
| 2. **`cl::sycl::ext::intel::cache<N>`, where `N` is an integer greater or equal to | ||
| 2. **`sycl::ext::intel::cache<N>`, where `N` is an integer greater or equal to | ||
| 0**: request, to the extent possible, that a read-only cache of the specified | ||
| size in bytes be implemented when when `load` is called. It is not allowed to | ||
| use that parameter for `store`. The default value of this parameter is `0`. | ||
| 3. **`cl::sycl::ext::intel::statically_coalesce<B>`, where `B` is a boolean**: | ||
| 3. **`sycl::ext::intel::statically_coalesce<B>`, where `B` is a boolean**: | ||
| request, to the extent possible, that `load` or `store` accesses, is allowed to | ||
| be statically coalesced with other memory accesses at compile time. The default | ||
| value of this parameter is `true`. | ||
| 4. **`cl::sycl::ext::intel::prefetch<B>`, where `B` is a boolean**: request, to the | ||
| 4. **`sycl::ext::intel::prefetch<B>`, where `B` is a boolean**: request, to the | ||
| extent possible, that a prefetcher be implemented when `load` is called. It is | ||
| not allowed to use that parameter for `store`. The default value of this | ||
| parameter is `false`. | ||
|
|
||
| Currently, not every combination of parameters is allowed due to limitations in | ||
| the backend. The following rules apply: | ||
| 1. For `store`, `cl::sycl::ext::intel::cache` must be `0` and | ||
| `cl::sycl::ext::intel::prefetch` must be `false`. | ||
| 2. For `load`, if `cl::sycl::ext::intel::cache` is set to a value greater than `0`, | ||
| then `cl::sycl::ext::intel::burst_coalesce` must be set to `true`. | ||
| 3. For `load`, exactly one of `cl::sycl::ext::intel::prefetch` and | ||
| `cl::sycl::ext::intel::burst_coalesce` is allowed to be `true`. | ||
| 4. For `load`, exactly one of `cl::sycl::ext::intel::prefetch` and | ||
| `cl::sycl::ext::intel::cache` is allowed to be `true`. | ||
| 1. For `store`, `sycl::ext::intel::cache` must be `0` and | ||
| `sycl::ext::intel::prefetch` must be `false`. | ||
| 2. For `load`, if `sycl::ext::intel::cache` is set to a value greater than `0`, | ||
| then `sycl::ext::intel::burst_coalesce` must be set to `true`. | ||
| 3. For `load`, exactly one of `sycl::ext::intel::prefetch` and | ||
| `sycl::ext::intel::burst_coalesce` is allowed to be `true`. | ||
| 4. For `load`, exactly one of `sycl::ext::intel::prefetch` and | ||
| `sycl::ext::intel::cache` is allowed to be `true`. | ||
|
|
||
| Member functions `load()` or `store()` can take in a property_list as argument, | ||
| which contains the following two properties of latency control: | ||
|
|
||
| 1. **`sycl::ext::intel::latency_anchor_id<N>`, where `N` is an integer**: | ||
| represents ID of the current function call when it performs as an anchor. The ID | ||
| must be unique within the application, with a diagnostic required if that | ||
| condition is not met. | ||
| 2. **`sycl::ext::intel::latency_constraint<A, B, C>`** contains control | ||
| parameters when the current function performs as a non-anchor, where: | ||
| - **`A` is an integer**: The ID of the target anchor defined on a different | ||
| instruction through a `latency_anchor_id` property. | ||
| - **`B` is an enum value**: The type of control from the set | ||
| {`latency::exact`, `latency::min`, `latency::max`}. | ||
| - **`C` is an integer**: The relative clock cycle difference between the | ||
| target anchor and the current function call, that the constraint should | ||
| infer subject to the type of the control (exact, min, max). | ||
|
|
||
| ## Implementation | ||
|
|
||
|
|
@@ -54,8 +71,38 @@ public: | |
| #if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem) | ||
| return *__builtin_intel_fpga_mem((_T *)Ptr, | ||
| _burst_coalesce | _cache | | ||
| _dont_statically_coalesce | _prefetch, | ||
| _cache_val); | ||
| _dont_statically_coalesce | _prefetch, | ||
| _cache_val, | ||
| _reserved_default_anchor_id, 0, 0, 0); | ||
| #else | ||
| return *Ptr; | ||
| #endif | ||
| } | ||
|
|
||
| template <typename _T, access::address_space _space, typename props> | ||
| static _T load(sycl::multi_ptr<_T, _space> Ptr, props p) { | ||
| check_space<_space>(); | ||
| check_load(); | ||
| #if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem) | ||
| static constexpr latency _control_type = props::get_property<latency_constraint>().type; | ||
| int32_t _type; | ||
| if (_control_type == latency::none) { | ||
| _type = 0; | ||
| } else if (_control_type == latency::exact) { | ||
| _type = 1; | ||
| } else if (_control_type == latency::max) { | ||
| _type = 2; | ||
| } else { // _control_type == latency::min | ||
| _type = 3; | ||
| } | ||
| return *__builtin_intel_fpga_mem((_T *)Ptr, | ||
| _burst_coalesce | _cache | | ||
| _dont_statically_coalesce | _prefetch, | ||
| _cache_val, | ||
| props::get_property<latency_anchor_id>().anchor_id, | ||
| props::get_property<latency_constraint>().target_anchor, | ||
| _type, | ||
| props::get_property<latency_constraint>().cycle); | ||
| #else | ||
| return *Ptr; | ||
| #endif | ||
|
|
@@ -68,8 +115,38 @@ public: | |
| #if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem) | ||
| *__builtin_intel_fpga_mem((_T *)Ptr, | ||
| _burst_coalesce | _cache | | ||
| _dont_statically_coalesce | _prefetch, | ||
| _cache_val) = Val; | ||
| _dont_statically_coalesce | _prefetch, | ||
| _cache_val, | ||
| _reserved_default_anchor_id, 0, 0, 0) = Val; | ||
| #else | ||
| *Ptr = Val; | ||
| #endif | ||
| } | ||
|
|
||
| template <typename _T, access::address_space _space, typename props> | ||
| static void store(sycl::multi_ptr<_T, _space> Ptr, _T Val, props p) { | ||
| check_space<_space>(); | ||
| check_store(); | ||
| #if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem) | ||
| static constexpr latency _control_type = props::get_property<latency_constraint>().type; | ||
| int32_t _type; | ||
| if (_control_type == latency::none) { | ||
| _type = 0; | ||
| } else if (_control_type == latency::exact) { | ||
| _type = 1; | ||
| } else if (_control_type == latency::max) { | ||
| _type = 2; | ||
| } else { // _control_type == latency::min | ||
| _type = 3; | ||
| } | ||
| *__builtin_intel_fpga_mem((_T *)Ptr, | ||
| _burst_coalesce | _cache | | ||
| _dont_statically_coalesce | _prefetch, | ||
| _cache_val, | ||
| props::get_property<latency_anchor_id>().anchor_id, | ||
| props::get_property<latency_constraint>().target_anchor, | ||
| _type, | ||
| props::get_property<latency_constraint>().cycle) = Val; | ||
| #else | ||
| *Ptr = Val; | ||
| #endif | ||
|
|
@@ -83,37 +160,45 @@ public: | |
| ```c++ | ||
| #include <sycl/ext/intel/fpga_extensions.hpp> | ||
| ... | ||
| cl::sycl::buffer<int, 1> output_buffer(output_data, 1); | ||
| cl::sycl::buffer<int, 1> input_buffer(input_data, 1); | ||
| sycl::buffer<int, 1> output_buffer(output_data, 1); | ||
| sycl::buffer<int, 1> input_buffer(input_data, 1); | ||
|
|
||
| Queue.submit([&](cl::sycl::handler &cgh) { | ||
| auto output_accessor = output_buffer.get_access<cl::sycl::access::mode::write>(cgh); | ||
| auto input_accessor = input_buffer.get_access<cl::sycl::access::mode::read>(cgh); | ||
| Queue.submit([&](sycl::handler &cgh) { | ||
| auto output_accessor = output_buffer.get_access<sycl::access::mode::write>(cgh); | ||
| auto input_accessor = input_buffer.get_access<sycl::access::mode::read>(cgh); | ||
|
|
||
| cgh.single_task<class kernel>([=] { | ||
| auto input_ptr = input_accessor.get_pointer(); | ||
| auto output_ptr = output_accessor.get_pointer(); | ||
|
|
||
| using PrefetchingLSU = | ||
| cl::sycl::ext::intel::lsu<cl::sycl::ext::intel::prefetch<true>, | ||
| cl::sycl::ext::intel::statically_coalesce<false>>; | ||
| sycl::ext::intel::lsu<sycl::ext::intel::prefetch<true>, | ||
| sycl::ext::intel::statically_coalesce<false>>; | ||
|
|
||
| using BurstCoalescedLSU = | ||
| cl::sycl::ext::intel::lsu<cl::sycl::ext::intel::burst_coalesce<false>, | ||
| cl::sycl::ext::intel::statically_coalesce<false>>; | ||
| sycl::ext::intel::lsu<sycl::ext::intel::burst_coalesce<false>, | ||
| sycl::ext::intel::statically_coalesce<false>>; | ||
|
|
||
| using CachingLSU = | ||
| cl::sycl::ext::intel::lsu<cl::sycl::ext::intel::burst_coalesce<true>, | ||
| cl::sycl::ext::intel::cache<1024>, | ||
| cl::sycl::ext::intel::statically_coalesce<true>>; | ||
| sycl::ext::intel::lsu<sycl::ext::intel::burst_coalesce<true>, | ||
| sycl::ext::intel::cache<1024>, | ||
| sycl::ext::intel::statically_coalesce<true>>; | ||
|
|
||
| using PipelinedLSU = cl::sycl::ext::intel::lsu<>; | ||
| using PipelinedLSU = sycl::ext::intel::lsu<>; | ||
|
|
||
| int X = PrefetchingLSU::load(input_ptr); // int X = input_ptr[0] | ||
| int Y = CachingLSU::load(input_ptr + 1); // int Y = input_ptr[1] | ||
|
|
||
| BurstCoalescedLSU::store(output_ptr, X); // output_ptr[0] = X | ||
| PipelinedLSU::store(output_ptr + 1, Y); // output_ptr[1] = Y | ||
|
|
||
| // latency controls | ||
| // Load is anchor 1 | ||
| int Z = PrefetchingLSU::load(input_ptr + 2, | ||
| property_list{latency_anchor_id<1>}); | ||
| // Store occurs 5 cycles after the anchor 1 read | ||
| BurstCoalescedLSU::store(output_ptr + 2, Z, | ||
| property_list{latency_constraint<1, latency::exact, 5>}); | ||
|
||
| }); | ||
| }); | ||
| ... | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having
Nas an integer looks very low-level.I have the feeling it represents more a name like a pipe, kernel or
specialization_constantwhich can be lowered by the runtime to an ID.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
Nis just a symbol to demonstrate the syntax of this property. It is a compile-time constant.