-
Notifications
You must be signed in to change notification settings - Fork 802
[SYCL][Docs] Add design for moving to the new offloading model #8658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The community has moved to using a different model for performing offloading behaviors during link. This document provides details of moving from our current offloading model which performs all of the device link and compilation behaviors in the driver and moves these actions into a separate tool which performs a more 'conglomerate' link step.
bader
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdtoguchi, the document looks good.
I think it's ready for review, so I suggest convert this draft to a pull request.
🔥🚀
sycl/doc/design/OffloadDesign.md
Outdated
| to be controlled by the `clang-linker-wrapper`. There are controlling options | ||
| that are available to the user which need to be provided to the wrapper to | ||
| understand which device libraries are not desired by the end user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not desired"? Can you give an example, please?
I thought "device library" is a separate file, so we can just remove the name of the library from the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much it is used in practice, but the behavior is controlled by the existing -fsycl-device-lib=arg. so maybe the wording of 'not desired' is incorrect and should be more just a statement of control that is available that can be used.
|
@bader, thanks for the comments. I will integrate them and add information about the decision to allow for the ability to read the older format before marking this as ready. |
|
|
||
| It is expected that all new binaries generated with the updated offloading | ||
| model will represent the embedded fat object format, moving away from the | ||
| `clang-offload-bundler` usage. We will not support a mixing of fat object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a problem for users that want to use existing third party libraries with newer libraries. I'm not sure if that's a use case that is worth addressing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ability to read in the older format should allow for the mixing of old and new libraries. It is not expected to be able to have both old and new format objects in a single archive however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the situation I'm slightly worried about if a third party vendor doesn't upgrade the format. Is there a mechanism for warning the user when mixing the two types, and is there a way to upgrade an existing library to the newer format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that backwards compatibility is not guaranteed to be preserved forever anyway and there are cases when the application has to be anyway recompiled in order to work with newer runtime.
It is not expected to be able to have both old and new format objects in a single archive however.
@mdtoguchi, I wonder, though, why can't we support that? If we anyway scan each archive member to handle it in accordance with its type, why can't we detect old/new format at that stage and act on it in a correct way? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could support, we would need to get down to within the archive level and unbundle/extract on an individual basis there. Would there be instances where folks would want to use old objects, compile new objects and put them into an archive, or even just adding new objects to an old archive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mostly just linking old archives with new archives, but it's a contrived case that we may be able to live without.
sycl/doc/design/OffloadDesign.md
Outdated
|
|
||
| | Option Name | Purpose | | ||
| |------------------------------|----------------------------------------------| | ||
| | `--fpga-tool-deps=<arg>` | Comma separated list of dependency files used for FPGA hardware compiles using `aoc` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this list only ever needed for FPGA compile? Or can it be generalized to something like --backend-tool-deps and populated only when necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I think of this, the dependency file that is generated during the compilation should be added to the final binary before the link is called. This would mean that this option is not necessary. It also implies we need to call out the ability to add the dependency file to the fat binary so it can be used during the link/call to aoc.
sycl/doc/design/OffloadDesign.md
Outdated
| #### spir64_fpga support | ||
|
|
||
| Compilation behaviors involving AOT for FPGA involve an additional call to | ||
| the either `aoc` (for Hardware) or `opencl-aot` (for Simulation). This call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opencl-aot is only needed for what we call emulation (x86 emulation of the user kernel). aoc is used for both hardware and simulation (cycle-accurate simulation of the generated Verilog code) platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I will correct this.
| command will be processed by a new options to the wrapper, | ||
| `--fpga-tool-arg=<arg>` | ||
|
|
||
| The FPGA target also has support for additional generated binaries that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to clarify something here. With this new model, it sounds like it's possible to package binaries like aoco, aocr, or aocx into the fat object files that the link stage can pick up, correct? If yes, then I'm not sure if Diagram 1 is exactly right, since it sounds like you may have an "Offline Compile" stage there. Unless you're implying that there could be a AOT compile stage as part of the "Device Compile".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case of using aocx/aoco/aocr files in object files/archives is not explicitly called out and the diagram is more a reflection of the general AOT compilation case. The usage of the FPGA specific device types should be spelled out more succinctly.
sycl/doc/design/OffloadDesign.md
Outdated
| @@ -0,0 +1,326 @@ | |||
| # Implementation design for offloading model | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a global comment here, so the conversation can be threaded.
It seems like there is a lot of overlap with this design and the change we plan to do for AOT for the optional kernel features. I think this design document should at least describe how those AOT changes will be reflected in this new offload design. We could still implement them as separate phases if we want, though I suspect it will be easier to do some of the work together.
In particular, I think we should consider adding the following "preparatory work" to this design, which will make the optional kernel features support easier:
-
Fully support the interface where specific device names are provided via the
-fsycl-targetscommand line option. Currently, we allow only a single device name like-fsycl-targets=intel_gpu_pvc. We should expand this to allow multiple device names such as-fsycl-targets=intel_gpu_pvc,intel_gpu_dg1. -
Currently, we invoke
ocloconce for all GPU targets. This should be changed to invoke it separately for each GPU device on the-fsycl-targetscommand line option. -
Currently, the output from
ocloc(i.e. native code for all GPU targets) is bundled as one offload region whose type isPI_DEVICE_BINARY_TYPE_NATIVE. Now that we invokeoclocseparately for each GPU device, will will have a separate offload region for each GPU device named on the-fsycl-targetscommand line option. We need some more device-specific label for these offload regions other thanPI_DEVICE_BINARY_TYPE_NATIVE. -
The SYCL runtime needs to be adjusted to find the appropriate offload region for the target device.
I think this will affect the offload design in several ways, but one that comes to mind is the --gen-tool-arg option that you propose. This is probably not granular enough because clang-linker-wrapper will invoke ocloc several times, once for each GPU device. Do we need some way to specify different arguments for each invocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points - current proposal is very singular in nature. In order to support the multiple targets, this will need to be broken down into meaningful pieces where we can specify an arbitrary number of different devices and given each device, a separate ocloc call is performed. This should also provide a way of passing different additional options to each individual ocloc call, effectively enabling support for items like -Xsycl-target-backend=intel_gpu_pvc <opts>
sycl/doc/design/OffloadDesign.md
Outdated
| `-Xsycl-target-backend=spir64_x86_64 <opts>` command will be processed by a new | ||
| option to the wrapper, `--cpu-tool-arg=<arg>` | ||
|
|
||
| ### Integration of the sycl-aspect-filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to leave this section empty?
FWIW, I think we will change the "optional kernel features" design, so that we do not need the sycl-aspect-filter tool. Instead, I think sycl-post-link can do the filtering itself and emit different LLVM bitcode files for each device when in AOT mode.
Maybe this section should just be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was kind of a placeholder, but I didn't do anything with it. I will remove.
artemrad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change include the development of developer tools that will allow us to extract intermediate files from the executable?
@artemrad, there is an existing tool named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mike for the documentation. Looks good overall. One comment. llvm-spirv translation stage is missing inside the Linker Wrapper. Please add that here and also mention it in the text.
Thanks
ajaykumarkannan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I think the proposal looks good to me.
| |--------|---------------|----------------|----------------------------| | ||
| | CPU | spir64_x86_64 | opencl-aot | `--cpu-tool-arg=<arg>` | | ||
| | GPU | spir64_gen | ocloc | `--gen-tool-arg=<arg>` | | ||
| | FPGA | spir64_fpga | aoc/opencl-aot | `--fpga-tool-arg=<arg>` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does clang-linker-wrapper know whether it needs to invoke ocloc, opencl-aot, etc.? I guess it must look for the presence of the --gen-tool-arg, --cpu-tool-arg, etc. options? What if there are no "arguments" that need to be passed to that tool? Is this an issue, or do we always have some arguments that need to be specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the offline compiler is selected by the target triple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some clarifying information. The clang-linker-wrapper is responsible for full discovery of targets based on findings within the binaries. 431dff7
| *Example: clang-linker-wrapper options* | ||
|
|
||
| Each OCLOC call will be represented as a separate device binary that is | ||
| individually wrapped and linked into the final executable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does clang-linker-wrapper know how many times to invoke ocloc? How does it know which GPU target is associated with each ocloc invocation?
Is the idea that clang-linker-wrapper will count the number of --gen-tool-arg options and parse the value to find the "-device pvc" part?
Would it make sense to change the syntax to make it easier to find the device name, something like:
--gen-tool-arg=pvc "-options extraopt_pvc"
Something else to keep in mind ... my thought is that we will change sycl-post-link to emit a separate device image for each GPU target. This is another reason why clang-linker-wrapper will need to know the set of GPU targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - we will need to provide a way to differentiate more than just the arch (spir64_gen vs spir64_x86_64). I like your idea of taking the known syntax of -opt=<target> "opts" which we currently use for -Xsycl-target-backend
| It is expected that the wrap information that is generated to be wrapped | ||
| around the device binary will match current wrapping information that is used | ||
| for the exiting offload model. The wrapping in the old model is using the | ||
| `clang-offload-wrapper` tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to implement the AOT part of "optional kernel features", we need to somehow annotate each device binary with an identifier for its GPU target. For example, a device binary created for --gen-tool-arg=pvc needs to be annotated as a "PVC" binary. My thought was to use the GMDID for this annotation. For example, the GMDID for pvc is 12.60.7 (see this listing).
I think we currently annotate AOT binaries with PI_DEVICE_BINARY_TYPE_NATIVE. Is this part of the "wrapping" you describe here? Is there anything about this design that will make it difficult to change this to the GMDID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have to provide the GMDID mapping in the driver (and continue to update as new ID values come along) so when a user wants intel_gpu_pvc the GMDID will be used and embedded into corresponding information tagged with the device binary within the fat object. When the clang-linker-wrapper is extracting, it will know to use the information as tagged accordingly to set the proper device with the ocloc call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than embedding the list of GMDID's in the driver, I think we can include the GMDID for each target in the device config file that we are designing (#9371). Once that happens, the driver will use the config file information to determine the set of legal targets for -fsycl-targets and also get the associated GMDID from there.
Clean up some smaller topics regarding fat objects and move them Add offload-packager usage information
aelovikov-intel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but it's somewhat outside of my area of expertise. @sergey-semenov , @steffenlarsen do you want to take a look as well?
sycl/doc/design/OffloadDesign.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The device libraries that are linked in is provided by the driver. The driver | |
| is responsible for letting the `clang-linker-wrapper` know what device libraries | |
| are required to be linked in as well as the location. | |
| A list of device libraries that need to linked in with user code is provided by the driver. The driver is also responsible for letting the `clang-linker-wrapper` know the location of the device libraries. |
sycl/doc/design/OffloadDesign.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this option? Can you please specify example of files do we expect to pass here? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asudarsa, the libraries here are the ones that are part of the first llvm-link call which does not use --only-needed. Basically device libraries that are required in full. Perhaps this isn't needed for SYCL though but more for OpenMP. I'll clean this out here.
sycl/doc/design/OffloadDesign.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | `--device-library-location=<arg>` | The location in which the device libraries reside to be used during compilation | | |
| | `--device-library-location=<arg>` | The location in which the device libraries reside | |
sycl/doc/design/OffloadDesign.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| *Table: Options to control device libraries* | |
| *Table: Options to pass device libraries to the clang-linker-wrapper* |
sycl/doc/design/OffloadDesign.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC
| The device libraries are controlled via the `-fno-sycl-device-lib=arg` option | |
| where the driver determines based on this option which libraries to tell the | |
| linker wrapper to pull in. | |
| The driver also passes `-fno-sycl-device-lib=arg` option to the clang-linker-wrapper. This option is used to determine the exact set of device libraries that need to be pulled in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be passed to the linker wrapper. As the driver controls what device libraries are to be linked in, the driver can determine which ones are associated to the -fno-sycl-device-lib option.
sycl/doc/design/OffloadDesign.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to repeat this option here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This corresponds to the 'list' of device libraries which are now controlled by the driver. I will remove this.
asudarsa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few changes requested. Please address. Thanks
asudarsa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks
|
@intel/llvm-gatekeepers, can this be merged? |
The community has moved to using a different model for performing offloading behaviors during link. This document provides details of moving from our current offloading model which performs all of the device link and compilation behaviors in the driver and moves these actions into a separate tool which performs a more 'conglomerate' link step.