Skip to content

Conversation

@mdtoguchi
Copy link
Contributor

The community has moved to using a different model for performing offloading behaviors during link. This document provides details of moving from our current offloading model which performs all of the device link and compilation behaviors in the driver and moves these actions into a separate tool which performs a more 'conglomerate' link step.

The community has moved to using a different model for performing offloading
behaviors during link.  This document provides details of moving from our
current offloading model which performs all of the device link and compilation
behaviors in the driver and moves these actions into a separate tool which
performs a more 'conglomerate' link step.
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdtoguchi, the document looks good.
I think it's ready for review, so I suggest convert this draft to a pull request.
🔥🚀

Comment on lines 149 to 151
to be controlled by the `clang-linker-wrapper`. There are controlling options
that are available to the user which need to be provided to the wrapper to
understand which device libraries are not desired by the end user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not desired"? Can you give an example, please?
I thought "device library" is a separate file, so we can just remove the name of the library from the command line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much it is used in practice, but the behavior is controlled by the existing -fsycl-device-lib=arg. so maybe the wording of 'not desired' is incorrect and should be more just a statement of control that is available that can be used.

@mdtoguchi
Copy link
Contributor Author

@bader, thanks for the comments. I will integrate them and add information about the decision to allow for the ability to read the older format before marking this as ready.

@mdtoguchi mdtoguchi marked this pull request as ready for review March 28, 2023 01:41
@mdtoguchi mdtoguchi requested a review from a team as a code owner March 28, 2023 01:41
@mdtoguchi mdtoguchi requested a review from a team as a code owner March 28, 2023 17:08

It is expected that all new binaries generated with the updated offloading
model will represent the embedded fat object format, moving away from the
`clang-offload-bundler` usage. We will not support a mixing of fat object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a problem for users that want to use existing third party libraries with newer libraries. I'm not sure if that's a use case that is worth addressing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ability to read in the older format should allow for the mixing of old and new libraries. It is not expected to be able to have both old and new format objects in a single archive however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the situation I'm slightly worried about if a third party vendor doesn't upgrade the format. Is there a mechanism for warning the user when mixing the two types, and is there a way to upgrade an existing library to the newer format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that backwards compatibility is not guaranteed to be preserved forever anyway and there are cases when the application has to be anyway recompiled in order to work with newer runtime.

It is not expected to be able to have both old and new format objects in a single archive however.

@mdtoguchi, I wonder, though, why can't we support that? If we anyway scan each archive member to handle it in accordance with its type, why can't we detect old/new format at that stage and act on it in a correct way? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could support, we would need to get down to within the archive level and unbundle/extract on an individual basis there. Would there be instances where folks would want to use old objects, compile new objects and put them into an archive, or even just adding new objects to an old archive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mostly just linking old archives with new archives, but it's a contrived case that we may be able to live without.


| Option Name | Purpose |
|------------------------------|----------------------------------------------|
| `--fpga-tool-deps=<arg>` | Comma separated list of dependency files used for FPGA hardware compiles using `aoc` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this list only ever needed for FPGA compile? Or can it be generalized to something like --backend-tool-deps and populated only when necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think of this, the dependency file that is generated during the compilation should be added to the final binary before the link is called. This would mean that this option is not necessary. It also implies we need to call out the ability to add the dependency file to the fat binary so it can be used during the link/call to aoc.

#### spir64_fpga support

Compilation behaviors involving AOT for FPGA involve an additional call to
the either `aoc` (for Hardware) or `opencl-aot` (for Simulation). This call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opencl-aot is only needed for what we call emulation (x86 emulation of the user kernel). aoc is used for both hardware and simulation (cycle-accurate simulation of the generated Verilog code) platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I will correct this.

command will be processed by a new options to the wrapper,
`--fpga-tool-arg=<arg>`

The FPGA target also has support for additional generated binaries that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to clarify something here. With this new model, it sounds like it's possible to package binaries like aoco, aocr, or aocx into the fat object files that the link stage can pick up, correct? If yes, then I'm not sure if Diagram 1 is exactly right, since it sounds like you may have an "Offline Compile" stage there. Unless you're implying that there could be a AOT compile stage as part of the "Device Compile".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case of using aocx/aoco/aocr files in object files/archives is not explicitly called out and the diagram is more a reflection of the general AOT compilation case. The usage of the FPGA specific device types should be spelled out more succinctly.

@@ -0,0 +1,326 @@
# Implementation design for offloading model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a global comment here, so the conversation can be threaded.

It seems like there is a lot of overlap with this design and the change we plan to do for AOT for the optional kernel features. I think this design document should at least describe how those AOT changes will be reflected in this new offload design. We could still implement them as separate phases if we want, though I suspect it will be easier to do some of the work together.

In particular, I think we should consider adding the following "preparatory work" to this design, which will make the optional kernel features support easier:

  • Fully support the interface where specific device names are provided via the -fsycl-targets command line option. Currently, we allow only a single device name like -fsycl-targets=intel_gpu_pvc. We should expand this to allow multiple device names such as -fsycl-targets=intel_gpu_pvc,intel_gpu_dg1.

  • Currently, we invoke ocloc once for all GPU targets. This should be changed to invoke it separately for each GPU device on the -fsycl-targets command line option.

  • Currently, the output from ocloc (i.e. native code for all GPU targets) is bundled as one offload region whose type is PI_DEVICE_BINARY_TYPE_NATIVE. Now that we invoke ocloc separately for each GPU device, will will have a separate offload region for each GPU device named on the -fsycl-targets command line option. We need some more device-specific label for these offload regions other than PI_DEVICE_BINARY_TYPE_NATIVE.

  • The SYCL runtime needs to be adjusted to find the appropriate offload region for the target device.

I think this will affect the offload design in several ways, but one that comes to mind is the --gen-tool-arg option that you propose. This is probably not granular enough because clang-linker-wrapper will invoke ocloc several times, once for each GPU device. Do we need some way to specify different arguments for each invocation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points - current proposal is very singular in nature. In order to support the multiple targets, this will need to be broken down into meaningful pieces where we can specify an arbitrary number of different devices and given each device, a separate ocloc call is performed. This should also provide a way of passing different additional options to each individual ocloc call, effectively enabling support for items like -Xsycl-target-backend=intel_gpu_pvc <opts>

`-Xsycl-target-backend=spir64_x86_64 <opts>` command will be processed by a new
option to the wrapper, `--cpu-tool-arg=<arg>`

### Integration of the sycl-aspect-filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to leave this section empty?

FWIW, I think we will change the "optional kernel features" design, so that we do not need the sycl-aspect-filter tool. Instead, I think sycl-post-link can do the filtering itself and emit different LLVM bitcode files for each device when in AOT mode.

Maybe this section should just be deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was kind of a placeholder, but I didn't do anything with it. I will remove.

Copy link
Contributor

@artemrad artemrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change include the development of developer tools that will allow us to extract intermediate files from the executable?

@mdtoguchi
Copy link
Contributor Author

Does this change include the development of developer tools that will allow us to extract intermediate files from the executable?

@artemrad, there is an existing tool named clang-offload-extract which we can retain to allow for extraction of embedded target images. clang-offload-packager can be used to unbundle/extract from generated fat objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mike for the documentation. Looks good overall. One comment. llvm-spirv translation stage is missing inside the Linker Wrapper. Please add that here and also mention it in the text.

Thanks

Copy link
Contributor

@ajaykumarkannan ajaykumarkannan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I think the proposal looks good to me.

|--------|---------------|----------------|----------------------------|
| CPU | spir64_x86_64 | opencl-aot | `--cpu-tool-arg=<arg>` |
| GPU | spir64_gen | ocloc | `--gen-tool-arg=<arg>` |
| FPGA | spir64_fpga | aoc/opencl-aot | `--fpga-tool-arg=<arg>` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does clang-linker-wrapper know whether it needs to invoke ocloc, opencl-aot, etc.? I guess it must look for the presence of the --gen-tool-arg, --cpu-tool-arg, etc. options? What if there are no "arguments" that need to be passed to that tool? Is this an issue, or do we always have some arguments that need to be specified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the offline compiler is selected by the target triple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some clarifying information. The clang-linker-wrapper is responsible for full discovery of targets based on findings within the binaries. 431dff7

*Example: clang-linker-wrapper options*

Each OCLOC call will be represented as a separate device binary that is
individually wrapped and linked into the final executable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does clang-linker-wrapper know how many times to invoke ocloc? How does it know which GPU target is associated with each ocloc invocation?

Is the idea that clang-linker-wrapper will count the number of --gen-tool-arg options and parse the value to find the "-device pvc" part?

Would it make sense to change the syntax to make it easier to find the device name, something like:

--gen-tool-arg=pvc "-options extraopt_pvc"

Something else to keep in mind ... my thought is that we will change sycl-post-link to emit a separate device image for each GPU target. This is another reason why clang-linker-wrapper will need to know the set of GPU targets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - we will need to provide a way to differentiate more than just the arch (spir64_gen vs spir64_x86_64). I like your idea of taking the known syntax of -opt=<target> "opts" which we currently use for -Xsycl-target-backend

It is expected that the wrap information that is generated to be wrapped
around the device binary will match current wrapping information that is used
for the exiting offload model. The wrapping in the old model is using the
`clang-offload-wrapper` tool.
Copy link
Contributor

@gmlueck gmlueck May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to implement the AOT part of "optional kernel features", we need to somehow annotate each device binary with an identifier for its GPU target. For example, a device binary created for --gen-tool-arg=pvc needs to be annotated as a "PVC" binary. My thought was to use the GMDID for this annotation. For example, the GMDID for pvc is 12.60.7 (see this listing).

I think we currently annotate AOT binaries with PI_DEVICE_BINARY_TYPE_NATIVE. Is this part of the "wrapping" you describe here? Is there anything about this design that will make it difficult to change this to the GMDID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have to provide the GMDID mapping in the driver (and continue to update as new ID values come along) so when a user wants intel_gpu_pvc the GMDID will be used and embedded into corresponding information tagged with the device binary within the fat object. When the clang-linker-wrapper is extracting, it will know to use the information as tagged accordingly to set the proper device with the ocloc call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than embedding the list of GMDID's in the driver, I think we can include the GMDID for each target in the device config file that we are designing (#9371). Once that happens, the driver will use the config file information to determine the set of legal targets for -fsycl-targets and also get the associated GMDID from there.

@mdtoguchi mdtoguchi requested a review from a team as a code owner July 11, 2023 17:44
Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but it's somewhat outside of my area of expertise. @sergey-semenov , @steffenlarsen do you want to take a look as well?

Comment on lines 186 to 188
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The device libraries that are linked in is provided by the driver. The driver
is responsible for letting the `clang-linker-wrapper` know what device libraries
are required to be linked in as well as the location.
A list of device libraries that need to linked in with user code is provided by the driver. The driver is also responsible for letting the `clang-linker-wrapper` know the location of the device libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this option? Can you please specify example of files do we expect to pass here? Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asudarsa, the libraries here are the ones that are part of the first llvm-link call which does not use --only-needed. Basically device libraries that are required in full. Perhaps this isn't needed for SYCL though but more for OpenMP. I'll clean this out here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `--device-library-location=<arg>` | The location in which the device libraries reside to be used during compilation |
| `--device-library-location=<arg>` | The location in which the device libraries reside |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*Table: Options to control device libraries*
*Table: Options to pass device libraries to the clang-linker-wrapper*

Comment on lines +198 to +200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC

Suggested change
The device libraries are controlled via the `-fno-sycl-device-lib=arg` option
where the driver determines based on this option which libraries to tell the
linker wrapper to pull in.
The driver also passes `-fno-sycl-device-lib=arg` option to the clang-linker-wrapper. This option is used to determine the exact set of device libraries that need to be pulled in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be passed to the linker wrapper. As the driver controls what device libraries are to be linked in, the driver can determine which ones are associated to the -fno-sycl-device-lib option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to repeat this option here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds to the 'list' of device libraries which are now controlled by the driver. I will remove this.

Copy link
Contributor

@asudarsa asudarsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few changes requested. Please address. Thanks

Copy link
Contributor

@asudarsa asudarsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@mdtoguchi
Copy link
Contributor Author

@intel/llvm-gatekeepers, can this be merged?

@aelovikov-intel aelovikov-intel merged commit 41b02b1 into intel:sycl Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants