Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdaptiveCpp for microcontrollers and Miosix OS #1720

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

NiccoloN
Copy link

@NiccoloN NiccoloN commented Feb 15, 2025

The PR makes the necessary changes to AdaptiveCpp to enable the omp library-only backend on multicore microcontrollers, in particular for the Miosix embedded OS. This work has been conducted at HEAPLab, Politecnico di Milano, to bring sycl into the world of embedded systems and microcontrollers.

add COMPILE_TOOLS option to cmake
add COMPILE_FOR_MICROCONTROLLERS definition to config.hpp.in
add compile-for-microcontrollers setting to acpp core config file
add custom rng for miosix as it doesn't support tls
…for microcontrollers

fix a bug that causes a linker/compiler error if flag lines are empty
…s where long is 32B: int was always assumed to 32B and so when long is 32B the same function gets defined twice through macro expansion
… the new definition COMPILE_FOR_MICROCONTROLLERS
Copy link
Collaborator

@illuhad illuhad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

Some general questions here:

  • If we add a new mode with a certain restricted functionality like you propose, it's pretty much guaranteed that after some time of ongoing development in other areas this feature will be unintentionally broken by other changes without explicit testing. Are you willing to maintain this mode? Is there a way to add it to CI?
  • Miosix seems to support file systems (according to https://miosix.org/), so why do we need to disable all of the std::filesystem stuff? AdaptiveCpp requires C++17, and std::filesystem is a part of C++ 17.
  • What is the overall goal here, functionality-wise? You say that you want to support omp.library-only, however this mode is intended more for debugging. It's not going to perform well at all (which is expected and by design) for any non-trivial SYCL nd_range kernel. omp.accelerated or generic targets will perform much better (orders of magnitude better!). Do you plan to add support for these?
  • Strategically, we've made the move away from individual feature enabling/disabling flags towards moving to fixed compiler support profiles (see ACPP_COMPILER_FEATURE_PROFILE) due to the simplification for users. How would this microcontroller mode fit into this? I'd very much like to fold the mode into this, rather then introduce again specific feature enabling/disabling flags with specific tradeoffs in terms of dependencies and feature support.

@@ -52,6 +52,8 @@ void *omp_allocator::raw_allocate(size_t min_alignment, size_t size_bytes,
// but it's unclear if it's a Mac, or libc++, or toolchain issue
#ifdef __APPLE__
return aligned_alloc(min_alignment, size_bytes);
#elif defined(_MIOSIX)
return malloc(size_bytes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Miosix guarantee that every allocation is aligned to the maximum possible value? Otherwise, this looks incorrect if min_alignment > 1.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malloc() and operator new on Miosix (and Linux/Windows/MacOS) return memory aligned to the C/C++ data type with at least the strictest alignment, which from memory should be long double, no need for any other logic to achieve this goal. Otherwise on CPU architectures not supporting unaligned memory access, the CPU will fault when accessing that memory. We do have some microcontroller targets that fault on unaligned accesses so we know.
I read AdaptiveCpp code and the only use we could find of the alignment seems to either pass the value 32 for cache alignment, or use alignof on data structures, so it seems safe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malloc() and operator new on Miosix (and Linux/Windows/MacOS) return memory aligned to the C/C++ data type with at least the strictest alignment, which from memory should be long double, no need for any other logic to achieve this goal.

Are there no vector load/stores on your hardware?

I read AdaptiveCpp code and the only use we could find of the alignment seems to either pass the value 32 for cache alignment, or use alignof on data structures, so it seems safe.

Well... the alignment is exposed to users in the form of sycl::aligned_alloc. If you just ignore the alignment, then technically this means that the SYCL specification is violated even if there is perhaps not a strong reason to have data aligned to more than long double. It causes an observable change in the API behavior compared to what users are technically guaranteed.

Now, we currently also kind of handwave this on CUDA and HIP backends, where we also don't take this parameter into account - but there at least we know that we generally get page-aligned memory, so it's less of a risk of not doing what the user wants.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARM Cortex-M is a family of instruction sets for RISC CPUs of different complexity and performance, but still limited to microcontroller-class performance.
The highest performance CPU we have access to and thus support is the Cortex-M7. Being a RISC CPU, memory access is done only by load/store instructions and the widest load/store it can perform is for a 64bit value (ldrd/strd instruction), so the strictest alignment we need is 8 bytes. malloc() is already overkill as it provides 16 byte alignment. This architecture does have some vector instructions but is very limited, like doing 4 separate byte adds in a 32 bit register or similar. Not fancy.
The architecture we're currently targeting for SYCL is even way, way more limited. It's one of the few dual-core microcontrollers but the cores are Cortex-M0. No vector instructions, no hardware floating point, no hardware integer divide, limited hardware integer multiply. The Raspberry Pi microcontroller does not even have external RAM nor caches. 264KByte of RAM is all you've got. Luckily we were pleased that a simple matrix multiply example with Miosix + AdaptiveCpp requires only 48KB RAM and there's room for improvement, so AdaptiveCpp can become really usable on this platform.

x ^= x << 25;
x ^= x >> 27;
uint64_t result = x * 2685821657736338717ull;
asm volatile("cpsie i":::"memory");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide some background on what these asm instructions do? Is the seed dynamic, or set statically to x?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some temporary code to disable/enable interrupts on ARM Cortex-M CPUs. Eventually we'll evaluate the performance and switch to a mutex or some other solution when we figure out how to make the kernel headers available when compiling AdaptiveCpp. Currently Miosix has no support for thread-local storage so the original code does not compile, and besides, std::mt19937 is a huge data structure, taking nearly 5KBytes of RAM, and having one instance for each thread is definitely not a good idea for microcontrollers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine not using mt19937, but I'm concerned about potential collisions for accessor ids when multiple threads submit work. If the accessor ids (which these random numbers are used for) collide, then your program will potentially do unpredictable things as the wrong kernels will be fed with the wrong data pointers as argument.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one point to explore then. Microcontrollers are often used for mission- or even safety- critical applications, so we don't like to hear "if two random number generators happen to collide your program does unpredictable things".
Right now, we're more interested in checking whether there's is interest in upstreaming support for microcontrollers. If there is, we can spend a little more time on this issue and propose for microcontroller targets a solution that, even if it costs us some performance, does not depend on randomness to work correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one point to explore then. Microcontrollers are often used for mission- or even safety- critical applications, so we don't like to hear "if two random number generators happen to collide your program does unpredictable things".

That's understandable, but it only matters for the buffer-accessor memory management model. I think especially on platforms where resource usage is a constraint, you'd probably rather want to use the SYCL 2020 USM memory management model, which gives you more control.

The buffer-accessor model does a lot of things implicitly, including memory deallocation via a garbage collection mechanism which you probably don't want.

But I think it is important to still get it to the point where it is in principle functional, even if it is just for the reason of being able to run the AdaptiveCpp unit tests.

Right now, we're more interested in checking whether there's is interest in upstreaming support for microcontrollers

Well, I can turn that question around :-) Are you interested in having your work upstream?
So far most AdaptiveCpp users have a background in HPC. But I think that does not mean that we're not open to expanding to other communities. It takes someone from those communities with the necessary motivation though to drive this. It's an open source project, so ultimately, those that use it & contribute determine the direction :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we're quite interested in using the buffer accessor model on microcontrollers.
First, for code portability among embedded targets, as scaling up from microcontrollers we find embedded single-board computers with GPUs but without fine-grained SVM, thus the buffer accessor model seems to be the best way to write embedded SYCL code. Second, Miosix uses C++ as system programming language, not C, so we're quit used to the RAII model of using objects to manage memory.

Also, maybe I'm wrong, but the buffer accessor model uses reference counting, not garbage collection so it should not be more heavy than, say, std::shared_ptr, am I missing something?

And yes, we'll be working on improving SYCL support for microcontrollers through AdaptiveCpp and we're interested in upstreaming the code, that's the reason for this first PR.

Copy link
Collaborator

@illuhad illuhad Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we're quite interested in using the buffer accessor model on microcontrollers.
First, for code portability among embedded targets, as scaling up from microcontrollers we find embedded single-board computers with GPUs but without fine-grained SVM, thus the buffer accessor model seems to be the best way to write embedded SYCL code. Second, Miosix uses C++ as system programming language, not C, so we're quit used to the RAII model of using objects to manage memory.

Fine-grained SVM is only a fallback for for the USM OpenCL extension when using the OpenCL backend. It's only lightly tested, and might not work for all scenarios as fine-grained SVM does not really expose sufficient control for SYCL.

On microcontroller, you wouldn't want to use it because OpenCL always involves JIT, and on CPU every regular pointer is a USM pointer anyway.
On some kind of embedded/single-board/mobile GPU it makes more sense of course - but even there, the fine-grained SVM issue is going to be solved by SVM 2.0, which the OpenCL working group is currently working on.
I expect this to be widely supported in the future so that we can rely on USM also on such hardware.

See KhronosGroup/OpenCL-Docs#1282

Most SYCL code written today uses the USM model, so you will want to at least support USM device allocations anyway.

Also, maybe I'm wrong, but the buffer accessor model uses reference counting, not garbage collection so it should not be more heavy than, say, std::shared_ptr, am I missing something?

Yes. There are cases where the underlying data object of buffer has to outlive the user-facing buffer object. Particularly, this is the case if the buffer destructor does not block (it does not always block, even though a lot of SYCL introductory materials kind of suggests this).
In this case, SYCL needs to manage lifetime of the data storage internally. This requries some mechanism for the SYCL runtime to notice when any kernel operating on the buffer has finished, so that it can ensure that data is only freed once it is safe to do so.
This can either be achieved by inserting callbacks after kernels (which from experience, introduces unacceptable latencies in many cases), or by deallocating lazily using some asynchronous mechanism. This is what AdaptiveCpp does and what I refer to by "garbage collection".

A consequence of this is that you cannot rely on data to be freed in the buffer-accessor model once the reference count goes to 0 in general.

There are a lot of other problems in the buffer-accessor model, which might be even worse on small GPUs that might be available in the hardware that you are interested in. This includes higher register pressure and high kernel submission latencies.
See https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/performance.md#sycl-memory-management-usm-vs-buffers

These defects are well-known in SYCL, but require fundamental changes to fix. I would not recommend for any new code to invest into the buffer-accessor model. AdaptiveCpp starting from version 24.10 actively warns if you use it.

And yes, we'll be working on improving SYCL support for microcontrollers through AdaptiveCpp and we're interested in upstreaming the code, that's the reason for this first PR.

Okay, that's good to hear :)

@fedetftpolimi
Copy link

Hello, I'm the Miosix maintainere here, I'm collaborating with @NiccoloN on this feature. Let's answer some questions:

As for maintaining this mode, we can check from time to time that the AdaptiveCpp microcontroller support keeps working, as we'll likely be using this in the following years. If it stops working, we'll provide patches.
As for adding some checks to the CI, we can try to add a workflow to compile AdaptiveCpp for Miosix, as well as compile some test benchmarks. This should catch most regressions. However, there is no easy way to run compiled Miosix code in the CI. That would require dedicated hardware.

As for the filesystem:

  1. Regarding C++17 support, Miosix is currently transitioning from C++14 to C++17, and some part of std::filesystem are still unimplemented but this will get fixed soon.
  2. However, Miosix can only support a filesystem on boards that have a suitable disk device, which usually is an SD card connector and supporting hardware. This cannot be taken for granted in the microcontroller space. The board we're targeting right now, the Raspberry Pi Pico (https://www.raspberrypi.com/products/raspberry-pi-pico/) has no such SD card connector. We may add a read-only filesystem using part of the internal Flash memory, but that's the best we can do.
  3. In any case, when not targeting GPUs I can't see why AdaptiveCpp needs to access files on disk to run SYCL code. Also consider that JIT can never be supported on microcontrollers as they have very limited RAM (from a few KBytes (!) to a few MBytes) thus it's not possible to run a compiler on the microcontroller.

The overall goal is that microcontrollers are becoming multi-core and thus it makes sense to use threads to exploit the available parallelism. In this context, extending SYCL support to also encompass microcontrollers allows greater code portability and reuse. Currently our PR does not support nd_range as we don't have support for coroutines on Miosix (Miosix unlike linux does continuous stack overflow checks so any attempt to move the stack pointer to another buffer to implement coroutines without informing the OS will cause a segfault). Regarding the option to use omp.accelerated on microcontrollers we may try as soon as we finish our work of patching clang to compile code for Miosix, however does omp.accelerated require JIT? If it does, then library only remains the only way.

@NiccoloN
Copy link
Author

Hello!

About the fourth point: we could think to replace the COMPILE_FOR_MICROCONTROLLERS flag with a new dedicated omp profile using ACPP_COMPILER_FEATURE_PROFILE, e.g. omp.micro. However I would like to leave the possibility to compile both with or without clang compiler support (which is WIP from our side). So we could introduce 2 new profiles: omp.micro.library-only and omp.micro.full. In any case I would leave the flag COMPILE_TOOLS separate from this, which could come handy by its own. Let me know it this could do the job for you.

@illuhad
Copy link
Collaborator

illuhad commented Feb 18, 2025

As for maintaining this mode, we can check from time to time that the AdaptiveCpp microcontroller support keeps working, as we'll likely be using this in the following years. If it stops working, we'll provide patches.
As for adding some checks to the CI, we can try to add a workflow to compile AdaptiveCpp for Miosix, as well as compile some test benchmarks. This should catch most regressions. However, there is no easy way to run compiled Miosix code in the CI. That would require dedicated hardware.

That's great, we don't need to actually run things, but at least compile testing would be good. I think this should catch most issues around the restricted feature set of this platform, which is the main concern.

As for the filesystem:

Thanks for explaining!

Regarding C++17 support, Miosix is currently transitioning from C++14 to C++17, and some part of std::filesystem are still unimplemented but this will get fixed soon.
However, Miosix can only support a filesystem on boards that have a suitable disk device, which usually is an SD card connector and supporting hardware. This cannot be taken for granted in the microcontroller space. The board we're targeting right now, the Raspberry Pi Pico (https://www.raspberrypi.com/products/raspberry-pi-pico/) has no such SD card connector. We may add a read-only filesystem using part of the internal Flash memory, but that's the best we can do.

Does this mean that once std::filesystem support is implemented in Miosix, we could build AdaptiveCpp as usual against std::filesystem, as long as no features from std::filesystem are actually used at runtime?
I think this would be far preferable compared to the current solution, as it would avoid having to riddle the code base with #ifdefs.

In any case, when not targeting GPUs I can't see why AdaptiveCpp needs to access files on disk to run SYCL code. Also consider that JIT can never be supported on microcontrollers as they have very limited RAM (from a few KBytes (!) to a few MBytes) thus it's not possible to run a compiler on the microcontroller.

I understand that JIT may not be possible. However, there may still be other use cases for std::filesystem.

  • currently, backends are loaded dynamically using dlopen. This is primarily for deployment flexibility, since it allows skipping backends if dependencies (e.g. driver availability) are unmet. I understand that this is not really needed if you only want to target CPU and only care about one particular microcontroller anyway.
  • We do have an per-application persistent storage (appdb) that we use to store information about kernels. At the moment, this is only used by the JIT mechanism, but in the future, it is likely that we would also want to store statistical data about kernels there for tuning purposes. This could also be relevant for the other compilation flows.

Currently our PR does not support nd_range as we don't have support for coroutines on Miosix

If you don't support nd_range, then I think the value proposition of SYCL for you is going to be severely limited. The code that you can express without nd_range if you only want to target CPU is basically exactly the same as #pragma omp parallel for, tbb::parallel_for, or even std::for_each(std::execution::par_unseq, ...).
All of these are likely easier to deploy than SYCL - especially std::for_each which will already come with the compiler. So why use SYCL in the first place?

however does omp.accelerated require JIT?

It does not. It runs additional LLVM transformations on the host IR to implement barriers without fibers.

Any non-trivial SYCL program will need nd_range. Without nd_range, you lose

  • ability to select work group sizes
  • ability to use local memory
  • ability to use SYCL 2020 group algorithm library
  • ability to use SYCL 2020 reductions
  • ability to use the more interesting algorithms from our acpp::algorithms library
  • ability to use AdaptiveCpp C++ standard parallelism offloading

local memory in omp.accelerated is currently implemented using thread_local so this will be an issue, but we can solve this once we get to that point.

I think that if we add some additional targets like microcontrollers in upstream AdaptiveCpp, then I would add least like core SYCL functionality like nd_range to be supported. With few exceptions like specific extensions around JIT compilation that only work with the JIT compiler by design, all other features are currently supported in all modes and backends. I don't really want to maintain separate feature support tables for different targets ;)

About the fourth point: we could think to replace the COMPILE_FOR_MICROCONTROLLERS flag with a new dedicated omp profile using ACPP_COMPILER_FEATURE_PROFILE, e.g. omp.micro. However I would like to leave the possibility to compile both with or without clang compiler support (which is WIP from our side). So we could introduce 2 new profiles: omp.micro.library-only and omp.micro.full. In any case I would leave the flag COMPILE_TOOLS separate from this, which could come handy by its own. Let me know it this could do the job for you.

Thanks for your thoughts. I've thought about it a bit more and kind of came around on this issue: I think the compiler feature profiles are there to describe what the compiler is capable of, not what the runtime can actually do on the target hardware.
Since you will not be compiling on the actual microcontroller (I assume) you could still have a full compiler profile with all features enabled, but then a binary built with the generic JIT compiler won't be able to run on the microcontroller.

Here's what I'd like to see for this PR to go forward:

  • Add some CI for compile-testing
  • Add nd_range using omp.accelerated, or at least work towards it so that at least all core functionalities are available.
  • Try to reduce COMPILE_FOR_MICROCONTROLLERS usage as much as possible. For example, the ability to not load the OpenMP backend dynamically is not necessarily microcontroller-specific, and as mentioned perhaps we could allow building against std::filesystem as long as it is not actually used.

@fedetftpolimi
Copy link

currently, backends are loaded dynamically using dlopen. This is primarily for deployment flexibility, since it allows skipping backends if dependencies (e.g. driver availability) are unmet. I understand that this is not really needed if you only want to target CPU and only care about one particular microcontroller anyway.

Well, dlopen and related functions do not exist on Miosix, and calling them, or even including the dlfcn.h header results in a compile-time error. Microcontrollers don't have hardware support for virtual memory, there's no MMU and RAM is not divided in pages, which is how traditions OS support loading and sharing libraries among processes.
That's why our patch switches to compiling static libraries and statically linking the support for the OpenMP backend.

We'll still work on suppporting std::filesystem as we'll likely want to use this functionality for our application code, but I'm not sure compiling AdaptiveCpp filesystem code also for microcontrollers:

  • would result in less #ifdef, as we'd need to #ifdef out dynamic linking code
  • would not result in code that does compile but then does not work because it tries to reach code that does a dlopen or tries to write to files on boards incapable of supporting a read-write filesystem, defeating the purpose of the CI.

So why use SYCL in the first place?

Code portability in the embedded world. Imagine you want to write parallel code once and then have it run both on embedded boards with a GPU, and on microcontrollers. In any case, you convinced me, we need to support nd_range efficiently to be able to write code once and deploy it to both microcontrollers and GPU-capable single-board-computers.

Here's what I'd like to see for this PR to go forward

I guess we'll go back coding then, and come up with a more complete proposal. In any case, if possible, we'd like to begin the upstream process before the code is 100% complete and fine tuned. Like, have a PR with some form of preliminary support accepted, and then do additional improvements in subsequent PRs rather than maintaining a fork for a long time and then do just a single huge PR at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants