-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdaptiveCpp for microcontrollers and Miosix OS #1720
base: develop
Are you sure you want to change the base?
Conversation
add COMPILE_TOOLS option to cmake add COMPILE_FOR_MICROCONTROLLERS definition to config.hpp.in add compile-for-microcontrollers setting to acpp core config file
…r microcontrollers
add custom rng for miosix as it doesn't support tls
…for microcontrollers fix a bug that causes a linker/compiler error if flag lines are empty
…s where long is 32B: int was always assumed to 32B and so when long is 32B the same function gets defined twice through macro expansion
… the new definition COMPILE_FOR_MICROCONTROLLERS
31145bf
to
1a7041d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
Some general questions here:
- If we add a new mode with a certain restricted functionality like you propose, it's pretty much guaranteed that after some time of ongoing development in other areas this feature will be unintentionally broken by other changes without explicit testing. Are you willing to maintain this mode? Is there a way to add it to CI?
- Miosix seems to support file systems (according to https://miosix.org/), so why do we need to disable all of the
std::filesystem
stuff? AdaptiveCpp requires C++17, andstd::filesystem
is a part of C++ 17. - What is the overall goal here, functionality-wise? You say that you want to support
omp.library-only
, however this mode is intended more for debugging. It's not going to perform well at all (which is expected and by design) for any non-trivial SYCLnd_range
kernel.omp.accelerated
orgeneric
targets will perform much better (orders of magnitude better!). Do you plan to add support for these? - Strategically, we've made the move away from individual feature enabling/disabling flags towards moving to fixed compiler support profiles (see
ACPP_COMPILER_FEATURE_PROFILE
) due to the simplification for users. How would this microcontroller mode fit into this? I'd very much like to fold the mode into this, rather then introduce again specific feature enabling/disabling flags with specific tradeoffs in terms of dependencies and feature support.
@@ -52,6 +52,8 @@ void *omp_allocator::raw_allocate(size_t min_alignment, size_t size_bytes, | |||
// but it's unclear if it's a Mac, or libc++, or toolchain issue | |||
#ifdef __APPLE__ | |||
return aligned_alloc(min_alignment, size_bytes); | |||
#elif defined(_MIOSIX) | |||
return malloc(size_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Miosix guarantee that every allocation is aligned to the maximum possible value? Otherwise, this looks incorrect if min_alignment > 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
malloc() and operator new on Miosix (and Linux/Windows/MacOS) return memory aligned to the C/C++ data type with at least the strictest alignment, which from memory should be long double, no need for any other logic to achieve this goal. Otherwise on CPU architectures not supporting unaligned memory access, the CPU will fault when accessing that memory. We do have some microcontroller targets that fault on unaligned accesses so we know.
I read AdaptiveCpp code and the only use we could find of the alignment seems to either pass the value 32 for cache alignment, or use alignof on data structures, so it seems safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
malloc() and operator new on Miosix (and Linux/Windows/MacOS) return memory aligned to the C/C++ data type with at least the strictest alignment, which from memory should be long double, no need for any other logic to achieve this goal.
Are there no vector load/stores on your hardware?
I read AdaptiveCpp code and the only use we could find of the alignment seems to either pass the value 32 for cache alignment, or use alignof on data structures, so it seems safe.
Well... the alignment is exposed to users in the form of sycl::aligned_alloc
. If you just ignore the alignment, then technically this means that the SYCL specification is violated even if there is perhaps not a strong reason to have data aligned to more than long double. It causes an observable change in the API behavior compared to what users are technically guaranteed.
Now, we currently also kind of handwave this on CUDA and HIP backends, where we also don't take this parameter into account - but there at least we know that we generally get page-aligned memory, so it's less of a risk of not doing what the user wants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARM Cortex-M is a family of instruction sets for RISC CPUs of different complexity and performance, but still limited to microcontroller-class performance.
The highest performance CPU we have access to and thus support is the Cortex-M7. Being a RISC CPU, memory access is done only by load/store instructions and the widest load/store it can perform is for a 64bit value (ldrd/strd instruction), so the strictest alignment we need is 8 bytes. malloc() is already overkill as it provides 16 byte alignment. This architecture does have some vector instructions but is very limited, like doing 4 separate byte adds in a 32 bit register or similar. Not fancy.
The architecture we're currently targeting for SYCL is even way, way more limited. It's one of the few dual-core microcontrollers but the cores are Cortex-M0. No vector instructions, no hardware floating point, no hardware integer divide, limited hardware integer multiply. The Raspberry Pi microcontroller does not even have external RAM nor caches. 264KByte of RAM is all you've got. Luckily we were pleased that a simple matrix multiply example with Miosix + AdaptiveCpp requires only 48KB RAM and there's room for improvement, so AdaptiveCpp can become really usable on this platform.
x ^= x << 25; | ||
x ^= x >> 27; | ||
uint64_t result = x * 2685821657736338717ull; | ||
asm volatile("cpsie i":::"memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide some background on what these asm instructions do? Is the seed dynamic, or set statically to x
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some temporary code to disable/enable interrupts on ARM Cortex-M CPUs. Eventually we'll evaluate the performance and switch to a mutex or some other solution when we figure out how to make the kernel headers available when compiling AdaptiveCpp. Currently Miosix has no support for thread-local storage so the original code does not compile, and besides, std::mt19937 is a huge data structure, taking nearly 5KBytes of RAM, and having one instance for each thread is definitely not a good idea for microcontrollers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine not using mt19937, but I'm concerned about potential collisions for accessor ids when multiple threads submit work. If the accessor ids (which these random numbers are used for) collide, then your program will potentially do unpredictable things as the wrong kernels will be fed with the wrong data pointers as argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one point to explore then. Microcontrollers are often used for mission- or even safety- critical applications, so we don't like to hear "if two random number generators happen to collide your program does unpredictable things".
Right now, we're more interested in checking whether there's is interest in upstreaming support for microcontrollers. If there is, we can spend a little more time on this issue and propose for microcontroller targets a solution that, even if it costs us some performance, does not depend on randomness to work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one point to explore then. Microcontrollers are often used for mission- or even safety- critical applications, so we don't like to hear "if two random number generators happen to collide your program does unpredictable things".
That's understandable, but it only matters for the buffer-accessor memory management model. I think especially on platforms where resource usage is a constraint, you'd probably rather want to use the SYCL 2020 USM memory management model, which gives you more control.
The buffer-accessor model does a lot of things implicitly, including memory deallocation via a garbage collection mechanism which you probably don't want.
But I think it is important to still get it to the point where it is in principle functional, even if it is just for the reason of being able to run the AdaptiveCpp unit tests.
Right now, we're more interested in checking whether there's is interest in upstreaming support for microcontrollers
Well, I can turn that question around :-) Are you interested in having your work upstream?
So far most AdaptiveCpp users have a background in HPC. But I think that does not mean that we're not open to expanding to other communities. It takes someone from those communities with the necessary motivation though to drive this. It's an open source project, so ultimately, those that use it & contribute determine the direction :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we're quite interested in using the buffer accessor model on microcontrollers.
First, for code portability among embedded targets, as scaling up from microcontrollers we find embedded single-board computers with GPUs but without fine-grained SVM, thus the buffer accessor model seems to be the best way to write embedded SYCL code. Second, Miosix uses C++ as system programming language, not C, so we're quit used to the RAII model of using objects to manage memory.
Also, maybe I'm wrong, but the buffer accessor model uses reference counting, not garbage collection so it should not be more heavy than, say, std::shared_ptr, am I missing something?
And yes, we'll be working on improving SYCL support for microcontrollers through AdaptiveCpp and we're interested in upstreaming the code, that's the reason for this first PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we're quite interested in using the buffer accessor model on microcontrollers.
First, for code portability among embedded targets, as scaling up from microcontrollers we find embedded single-board computers with GPUs but without fine-grained SVM, thus the buffer accessor model seems to be the best way to write embedded SYCL code. Second, Miosix uses C++ as system programming language, not C, so we're quit used to the RAII model of using objects to manage memory.
Fine-grained SVM is only a fallback for for the USM OpenCL extension when using the OpenCL backend. It's only lightly tested, and might not work for all scenarios as fine-grained SVM does not really expose sufficient control for SYCL.
On microcontroller, you wouldn't want to use it because OpenCL always involves JIT, and on CPU every regular pointer is a USM pointer anyway.
On some kind of embedded/single-board/mobile GPU it makes more sense of course - but even there, the fine-grained SVM issue is going to be solved by SVM 2.0, which the OpenCL working group is currently working on.
I expect this to be widely supported in the future so that we can rely on USM also on such hardware.
See KhronosGroup/OpenCL-Docs#1282
Most SYCL code written today uses the USM model, so you will want to at least support USM device allocations anyway.
Also, maybe I'm wrong, but the buffer accessor model uses reference counting, not garbage collection so it should not be more heavy than, say, std::shared_ptr, am I missing something?
Yes. There are cases where the underlying data object of buffer
has to outlive the user-facing buffer
object. Particularly, this is the case if the buffer destructor does not block (it does not always block, even though a lot of SYCL introductory materials kind of suggests this).
In this case, SYCL needs to manage lifetime of the data storage internally. This requries some mechanism for the SYCL runtime to notice when any kernel operating on the buffer
has finished, so that it can ensure that data is only freed once it is safe to do so.
This can either be achieved by inserting callbacks after kernels (which from experience, introduces unacceptable latencies in many cases), or by deallocating lazily using some asynchronous mechanism. This is what AdaptiveCpp does and what I refer to by "garbage collection".
A consequence of this is that you cannot rely on data to be freed in the buffer-accessor model once the reference count goes to 0 in general.
There are a lot of other problems in the buffer-accessor model, which might be even worse on small GPUs that might be available in the hardware that you are interested in. This includes higher register pressure and high kernel submission latencies.
See https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/performance.md#sycl-memory-management-usm-vs-buffers
These defects are well-known in SYCL, but require fundamental changes to fix. I would not recommend for any new code to invest into the buffer-accessor model. AdaptiveCpp starting from version 24.10 actively warns if you use it.
And yes, we'll be working on improving SYCL support for microcontrollers through AdaptiveCpp and we're interested in upstreaming the code, that's the reason for this first PR.
Okay, that's good to hear :)
Hello, I'm the Miosix maintainere here, I'm collaborating with @NiccoloN on this feature. Let's answer some questions: As for maintaining this mode, we can check from time to time that the AdaptiveCpp microcontroller support keeps working, as we'll likely be using this in the following years. If it stops working, we'll provide patches. As for the filesystem:
The overall goal is that microcontrollers are becoming multi-core and thus it makes sense to use threads to exploit the available parallelism. In this context, extending SYCL support to also encompass microcontrollers allows greater code portability and reuse. Currently our PR does not support |
Hello! About the fourth point: we could think to replace the COMPILE_FOR_MICROCONTROLLERS flag with a new dedicated omp profile using ACPP_COMPILER_FEATURE_PROFILE, e.g. omp.micro. However I would like to leave the possibility to compile both with or without clang compiler support (which is WIP from our side). So we could introduce 2 new profiles: omp.micro.library-only and omp.micro.full. In any case I would leave the flag COMPILE_TOOLS separate from this, which could come handy by its own. Let me know it this could do the job for you. |
That's great, we don't need to actually run things, but at least compile testing would be good. I think this should catch most issues around the restricted feature set of this platform, which is the main concern.
Thanks for explaining!
Does this mean that once
I understand that JIT may not be possible. However, there may still be other use cases for
If you don't support
It does not. It runs additional LLVM transformations on the host IR to implement barriers without fibers. Any non-trivial SYCL program will need
local memory in I think that if we add some additional targets like microcontrollers in upstream AdaptiveCpp, then I would add least like core SYCL functionality like
Thanks for your thoughts. I've thought about it a bit more and kind of came around on this issue: I think the compiler feature profiles are there to describe what the compiler is capable of, not what the runtime can actually do on the target hardware. Here's what I'd like to see for this PR to go forward:
|
Well, We'll still work on suppporting std::filesystem as we'll likely want to use this functionality for our application code, but I'm not sure compiling AdaptiveCpp filesystem code also for microcontrollers:
Code portability in the embedded world. Imagine you want to write parallel code once and then have it run both on embedded boards with a GPU, and on microcontrollers. In any case, you convinced me, we need to support
I guess we'll go back coding then, and come up with a more complete proposal. In any case, if possible, we'd like to begin the upstream process before the code is 100% complete and fine tuned. Like, have a PR with some form of preliminary support accepted, and then do additional improvements in subsequent PRs rather than maintaining a fork for a long time and then do just a single huge PR at the end. |
The PR makes the necessary changes to AdaptiveCpp to enable the omp library-only backend on multicore microcontrollers, in particular for the Miosix embedded OS. This work has been conducted at HEAPLab, Politecnico di Milano, to bring sycl into the world of embedded systems and microcontrollers.