Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use aligned_alloc() as default allocator for halide_malloc() on most platforms #7189

Closed
wants to merge 16 commits into from

Conversation

steven-johnson
Copy link
Contributor

@steven-johnson steven-johnson commented Nov 30, 2022

C++17 includes C11, which includes aligned_alloc as part of the standard library. We should move to prefer using it over malloc() in halide_malloc(), since we currently have to do an overallocate-and-adjust trick with malloc() to get what we want.

We can't move to it everywhere, alas, because:

  • Older versions of Android (before Android 9) and OSX (before 10.15) don't support them regardless of your compiler version
  • Microsoft has stated that they are unlikely to ever support it, as they map malloc() directly onto Win32 APIs and apparently those APIs don't offer any aligned-allocation options, oddly enough
  • Hexagon/QuRT I left as-is, because I'm not sure about whether aligned_alloc is offered there and if so, what version. (Would welcome input from @pranavb-ca et al on this)

That said, on ~all modern Unixy platforms, it should be available and arguably preferable, so this PR makes it the default, with a no_aligned_alloc feature to allow dropping back to using plain malloc()-with-hackery. (This option is ignored for Windows and QuRT, of course.)

It's worth pointing out that this subtlely modifies the (implicit) contract of halide_malloc: in addition to guaranteeing that the returned pointer is aligned to halide_malloc_alignment(), the aligned_alloc() allocator also requires/guarantees that the memory returned is an integral number of align-sizes long (rounding up, obviously). In order to avoid subtle behavior differences between the two malloc-based allocators remaining (posix and qurt) I modified them to also make the same guarantee. (I guess I should document this and make sure that all the overrides of halide_malloc() I know of in Google and other places make this same guarantee...)

I considered adding a new halide_aligned_alloc() runtime function instead, and then migrating all the calls we make for buffer memory there (since that's the only place we really care about this alignment), leaving halide_malloc() to be used for whatever places we need to allocate memory that doesn't need special alignment... but (1) I'm not sure if there are enough of the latter to be worthwhile, and (2) we'd have to 'dumb down' halide_malloc() to make the exercise worthwhile, and that would leave us risk of hard-to-repro crashes if we got one of the allocations wrong.

This will likely need some torture testing, since it's possible that some platforms offer aligned_alloc but with inferior performance.

Not sure how useful this will be, but this will allow us to experiment to see if it pays for itself.
@abadams
Copy link
Member

abadams commented Nov 30, 2022

On platforms where it's available, why would we ever not use aligned_alloc?

Is the target feature just for testing it out?

@steven-johnson
Copy link
Contributor Author

steven-johnson commented Nov 30, 2022

On platforms where it's available, why would we ever not use aligned_alloc?

  • On some platforms we can't determine for sure... e.g. Android, where it is only available with SDK level 28 (or so), regardless of C++17. On OSX, it's apparently only available as of macOS 10.15.
  • It's also possible that a platform has implemented aligned_alloc() but it's dramatically slower than malloc().

Is the target feature just for testing it out?

For now, yes. (Long-term it would be preferable to move to making it mandatory on all platforms that support it, of course.)

check(c(in2, offset4, out4));
});
const float megapixels = (W * H) / (1024.f * 1024.f);
printf("Benchmark: %dx%d -> %f mpix/s for %s\n", W, H, megapixels / time, t.to_string().c_str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a performance test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe eventually? Need to see real-world examples first to see where we can expect to use it.

@steven-johnson
Copy link
Contributor Author

Thinking out loud... maybe the right option is actually to make aligned_alloc the default and have a no_aligned_alloc feature to turn it off? As mentioned above, some runtimes don't support it, but for Android, that's before API 28 (Android 9, 2018), and for OSX, it's before 10.15 (2019). While I'm sure there are developers still targeting those older systems, it's not clear how many are using Halide, and if they are, they'd encounter a failure at link time, so the error would be obvious (and easy to fix).

@steven-johnson steven-johnson changed the title Add aligned_alloc() option (#7160) Use aligned_alloc() as default allocator on most platforms Dec 1, 2022
@steven-johnson
Copy link
Contributor Author

PTAL


void run_test(Target t) {
// Must call this so that the changes in cached runtime are noticed
Halide::Internal::JITSharedRuntime::release_all();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we don't invalidate the cache if the target changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too. Want me to open an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steven-johnson
Copy link
Contributor Author

PTAL

@steven-johnson steven-johnson changed the title Use aligned_alloc() as default allocator on most platforms Use aligned_alloc() as default allocator for halide_malloc() on most platforms Dec 1, 2022
steven-johnson added a commit that referenced this pull request Dec 1, 2022
…latforms

(See also #7189)

Modify H::R::Buffer to default to using `aligned_alloc()` instead of `malloc()`, except:
- If user code passes a non-null `allocate_fn` or `deallocate_fn`, we always use those (and/or malloc/free)
- If the code is compiling under MSVC, never use `aligned_alloc` (Windows doesn't support it)
- If HALIDE_RUNTIME_BUFFER_USE_ALIGNED_ALLOC is defined to be 0, never use `aligned_alloc` (this is to allow for usage on e.g. older Android and OSX versions which don't provide `aligned_alloc()` in the stdlib, regardless of C++ versions.)

Also, as with #7189, this ensures that the allocated space has the start of the host data as 128-aligned, and also now ensures that the size allocated 128-aligned (rounding up as needed).
@steven-johnson steven-johnson added the release_notes For changes that may warrant a note in README for official releases. label Dec 2, 2022
@pranavb-ca
Copy link
Contributor

@steven-johnson - Thanks for this. You are right, the pointer math tricks can be avoided. I think we can replace this with memalign on QuRT. I am working from India at present but only for a few days more before I am out on vacation, so, I'll get to this for Qurt once I am back in early Jan.

Currently, halide_malloc_alignment() is WEAK_INLINE, so even if you attempt to override it, it won't have any effect (the call sites in halide_malloc() will have inlined the implementation selected by LLVM_Runtime_Linker).

This promotes halide_malloc_alignment() to a documented piece of the runtime you can override, with the caveat that it must return the same value every time it is called. To reduce overhead in halide_malloc(), we copy the value into a global at startup time and use that instead.

Also, added the (oops) missing documentation to HalideRuntime.h about the new alignment requirements.
@steven-johnson
Copy link
Contributor Author

attn @abadams: please take a look at 54caa29

steven-johnson added a commit that referenced this pull request Dec 6, 2022
Currently, halide_malloc_alignment() is WEAK_INLINE, so even if you attempt to override it, it won't have any effect (the call sites in halide_malloc() will have inlined the implementation selected by LLVM_Runtime_Linker).

This promotes halide_malloc_alignment() to a documented piece of the runtime you can override, with the caveat that it must return the same value every time it is called. To reduce overhead in halide_malloc(), we copy the value into a global at startup time and use that instead.

By itself, this isn't that useful, but it's essential to have in order for #7189 to be robust; I'm pulling this change into a separate PR so that it can land separately, enabling downstream users to be able to update their halide_malloc() implementations ahead of the "real" change.
@abadams
Copy link
Member

abadams commented Dec 6, 2022

halide_malloc_alignment was weak_inline so that it would resolve to a constant at compile time, which makes all the masking etc operations on constants. Making it overridable adds a function call overhead to halide_malloc and possibly makes the alignment math more expensive. I don't think it's worth it. Users that want custom alignment should just override halide_malloc entirely.

Edit: Ignore the above, I only read the title and reacted to that and didn't look at the details of the PR.

@steven-johnson
Copy link
Contributor Author

Closing this in favor of #7206

steven-johnson added a commit that referenced this pull request Dec 7, 2022
…latforms (#7190)

Use aligned_alloc() as default allocator for HalideBuffer.h on most platforms

(See also #7189)

Modify H::R::Buffer to default to using `aligned_alloc()` instead of `malloc()`, except:
- If user code passes a non-null `allocate_fn` or `deallocate_fn`, we always use those (and/or malloc/free)
- If the code is compiling under MSVC, never use `aligned_alloc` (Windows doesn't support it)
- If HALIDE_RUNTIME_BUFFER_USE_ALIGNED_ALLOC is defined to be 0, never use `aligned_alloc` (this is to allow for usage on e.g. older Android and OSX versions which don't provide `aligned_alloc()` in the stdlib, regardless of C++ versions.)

Also, as with #7189, this ensures that the allocated space has the start of the host data as 128-aligned, and also now ensures that the size allocated 128-aligned (rounding up as needed).
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
…latforms (halide#7190)

Use aligned_alloc() as default allocator for HalideBuffer.h on most platforms

(See also halide#7189)

Modify H::R::Buffer to default to using `aligned_alloc()` instead of `malloc()`, except:
- If user code passes a non-null `allocate_fn` or `deallocate_fn`, we always use those (and/or malloc/free)
- If the code is compiling under MSVC, never use `aligned_alloc` (Windows doesn't support it)
- If HALIDE_RUNTIME_BUFFER_USE_ALIGNED_ALLOC is defined to be 0, never use `aligned_alloc` (this is to allow for usage on e.g. older Android and OSX versions which don't provide `aligned_alloc()` in the stdlib, regardless of C++ versions.)

Also, as with halide#7189, this ensures that the allocated space has the start of the host data as 128-aligned, and also now ensures that the size allocated 128-aligned (rounding up as needed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants