-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for nested multiversioned functions #8
Comments
An idea that I've been toying with is exposing the dispatcher to the module with a signature like This would work in tandem with a new attribute #[target_clones("[x86|x86_64]+avx")]
fn foo() { /* snip */ }
#[target_clones("[x86|x86_64]+avx")]
#[static_dispatch(foo)]
fn bar() { foo(); } The generated code for fn __dispatch_bar(&'static [&'static str]) -> fn() -> () {
fn __clone_0() {
/* embedded via static_dispatch */
fn foo() {
__dispatch_foo(&["avx"])();
}
/* the original definition of bar */
foo();
}
fn __clone_1() {
/* embedded via static_dispatch */
fn foo() {
__dispatch_foo(&[])();
}
/* the original definition of bar */
foo();
}
/* statically dispatch clones here */
}
fn bar() {
/* dynamically dispatch here via __dispatch_bar */
} It appears that const propagation optimization is good enough to turn the static dispatching into inlined functions (!) based on this test: https://rust.godbolt.org/z/gg-XQu @TethysSvensson, any opinion on this implementation? My only qualm with it is that it exposes an implementation detail ( |
I'm not completely convinced that having What if If we can live with having a compile-time error in this case, then we can do something where this: #[target_clones("x86_64+avx")]
fn foo() { /* snip */ }
#[target_clones("x86_64+avx")]
#[static_dispatch(foo)]
pub fn bar() { foo(); } Turns into this: fn foo() { /* same dispatcher logic as currently */ }
mod foo {
use super::*;
#[target_feature(enable = "avx")]
pub(super) unsafe fn x86_64_avx() {
fn foo() { unsafe { foo::x86_64_avx() } }
fn safe_wrapper() { /* snip */ }
safe_wrapper();
}
pub(super) unsafe fn default_impl() {
fn foo() { unsafe { foo::default_impl() } }
fn safe_wrapper() { /* snip */ }
safe_wrapper();
}
}
pub fn bar() { /* same dispatcher logic as currently */ }
pub mod bar {
use super::*;
#[target_feature(enable = "avx")]
pub unsafe fn x86_64_avx() {
fn foo() { unsafe { foo::x86_64_avx() } }
fn bar() { unsafe { bar::x86_64_avx() } }
fn safe_wrapper() { foo(); }
safe_wrapper();
}
pub unsafe fn default_impl() {
fn foo() { unsafe { foo::default_impl() } }
fn bar() { unsafe { bar::default_impl() } }
fn safe_wrapper() { foo(); }
safe_wrapper();
}
} This is very close to what we are already doing, except the paths and visibility has been changed. |
That was my original idea too, it's definitely very similar to what we have now. You make a good point about a feature mismatch where multiple choices are valid. It sounds like recommended usage should be to use the same feature sets on every interacting function in the crate. I think part of the reason I wanted something more complicated was because the list of clones could get very long, but that's definitely a different issue (I've opened #10 to try to deal with that). One issue that I'm just thinking of--what if |
How about this then? pub: The functions inside the module all have the same As far as I can tell, this solves the problem. If a safe function |
I agree with your comments on mod foo {
use super::*;
pub fn default() {
fn bar(/* how do we determine these args? */) -> /* or the return type? */ {
bar::default(args)
}
}
} Since we're in a proc macro, I think the way to go will be to parse the body and replace all instances of #[target_clones("x86_64+avx")]
fn foo() {
let baz = #[static_dispatch] bar();
} |
I am not sure that solution actually solves since the I think this could be solved with a sufficient number of wrappers, but it becomes quite cumbersome at some point. |
We could for instance do it something like this: #[target_clones("x86_64+avx")]
fn foo(x: i32, y: i32) -> i32 {
if x <= 0 {
y
} else {
1 + foo(x - 1, y)
}
}
#[target_clones("x86_64+avx")]
#[static_dispatch(foo)]
fn bar(x: i32) -> i32 {
foo(x, x)
} Becomes: fn foo(x: i32, y: i32) -> i32 { /* dispatcher logic */ }
mod foo {
type FnType = fn(i32, i32) -> i32;
#[inline(always)]
pub(super) unsafe fn avx() -> FnType {
#[target_feature(enable = "avx")]
unsafe fn avx(x: i32, y: i32) -> i32 {
#[inline(always)]
fn safe(x: i32, y: i32) -> i32 {
if x <= 0 {
y
} else {
1 + (unsafe { super::foo::avx() })(x - 1, y)
}
}
safe(x, y)
}
#[inline(always)]
fn foo(x: i32, y: i32) -> i32 {
unsafe { avx(x, y) }
}
foo
}
#[inline(always)]
pub(super) unsafe fn default_impl() -> FnType {
unsafe fn default_impl(x: i32, y: i32) -> i32 {
#[inline(always)]
fn safe(x: i32, y: i32) -> i32 {
if x <= 0 {
y
} else {
1 + (unsafe { super::foo::default_impl() })(x - 1, y)
}
}
safe(x, y)
}
#[inline(always)]
fn foo(x: i32, y: i32) -> i32 {
unsafe { default_impl(x, y) }
}
foo
}
}
fn bar(x: i32) -> i32 { /* dispatcher logic */ }
mod bar {
type FnType = fn(i32) -> i32;
#[inline(always)]
unsafe fn avx() -> FnType {
#[target_feature(enable = "avx")]
unsafe fn avx(x: i32) -> i32 {
#[inline(always)]
fn safe(x: i32) -> i32 {
(unsafe { super::foo::avx() })(x, x)
}
safe(x)
}
#[inline(always)]
fn bar(x: i32) -> i32 {
unsafe { avx(x) }
}
bar
}
#[inline(always)]
unsafe fn default_impl() -> FnType {
unsafe fn default_impl(x: i32) -> i32 {
#[inline(always)]
fn safe(x: i32) -> i32 {
(unsafe { super::foo::default_impl() })(x, x)
}
safe(x)
}
#[inline(always)]
fn bar(x: i32) -> i32 {
unsafe { default_impl(x) }
}
bar
}
} The idea here is to go through the body and replace every instance of This is really ugly IMO. However I have thought about it for a while and have not yet come up with anything better that achieves all of the following properties:
If you think this is the right approach, I think I have time for implementing it. That is, once we agree on what we want. On the other hand, if you are looking for something to hack on, don't let me stop you. 😉 |
I agree that replacing the invoked function at the call site is probably the only easy (for users, not for us) and safe way to do this. A few comments:
I've done a bit of work in the #[target_features(enable = "avx")]
unsafe fn avx(/* args */) {
#[inline(always)]
fn foo() { /* body */ }
foo(/* args */)
}
fn default(/* args */) {
#[inline(always)]
fn foo() { /* body */}
foo(/* args */)
} I believe this should have the same safety guarantee. If you'd like, we can merge my changes to master before you look at the actual static dispatch component. |
Doing it that way will not allow optimizations to work for recursive functions, because |
And I believe that Syn has some visitor functionality to go through the AST recursively for you and only look for specific things. I see your point about global replacement. I am not sure I can think of a better solution than the one you propose. |
I'm not sure I understand the inlining issue, in this example it seems to inline just fine: https://rust.godbolt.org/z/DAk5P5. That said, I'm not terribly concerned about it and I'll revert that change if that third function is necessary. If Syn has visitor functionality that would be great. One thought about call site replacement is that something like this doesn't work: let f = #[static_dispatch] foo; I figure if the user absolutely needs that, however, they can use a closure. |
Actually, now that I think about it, that example probably works fine as long as you don't specify |
I am pretty sure I had an issue with inlining previously using something similar to what you propose. However I cannot replicate the problem right now, so it's probably fine. |
Why wouldn't that example work? |
I think this is a bad idea, as it allows the user to have unsafe code in the arguments without causing compilation issues, e.g. |
Very good point. Returning the function pointer is probably a better way to do that. |
I think I found the counter example now. This used to optimize correctly, but no longer does: #[multiversion::target_clones("x86_64+avx")]
pub fn square(i: i32, x: &mut [f32]) {
if i <= 1 {
for v in x {
*v *= *v;
}
} else {
square(i - 1, &mut x[1..]);
square(i - 2, &mut x[2..]);
}
} |
In the current master, the innermost function will no longer be inlined, despite the |
Looks like you're right, it appears to work in some trivial examples but that's it. I think this is caused by rust-lang/rust#53117. I've added the recursion-helper back. |
I have made a branch implementing this. Feel free to use directly or take partial inspiration from it. It does what we agreed upon. The main downside is that it currently breaks the |
I've done some testing and unfortunately I'm not sure this method is going to work. The compiler has a really hard time keeping track of the CPU features and inlining. If the returned function is transmuted from the unsafe fn pointer as you currently have in your branch, the function isn't inlined when static-dispatched. If the function is wrapped one more time (or the recursion helper is returned, these seem to have the same effect), the function is now properly inlined when static-dispatched, but somehow the CPU features are lost in the version used by the dynamic dispatcher. I think a better solution would be to drop the indirection via function pointers and go back to exposing the functions directly. Since we can't transmute at the call site (for static dispatch), I think we might be able to use a an unsafe block with extra care to make sure the arguments are evaluated outside the unsafe block. Using syn we can count the number of arguments and make something like: {
let __arg_0 = /* some expr */;
let __arg_1 = /* some other expr */;
{
#[allow(unused_unsafe)]
unsafe { foo(__arg_0, __arg_1) }
}
} |
I'm now remembering that this solution doesn't work properly if the original function is unsafe... |
Maybe we can hack something up using rust-lang/rust#64035? E.g. having a macro that generates an |
I think something like that would definitely work. I wonder if we could even just make a single macro that produces all of the various versions (or I actually really like the idea of being able to report with Did you have a specific idea of how it would work? |
For the record, I think I discovered why it didn't work. The compiler needs to put "breaks" before code that may require feature detection, in order to prevent accidentally performing speculative execution on an unsupported function before feature detection is complete. I believe this is why inlining was lost when transmuting a function pointer (I suppose speculative execution stops at a |
Added in #12. |
In this example:
foo
should be statically dispatched when invoked inbar
, since the CPU features have already been established when dispatchingbar
. It would also be nice if this even worked when functions have mismatched feature sets (x86+sse+avx
should be able to statically dispatchx86+sse
functions).The text was updated successfully, but these errors were encountered: