-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement x86 AVX intrinsics #3192
Conversation
Thanks for the PR! Unfortunately I am exceptionally busy right now and honestly I am not sure if I'll be able to review a PR of this size before the Christmas break. Sorry for that. |
Perhaps if @eduardosm splits out the "mere code movement" parts with rounding and sqrt into a separate PR so that the maskload, etc. stuff can be reviewed separately? |
Splitting up is always helpful, yeah. Reviewing has a strictly superlinear time complexity, I suspect somewhere between |
Yeah. I was also thinking "hm, I could help examine the tricky parts and doublecheck stuff, but Ralf knows how stuff is/should be organized better than me (obviously), so I have no comment on 'where should |
Done, created #3214 with the "just moving things" part |
Move some x86 intrinsics code to helper functions in `shims::x86` To make them reusable for intrinsics of other x86 features. Splitted from #3192
☔ The latest upstream changes (presumably #3214) made this pull request unmergeable. Please resolve the merge conflicts. |
Move some x86 intrinsics code to helper functions in `shims::x86` To make them reusable for intrinsics of other x86 features. Splitted from rust-lang/miri#3192
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Mostly looks good, I have some nits and questions though.
I didn't look at the tests (yet); I assume they cover all the new intrinsics and in particular the corner cases.
@rustbot author |
@rustbot ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, finally went through all the ops and comments. Sorry for the long wait!
src/shims/x86/mod.rs
Outdated
/// | ||
/// Each 128-bit chunk is treated independently (i.e., the value for | ||
/// the is i-th 128-bit chunk of `dest` is calculated with the i-th | ||
/// 128-bit blocks of `left` and `right`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't that mean the output should be shorter than the input, since each 128bit chunk produces one result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output has the same size as each input, since each 128-bit chunk of the input produces a 128-bit chunk for the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand what is meant by performing an operation "horizontally".
Given two vectors, I can see how can one do an operation "pointwise" (that's what it would be called in mathematics, anyway): the i-th output element is computed as op(left[i], right[i])
. Naturally, then the output is the same length as either input. Is that what you mean by "horizontally"? If not, what do you mean? I thought it may be a sort of fold, like summing all elements, but then you get one output element per input chunk, so that doesn't match the comment. I was trying to reverse engineer the code but failed.^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I stared at the code some more, and I think it's composing neighboring elements?
- concatenate
left ++ right
to forminput
- then compute
output[i]
asop(input[2*i], input[2*i + 1])
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct
(FYI, I will assume the ball is in your court until you do |
@rustbot ready |
src/shims/x86/avx.rs
Outdated
let control = this.project_index(&control, i)?; | ||
|
||
// Each 128-bit lane is shuffled independently. Since each lane contains | ||
// two 64-bit elements, only the second bit from `right` is used (yes, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
does not exist any more, should probably be control
now?
src/shims/x86/avx.rs
Outdated
// second instead of the first, ask Intel). To read the value from the current | ||
// lane, add the destination index truncated to a multiple of 2. | ||
let src_i = ((this.read_scalar(&control)?.to_u64()? >> 1) & 1) | ||
.checked_add(i & !1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, i & !1
is the chunk_base
, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, since each element is 64 bits, and each chunk is 128 bits, a chunk has two elements, so we chop off the lowest bit to get the chunk base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then let-bind it like above please. :)
src/shims/x86/avx.rs
Outdated
for i in 0..dest_len { | ||
let control = this.project_index(&control, i)?; | ||
|
||
// Each 128-bit lane is shuffled independently. Since each lane contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these lane
be chunk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I missed those
src/shims/x86/mod.rs
Outdated
/// | ||
/// Each 128-bit chunk is treated independently (i.e., the value for | ||
/// the is i-th 128-bit chunk of `dest` is calculated with the i-th | ||
/// 128-bit blocks of `left` and `right`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand what is meant by performing an operation "horizontally".
Given two vectors, I can see how can one do an operation "pointwise" (that's what it would be called in mathematics, anyway): the i-th output element is computed as op(left[i], right[i])
. Naturally, then the output is the same length as either input. Is that what you mean by "horizontally"? If not, what do you mean? I thought it may be a sort of fold, like summing all elements, but then you get one output element per input chunk, so that doesn't match the comment. I was trying to reverse engineer the code but failed.^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rustbot author
src/shims/x86/avx.rs
Outdated
this.copy_op( | ||
&this.project_index(&data, src_i)?, | ||
&this.project_index(&dest, i)?, | ||
/*allow_transmute*/ false, | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature of copy_op changed, this needs a rebase.
@rustbot ready |
@bors r+ |
☀️ Test successful - checks-actions |
Move some x86 intrinsics code to helper functions in `shims::x86` To make them reusable for intrinsics of other x86 features. Splitted from rust-lang/miri#3192
Move some x86 intrinsics code to helper functions in `shims::x86` To make them reusable for intrinsics of other x86 features. Splitted from rust-lang/miri#3192
Blocked on #3214