-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load/Store, masked set and counting operations #2430
base: master
Are you sure you want to change the base?
Load/Store, masked set and counting operations #2430
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
@@ -429,6 +429,10 @@ for comparisons, for example `Lt` instead of `operator<`. | |||
the result, with `t0` in the least-significant (lowest-indexed) lane of each | |||
128-bit block and `tK` in the most-significant (highest-indexed) lane of | |||
each 128-bit block: `{t0, t1, ..., tK}` | |||
* <code>V **SetOr**(V no, M m, T a)</code>: returns N-lane vector with lane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming to MaskedSetOr() and MaskedSet()?
@@ -1029,6 +1037,12 @@ Per-lane variable shifts (slow if SSSE3/SSE4, or 16-bit, or Shr i64 on AVX2): | |||
```HighestValue<MakeSigned<TFromV<V>>>()``` is returned in the | |||
corresponding result lanes. | |||
|
|||
* <code>bool **AllOnes**(D, V v)</code>: returns whether all bits in `v[i]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be understood that this checks if all lane values are 1.0. Would AllBits1 or AllBitsOne be more clear? Same below.
@@ -1508,6 +1522,9 @@ aligned memory at indices which are not a multiple of the vector length): | |||
|
|||
* <code>Vec<D> **LoadU**(D, const T* p)</code>: returns `p[i]`. | |||
|
|||
* <code>Vec<D> **MaskedLoadU**(D, M m, const T* p)</code>: returns `p[i]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaskedLoad already exists for this purpose, we can remove this :) Note that all the load ops except Load()
itself do not require alignment.
@@ -1538,6 +1555,10 @@ aligned memory at indices which are not a multiple of the vector length): | |||
lanes from `p` to the first (lowest-index) lanes of the result vector and | |||
fills the remaining lanes with `no`. Like LoadN, this does not fault. | |||
|
|||
* <code> Vec<D> **LoadHigher**(D d, V v, T* p)</code>: Loads `Lanes(d)/2` lanes from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename to *Upper for consistency.
This could be misunderstood to load the upper half of a vector starting at p
, it seems instead to load directly from p. How about something like InsertIntoUpper, and perhaps v as the last argument?
* <code>void **StoreTruncated**(Vec<DFrom> v, DFrom d, To* HWY_RESTRICT | ||
p)</code>: Truncates elements of `v` to type `To` and stores on `p`. It is | ||
similar to performing `TruncateTo` followed by `StoreN` where | ||
`max_lanes_to_store` is `Lanes(d)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why we mention the StoreN here? It seems like natural store semantics that it stores an entire vector (in this case after truncating to smaller elements), so we could remove the mention of max_lanes and "does not modify".
Generally we define the D argument that comes before a pointer to describe what is being stored. So it would be D, not DFrom here.
Also, we have CompressStore which emphasizes that Compress happens first. Seems that TruncateStore would be a more consistent naming?
// ------------------------------ SetOr/SetOrZero | ||
|
||
#define HWY_SVE_SET_OR(BASE, CHAR, BITS, HALF, NAME, OP) \ | ||
HWY_API HWY_SVE_V(BASE, BITS) NAME(HWY_SVE_V(BASE, BITS) inactive, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename inactive to no
per convention.
#endif | ||
template <class D, typename T, class V = VFromD<D>(), HWY_IF_LANES_GT_D(D, 1)> | ||
HWY_API V LoadHigher(D d, V a, T* p) { | ||
const V b = LoadU(d, p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users may be surprised that there's a full-length load here, this might crash.
Alternative:
Half<D> dh;
const VFromD<decltype(dh)> b = LoadU(dh, p);
return Combine(d, b, LowerHalf(a));
template <class V> | ||
HWY_API bool AllOnes(V a) { | ||
DFromV<V> d; | ||
return AllTrue(d, Eq(Not(a), Zero(d))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely a bit more efficient: we can BitCast to unsigned, then compare to Set(du, hwy::HighestValue(du).
Introduces:
"OrZero" operations will return zero where the mask is false whereas standard masking returns the corresponding lane of a passed vector.
All introduced operations are implemented in generic_ops-inl.h and in arm_sve-inl.h where there is a performance gain to be made. Testing is also performed for all new operations.