-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Example usages of a VectorSVE API #88140
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsBackground and motivationAdding a vector API for Arm SVE/SVE2 would be useful. SVE is a mandatory feature in Arm 9.0 onwards and is an alternative to NEON. Code written in SVE is vector length agnostic and will automatically scale to the vector length of the machine it is running on, and therefore will only require a single implementation per routine. Use of predication in SVE enables loop heads and tails to be skipped, making code shorter, simpler and easier to write. This issue provides examples of how such an API might be used. API ProposalNone provided. API Usage /*
Sum all the values in an int array.
*/
public static unsafe int sum_sve(ref int* srcBytes, int length)
{
VectorSVE<int> total = Sve.Create((int)0);
int* src = srcBytes;
VectorSVEPred pred = Sve.WhileLessThan(i, length);
/*
WhileLessThan comes in two variants:
VectorSVEPred WhileLessThan(int val, int limit)
VectorSVEComparison WhileLessThan(VectorSVEPred out predicate, int val, int limit)
A VectorSVEComparison can be tested using the SVE condition codes (none, any, last, nlast etc).
`if (cmp.nlast) ....`
`if (Sve.WhileLessThan(out pred, i, length).first) ....`
`if (cmp)` is the same as doing `if (cmp.any)`
Ideally the following will not be allowed:
auto f = Sve.WhileLessThan(out pred, i, length).first
*/
/*
Always using a function call for the vector length instead of assigning to a variable will allow
easier optimisation to INCW (which is faster than incrementing by a variable).
*/
for (int i = 0; Sve.WhileLessThan(out pred, i, length); i += Sve.VectorLength<int>())
{
VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src, i);
/*
This is the standard sve `add` instruction which uses a merge predicate.
For each lane in the predicate, add the two vectors. For all other lanes use the first vector.
*/
total = Sve.MergeAdd(pred, total, vec);
}
// No tail call required.
return Sve.AddAcross(total).ToScalar();
}
/*
Sum all the values in an int array, without predication.
For performance reasons, it may be better to use an unpredicated loop, followed by a tail.
Ideally, the user would write the predicated version and the Jit would optimise to this if required.
*/
public static unsafe int sum_sve_unpredicated_loop(ref int* srcBytes, int length)
{
VectorSVE<int> total = Sve.Create((int)0);
int* src = srcBytes;
int i = 0;
for (i = 0; i+Sve.VectorLength<int>() <= length; i+= Sve.VectorLength<int>() )
{
VectorSVE<int> vec = Sve.LoadUnsafe(ref *src, i);
total = Sve.MergeAdd(pred, total, vec);
}
// Predicated tail.
VectorSVEPred pred = Sve.WhileLessThan(i, length);
VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src, i);
total = Sve.MergeAdd(pred, vec, total);
return Sve.AddAcross(total).ToScalar();
}
/*
Count all the non zero elements in an int array.
*/
public static unsafe int CountNonZero_sve(ref int* srcBytes, int length)
{
VectorSVE<int> total = Sve.Create((int)0);
int* src = srcBytes;
VectorSVEPred pred = Sve.WhileLessThan(i, length);
VectorSVEPred true_pred = Sve.CreatePred(true);
for (int i = 0; Sve.WhileLessThan(out pred, i, length); i += Sve.VectorLength<int>())
{
VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src, i);
VectorSVEPred cmp_res = Sve.CompareGreaterThan(pred, vec, 0);
total = Sve.MergeAdd(cmp_res, total, vec);
}
// No tail call required.
return Sve.AddAcross(total).ToScalar();
}
/*
Count all the non zero elements in an int array, without predication.
*/
public static unsafe int CountNonZero_sve_unpredicated_loop(ref int* srcBytes, int length)
{
VectorSVE<int> total = Sve.Create((int)0);
int* src = srcBytes;
VectorSVEPred pred = Sve.WhileLessThan(i, length);
VectorSVEPred true_pred = Sve.CreatePred(true);
// Comparisons require predicates. Therefore for a truely non predicated version, use Neon.
int vector_length = 16/sizeof(int);
for (int i = 0; i+vector_length <= length; i+=vector_length)
{
Vector128<int> vec = AdvSimd.LoadVector128(src);
Vector128<int> gt = AdvSimd.CompareGreaterThan(vec, zero);
Vector128<int> bits = AdvSimd.And(gt, one);
total = AdvSimd.Add(bits, total);
src += vector_length;
}
// Predicated tail.
VectorSVEPred pred = Sve.WhileLessThan(i, length);
VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src);
VectorSVEPred cmp_res = Sve.CompareGreaterThan(pred, vec, 0);
total = Sve.MergeAdd(cmp_res, total, vec);
return Sve.AddAcross(total).ToScalar();
}
/*
Count all the elements in a null terminated array of unknown size.
*/
public static unsafe int CountLength_sve(ref int* srcBytes)
{
int* src = srcBytes;
VectorSVEPred pred = Sve.CreatePred(true);
int ret = 0;
while (true)
{
VectorSVE<int> vec = Sve.LoadUnsafeUntilFault(pred, ref *src); // LD1FF
/*
Reading the fault predicate via RDFFRS will also set the condition flags:
VectorSVEComparison GetFaultPredicate(VectorSVEPred out fault, VectorSVEPred pred)
*/
VectorSVEPred fault_pred;
if (Sve.GetFaultPredicate(out fault_pred, pred).last)
{
// Last element is set in fault_pred, therefore the load did not fault.
/*
Like WhileLessThan, comparisons come in two variants:
VectorSVEPred CompareEquals(VectorSVEPred pred, VectorSVE a, VectorSVE b)
VectorSVEComparison CompareEquals(VectorSVEPred out cmp_result, VectorSVEPred pred, VectorSVE a, VectorSVE b)
*/
// Look for any zeros across the entire vector.
VectorSVEPred cmp_zero;
if (Sve.CompareEquals(out cmp_zero, pred, vec, 0).none)
{
// No zeroes found. Continue loop.
ret += Sve.VectorLength<int>();
}
else
{
// Zero found. Count up to it and return.
VectorSVEPred matches = Sve.PredFillUpToFirstMatch(pred, cmp_zero); // BRKB
ret += Sve.PredCountTrue(matches); // INCP
return ret;
}
}
else
{
// Load caused a fault.
// Look for any zeros across the vector up until the fault.
VectorSVEPred cmp_zero;
if (Sve.CompareEquals(out cmp_zero, fault_pred, vec, 0).none)
{
// No zeroes found. Clear faulting predicate and continue loop.
ret += Sve.PredCountTrue(fault_pred); // INCP
Sve.ClearFaultPredicate(); // SETFFR
}
else
{
// Zero found. Count up to it and return.
VectorSVEPred matches = Sve.PredFillUpToFirstMatch(pred, cmp_zero); // BRKB
ret += Sve.PredCountTrue(matches); // INCP
return ret;
}
}
}
}
Alternative DesignsNo response RisksNo response
|
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics Issue DetailsBackground and motivationAdding a vector API for Arm SVE/SVE2 would be useful. SVE is a mandatory feature in Arm 9.0 onwards and is an alternative to NEON. Code written in SVE is vector length agnostic and will automatically scale to the vector length of the machine it is running on, and therefore will only require a single implementation per routine. Use of predication in SVE enables loop heads and tails to be skipped, making code shorter, simpler and easier to write. This issue provides examples of how such an API might be used. API ProposalNone provided. API Usage /*
Sum all the values in an int array.
*/
public static unsafe int sum_sve(ref int* srcBytes, int length)
{
VectorSVE<int> total = Sve.Create((int)0);
int* src = srcBytes;
VectorSVEPred pred = Sve.WhileLessThan(i, length);
/*
WhileLessThan comes in two variants:
VectorSVEPred WhileLessThan(int val, int limit)
VectorSVEComparison WhileLessThan(VectorSVEPred out predicate, int val, int limit)
A VectorSVEComparison can be tested using the SVE condition codes (none, any, last, nlast etc).
`if (cmp.nlast) ....`
`if (Sve.WhileLessThan(out pred, i, length).first) ....`
`if (cmp)` is the same as doing `if (cmp.any)`
Ideally the following will not be allowed:
auto f = Sve.WhileLessThan(out pred, i, length).first
*/
/*
Always using a function call for the vector length instead of assigning to a variable will allow
easier optimisation to INCW (which is faster than incrementing by a variable).
*/
for (int i = 0; Sve.WhileLessThan(out pred, i, length); i += Sve.VectorLength<int>())
{
VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src, i);
/*
This is the standard sve `add` instruction which uses a merge predicate.
For each lane in the predicate, add the two vectors. For all other lanes use the first vector.
*/
total = Sve.MergeAdd(pred, total, vec);
}
// No tail call required.
return Sve.AddAcross(total).ToScalar();
}
/*
Sum all the values in an int array, without predication.
For performance reasons, it may be better to use an unpredicated loop, followed by a tail.
Ideally, the user would write the predicated version and the Jit would optimise to this if required.
*/
public static unsafe int sum_sve_unpredicated_loop(ref int* srcBytes, int length)
{
VectorSVE<int> total = Sve.Create((int)0);
int* src = srcBytes;
int i = 0;
for (i = 0; i+Sve.VectorLength<int>() <= length; i+= Sve.VectorLength<int>() )
{
VectorSVE<int> vec = Sve.LoadUnsafe(ref *src, i);
total = Sve.MergeAdd(pred, total, vec);
}
// Predicated tail.
VectorSVEPred pred = Sve.WhileLessThan(i, length);
VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src, i);
total = Sve.MergeAdd(pred, vec, total);
return Sve.AddAcross(total).ToScalar();
}
/*
Count all the non zero elements in an int array.
*/
public static unsafe int CountNonZero_sve(ref int* srcBytes, int length)
{
VectorSVE<int> total = Sve.Create((int)0);
int* src = srcBytes;
VectorSVEPred pred = Sve.WhileLessThan(i, length);
VectorSVEPred true_pred = Sve.CreatePred(true);
for (int i = 0; Sve.WhileLessThan(out pred, i, length); i += Sve.VectorLength<int>())
{
VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src, i);
VectorSVEPred cmp_res = Sve.CompareGreaterThan(pred, vec, 0);
total = Sve.MergeAdd(cmp_res, total, vec);
}
// No tail call required.
return Sve.AddAcross(total).ToScalar();
}
/*
Count all the non zero elements in an int array, without predication.
*/
public static unsafe int CountNonZero_sve_unpredicated_loop(ref int* srcBytes, int length)
{
VectorSVE<int> total = Sve.Create((int)0);
int* src = srcBytes;
VectorSVEPred pred = Sve.WhileLessThan(i, length);
VectorSVEPred true_pred = Sve.CreatePred(true);
// Comparisons require predicates. Therefore for a truely non predicated version, use Neon.
int vector_length = 16/sizeof(int);
for (int i = 0; i+vector_length <= length; i+=vector_length)
{
Vector128<int> vec = AdvSimd.LoadVector128(src);
Vector128<int> gt = AdvSimd.CompareGreaterThan(vec, zero);
Vector128<int> bits = AdvSimd.And(gt, one);
total = AdvSimd.Add(bits, total);
src += vector_length;
}
// Predicated tail.
VectorSVEPred pred = Sve.WhileLessThan(i, length);
VectorSVE<int> vec = Sve.LoadUnsafe(pred, ref *src);
VectorSVEPred cmp_res = Sve.CompareGreaterThan(pred, vec, 0);
total = Sve.MergeAdd(cmp_res, total, vec);
return Sve.AddAcross(total).ToScalar();
}
/*
Count all the elements in a null terminated array of unknown size.
*/
public static unsafe int CountLength_sve(ref int* srcBytes)
{
int* src = srcBytes;
VectorSVEPred pred = Sve.CreatePred(true);
int ret = 0;
while (true)
{
VectorSVE<int> vec = Sve.LoadUnsafeUntilFault(pred, ref *src); // LD1FF
/*
Reading the fault predicate via RDFFRS will also set the condition flags:
VectorSVEComparison GetFaultPredicate(VectorSVEPred out fault, VectorSVEPred pred)
*/
VectorSVEPred fault_pred;
if (Sve.GetFaultPredicate(out fault_pred, pred).last)
{
// Last element is set in fault_pred, therefore the load did not fault.
/*
Like WhileLessThan, comparisons come in two variants:
VectorSVEPred CompareEquals(VectorSVEPred pred, VectorSVE a, VectorSVE b)
VectorSVEComparison CompareEquals(VectorSVEPred out cmp_result, VectorSVEPred pred, VectorSVE a, VectorSVE b)
*/
// Look for any zeros across the entire vector.
VectorSVEPred cmp_zero;
if (Sve.CompareEquals(out cmp_zero, pred, vec, 0).none)
{
// No zeroes found. Continue loop.
ret += Sve.VectorLength<int>();
}
else
{
// Zero found. Count up to it and return.
VectorSVEPred matches = Sve.PredFillUpToFirstMatch(pred, cmp_zero); // BRKB
ret += Sve.PredCountTrue(matches); // INCP
return ret;
}
}
else
{
// Load caused a fault.
// Look for any zeros across the vector up until the fault.
VectorSVEPred cmp_zero;
if (Sve.CompareEquals(out cmp_zero, fault_pred, vec, 0).none)
{
// No zeroes found. Clear faulting predicate and continue loop.
ret += Sve.PredCountTrue(fault_pred); // INCP
Sve.ClearFaultPredicate(); // SETFFR
}
else
{
// Zero found. Count up to it and return.
VectorSVEPred matches = Sve.PredFillUpToFirstMatch(pred, cmp_zero); // BRKB
ret += Sve.PredCountTrue(matches); // INCP
return ret;
}
}
}
}
Alternative DesignsNo response RisksReferencesSVE Programming Examples No response
|
cc: @JulieLeeMSFT |
Thanks for opening the issue! This is definitely a space we want to improve, just wanting to lay out some baseline information and expectations for this. It's worth noting that this is not actionable as an API proposal in its current setup. Any API proposal needs to follow the API proposal template including defining the full public surface area to be added. It's also worth noting any work towards SVE/SVE2 is going to require:
For 1, there are only some mobile chips (Samsung/Qualcomm) and the AWS Graviton3 that currently support SVE or Armv9 at all. The latter is the only one that would be viable for CI and it would likely require some non-trivial effort to make happen. It may still be some time and require higher market saturation of SVE/SVE2 before work/progress can be made (the same was true for AVX-512). Having more easily accessible hardware that supports the ISAs (both for CI and local development) can accelerate this considerably. For 2, SVE is a very different programming model from the existing fixed sized vectors exposed by It is entirely possible that SVE will be in a similar boat and in order for the BCL and general user-base to best take advantage of it, we will need to consider a balance between the general programming model most user-code needs for other platforms and what exists in hardware directly. That is, the simplest way to get SVE support available, without requiring all users to go rewrite their code and add completely separate code paths just for the subset of hardware with SVE support would be to have For 3, native SVE has many limitations around where the SVE vector types can be declared and how they are allowed to be used. Getting the same restrictions in managed would be very complex and not necessarily "pay for play". For the JIT, this is fine since things can be determined at runtime and match exactly. For AOT, this represents a problem since the sizes and other information isn't known until runtime which can complicate a number of scenarios. |
@tannergooding Thanks for the initial feedback, just some quick responses:
There are actually some more that are available. If Linux would be an acceptable target I can send you an email with other options than the ones you have stated here.
Agreed 100%, though the purpose of this ticket is to start the discussions on the SVE intrinsics that would form the basis of Before doing a full SVE design and proposal, it's better for us to get some input from you folks on how you'd like these kinds of situations to be handled. From our last discussion we highlighted that SVE would need some additional JIT design work for workloads that Vector won't ever cover, such as SME. Therefore I think it's best to focus this issue on how we can even provide core Scalable vector support for current and future Arm architectures rather than how to support them in the generic wrappers. I believe one is the pre-requisite to the other?
Agreed! one of the questions we had raised last time was what are the constraints on which AOT operates. If AOT is equivalent to a native compiler's If AOT is supposed to be platform agnostic, than the solution could be the one we discussed 2 years ago, where instead of assuming VLA, we assume a minimum vector length that the user selects. i.e. if they want full compatibility we select VL128 and just simply predicate all SVE operations to this. This fixes the AOT issues since we'd never use more than VL sized vectors. But really, to move forward with SVE support we need your input. SVE is much more than just a vector length play, and without support for it .NET the ability to improve will be quite limited. SVE will, and does have capabilities that Advanced SIMD will never have. |
Linux is great!. We just really need to ensure there are options we can use for both development and validation. This namely means they can't be exclusively mobile devices (i.e. phones). Having a list of viable devices would be good, all the sources I've found for SVE or Armv9 have shown to be phones, android devices, the latest version of Gravitron, or not yet released for general use.
It depends on the general approach taken. We could very likely expose some This model is actually how AVX-512 was designed to be exposed for .NET 8. That is, we planned to ship the general purpose We went with this approach because it allowed easier access to the acceleration for a broader audience, worked well when considering all platforms (not just x86/x64), allowed existing code to improve without explicit user action, and allowed developers to iteratively extend their existing algorithms without having to add massive amounts of new platform specific code.
The default for most AOT would be platform agnostic. For example,
Predicated instructions has an increased cost over non-predicated, correct? In particular, we have at least 3 main platforms to consider support around:
There are also other platforms we support (Arm32, x86), platforms that are being brought online by the community (LoongArch, RISC-V), etc. Most of these platforms have some amount of SIMD support and giving access to the platform specific APIs gives the greatest amount of power in your algorithms. Exposing this support long term is generally desirable. However, an algorithm supporting every one of these platforms is also very expensive and most of the time the algorithms are fairly similar to each other regardless of the platform being targeted and there's often only minor differences between them. This is why the cross platform APIs were introduced in .NET 7, as it allowed us to take what was previously a minimum of 3 (Arm64, Fallback, x64) but sometimes more (different vector sizes, per ISA lightup, other platforms, etc) paths and merge them down to just 2 paths (Vector, Fallback) while still maintaining the perf or only losing perf within an acceptable margin. To that end, the general direction we'd like to maintain is that developers can continue writing (for the most part) more general cross platform code for the bulk of their algorithms. They would then utilize the platform specific intrinsics for opportunistic light-up within these cross platform algorithms where it makes a significant impact. Extreme power users could continue writing individual algorithms that are fine-tuned for each platform/architecture, but it would not necessarily be the "core" scenario we expect most users to be targeting (namely due to the complexity around doing this). I believe that ensuring I do think we should be able to make Given the samples in the OP, there's this callout:
I actually think its the inverse and given the cross platform considerations, we want users to write code unpredicated with a tail loop. This is what they already have to do for Instead, we could express the code such as: public static unsafe int sum_sve_unpredicated_loop(int* srcBytes, int length)
{
Vector<int> total = Vector<int>.Zero;
(int iterations, int remainder) = int.DivRem(i, Vector<int>.Length);
for (int i = 0; i < iterations; i++)
{
total += Vector.Load(srcBytes);
srcBytes += Vector<int>.Length;
}
// Predicated tail.
Vector<int> pred = Vector.CreateTrailingMask(remainder);
total += Vector.MaskedLoad(pred, srcBytes, Vector<int>.Zero);
return Vector.Sum(total);
} We would still of course expose There are notably a few other ways this could be done as well, but the general point is that this results in very natural and easy to write code that works on any of the platforms, including on hardware that only supports There is a note that The JIT handles this difference from hardware in lowering by doing some tracking to know if the value is actually a Noting that this decision also doesn't restrict our ability from adding such support in the future. It just greatly simplified what was required today and made it much faster and more feasible to get the general support out while also keeping the implementation cost low. We could also add support for the fully predicated loops and optimizing them to be rewritten as unpredicated with a post predicated handler. But that is more complicated and I believe would be better handled as a future consideration after we get the core support out. I'd like to see us add SVE support and so I'd hope going along the avenues we've found to work and help with adoption and implicit light up would also apply here. |
Agreed, this seemed the closest category. Given that this won't turn into a full API proposal, I could just drop the category and move to a normal issue?
For now, yes.
Having AOT Arm64 target 8.0 feels sensible. For ReadyToRun, it would then tier appropriately to what is available. For the "no JIT, only AOT" option, maybe a command line option at compile time - essentially the same as the gcc/llvm
Without explicit VectorMask types you're losing type checking. Could VectorMask just be a wrapper over Vector? On downlevel platforms, the tail is being turned into a scalar loop? Therefore on simple loops, the mask is effectively optimised away, right?
That makes sense when coming at it from the generic side. From an implementation side, I'm thinking that predication makes things easier. The user writes a single predicated loop. There is enough information in the loop for the jit to 1) create a predicated loop with no tail or 2) create unpredicated main loop and predicated tail or 3) create unpredicated main loop and scalar tail. I understand users are not used to writing with predication yet. But, in the example, the user still has to use predication for the tail. Is that why Another option maybe would be to never have |
We have this already, and allow targeting specific versions (
Not cheaply/easily and not without requiring users go and explicitly opt into the use of masking. The general point is that on downlevel hardware ( It is fairly trivial for the JIT to recognize and internally do the correct type tracking. It is also fairly trivial for the JIT to insert the relevant implicit conversions if an API expects a mask and was given a vector; or if it expects a vector and was given a mask. One of the reasons this is cheap and easy to do is because it is completely non-observable to whoever is writing the algorithm and so the JIT can do this in a very "pay for play" manner. If we inverse things, such that we expose
Users do different things for different scenarios. Some users simply defer back to the scalar path. Some will "backtrack" and do 1 more iteration when the operation is idempotent. Some use explicit masked operations to ensure the operation remains idempotent.
I'd disagree on this. The general "how to write vectorized code" steps are:
Predication/masking is effectively a more advanced scenario that can be used to handling more complex scenarios such as branching or handling the tail (remaining) elements that don't fill the vector. It's not something that's actually needed super often and generally represents a minority of the overall SIMD code you write.
Right. We opted to drop It also means that we don't have to do more complex handling for downlevel hardware (which is currently the far more common scenario) and can instead restrict it only to the newer hardware. It means that we won't typically encounter unnecessary predication which means we don't have to worry about optimizing the unnecessary bits away. It then also simplifies the general handling of loops as we don't have to consider predication as part of loop cloning, loop hoisting, or other optimizations in general.
Users are already sometimes writing predicated tails today, so I don't see the need to block this. I think it would be better to just expose the couple additional APIs that would help with writing predicated tails instead. Exposing a nearly identical type that supports predication would just make adding predication support harder and would increase the amount of code the typical user needs to maintain.
In my opinion, this is largely a non-issue. Downlevel users already need to do their masking/predication using The JIT is then fully capable of knowing which methods return a "mask like value", which we already minimally track on downlevel hardware to allow some special optimizations. The JIT is also fully capable of correctly tracking the type internally as The only requirement for the JIT here is that if it has a A managed analyzer can then correctly surface to the end user when they are inefficiently handling masking (e.g. passing in a mask to something that doesn't expect one, or vice versa). This is very similar to the |
Great, I've reached out to some people internally and will get back to you.
I don't disagree, but I don't see why you'd not want the ability for people to who want the full benefit of it to avoid generic code? For that reason we don't even see SVE and NEON as orthogonal things, and we have code that dips between the two sometimes on an instruction basis depending on what you need https://arm-software.github.io/acle/main/acle.html#arm_neon_sve_bridgeh so I think you're designing yourself into a box by not exposing SVE directly as well.
Yes indeed, but this cost is amortized over the ability to vectorize code that you can't with Advanced SIMD.
I don't think we're disagreeing here. All we want to do with this Issue though it highlight that we need to have a way to use
Well for one, AVX-512 masking isn't as first class as SVE's. just compare the number of operations on masks between both ISAs for instance. Secondly SVE simply does not have unpredicated equivalences of all instructions, while SVE2 adds some more it's not a 1-1 thing. So even for your "unpredicated" loop you'll end up needing to predicate the instructions for use. So going from predicated to unpredicated makes much more sense.
Don't really see how this would extend to non-masked ISAs? There's nothing you can do for the predicated tail for Advanced SIMD here. You're going to have to generate scalar. At the very most you can generate a vector epilogue with a scalar tail. But you could have done all that from a single loop anyway. I think we've discussed this before, I still don't see why you can't generate the loop above from a more generic VLA friendly representation. It's just peeling the last vector iteration. I would argue that's easier to write for people as well, and easier to maintain since you don't need to have two instances of your loop body to maintain.
I'm having trouble grokking this part :) What would if (a[i] >0 && b [i] < 0) look like when vectorized in this representation?
Right, so a half way step between autovec and intrinsics. That's fine, in C code we treat predication as just normal masking anyway. typically we don't carry additional IL for it, just normal boolean operation on vector booleans.
Sure, but my worry here is that if you get people to rewrite their loops once using the unpredicated main body and predicated tail approach once, would they really rewrite it later again? I can't speak for whether it's more work or not obviously, but especially in case of SVE1 you'll have to forcibly predicate many instructions anyway to generate the "unpredicated" loop.
:) |
Sure, but to again use a simple example, how would something like a a non-ifconvertible conditonal look? i.e.
without exposing vector mask, how does one write this?
I'd disagree with this :) Yes this is true for VLS but not VLA. The entire point of VLA is that you don't have to think of scalar at all, and conversely you don't need to think of vector length. For VLA the expectations is that
VLA is supposed to map more closely to an intuitive loop design where you're supposed to be able to map directly from scalar to vector without worrying about buffer overruns, overreads etc.
This has been hurting my head tbh :) I've been having trouble understanding with
Sorry don't quite follow this one, So if a user does
what happens here? Yes It's nonsensical but the types return by the comparison indicate it should do something sane. How about cases where it's ambiguous
is this a predicate combination or bitwise or of values? |
I think there's maybe some misunderstanding/miscommunication going on, so I'd like to try and reiterate my side a bit. There are effectively 2 target audiences we want to capture here, noting that in bothof these cases the users will likely need to be considering multiple platforms (
To achieve 1, we have a need to expose "platform specific APIs". This includes To achieve 2, we have a need to expose "cross platform APIs". This is primarily For the first audience, these developers are likely more familiar with SIMD and the specific underlying platform they're targeting. They are likely willing to write code for each target platform and even code per ISA to get every bit of perf out. Since doing this can require a lot of platform specific knowledge/expertise, can require significant testing and hardware to show the wins, and since it may not provide significant real world wins over in all areas/domains; it is not something that a typical developer is likely going to be doing. For the second audience, these developers may only be familiar with 1 target platform (i.e. only familiar with We want and need to target both of these audiences. However, it is very important to keep in mind that the second audience is a significantly larger target and we will see the most overall benefit across the ecosystem by ensuring new ISAs can be successfully used from here. If something like To that end, I want to ensure that we expose
This general need to be pay for play and not significantly increase the complexity sometimes necessitates thinking about alternative ways to expose functionality. We also needed to factor in that newer ISAs are typically only available in a minority of hardware and it can take significant time for market saturation (such that you have a significantly higher chance of encountering the ISA) and that having a significantly different programming model to work with newer ISAs or functionality can hinder adoption and integration with general algorithms, particular for audience 2. For example, we've been working on exposing However, it was quickly found that exposing
Given this, we took a step back and considered if we could achieve this in an alternative fashion. We considered how users, of both audiences, have to achieve the equivalent of masking support today on downlevel ISAs, what was feasible for the JIT to recognize/integrate, the costs/impact of the different approaches for both audiences, etc. Ultimately, we came to the conclusion that we could get all the same functionality a different way and could ensure it was explicitly documented. One of the primary considerations was that Vector128<float> result = Add(left, right);
return ConditionalSelect(mergeMask, result, mergeValues); Likewise, that all downlevel platforms without explicit masking/predication support currently have their "comparison" and other similar intrinsics return a So the general thought process was that if we simply preserve this model, then we can trivially allow existing code patterns to trivially light up on hardware that supports masking/predication. We likewise can avoid exposing several thousand new API overloads that explicitly take masking parameters in favor of pattern recognition over The JIT would still have a This resulted in something that was incredibly pay for play, had immediate benefits showing up to existing workloads without needing to touch the managed SIMD algorithm, and which still generated the correct and expected code. We simply need to continue expanding the pattern recognition to the full set of mask patterns. My expectation is that a similar model would work very well for |
The intent was not to "not expose SVE". It was simply a question of whether
Sorry, rather I meant that for many instructions
Right. The same is true of
👍. What I'm trying to emphasize is that as part of the design we really need to consider how the functionality is available for the second audience. If we only consider one side or the other, then we are likely to end up with something that is suboptimal. There will be quite a lot of code and codebases that will want to take advantage of SVE without necessarily writing algorithms that explicitly use For such cases, we are even looking at ways to further improve the handling and collapse the algorithms further, such as via an There will still be plenty of cases where developers will want or need
There's always going to be pro's and con's for each platform. There is functionality that Arm64 has that x64 does not and but also functionality that x64 has which Arm64 does not. There are places where Arm64 makes some operations significantly more trivial to do and there are places where x64 does the same over Arm64. It's a constant battle for both sides, but developers are going to want/need to support both. .NET therefore has a desire to allow developers to achieve the best on both as well, while also considering how developers can succeed while needing to support the ever growing sets of ISAs, platforms, and targets.
Right, but the consideration is that predicate will often be For these APIs, to be "strictly compatible" with what hardware defines, we'd define and expose the following (matching native): VectorSve<float> Add(VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> AddMerge(VectorSvePredicate pg, VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> AddMergeUnsafe(VectorSvePredicate pg, VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> AddMergeZero(VectorSvePredicate pg, VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> AbsMerge(VectorSve<float> inactive, VectorSvePredicate pg, VectorSve<float> op1);
VectorSve<float> AbsMergeUnsafe(VectorSvePredicate pg, VectorSve<float> op1);
VectorSve<float> AbsMergeZero(VectorSvePredicate pg, VectorSve<float> op1); Where This pattern repeats for most instructions meaning we have 3-4x new APIs per instruction, giving us the same API explosion we were going to have for We could collapse this quite a bit by instead exposing: VectorSve<float> Add(VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> AddMerge(VectorSve<float> inactive, VectorSvePredicate pg, VectorSve<float> op1, VectorSve<float> op2);
VectorSve<float> Abs(VectorSve<float> op1);
VectorSve<float> AbsMerge(VectorSve<float> inactive, VectorSvePredicate pg, VectorSve<float> op1); The JIT could then trivially recognize the following: AddMerge(op1, pg, op1, op2) == AddMerge(pg, op1, op2)
AddMerge(VectorSve<float>.Zero, pg, op1, op2) == AddMergeZero(pg, op1, op2)
AbsMerge(op1) == AbsMerge(VectorSvePredicate.All, op1) Much as We could then consider if this can be collapsed a bit more. We already have a variable width vector ( Vector<float> Add(Vector<float> op1, Vector<float> op2);
Vector<float> AddMerge(Vector<float> inactive, VectorSvePredicate pg, Vector<float> op1, Vector<float> op2);
Vector<float> Abs(Vector<float> op1);
Vector<float> AbsMerge(Vector<float> inactive, VectorSvePredicate pg, Vector<float> op1); If we then take it a step further and do the same thing as what we've done with Vector<float> Add(Vector<float> op1, Vector<float> op2);
Vector<float> Abs(Vector<float> op1);
Vector<float> Select(VectorSvePredicate pg, Vector<float> op1, Vector<float> op2); Developers would then access the predicated functionality as: Select(pg, Add(op1, op2), inactive) == AddMerge(inactive, pg, op1, op2) This means that we only have 1 new API to expose per SVE instruction, again while expanding the total support available, and centralizing the general pattern recognition required (and which we'll already need/want to support). The only step that could be done here is to remove |
The JIT is, in general, time constrained. It cannot trivially do all the same optimizations that a native compiler could do. However, there are other optimizations that it can do more trivially particularly since it knows the exact machine it's targeting. Loop peeling isn't necessarily complex, but it produces quite a bit more IR and requires a lot of additional handling and optimizations to ensure everything works correctly. It's not something we have a lot of support for around today and adding it would likely not be very "pay for play", especially if its required to get good performance/handling for code users are writing to be "perf critical". Additionally, most developers are going to require the scalar loop to exist anyways so they have something to execute when SVE isn't supported. So it's mostly just a question of "do I fallthrough to the scalar path" -or- "do I do efficiently handle the tail elements". For "efficient tail handling", you most typically have at least a full vector of elements and so developers will just backtrack by There are then some other more complex cases that aren't so trivially handled and which won't translate as cleanly. Particularly if you cannot backtrack (e.g. your entire input is less than a full vector or you don't know if the access will fault or not). These can really only be handle with
On AVX-512 capable hardware, this will end up actually being a mask register, such as For
This generates then (assuming ; Pre-AVX512 hardware
vxorps xmm0, xmm0, xmm0
vpcmpgtd xmm3, xmm1, xmm0
vpcmpltd xmm4, xmm2, xmm0
vpand xmm3, xmm4, xmm4
; AVX512 hardware
vxorps xmm0, xmm0, xmm0
vpcmpgtd k1, xmm1, xmm0
vpcmpltd k2, xmm2, xmm0
kandd k1, k2
This isn't really auto-vectorization as the user is still explicitly writing vectorized code. We're simply recognizing the common SIMD patterns and emitting better codegen for them on hardware where that can be done.
The same way you'd have to write it for NEON today (assuming again that the snippet provided is the scalar algorithm): // va/vb/vc = current vector for a/b/c, would have been loaded using `Vector128.Load`
Vector128<int> mask = Vector128.GreaterThan(va, Vector128<int>.Zero);
vb = Vector128.ConditionalSelect(mask, vc * n, vb);
va += Vector128.ConditionalSelect(mask, vb - vc, Vector128<int>.Zero); This simply uses SVE predicated instructions -or- AVX-512 masked instructions on capable hardware; otherwise it emits
Right, but its not universal and therefore not a de-facto scenario for users. Users have to consider these scenarios for Arm64
It depends on the user of the value and what the underlying instructions allow. On AVX-512 capable hardware, we can either do a
Same scenario. It depends on how the value is being used. You'd either need to convert the mask ( The typical cases, particularly in the real world, will be simply combining masks together and then using them as masks, so no additional "stuff" is needed. You get the exact same codegen as you would if masking was exposed as a "proper" type. |
-- Should be done. Just wanted to reiterate the general point I'm coming from to try and clear up any confusion and then try to separately address your individual questions. |
I've been thinking (and discussing with others) how we could handle the type of a predicate variable. These are the options we came up with. Let's assume we are using VectorT for SVE vectors.
|
Based on the existing design we've found to work for AVX-512, which supports a similar predication/masking concept, the intended design here is to use
Such an API would not be exposed. Instead we would only expose Vector.ConditionalSelect(pg, Sve2.AbsoluteValue(op), inactive)
Vector.ConditionalSelect(pg, Sve2.AbsoluteValue(op), Vector<int>.Undefined) // Vector<int>.Undefined would be new
Vector.ConditionalSelect(pg, Sve2.AbsoluteValue(op), Vector<int>.Zero) This allows all forms of predication to be supported via
This is a non-issue as the JIT can emit an implicit conversion. For the most part, this is a non-issue given standard coding patterns around masks and when/where you would want to use them.
This should also be a non-issue. If the user wants to do arithmetic on the output of a It may not be the "best" pattern for an SVE specific code path, but the user can always opt to write an SVE specific path that does something more optimal. |
Ok, I'm mostly on board with this approach. I've been working my way through all the predicated instructions trying to see if they will work.
Looking also at
This will can all be covered in C# via: public static unsafe Vector<T> And(Vector<T> left, int right);
public static unsafe Vector<T> And(Vector<T> left, Vector<T> right); Note there is a variant that works only on predicates. Allowing for: Vector<int> mask_c = Sve.And(mask_a, mask_b); But there is no variant of Vector<int> mask_c = Sve.Add(mask_a, mask_b); Assuming this is allowed somehow, does there need to be something in the vector API to indicate to users that What happens if the user tries:
Will this error? Or will There are also some instructions that work only on predicates. For example:
generates into: public static unsafe Vector<T> Nand(Vector<T> left, Vector<T> right);
public static unsafe Vector<T> Nor(Vector<T> left, Vector<T> right); It's fairly easy to support these functions for standard vectors too, but it is worth noting. |
That being said, this is one of the more unique cases where there preserving the general support is desirable and where it can't be trivially done via pattern recognition. Thus it's a case we'd expose an API like
We can either keep the name simple or name them |
Explicitly naming the methods that work on masks would be great. It ensures the meaning is clear to the user. However, it will increase the number of methods. public static unsafe Vector<sbyte> And(Vector<sbyte> left, Vector<sbyte> right);
public static unsafe Vector<short> And(Vector<short> left, Vector<short> right);
public static unsafe Vector<int> And(Vector<int> left, Vector<int> right);
public static unsafe Vector<long> And(Vector<long> left, Vector<long> right);
public static unsafe Vector<byte> And(Vector<byte> left, Vector<byte> right);
public static unsafe Vector<ushort> And(Vector<ushort> left, Vector<ushort> right);
public static unsafe Vector<uint> And(Vector<uint> left, Vector<uint> right);
public static unsafe Vector<ulong> And(Vector<ulong> left, Vector<ulong> right);
public static unsafe Vector<sbyte> AndMask(Vector<sbyte> left, Vector<sbyte> right);
public static unsafe Vector<short> AndMask(Vector<short> left, Vector<short> right);
public static unsafe Vector<int> AndMask(Vector<int> left, Vector<int> right);
public static unsafe Vector<long> AndMask(Vector<long> left, Vector<long> right);
public static unsafe Vector<byte> AndMask(Vector<byte> left, Vector<byte> right);
public static unsafe Vector<ushort> AndMask(Vector<ushort> left, Vector<ushort> right);
public static unsafe Vector<uint> AndMask(Vector<uint> left, Vector<uint> right);
public static unsafe Vector<ulong> AndMask(Vector<ulong> left, Vector<ulong> right); Curiously, this is a case where C# has more methods than C, due to C having a single svbool_t svand[_b]_z (svbool_t pg, svbool_t op1, svbool_t op2) |
We don't really need Sve.AndMask(vector1.AsMask(), vector2.AsMask()).AsVector() == Sve.And(vector1, vector2);
Sve.AndMask(mask1, mask2) == Sve.And(mask1.AsVector(), mask2.AsVector()).AsMask(); Thus, in the world where we only expose We only need to expose additional overloads for cases like |
After parsing through the
That's 2307 C# methods across 27 groups. Here's a truncated namespace System.Runtime.Intrinsics.Arm
public abstract class Sve : AdvSimd /// Feature: FEAT_SVE Category: math
{
/// Abs : Absolute value
/// svfloat32_t svabs[_f32]_m(svfloat32_t inactive, svbool_t pg, svfloat32_t op) : "FABS Ztied.S, Pg/M, Zop.S" or "MOVPRFX Zresult, Zinactive; FABS Zresult.S, Pg/M, Zop.S"
/// svfloat32_t svabs[_f32]_x(svbool_t pg, svfloat32_t op) : "FABS Ztied.S, Pg/M, Ztied.S" or "MOVPRFX Zresult, Zop; FABS Zresult.S, Pg/M, Zop.S"
/// svfloat32_t svabs[_f32]_z(svbool_t pg, svfloat32_t op) : "MOVPRFX Zresult.S, Pg/Z, Zop.S; FABS Zresult.S, Pg/M, Zop.S"
public static unsafe Vector<float> Abs(Vector<float> value);
/// svfloat64_t svabs[_f64]_m(svfloat64_t inactive, svbool_t pg, svfloat64_t op) : "FABS Ztied.D, Pg/M, Zop.D" or "MOVPRFX Zresult, Zinactive; FABS Zresult.D, Pg/M, Zop.D"
/// svfloat64_t svabs[_f64]_x(svbool_t pg, svfloat64_t op) : "FABS Ztied.D, Pg/M, Ztied.D" or "MOVPRFX Zresult, Zop; FABS Zresult.D, Pg/M, Zop.D"
/// svfloat64_t svabs[_f64]_z(svbool_t pg, svfloat64_t op) : "MOVPRFX Zresult.D, Pg/Z, Zop.D; FABS Zresult.D, Pg/M, Zop.D"
public static unsafe Vector<double> Abs(Vector<double> value);
/// svint8_t svabs[_s8]_m(svint8_t inactive, svbool_t pg, svint8_t op) : "ABS Ztied.B, Pg/M, Zop.B" or "MOVPRFX Zresult, Zinactive; ABS Zresult.B, Pg/M, Zop.B"
/// svint8_t svabs[_s8]_x(svbool_t pg, svint8_t op) : "ABS Ztied.B, Pg/M, Ztied.B" or "MOVPRFX Zresult, Zop; ABS Zresult.B, Pg/M, Zop.B"
/// svint8_t svabs[_s8]_z(svbool_t pg, svint8_t op) : "MOVPRFX Zresult.B, Pg/Z, Zop.B; ABS Zresult.B, Pg/M, Zop.B"
public static unsafe Vector<sbyte> Abs(Vector<sbyte> value);
/// svint16_t svabs[_s16]_m(svint16_t inactive, svbool_t pg, svint16_t op) : "ABS Ztied.H, Pg/M, Zop.H" or "MOVPRFX Zresult, Zinactive; ABS Zresult.H, Pg/M, Zop.H"
/// svint16_t svabs[_s16]_x(svbool_t pg, svint16_t op) : "ABS Ztied.H, Pg/M, Ztied.H" or "MOVPRFX Zresult, Zop; ABS Zresult.H, Pg/M, Zop.H"
/// svint16_t svabs[_s16]_z(svbool_t pg, svint16_t op) : "MOVPRFX Zresult.H, Pg/Z, Zop.H; ABS Zresult.H, Pg/M, Zop.H"
public static unsafe Vector<short> Abs(Vector<short> value);
/// svint32_t svabs[_s32]_m(svint32_t inactive, svbool_t pg, svint32_t op) : "ABS Ztied.S, Pg/M, Zop.S" or "MOVPRFX Zresult, Zinactive; ABS Zresult.S, Pg/M, Zop.S"
/// svint32_t svabs[_s32]_x(svbool_t pg, svint32_t op) : "ABS Ztied.S, Pg/M, Ztied.S" or "MOVPRFX Zresult, Zop; ABS Zresult.S, Pg/M, Zop.S"
/// svint32_t svabs[_s32]_z(svbool_t pg, svint32_t op) : "MOVPRFX Zresult.S, Pg/Z, Zop.S; ABS Zresult.S, Pg/M, Zop.S"
public static unsafe Vector<int> Abs(Vector<int> value);
/// svint64_t svabs[_s64]_m(svint64_t inactive, svbool_t pg, svint64_t op) : "ABS Ztied.D, Pg/M, Zop.D" or "MOVPRFX Zresult, Zinactive; ABS Zresult.D, Pg/M, Zop.D"
/// svint64_t svabs[_s64]_x(svbool_t pg, svint64_t op) : "ABS Ztied.D, Pg/M, Ztied.D" or "MOVPRFX Zresult, Zop; ABS Zresult.D, Pg/M, Zop.D"
/// svint64_t svabs[_s64]_z(svbool_t pg, svint64_t op) : "MOVPRFX Zresult.D, Pg/Z, Zop.D; ABS Zresult.D, Pg/M, Zop.D"
public static unsafe Vector<long> Abs(Vector<long> value);
/// AbsoluteDifference : Absolute difference
/// svfloat32_t svabd[_f32]_m(svbool_t pg, svfloat32_t op1, svfloat32_t op2) : "FABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svfloat32_t svabd[_f32]_x(svbool_t pg, svfloat32_t op1, svfloat32_t op2) : "FABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "FABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svfloat32_t svabd[_f32]_z(svbool_t pg, svfloat32_t op1, svfloat32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; FABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
public static unsafe Vector<float> AbsoluteDifference(Vector<float> left, Vector<float> right);
/// svfloat64_t svabd[_f64]_m(svbool_t pg, svfloat64_t op1, svfloat64_t op2) : "FABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svfloat64_t svabd[_f64]_x(svbool_t pg, svfloat64_t op1, svfloat64_t op2) : "FABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "FABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svfloat64_t svabd[_f64]_z(svbool_t pg, svfloat64_t op1, svfloat64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; FABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
public static unsafe Vector<double> AbsoluteDifference(Vector<double> left, Vector<double> right);
/// svint8_t svabd[_s8]_m(svbool_t pg, svint8_t op1, svint8_t op2) : "SABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "MOVPRFX Zresult, Zop1; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
/// svint8_t svabd[_s8]_x(svbool_t pg, svint8_t op1, svint8_t op2) : "SABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "SABD Ztied2.B, Pg/M, Ztied2.B, Zop1.B" or "MOVPRFX Zresult, Zop1; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
/// svint8_t svabd[_s8]_z(svbool_t pg, svint8_t op1, svint8_t op2) : "MOVPRFX Zresult.B, Pg/Z, Zop1.B; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B" or "MOVPRFX Zresult.B, Pg/Z, Zop2.B; SABD Zresult.B, Pg/M, Zresult.B, Zop1.B"
public static unsafe Vector<sbyte> AbsoluteDifference(Vector<sbyte> left, Vector<sbyte> right);
/// svint16_t svabd[_s16]_m(svbool_t pg, svint16_t op1, svint16_t op2) : "SABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "MOVPRFX Zresult, Zop1; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
/// svint16_t svabd[_s16]_x(svbool_t pg, svint16_t op1, svint16_t op2) : "SABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "SABD Ztied2.H, Pg/M, Ztied2.H, Zop1.H" or "MOVPRFX Zresult, Zop1; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
/// svint16_t svabd[_s16]_z(svbool_t pg, svint16_t op1, svint16_t op2) : "MOVPRFX Zresult.H, Pg/Z, Zop1.H; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H" or "MOVPRFX Zresult.H, Pg/Z, Zop2.H; SABD Zresult.H, Pg/M, Zresult.H, Zop1.H"
public static unsafe Vector<short> AbsoluteDifference(Vector<short> left, Vector<short> right);
/// svint32_t svabd[_s32]_m(svbool_t pg, svint32_t op1, svint32_t op2) : "SABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svint32_t svabd[_s32]_x(svbool_t pg, svint32_t op1, svint32_t op2) : "SABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "SABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svint32_t svabd[_s32]_z(svbool_t pg, svint32_t op1, svint32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; SABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
public static unsafe Vector<int> AbsoluteDifference(Vector<int> left, Vector<int> right);
/// svint64_t svabd[_s64]_m(svbool_t pg, svint64_t op1, svint64_t op2) : "SABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svint64_t svabd[_s64]_x(svbool_t pg, svint64_t op1, svint64_t op2) : "SABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "SABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svint64_t svabd[_s64]_z(svbool_t pg, svint64_t op1, svint64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; SABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
public static unsafe Vector<long> AbsoluteDifference(Vector<long> left, Vector<long> right);
/// svuint8_t svabd[_u8]_m(svbool_t pg, svuint8_t op1, svuint8_t op2) : "UABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "MOVPRFX Zresult, Zop1; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
/// svuint8_t svabd[_u8]_x(svbool_t pg, svuint8_t op1, svuint8_t op2) : "UABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "UABD Ztied2.B, Pg/M, Ztied2.B, Zop1.B" or "MOVPRFX Zresult, Zop1; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
/// svuint8_t svabd[_u8]_z(svbool_t pg, svuint8_t op1, svuint8_t op2) : "MOVPRFX Zresult.B, Pg/Z, Zop1.B; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B" or "MOVPRFX Zresult.B, Pg/Z, Zop2.B; UABD Zresult.B, Pg/M, Zresult.B, Zop1.B"
public static unsafe Vector<byte> AbsoluteDifference(Vector<byte> left, Vector<byte> right);
/// svuint16_t svabd[_u16]_m(svbool_t pg, svuint16_t op1, svuint16_t op2) : "UABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "MOVPRFX Zresult, Zop1; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
/// svuint16_t svabd[_u16]_x(svbool_t pg, svuint16_t op1, svuint16_t op2) : "UABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "UABD Ztied2.H, Pg/M, Ztied2.H, Zop1.H" or "MOVPRFX Zresult, Zop1; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
/// svuint16_t svabd[_u16]_z(svbool_t pg, svuint16_t op1, svuint16_t op2) : "MOVPRFX Zresult.H, Pg/Z, Zop1.H; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H" or "MOVPRFX Zresult.H, Pg/Z, Zop2.H; UABD Zresult.H, Pg/M, Zresult.H, Zop1.H"
public static unsafe Vector<ushort> AbsoluteDifference(Vector<ushort> left, Vector<ushort> right);
/// svuint32_t svabd[_u32]_m(svbool_t pg, svuint32_t op1, svuint32_t op2) : "UABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svuint32_t svabd[_u32]_x(svbool_t pg, svuint32_t op1, svuint32_t op2) : "UABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "UABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svuint32_t svabd[_u32]_z(svbool_t pg, svuint32_t op1, svuint32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; UABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
public static unsafe Vector<uint> AbsoluteDifference(Vector<uint> left, Vector<uint> right);
/// svuint64_t svabd[_u64]_m(svbool_t pg, svuint64_t op1, svuint64_t op2) : "UABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svuint64_t svabd[_u64]_x(svbool_t pg, svuint64_t op1, svuint64_t op2) : "UABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "UABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svuint64_t svabd[_u64]_z(svbool_t pg, svuint64_t op1, svuint64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; UABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
public static unsafe Vector<ulong> AbsoluteDifference(Vector<ulong> left, Vector<ulong> right);
/// svfloat32_t svabd[_n_f32]_m(svbool_t pg, svfloat32_t op1, float32_t op2) : "FABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svfloat32_t svabd[_n_f32]_x(svbool_t pg, svfloat32_t op1, float32_t op2) : "FABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "FABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svfloat32_t svabd[_n_f32]_z(svbool_t pg, svfloat32_t op1, float32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; FABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; FABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
public static unsafe Vector<float> AbsoluteDifference(Vector<float> left, float right);
/// svfloat64_t svabd[_n_f64]_m(svbool_t pg, svfloat64_t op1, float64_t op2) : "FABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svfloat64_t svabd[_n_f64]_x(svbool_t pg, svfloat64_t op1, float64_t op2) : "FABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "FABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svfloat64_t svabd[_n_f64]_z(svbool_t pg, svfloat64_t op1, float64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; FABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; FABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
public static unsafe Vector<double> AbsoluteDifference(Vector<double> left, double right);
/// svint8_t svabd[_n_s8]_m(svbool_t pg, svint8_t op1, int8_t op2) : "SABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "MOVPRFX Zresult, Zop1; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
/// svint8_t svabd[_n_s8]_x(svbool_t pg, svint8_t op1, int8_t op2) : "SABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "SABD Ztied2.B, Pg/M, Ztied2.B, Zop1.B" or "MOVPRFX Zresult, Zop1; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
/// svint8_t svabd[_n_s8]_z(svbool_t pg, svint8_t op1, int8_t op2) : "MOVPRFX Zresult.B, Pg/Z, Zop1.B; SABD Zresult.B, Pg/M, Zresult.B, Zop2.B" or "MOVPRFX Zresult.B, Pg/Z, Zop2.B; SABD Zresult.B, Pg/M, Zresult.B, Zop1.B"
public static unsafe Vector<sbyte> AbsoluteDifference(Vector<sbyte> left, sbyte right);
/// svint16_t svabd[_n_s16]_m(svbool_t pg, svint16_t op1, int16_t op2) : "SABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "MOVPRFX Zresult, Zop1; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
/// svint16_t svabd[_n_s16]_x(svbool_t pg, svint16_t op1, int16_t op2) : "SABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "SABD Ztied2.H, Pg/M, Ztied2.H, Zop1.H" or "MOVPRFX Zresult, Zop1; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
/// svint16_t svabd[_n_s16]_z(svbool_t pg, svint16_t op1, int16_t op2) : "MOVPRFX Zresult.H, Pg/Z, Zop1.H; SABD Zresult.H, Pg/M, Zresult.H, Zop2.H" or "MOVPRFX Zresult.H, Pg/Z, Zop2.H; SABD Zresult.H, Pg/M, Zresult.H, Zop1.H"
public static unsafe Vector<short> AbsoluteDifference(Vector<short> left, short right);
/// svint32_t svabd[_n_s32]_m(svbool_t pg, svint32_t op1, int32_t op2) : "SABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svint32_t svabd[_n_s32]_x(svbool_t pg, svint32_t op1, int32_t op2) : "SABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "SABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svint32_t svabd[_n_s32]_z(svbool_t pg, svint32_t op1, int32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; SABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; SABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
public static unsafe Vector<int> AbsoluteDifference(Vector<int> left, int right);
/// svint64_t svabd[_n_s64]_m(svbool_t pg, svint64_t op1, int64_t op2) : "SABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svint64_t svabd[_n_s64]_x(svbool_t pg, svint64_t op1, int64_t op2) : "SABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "SABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svint64_t svabd[_n_s64]_z(svbool_t pg, svint64_t op1, int64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; SABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; SABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
public static unsafe Vector<long> AbsoluteDifference(Vector<long> left, long right);
/// svuint8_t svabd[_n_u8]_m(svbool_t pg, svuint8_t op1, uint8_t op2) : "UABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "MOVPRFX Zresult, Zop1; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
/// svuint8_t svabd[_n_u8]_x(svbool_t pg, svuint8_t op1, uint8_t op2) : "UABD Ztied1.B, Pg/M, Ztied1.B, Zop2.B" or "UABD Ztied2.B, Pg/M, Ztied2.B, Zop1.B" or "MOVPRFX Zresult, Zop1; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B"
/// svuint8_t svabd[_n_u8]_z(svbool_t pg, svuint8_t op1, uint8_t op2) : "MOVPRFX Zresult.B, Pg/Z, Zop1.B; UABD Zresult.B, Pg/M, Zresult.B, Zop2.B" or "MOVPRFX Zresult.B, Pg/Z, Zop2.B; UABD Zresult.B, Pg/M, Zresult.B, Zop1.B"
public static unsafe Vector<byte> AbsoluteDifference(Vector<byte> left, byte right);
/// svuint16_t svabd[_n_u16]_m(svbool_t pg, svuint16_t op1, uint16_t op2) : "UABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "MOVPRFX Zresult, Zop1; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
/// svuint16_t svabd[_n_u16]_x(svbool_t pg, svuint16_t op1, uint16_t op2) : "UABD Ztied1.H, Pg/M, Ztied1.H, Zop2.H" or "UABD Ztied2.H, Pg/M, Ztied2.H, Zop1.H" or "MOVPRFX Zresult, Zop1; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H"
/// svuint16_t svabd[_n_u16]_z(svbool_t pg, svuint16_t op1, uint16_t op2) : "MOVPRFX Zresult.H, Pg/Z, Zop1.H; UABD Zresult.H, Pg/M, Zresult.H, Zop2.H" or "MOVPRFX Zresult.H, Pg/Z, Zop2.H; UABD Zresult.H, Pg/M, Zresult.H, Zop1.H"
public static unsafe Vector<ushort> AbsoluteDifference(Vector<ushort> left, ushort right);
/// svuint32_t svabd[_n_u32]_m(svbool_t pg, svuint32_t op1, uint32_t op2) : "UABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "MOVPRFX Zresult, Zop1; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svuint32_t svabd[_n_u32]_x(svbool_t pg, svuint32_t op1, uint32_t op2) : "UABD Ztied1.S, Pg/M, Ztied1.S, Zop2.S" or "UABD Ztied2.S, Pg/M, Ztied2.S, Zop1.S" or "MOVPRFX Zresult, Zop1; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S"
/// svuint32_t svabd[_n_u32]_z(svbool_t pg, svuint32_t op1, uint32_t op2) : "MOVPRFX Zresult.S, Pg/Z, Zop1.S; UABD Zresult.S, Pg/M, Zresult.S, Zop2.S" or "MOVPRFX Zresult.S, Pg/Z, Zop2.S; UABD Zresult.S, Pg/M, Zresult.S, Zop1.S"
public static unsafe Vector<uint> AbsoluteDifference(Vector<uint> left, uint right);
/// svuint64_t svabd[_n_u64]_m(svbool_t pg, svuint64_t op1, uint64_t op2) : "UABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "MOVPRFX Zresult, Zop1; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svuint64_t svabd[_n_u64]_x(svbool_t pg, svuint64_t op1, uint64_t op2) : "UABD Ztied1.D, Pg/M, Ztied1.D, Zop2.D" or "UABD Ztied2.D, Pg/M, Ztied2.D, Zop1.D" or "MOVPRFX Zresult, Zop1; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D"
/// svuint64_t svabd[_n_u64]_z(svbool_t pg, svuint64_t op1, uint64_t op2) : "MOVPRFX Zresult.D, Pg/Z, Zop1.D; UABD Zresult.D, Pg/M, Zresult.D, Zop2.D" or "MOVPRFX Zresult.D, Pg/Z, Zop2.D; UABD Zresult.D, Pg/M, Zresult.D, Zop1.D"
public static unsafe Vector<ulong> AbsoluteDifference(Vector<ulong> left, ulong right);
....SNIP.....
/// SubtractSaturate : Saturating subtract
/// svint8_t svqsub[_s8](svint8_t op1, svint8_t op2) : "SQSUB Zresult.B, Zop1.B, Zop2.B"
public static unsafe Vector<sbyte> SubtractSaturate(Vector<sbyte> left, Vector<sbyte> right);
/// svint16_t svqsub[_s16](svint16_t op1, svint16_t op2) : "SQSUB Zresult.H, Zop1.H, Zop2.H"
public static unsafe Vector<short> SubtractSaturate(Vector<short> left, Vector<short> right);
/// svint32_t svqsub[_s32](svint32_t op1, svint32_t op2) : "SQSUB Zresult.S, Zop1.S, Zop2.S"
public static unsafe Vector<int> SubtractSaturate(Vector<int> left, Vector<int> right);
/// svint64_t svqsub[_s64](svint64_t op1, svint64_t op2) : "SQSUB Zresult.D, Zop1.D, Zop2.D"
public static unsafe Vector<long> SubtractSaturate(Vector<long> left, Vector<long> right);
/// svuint8_t svqsub[_u8](svuint8_t op1, svuint8_t op2) : "UQSUB Zresult.B, Zop1.B, Zop2.B"
public static unsafe Vector<byte> SubtractSaturate(Vector<byte> left, Vector<byte> right);
/// svuint16_t svqsub[_u16](svuint16_t op1, svuint16_t op2) : "UQSUB Zresult.H, Zop1.H, Zop2.H"
public static unsafe Vector<ushort> SubtractSaturate(Vector<ushort> left, Vector<ushort> right);
/// svuint32_t svqsub[_u32](svuint32_t op1, svuint32_t op2) : "UQSUB Zresult.S, Zop1.S, Zop2.S"
public static unsafe Vector<uint> SubtractSaturate(Vector<uint> left, Vector<uint> right);
/// svuint64_t svqsub[_u64](svuint64_t op1, svuint64_t op2) : "UQSUB Zresult.D, Zop1.D, Zop2.D"
public static unsafe Vector<ulong> SubtractSaturate(Vector<ulong> left, Vector<ulong> right);
/// svint8_t svqsub[_n_s8](svint8_t op1, int8_t op2) : "SQSUB Ztied1.B, Ztied1.B, #op2" or "SQADD Ztied1.B, Ztied1.B, #-op2" or "SQSUB Zresult.B, Zop1.B, Zop2.B"
public static unsafe Vector<sbyte> SubtractSaturate(Vector<sbyte> left, sbyte right);
/// svint16_t svqsub[_n_s16](svint16_t op1, int16_t op2) : "SQSUB Ztied1.H, Ztied1.H, #op2" or "SQADD Ztied1.H, Ztied1.H, #-op2" or "SQSUB Zresult.H, Zop1.H, Zop2.H"
public static unsafe Vector<short> SubtractSaturate(Vector<short> left, short right);
/// svint32_t svqsub[_n_s32](svint32_t op1, int32_t op2) : "SQSUB Ztied1.S, Ztied1.S, #op2" or "SQADD Ztied1.S, Ztied1.S, #-op2" or "SQSUB Zresult.S, Zop1.S, Zop2.S"
public static unsafe Vector<int> SubtractSaturate(Vector<int> left, int right);
/// svint64_t svqsub[_n_s64](svint64_t op1, int64_t op2) : "SQSUB Ztied1.D, Ztied1.D, #op2" or "SQADD Ztied1.D, Ztied1.D, #-op2" or "SQSUB Zresult.D, Zop1.D, Zop2.D"
public static unsafe Vector<long> SubtractSaturate(Vector<long> left, long right);
/// svuint8_t svqsub[_n_u8](svuint8_t op1, uint8_t op2) : "UQSUB Ztied1.B, Ztied1.B, #op2" or "UQSUB Zresult.B, Zop1.B, Zop2.B"
public static unsafe Vector<byte> SubtractSaturate(Vector<byte> left, byte right);
/// svuint16_t svqsub[_n_u16](svuint16_t op1, uint16_t op2) : "UQSUB Ztied1.H, Ztied1.H, #op2" or "UQSUB Zresult.H, Zop1.H, Zop2.H"
public static unsafe Vector<ushort> SubtractSaturate(Vector<ushort> left, ushort right);
/// svuint32_t svqsub[_n_u32](svuint32_t op1, uint32_t op2) : "UQSUB Ztied1.S, Ztied1.S, #op2" or "UQSUB Zresult.S, Zop1.S, Zop2.S"
public static unsafe Vector<uint> SubtractSaturate(Vector<uint> left, uint right);
/// svuint64_t svqsub[_n_u64](svuint64_t op1, uint64_t op2) : "UQSUB Ztied1.D, Ztied1.D, #op2" or "UQSUB Zresult.D, Zop1.D, Zop2.D"
public static unsafe Vector<ulong> SubtractSaturate(Vector<ulong> left, ulong right);
// total ACLE covered: 390
// total method signatures: 158
// total method names: 11
} Adding There's lots of parsing to fix up namings and types. Plus predicates have been stripped out where necessary (eg If there's nothing obvious that need fixing in the above, then I can start posting a few of them as API requests next week. |
At a glance those look correct. Noting that for the purposes of API review, having bigger issues is fine and a lot of the information can be compressed down. Consider for example how we did AVX512F here: #73604. We have all the necessary APIs, but we don't provide information unnecessary to API review such as what instruction or native API it maps to. Given how large
We aren't going to reject an API for a particular For This is going to be very hard to represent to users "correctly" as it requires |
Background and motivation
Adding a vector API for Arm SVE/SVE2 would be useful. SVE is a mandatory feature in Arm 9.0 onwards and is an alternative to NEON. Code written in SVE is vector length agnostic and will automatically scale to the vector length of the machine it is running on, and therefore will only require a single implementation per routine. Use of predication in SVE enables loop heads and tails to be skipped, making code shorter, simpler and easier to write.
This issue provides examples of how such an API might be used.
API Proposal
None provided.
API Usage
Alternative Designs
No response
Risks
References
SVE Programming Examples
A64 -- SVE Instructions (alphabetic order)
No response
The text was updated successfully, but these errors were encountered: