-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading wider type than the type of the Buffer #6756
Comments
I wish I had an answer to this, but it's also something that I've struggled with. I was battling it in the context of getting low-bit-width noise (e.g. 8-bit noise) for dithering. We can generate 32-bit noise and I'd like to generate e.g. 4 x 32-bit noise and reinterpret it as 16 x 8-bit noise. One way to get it would be to do something like your code above, but vectorize r instead of unrolling it, and then have a VectorReduce operator which is a bit-concatenation reduction, which would compile to a no-op. That seems very complicated and fragile though. |
The general case of accessing data with unusual layouts comes up a fair bit. The simplest one is just interleaved vs. planar, which we support with strides and constraints on inputs/outputs, but this causes consistent difficulty for every new Halide user. Sometimes one wants to access data that is in arrays of structs and in an extreme case, those can have bitfields. Non-uniform sampling in YUV is another case. The similarity to the issue here is there is a lot of need for encapsulation of these patterns, including accessing integer data at different bit widths than it is declared with. It seems introducing a set of encapsulations along the lines of BoundaryConditions would be useful. Ideally these encapsulations would simply provide a new Func that is naturally indexable but that can also be scheduled in a meaningful way. The scheduling might be specified as an input to the encapsulation that makes the Func, or it could be that the encapsulation returns all the handles necessary to schedule the internal logic. (Or we could use Generators, maybe...) In this particular case, an API sketch looks like:
The indexing of Per scheduling for something like this, we'd want to make sure it vectorizes as the enclosing context, though that's tricky as while the number of bits of input and output are the same, the number of items is not. One mostly wants to make sure the vector loads are coherent, not repeated. |
Thanks for taking a look! FWIW, endianness is indeed the next problem here. |
Yes, it is absolutely important to be able to access data in whatever layout it is in. I wrote the comment on "Native" to e.g. make sure folks knew "Native" probably would never test the BE path and such. |
Few more points, just to spell them out explicitly:
|
Ok, so i'm looking at implementing this, and i'm not sure i follow. I need some more hints.
It is obvious how to provide such wrappers for extracting sub-integers out of an integer, The main problem that motivated me to file this issue is that the fact that |
So the fact that the encapsulation doesn't nail the performance yet is not the main concern. The question is can an encapsulation plus some work on Halide, for some cases, get the performance? |
I'm still very much in the dark regarding Halide internals, In principle, so far, the missing optimization here is that given |
As discussed, this implements an utility to make it easier to write code that wants to essentially operate either on smaller chunks than the whole type, or operate on several consecutive elements as a large element. The bigger picture is that it is desirable to perform load widening in such situations, and while this doesn't do that, at least having a common interface should be a step in that direction. I've took liberty to add/expose some QoL variable-bit-lenth operations in IROperator, while there. I believe, this has sufficient test coverage now. While implmeneting, i stumbled&fixed halide#6782 Refs. halide#6756
Ok, i think #6775 is roughly ready. |
Trying to think of the actual load widening now.
I'm thinking the first approach is more general, and cheaper/simpler, so that is what i will look into. Now that i think about it, the big problem is that the |
As discussed, this implements an utility to make it easier to write code that wants to essentially operate either on smaller chunks than the whole type, or operate on several consecutive elements as a large element. The bigger picture is that it is desirable to perform load widening in such situations, and while this doesn't do that, at least having a common interface should be a step in that direction. I've took liberty to add/expose some QoL variable-bit-lenth operations in IROperator, while there. I believe, this has sufficient test coverage now. While implmeneting, i stumbled&fixed halide#6782 Refs. halide#6756
…terpret Second (of three) pieces of the load widening puzzle. Here, the codegen is taught to directly emit scalar loads, instead of doing vector load and `bitcast`ing it. Refs. halide#6801 Refs. halide#6756 Refs. halide#6775
…terpret Second (of three) pieces of the load widening puzzle. Here, the codegen is taught to directly emit scalar loads, instead of doing vector load and `bitcast`ing it. Refs. halide#6801 Refs. halide#6756 Refs. halide#6775
As discussed, this implements an utility to make it easier to write code that wants to essentially operate either on smaller chunks than the whole type, or operate on several consecutive elements as a large element. The bigger picture is that it is desirable to perform load widening in such situations, and while this doesn't do that, at least having a common interface should be a step in that direction. I've took liberty to add/expose some QoL variable-bit-lenth operations in IROperator, while there. I believe, this has sufficient test coverage now. While implmeneting, i stumbled&fixed halide#6782 Refs. halide#6756
As discussed, this implements an utility to make it easier to write code that wants to essentially operate either on smaller chunks than the whole type, or operate on several consecutive elements as a large element. The bigger picture is that it is desirable to perform load widening in such situations, and while this doesn't do that, at least having a common interface should be a step in that direction. I've took liberty to add/expose some QoL variable-bit-lenth operations in IROperator, while there. I believe, this has sufficient test coverage now. While implmeneting, i stumbled&fixed halide#6782 Refs. halide#6756
Problem: given buffer of
uint8_t
's, how do to perform a load ofuint64_t
from it?It's possible i'm completely missing it, but so far i have not found a reasonable solution.
The obvious way is: (for little-endian loads, that is)
(full code and dumps: https://godbolt.org/z/fhs9ajqhK)
But that doesn't really work, halide performs u8 loads,
and LLVM happily misses the point, has no middle-end load widening,
pulls apart the sequence during middle-end optimizations,
and by the back-end ISEL, there is no real way to recover.
What does work is:
(https://godbolt.org/z/hM844f37E)
... but something tells me i'm not supposed to use stuff from
Internal
directly...Thoughts?
The text was updated successfully, but these errors were encountered: