Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading wider type than the type of the Buffer #6756

Closed
LebedevRI opened this issue May 9, 2022 · 10 comments
Closed

Loading wider type than the type of the Buffer #6756

LebedevRI opened this issue May 9, 2022 · 10 comments

Comments

@LebedevRI
Copy link
Contributor

Problem: given buffer of uint8_t's, how do to perform a load of uint64_t from it?
It's possible i'm completely missing it, but so far i have not found a reasonable solution.

The obvious way is: (for little-endian loads, that is)

  ImageParam input(type_of<uint8_t>(), 1, "input");

  Var elt("elt");

  Func wide_load("wide_load");
  wide_load(elt) = cast(CACHE_TYPE, input(elt));
  RDom r(0, CACHE_BYTES);
  wide_load(elt) = wide_load(elt) | (cast(CACHE_TYPE, input(elt + r)) << (8 * r));

  wide_load.update(0).unroll(r);

  Func cache("cache");
  cache(elt) = wide_load(elt);

(full code and dumps: https://godbolt.org/z/fhs9ajqhK)
But that doesn't really work, halide performs u8 loads,
and LLVM happily misses the point, has no middle-end load widening,
pulls apart the sequence during middle-end optimizations,
and by the back-end ISEL, there is no real way to recover.

What does work is:

  std::vector<Expr> input_bytes_le;
  for (int i = 0; i != CACHE_BYTES; ++i)
    input_bytes_le.push_back(input(input_offset + i));

  Func cache("cache");
  cache(elt) =
      reinterpret(CACHE_TYPE, Internal::Shuffle::make_concat(input_bytes_le));

(https://godbolt.org/z/hM844f37E)
... but something tells me i'm not supposed to use stuff from Internal directly...

Thoughts?

@abadams
Copy link
Member

abadams commented May 10, 2022

I wish I had an answer to this, but it's also something that I've struggled with. I was battling it in the context of getting low-bit-width noise (e.g. 8-bit noise) for dithering. We can generate 32-bit noise and I'd like to generate e.g. 4 x 32-bit noise and reinterpret it as 16 x 8-bit noise.

One way to get it would be to do something like your code above, but vectorize r instead of unrolling it, and then have a VectorReduce operator which is a bit-concatenation reduction, which would compile to a no-op. That seems very complicated and fragile though.

@zvookin
Copy link
Member

zvookin commented May 10, 2022

The general case of accessing data with unusual layouts comes up a fair bit. The simplest one is just interleaved vs. planar, which we support with strides and constraints on inputs/outputs, but this causes consistent difficulty for every new Halide user. Sometimes one wants to access data that is in arrays of structs and in an extreme case, those can have bitfields. Non-uniform sampling in YUV is another case. The similarity to the issue here is there is a lot of need for encapsulation of these patterns, including accessing integer data at different bit widths than it is declared with.

It seems introducing a set of encapsulations along the lines of BoundaryConditions would be useful.

Ideally these encapsulations would simply provide a new Func that is naturally indexable but that can also be scheduled in a meaningful way. The scheduling might be specified as an input to the encapsulation that makes the Func, or it could be that the encapsulation returns all the handles necessary to schedule the internal logic. (Or we could use Generators, maybe...)

In this particular case, an API sketch looks like:

class enum ByteOrder {
    Native,  // Basically the same as LE now as Halide doesn't support big endian architectures...
    LittleEndian,
    BigEndian,
};

template <typename result_type>
Func change_type(const Func &input, const VarOrRVar &dim, ByteOrder byte_order = ByteOrder::Native) {
}

The indexing of dim changes scale depending on whether the new type is larger or smaller than the existing one. An option to add or remove a dimension could also be provided. (E.g. if there is an index going from 0 to 3 to enumerate the bytes in a 32-bit integer, that dime would go away in a conversion from 8-bit to 32-bit. Ditto introducing such a dim going the other way. This can also be done with fuse and split scheduling, but that only affects scheduling and one would need to introduce a wrapper Func to change how the indexing is done while writing Halide using the returned Func.)

Per scheduling for something like this, we'd want to make sure it vectorizes as the enclosing context, though that's tricky as while the number of bits of input and output are the same, the number of items is not. One mostly wants to make sure the vector loads are coherent, not repeated.

@LebedevRI
Copy link
Contributor Author

Thanks for taking a look!

FWIW, endianness is indeed the next problem here.
It is not relevant that halide, as of present, only supports little-endian machines (both host and target).
The incoming data may have been produced on a big-endian machine, and stored in it's native endianness,
so there really needs to be a way to byteswap. This is not a theoretical concern.

@zvookin
Copy link
Member

zvookin commented May 10, 2022

Yes, it is absolutely important to be able to access data in whatever layout it is in. I wrote the comment on "Native" to e.g. make sure folks knew "Native" probably would never test the BE path and such.

@LebedevRI
Copy link
Contributor Author

Few more points, just to spell them out explicitly:

  • widening would effectively have to deal with the same legality reasoning the loop unrolling/vectorization does,
    because the input element count may not be a multiple of the rescaling factor (3 x u8 is 1.5 x u16),
    yet rounding down to a multiple is not always the right solution since it's lossy.
    Perhaps most often, one might want those missing bytes to be either zeros, or maybe undeterminate garbage.
  • (obvious) The step may also be not a multiple of the wider type, but i suppose that will just work.

@LebedevRI
Copy link
Contributor Author

It seems introducing a set of encapsulations along the lines of BoundaryConditions would be useful.
In this particular case, an API sketch looks like:

Ok, so i'm looking at implementing this, and i'm not sure i follow. I need some more hints.

BoundaryConditions are modelled as thin wrappers, and don't require modifications
to the Halide itself, in the sense that they could be implemented in user code too.

It is obvious how to provide such wrappers for extracting sub-integers out of an integer,
or producing a wider integer by concatenating smaller integers.

The main problem that motivated me to file this issue is that the fact that
such naive/native solution leads to poor results, at lest, and especially for, the widening case.
How would providing those helpers help solve that problem?

@zvookin
Copy link
Member

zvookin commented May 19, 2022

BoundaryConditions required adding likely and has seen multiple revs to compilation strategy for loops splitting. It's probably still not ideal. We'll need to change the implementation inside BoundaryConditions to utilize texture unit support for boundaries. (E.g. adding an intrinsic to carry through the boundary.) Compilation approach is likely different for vector architectures with pervasive predication or length bounded vectors. So far we still believe it's a good abstraction, but I had hoped it would have been more reliably excellent in performance by now.

So the fact that the encapsulation doesn't nail the performance yet is not the main concern. The question is can an encapsulation plus some work on Halide, for some cases, get the performance?

@LebedevRI
Copy link
Contributor Author

The question is can an encapsulation plus some work on Halide, for some cases, get the performance?

I'm still very much in the dark regarding Halide internals,
and since answer to this question depends on that,
i can only guess :)

In principle, so far, the missing optimization here is that given Buffer<uint8_t>,
loading two consequtive elements and concatenating them should be lowered as a single load.
Can that be done?

LebedevRI added a commit to LebedevRI/Halide that referenced this issue May 29, 2022
As discussed, this implements an utility to make it easier
to write code that wants to essentially operate either on
smaller chunks than the whole type, or operate on several
consecutive elements as a large element.

The bigger picture is that it is desirable to perform load
widening in such situations, and while this doesn't do that,
at least having a common interface should be a step in that direction.

I've took liberty to add/expose some QoL variable-bit-lenth
operations in IROperator, while there.

I believe, this has sufficient test coverage now.
While implmeneting, i stumbled&fixed
halide#6782

Refs. halide#6756
@LebedevRI
Copy link
Contributor Author

Ok, i think #6775 is roughly ready.

@LebedevRI
Copy link
Contributor Author

Trying to think of the actual load widening now.
As i see it, there are two general strategies:

  • Finding all loads, recognizing "chains" (when one load is loading at index and another at index+1), creating a single wide load and replacing narrow loads with extracts out of wide load
  • Recursively analyzing all Expr to see if we can recognize&replace the whole Expr with complete wide load

I'm thinking the first approach is more general, and cheaper/simpler, so that is what i will look into.

Now that i think about it, the big problem is that the Load expr takes element index,
so given a load of i8 at an 'opaque' index i, we can't actually represent that
as low 8 bits of a load of i16 at index i/2, because i might not be a multiple of 2...
This calls for a new MisalignedLoad expr?

LebedevRI added a commit to LebedevRI/Halide that referenced this issue Jun 14, 2022
As discussed, this implements an utility to make it easier
to write code that wants to essentially operate either on
smaller chunks than the whole type, or operate on several
consecutive elements as a large element.

The bigger picture is that it is desirable to perform load
widening in such situations, and while this doesn't do that,
at least having a common interface should be a step in that direction.

I've took liberty to add/expose some QoL variable-bit-lenth
operations in IROperator, while there.

I believe, this has sufficient test coverage now.
While implmeneting, i stumbled&fixed
halide#6782

Refs. halide#6756
LebedevRI added a commit to LebedevRI/Halide that referenced this issue Jun 14, 2022
…terpret

Second (of three) pieces of the load widening puzzle.
Here, the codegen is taught to directly emit scalar loads,
instead of doing vector load and `bitcast`ing it.

Refs. halide#6801
Refs. halide#6756
Refs. halide#6775
LebedevRI added a commit to LebedevRI/Halide that referenced this issue Jun 14, 2022
…terpret

Second (of three) pieces of the load widening puzzle.
Here, the codegen is taught to directly emit scalar loads,
instead of doing vector load and `bitcast`ing it.

Refs. halide#6801
Refs. halide#6756
Refs. halide#6775
LebedevRI added a commit to LebedevRI/Halide that referenced this issue Aug 8, 2022
As discussed, this implements an utility to make it easier
to write code that wants to essentially operate either on
smaller chunks than the whole type, or operate on several
consecutive elements as a large element.

The bigger picture is that it is desirable to perform load
widening in such situations, and while this doesn't do that,
at least having a common interface should be a step in that direction.

I've took liberty to add/expose some QoL variable-bit-lenth
operations in IROperator, while there.

I believe, this has sufficient test coverage now.
While implmeneting, i stumbled&fixed
halide#6782

Refs. halide#6756
LebedevRI added a commit to LebedevRI/Halide that referenced this issue Aug 8, 2022
As discussed, this implements an utility to make it easier
to write code that wants to essentially operate either on
smaller chunks than the whole type, or operate on several
consecutive elements as a large element.

The bigger picture is that it is desirable to perform load
widening in such situations, and while this doesn't do that,
at least having a common interface should be a step in that direction.

I've took liberty to add/expose some QoL variable-bit-lenth
operations in IROperator, while there.

I believe, this has sufficient test coverage now.
While implmeneting, i stumbled&fixed
halide#6782

Refs. halide#6756
@LebedevRI LebedevRI closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants