Loading wider type than the type of the Buffer #6756

LebedevRI · 2022-05-09T22:05:31Z

Problem: given buffer of uint8_t's, how do to perform a load of uint64_t from it?
It's possible i'm completely missing it, but so far i have not found a reasonable solution.

The obvious way is: (for little-endian loads, that is)

  ImageParam input(type_of<uint8_t>(), 1, "input");

  Var elt("elt");

  Func wide_load("wide_load");
  wide_load(elt) = cast(CACHE_TYPE, input(elt));
  RDom r(0, CACHE_BYTES);
  wide_load(elt) = wide_load(elt) | (cast(CACHE_TYPE, input(elt + r)) << (8 * r));

  wide_load.update(0).unroll(r);

  Func cache("cache");
  cache(elt) = wide_load(elt);

(full code and dumps: https://godbolt.org/z/fhs9ajqhK)
But that doesn't really work, halide performs u8 loads,
and LLVM happily misses the point, has no middle-end load widening,
pulls apart the sequence during middle-end optimizations,
and by the back-end ISEL, there is no real way to recover.

What does work is:

  std::vector<Expr> input_bytes_le;
  for (int i = 0; i != CACHE_BYTES; ++i)
    input_bytes_le.push_back(input(input_offset + i));

  Func cache("cache");
  cache(elt) =
      reinterpret(CACHE_TYPE, Internal::Shuffle::make_concat(input_bytes_le));

(https://godbolt.org/z/hM844f37E)
... but something tells me i'm not supposed to use stuff from Internal directly...

Thoughts?

The text was updated successfully, but these errors were encountered:

abadams · 2022-05-10T16:39:16Z

I wish I had an answer to this, but it's also something that I've struggled with. I was battling it in the context of getting low-bit-width noise (e.g. 8-bit noise) for dithering. We can generate 32-bit noise and I'd like to generate e.g. 4 x 32-bit noise and reinterpret it as 16 x 8-bit noise.

One way to get it would be to do something like your code above, but vectorize r instead of unrolling it, and then have a VectorReduce operator which is a bit-concatenation reduction, which would compile to a no-op. That seems very complicated and fragile though.

zvookin · 2022-05-10T17:52:58Z

The general case of accessing data with unusual layouts comes up a fair bit. The simplest one is just interleaved vs. planar, which we support with strides and constraints on inputs/outputs, but this causes consistent difficulty for every new Halide user. Sometimes one wants to access data that is in arrays of structs and in an extreme case, those can have bitfields. Non-uniform sampling in YUV is another case. The similarity to the issue here is there is a lot of need for encapsulation of these patterns, including accessing integer data at different bit widths than it is declared with.

It seems introducing a set of encapsulations along the lines of BoundaryConditions would be useful.

Ideally these encapsulations would simply provide a new Func that is naturally indexable but that can also be scheduled in a meaningful way. The scheduling might be specified as an input to the encapsulation that makes the Func, or it could be that the encapsulation returns all the handles necessary to schedule the internal logic. (Or we could use Generators, maybe...)

In this particular case, an API sketch looks like:

class enum ByteOrder {
    Native,  // Basically the same as LE now as Halide doesn't support big endian architectures...
    LittleEndian,
    BigEndian,
};

template <typename result_type>
Func change_type(const Func &input, const VarOrRVar &dim, ByteOrder byte_order = ByteOrder::Native) {
}

The indexing of dim changes scale depending on whether the new type is larger or smaller than the existing one. An option to add or remove a dimension could also be provided. (E.g. if there is an index going from 0 to 3 to enumerate the bytes in a 32-bit integer, that dime would go away in a conversion from 8-bit to 32-bit. Ditto introducing such a dim going the other way. This can also be done with fuse and split scheduling, but that only affects scheduling and one would need to introduce a wrapper Func to change how the indexing is done while writing Halide using the returned Func.)

Per scheduling for something like this, we'd want to make sure it vectorizes as the enclosing context, though that's tricky as while the number of bits of input and output are the same, the number of items is not. One mostly wants to make sure the vector loads are coherent, not repeated.

LebedevRI · 2022-05-10T17:59:07Z

Thanks for taking a look!

FWIW, endianness is indeed the next problem here.
It is not relevant that halide, as of present, only supports little-endian machines (both host and target).
The incoming data may have been produced on a big-endian machine, and stored in it's native endianness,
so there really needs to be a way to byteswap. This is not a theoretical concern.

zvookin · 2022-05-10T18:01:55Z

Yes, it is absolutely important to be able to access data in whatever layout it is in. I wrote the comment on "Native" to e.g. make sure folks knew "Native" probably would never test the BE path and such.

LebedevRI · 2022-05-10T18:40:29Z

Few more points, just to spell them out explicitly:

widening would effectively have to deal with the same legality reasoning the loop unrolling/vectorization does,
because the input element count may not be a multiple of the rescaling factor (3 x u8 is 1.5 x u16),
yet rounding down to a multiple is not always the right solution since it's lossy.
Perhaps most often, one might want those missing bytes to be either zeros, or maybe undeterminate garbage.
(obvious) The step may also be not a multiple of the wider type, but i suppose that will just work.

LebedevRI · 2022-05-19T14:28:36Z

It seems introducing a set of encapsulations along the lines of BoundaryConditions would be useful.
In this particular case, an API sketch looks like:

Ok, so i'm looking at implementing this, and i'm not sure i follow. I need some more hints.

BoundaryConditions are modelled as thin wrappers, and don't require modifications
to the Halide itself, in the sense that they could be implemented in user code too.

It is obvious how to provide such wrappers for extracting sub-integers out of an integer,
or producing a wider integer by concatenating smaller integers.

The main problem that motivated me to file this issue is that the fact that
such naive/native solution leads to poor results, at lest, and especially for, the widening case.
How would providing those helpers help solve that problem?

zvookin · 2022-05-19T17:43:25Z

BoundaryConditions required adding likely and has seen multiple revs to compilation strategy for loops splitting. It's probably still not ideal. We'll need to change the implementation inside BoundaryConditions to utilize texture unit support for boundaries. (E.g. adding an intrinsic to carry through the boundary.) Compilation approach is likely different for vector architectures with pervasive predication or length bounded vectors. So far we still believe it's a good abstraction, but I had hoped it would have been more reliably excellent in performance by now.

So the fact that the encapsulation doesn't nail the performance yet is not the main concern. The question is can an encapsulation plus some work on Halide, for some cases, get the performance?

LebedevRI · 2022-05-19T17:54:26Z

The question is can an encapsulation plus some work on Halide, for some cases, get the performance?

I'm still very much in the dark regarding Halide internals,
and since answer to this question depends on that,
i can only guess :)

In principle, so far, the missing optimization here is that given Buffer<uint8_t>,
loading two consequtive elements and concatenating them should be lowered as a single load.
Can that be done?

As discussed, this implements an utility to make it easier to write code that wants to essentially operate either on smaller chunks than the whole type, or operate on several consecutive elements as a large element. The bigger picture is that it is desirable to perform load widening in such situations, and while this doesn't do that, at least having a common interface should be a step in that direction. I've took liberty to add/expose some QoL variable-bit-lenth operations in IROperator, while there. I believe, this has sufficient test coverage now. While implmeneting, i stumbled&fixed halide#6782 Refs. halide#6756

LebedevRI · 2022-05-29T21:34:48Z

Ok, i think #6775 is roughly ready.

LebedevRI · 2022-06-03T12:18:10Z

Trying to think of the actual load widening now.
As i see it, there are two general strategies:

Finding all loads, recognizing "chains" (when one load is loading at index and another at index+1), creating a single wide load and replacing narrow loads with extracts out of wide load
Recursively analyzing all Expr to see if we can recognize&replace the whole Expr with complete wide load

I'm thinking the first approach is more general, and cheaper/simpler, so that is what i will look into.

Now that i think about it, the big problem is that the Load expr takes element index,
so given a load of i8 at an 'opaque' index i, we can't actually represent that
as low 8 bits of a load of i16 at index i/2, because i might not be a multiple of 2...
This calls for a new MisalignedLoad expr?

As discussed, this implements an utility to make it easier to write code that wants to essentially operate either on smaller chunks than the whole type, or operate on several consecutive elements as a large element. The bigger picture is that it is desirable to perform load widening in such situations, and while this doesn't do that, at least having a common interface should be a step in that direction. I've took liberty to add/expose some QoL variable-bit-lenth operations in IROperator, while there. I believe, this has sufficient test coverage now. While implmeneting, i stumbled&fixed halide#6782 Refs. halide#6756

…terpret Second (of three) pieces of the load widening puzzle. Here, the codegen is taught to directly emit scalar loads, instead of doing vector load and `bitcast`ing it. Refs. halide#6801 Refs. halide#6756 Refs. halide#6775

As discussed, this implements an utility to make it easier to write code that wants to essentially operate either on smaller chunks than the whole type, or operate on several consecutive elements as a large element. The bigger picture is that it is desirable to perform load widening in such situations, and while this doesn't do that, at least having a common interface should be a step in that direction. I've took liberty to add/expose some QoL variable-bit-lenth operations in IROperator, while there. I believe, this has sufficient test coverage now. While implmeneting, i stumbled&fixed halide#6782 Refs. halide#6756

LebedevRI mentioned this issue May 29, 2022

Narrowing/widening the Func along a dimension (#6756) #6775

Closed

LebedevRI mentioned this issue Jun 4, 2022

Internal representation for misaligned loads #6801

Closed

LebedevRI mentioned this issue Jun 14, 2022

[Codegen_LLVM] Directly load scalar that we'd load as vector and rein… #6809

Closed

LebedevRI closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading wider type than the type of the Buffer #6756

Loading wider type than the type of the Buffer #6756

LebedevRI commented May 9, 2022

abadams commented May 10, 2022

zvookin commented May 10, 2022

LebedevRI commented May 10, 2022

zvookin commented May 10, 2022

LebedevRI commented May 10, 2022

LebedevRI commented May 19, 2022

zvookin commented May 19, 2022 •

edited

Loading

LebedevRI commented May 19, 2022

LebedevRI commented May 29, 2022

LebedevRI commented Jun 3, 2022

Loading wider type than the type of the Buffer #6756

Loading wider type than the type of the Buffer #6756

Comments

LebedevRI commented May 9, 2022

abadams commented May 10, 2022

zvookin commented May 10, 2022

LebedevRI commented May 10, 2022

zvookin commented May 10, 2022

LebedevRI commented May 10, 2022

LebedevRI commented May 19, 2022

zvookin commented May 19, 2022 • edited Loading

LebedevRI commented May 19, 2022

LebedevRI commented May 29, 2022

LebedevRI commented Jun 3, 2022

zvookin commented May 19, 2022 •

edited

Loading