improve performance/ergonomics of reading compound datatypes #592

kleinhenz · 2019-11-01T15:50:48Z

When reading compound datatypes this replaces HDF5Compound with a named tuple with the correct data layout. The data is directly read into the output buffer instead of going through an IOBuffer. This results in very large performance gains.

I wrote a small benchmark to test the performance of reading a one million element dataset of a small compound datatype with three float fields.

Before

BenchmarkTools.Trial:
  memory estimate:  329.44 MiB
  allocs estimate:  9000084
  --------------
  minimum time:     1.689 s (19.72% GC)
  median time:      1.878 s (24.72% GC)
  mean time:        1.860 s (25.70% GC)
  maximum time:     2.013 s (31.63% GC)
  --------------
  samples:          3
  evals/sample:     1⏎

After:

BenchmarkTools.Trial:
  memory estimate:  22.89 MiB
  allocs estimate:  64
  --------------
  minimum time:     4.305 ms (20.43% GC)
  median time:      4.584 ms (20.69% GC)
  mean time:        4.717 ms (23.21% GC)
  maximum time:     53.451 ms (91.03% GC)
  --------------
  samples:          1059
  evals/sample:     1⏎

Additionally I fixed support for variable length strings as fields of compound datatypes which seemed to be broken before.

In the case where the compound type contains strings/arrays/vlen types a copy is made to convert from the hdf5 compatible type (e.g. Cstring, FixedArray, etc.) to native julia types. This has some performance cost but I think is almost always what you want to do.

Besides being much much faster I think that getting an array of NamedTuples is much easier to work with than the HDF5Compound struct there was before.

This should close #408. I prefer this solution to #559 since it doesn't change the interface and doesn't require manual specification of the struct.

tknopp · 2019-11-02T10:10:59Z

So, from my perspective this looks very good. What I request is:

a major version bumb, when this lands so that I can pin my package to the previous version.

@kleinhenz: While I think you approach is better than #559, it would help a lot when we there would be possibility to convert the named tuple into the equivalent struct representation. One issue I am currently seeing is that you represent FixedSize Vectors as Vectors, while the "in-memory" representation is a NTuple.

musm · 2019-11-03T19:59:02Z

We can certainly make a major version bump.

One issue I am currently seeing is that you represent FixedSize Vectors as Vectors, while the "in-memory" representation is a NTuple.

@tknopp Can you double check here your statement. The code here is using a fixed size NTuple not Vectors for FixedArray?

kleinhenz · 2019-11-03T20:16:17Z

In the normalize_types function I translate from HDF5.FixedArray which is just a wrapper around an NTuple with the shape information saved in the type into an actual Array. This is definitely a design decision that I could see going either way. My thought was that in most cases you will want to work with the data as an array so it makes sense to just translate. If we depended on StaticArrays we could just use that, but that is probably overkill for this package.

The reason I decided to do the type translation here is because it is basically necessary for string types. You don't want to return named tuples with Cstring members which point to memory allocated by hdf5. So if in order to get sensible julia types we have to do some type translations for strings I thought we might as well also do it for arrays. This could easily be changed if having compatibility with existing structs is more important.

In the case where there are no string or array members then I think using reinterpret in order to get to an equivalent struct representation should just work.

tknopp · 2019-11-04T09:24:41Z

Can we make the type normalization optional? Maybe with the current behavior being the default.

kleinhenz · 2019-11-04T15:54:03Z

So thinking about this a little more I think that in the case where you have a memory compatible struct already defined it makes a lot more sense to use it directly like in #559 than to use this and then reinterpret. The nice thing about this approach is that it gives you an ergonomic representation without you having to specify a struct manually. But if the struct is already defined then I don't think there's really any advantage to reading as named tuples first. The main function in #559 is <10 lines of code so I don't think using an approach like that is too onerous. I think that makes more sense than complicating the default interface by making the type normalization configurable.

tknopp · 2019-11-04T17:50:23Z

Is it really complicating the interface? I mainly asking for a backdoor, where the raw things are returned. I am actually wondering how you plan to realize the write support. In that case you will need to denormalize the types.

kleinhenz · 2019-11-04T19:00:31Z

In order to make it optional it seems like you need to either add a new global flag which is a bit unappealing or else add a new mechanism for passing keyword arguments through all of the machinery which seems like sort of a big change just for this. I can't just add it to the relevant method because usually this will be invoked through h5read or read(dset).

What would be the advantage of using this and then reinterpreting the data into your custom struct over just reading directly into the custom struct?

IMHO write support is separate and maybe out of scope. I guess JLD already solves the problem of how to map julia types to compound hdf5 types. It's probably best to leave general serialization logic there.

tknopp · 2019-11-04T20:57:20Z

IMHO write support is separate and maybe out of scope. I guess JLD already solves the problem of how to map julia types to compound hdf5 types. It's probably best to leave general serialization logic there.

I disagree with that. You seem to think about compounds as some serialization format where the concrete way things are stored don't matter too much. But there are file formats (e.g. https://ismrmrd.github.io) that use a fixed compound type definition. I therefore consider this to be a contract where the user needs full control to fulfill the contract.

But anyway, I can live with the normalization and will work around it in my code.

kleinhenz · 2019-11-05T17:38:39Z

That's a good point. Still I think that is an issue that should to be addressed separately. As noted in #341 making the set of natively supported types dynamically modifiable is currently pretty non-trivial because HDF5Scalar is used in dispatch everywhere. It would be nice to be able to define datatype(::Type{mytype}) and have all of the high level machinery just work. It would also be nice to provide an extension point to allow user specification of the type to be read into to override the default namedtuple scheme proposed here. However I think both of these things would require fairly big changes to the internal library code.

The change proposed here is much more minimal since it doesn't do anything about user defined types but I think it's worth making since it greatly improves performance and named tuples are much more convenient than the HDF5Compound type.

musm · 2019-11-06T16:17:25Z

src/HDF5.jl

          end
-          iscomplex = (membernames == COMPLEX_FIELD_NAMES[]) && (membertypes[1] == membertypes[2]) && (membertypes[1] <: HDF5.HDF5Scalar)


unrelated changes.

this part was refactored because now membernames/membertypes are used outside of of the COMPLEX_SUPPORT path.

kleinhenz · 2019-11-19T19:48:29Z

@musm what do you think about this approach?

tknopp · 2019-11-28T07:35:00Z

@musm, @kleinhenz so I support merging this although it will break my package hardly. Maybe #559 can be then worked on in a separate step. As @kleinhenz outlined they are orthogonal. In this PR one has no knowledge about the compound. In #559 one has the compound at hand. So it may even dispatch nicely.

musm · 2019-12-04T20:03:03Z

Yeah I think the approach here is a good one. I need to play around with it a bit more.

What about wrapping the NamedTuple in an HDF5Compound container? This might help users more easily upgrade or transition.

kleinhenz · 2019-12-06T19:15:12Z

How would you want to wrap it? To maintain the old interface or to just provide an outer type?

My instinct would be to just return the named tuple directly since

the old interface was inconvenient enough to justify a breaking change
people already know how named tuples work so adapting code to use the named tuples will have less friction than trying to figure out the semantics of a new HDF5Compound container.

musm · 2019-12-12T02:13:57Z

Fair enough. I think we will need an upcoming breaking release containing this PR, changes to the array interface to align them with Base julia, and the update to the JLL artifacts.

codecov · 2020-02-29T22:02:59Z

Codecov Report

Merging #592 into master will decrease coverage by 0.46%.
The diff coverage is 96%.

@@            Coverage Diff            @@
##           master    #592      +/-   ##
=========================================
- Coverage   78.57%   78.1%   -0.47%     
=========================================
  Files           3       3              
  Lines        1134    1128       -6     
=========================================
- Hits          891     881      -10     
- Misses        243     247       +4

Impacted Files	Coverage Δ
src/HDF5.jl	`77.84% <96%> (-0.5%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9163402...6c8bc40. Read the comment docs.

kleinhenz · 2020-02-29T22:13:36Z

@musm does anything else have to happen with this before it can be merged?

test/compound.jl

remove unnecessary convert calls Co-Authored-By: Mustafa M. <mus-m@outlook.com>

src/HDF5.jl

whitespace Co-Authored-By: Mustafa M. <mus-m@outlook.com>

musm · 2020-03-02T17:41:00Z

src/HDF5.jl

-        return HDF5Compound(data[1], membername, membertype)
+# get a vector of all the leaf types in a (possibly nested) named tuple
+function get_all_types(::Type{NamedTuple{T, U}}) where T where U
+  types = []


should probably be

Suggested change

types = []

types = DataType[]

musm · 2020-03-02T17:41:26Z

src/HDF5.jl

-    if T === HDF5Compound{N}
-        return HDF5Compound(data[1], membername, membertype)
+# get a vector of all the leaf types in a (possibly nested) named tuple
+function get_all_types(::Type{NamedTuple{T, U}}) where T where U


I'm guessing there's a better way to do this that I can't currently think of, but seems fine for now.

This seems faster / better

function get_all_types1(::Type{NamedTuple}) types = DataType[] for Ui in U.types if Ui <: NamedTuple append!(types, get_all_types(Ui)) else push!(types, Ui) end end return types end

maybe this would be better/more julian?

do_normalize(::Type{T}) where T = false do_normalize(::Type{NamedTuple{T, U}}) where T where U = any(i -> do_normalize(fieldtype(U,i)), 1:fieldcount(U)) do_normalize(::Type{T}) where T <: Union{Cstring, FixedString, FixedArray, VariableArray} = true do_reclaim(::Type{T}) where T = false do_reclaim(::Type{NamedTuple{T, U}}) where T where U = any(i -> do_reclaim(fieldtype(U,i)), 1:fieldcount(U)) do_reclaim(::Type{T}) where T <: Union{Cstring, VariableArray} = true

the only reason I collect all the types is to see if I need to call reclaim or do an extra normalization step to turn the data into a native julia type.

yeah that might be simpler. It's certainly in our best favor to make the code run as fast as possible and be as simple as possible to read for maintenance. You can also benchmark performance.

musm · 2020-03-02T18:04:20Z

src/HDF5.jl

-size(::Type{FixedArray{T,D}}) where {T,D} = D
-eltype(::Type{FixedArray{T,D}}) where {T,D} = T
+struct FixedArray{T,D,L}
+  data::NTuple{L, T}


how / where's this field being used ?

This is used so that the julia type you construct for an H5T_ARRAY is memory compatible and can be directly written into which is necessary for compound datatypes with H5T_ARRAY members.

musm · 2020-03-02T19:17:40Z

src/HDF5.jl

-            push!(row, read!(io, Vector{UInt8}(undef,dsize)))
-        end
+# convert Cstring/FixedString to String
+function normalize_types(x::NamedTuple{T}) where T


where was this normalization obtained from?

this was taken from the current handling of strings/vlen types.

except that for vlen arrays I use a non-owning unsafe wrap and then make a copy. I think this is the right thing to do since hdf5 wants to manage its own memory.

src/HDF5.jl

musm · 2020-03-02T19:22:44Z

src/HDF5.jl

+              return Cstring
+            else
+              n = h5t_get_size(dtype.id)
+              return FixedString{Int(n)}


Again I'm guessing it's fine not do Int here as well. Unless you foresee negative consequences.

This one is actually necessary since NTuple requires the first type parameter to be Int64.

musm · 2020-03-02T19:23:06Z

Thanks, I think we are almost there, just a few more tweaks and we can merge and tag a new major.

musm · 2020-03-02T19:27:52Z

Any chance we could also get this to work with a variant of #341 in the future?

* use dispatch instead of manually collecting all leaf types for type normalization * remove unnecessary Int conversion

kleinhenz · 2020-03-02T19:53:55Z

yeah I think that should work fine. I already checked that this integrates with the complex support so float16 should work the same way. One thing that doesn't work currently is hyperslab support since we have an annoying check for T <: HDF5Scalar in the getindex code. I think a lot of that code should probably be reworked to be more generic.

kleinhenz mentioned this pull request Nov 1, 2019

Add h5read_compound to read compound structures efficiently #559

Closed

tknopp requested a review from musm November 2, 2019 10:02

musm reviewed Nov 6, 2019

View reviewed changes

timholy mentioned this pull request Nov 27, 2019

support MATLAB tables JuliaIO/MAT.jl#127

Open

kleinhenz force-pushed the compound branch from a199931 to ca281f3 Compare November 27, 2019 23:01

kleinhenz added 6 commits February 29, 2020 16:42

read compound datasets as named tuples

82a6ffc

properly support variable/fixed length strings in compound dataset

349eaef

add more compound dataset tests

31ab8d2

add note

4e3f6c9

support vlen arrays in compound datasets

dfe1667

don't use fieldtypes for julia 1.0 compatibility

d52ea72

kleinhenz force-pushed the compound branch from ca281f3 to d52ea72 Compare February 29, 2020 21:49