-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve performance/ergonomics of reading compound datatypes #592
Conversation
So, from my perspective this looks very good. What I request is:
@kleinhenz: While I think you approach is better than #559, it would help a lot when we there would be possibility to convert the named tuple into the equivalent struct representation. One issue I am currently seeing is that you represent FixedSize Vectors as Vectors, while the "in-memory" representation is a |
We can certainly make a major version bump.
@tknopp Can you double check here your statement. The code here is using a fixed size NTuple not Vectors for FixedArray? |
In the The reason I decided to do the type translation here is because it is basically necessary for string types. You don't want to return named tuples with In the case where there are no string or array members then I think using |
Can we make the type normalization optional? Maybe with the current behavior being the default. |
So thinking about this a little more I think that in the case where you have a memory compatible struct already defined it makes a lot more sense to use it directly like in #559 than to use this and then reinterpret. The nice thing about this approach is that it gives you an ergonomic representation without you having to specify a struct manually. But if the struct is already defined then I don't think there's really any advantage to reading as named tuples first. The main function in #559 is <10 lines of code so I don't think using an approach like that is too onerous. I think that makes more sense than complicating the default interface by making the type normalization configurable. |
Is it really complicating the interface? I mainly asking for a backdoor, where the raw things are returned. I am actually wondering how you plan to realize the write support. In that case you will need to denormalize the types. |
In order to make it optional it seems like you need to either add a new global flag which is a bit unappealing or else add a new mechanism for passing keyword arguments through all of the machinery which seems like sort of a big change just for this. I can't just add it to the relevant method because usually this will be invoked through What would be the advantage of using this and then reinterpreting the data into your custom struct over just reading directly into the custom struct? IMHO write support is separate and maybe out of scope. I guess |
I disagree with that. You seem to think about compounds as some serialization format where the concrete way things are stored don't matter too much. But there are file formats (e.g. https://ismrmrd.github.io) that use a fixed compound type definition. I therefore consider this to be a contract where the user needs full control to fulfill the contract. But anyway, I can live with the normalization and will work around it in my code. |
That's a good point. Still I think that is an issue that should to be addressed separately. As noted in #341 making the set of natively supported types dynamically modifiable is currently pretty non-trivial because The change proposed here is much more minimal since it doesn't do anything about user defined types but I think it's worth making since it greatly improves performance and named tuples are much more convenient than the |
end | ||
iscomplex = (membernames == COMPLEX_FIELD_NAMES[]) && (membertypes[1] == membertypes[2]) && (membertypes[1] <: HDF5.HDF5Scalar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part was refactored because now membernames
/membertypes
are used outside of of the COMPLEX_SUPPORT
path.
@musm what do you think about this approach? |
@musm, @kleinhenz so I support merging this although it will break my package hardly. Maybe #559 can be then worked on in a separate step. As @kleinhenz outlined they are orthogonal. In this PR one has no knowledge about the compound. In #559 one has the compound at hand. So it may even dispatch nicely. |
Yeah I think the approach here is a good one. I need to play around with it a bit more. What about wrapping the |
How would you want to wrap it? To maintain the old interface or to just provide an outer type? My instinct would be to just return the named tuple directly since
|
Fair enough. I think we will need an upcoming breaking release containing this PR, changes to the array interface to align them with Base julia, and the update to the JLL artifacts. |
Codecov Report
@@ Coverage Diff @@
## master #592 +/- ##
=========================================
- Coverage 78.57% 78.1% -0.47%
=========================================
Files 3 3
Lines 1134 1128 -6
=========================================
- Hits 891 881 -10
- Misses 243 247 +4
Continue to review full report at Codecov.
|
@musm does anything else have to happen with this before it can be merged? |
remove unnecessary convert calls Co-Authored-By: Mustafa M. <mus-m@outlook.com>
whitespace Co-Authored-By: Mustafa M. <mus-m@outlook.com>
src/HDF5.jl
Outdated
return HDF5Compound(data[1], membername, membertype) | ||
# get a vector of all the leaf types in a (possibly nested) named tuple | ||
function get_all_types(::Type{NamedTuple{T, U}}) where T where U | ||
types = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably be
types = [] | |
types = DataType[] |
src/HDF5.jl
Outdated
if T === HDF5Compound{N} | ||
return HDF5Compound(data[1], membername, membertype) | ||
# get a vector of all the leaf types in a (possibly nested) named tuple | ||
function get_all_types(::Type{NamedTuple{T, U}}) where T where U |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing there's a better way to do this that I can't currently think of, but seems fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems faster / better
function get_all_types1(::Type{NamedTuple})
types = DataType[]
for Ui in U.types
if Ui <: NamedTuple
append!(types, get_all_types(Ui))
else
push!(types, Ui)
end
end
return types
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this would be better/more julian?
do_normalize(::Type{T}) where T = false
do_normalize(::Type{NamedTuple{T, U}}) where T where U = any(i -> do_normalize(fieldtype(U,i)), 1:fieldcount(U))
do_normalize(::Type{T}) where T <: Union{Cstring, FixedString, FixedArray, VariableArray} = true
do_reclaim(::Type{T}) where T = false
do_reclaim(::Type{NamedTuple{T, U}}) where T where U = any(i -> do_reclaim(fieldtype(U,i)), 1:fieldcount(U))
do_reclaim(::Type{T}) where T <: Union{Cstring, VariableArray} = true
the only reason I collect all the types is to see if I need to call reclaim or do an extra normalization step to turn the data into a native julia type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that might be simpler. It's certainly in our best favor to make the code run as fast as possible and be as simple as possible to read for maintenance. You can also benchmark performance.
size(::Type{FixedArray{T,D}}) where {T,D} = D | ||
eltype(::Type{FixedArray{T,D}}) where {T,D} = T | ||
struct FixedArray{T,D,L} | ||
data::NTuple{L, T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how / where's this field being used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used so that the julia type you construct for an H5T_ARRAY
is memory compatible and can be directly written into which is necessary for compound datatypes with H5T_ARRAY
members.
src/HDF5.jl
Outdated
push!(row, read!(io, Vector{UInt8}(undef,dsize))) | ||
end | ||
# convert Cstring/FixedString to String | ||
function normalize_types(x::NamedTuple{T}) where T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where was this normalization obtained from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was taken from the current handling of strings/vlen types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except that for vlen arrays I use a non-owning unsafe wrap and then make a copy. I think this is the right thing to do since hdf5 wants to manage its own memory.
return Cstring | ||
else | ||
n = h5t_get_size(dtype.id) | ||
return FixedString{Int(n)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I'm guessing it's fine not do Int
here as well. Unless you foresee negative consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is actually necessary since NTuple requires the first type parameter to be Int64
.
Thanks, I think we are almost there, just a few more tweaks and we can merge and tag a new major. |
Any chance we could also get this to work with a variant of #341 in the future? |
yeah I think that should work fine. I already checked that this integrates with the complex support so float16 should work the same way. One thing that doesn't work currently is hyperslab support since we have an annoying check for |
When reading compound datatypes this replaces
HDF5Compound
with a named tuple with the correct data layout. The data is directly read into the output buffer instead of going through anIOBuffer
. This results in very large performance gains.I wrote a small benchmark to test the performance of reading a one million element dataset of a small compound datatype with three float fields.
Before
After:
Additionally I fixed support for variable length strings as fields of compound datatypes which seemed to be broken before.
In the case where the compound type contains strings/arrays/vlen types a copy is made to convert from the hdf5 compatible type (e.g.
Cstring
,FixedArray
, etc.) to native julia types. This has some performance cost but I think is almost always what you want to do.Besides being much much faster I think that getting an array of
NamedTuples
is much easier to work with than theHDF5Compound
struct there was before.This should close #408. I prefer this solution to #559 since it doesn't change the interface and doesn't require manual specification of the struct.