Very slow reading of long strings in v0.14 #742

chrisbrahms · 2020-11-18T14:54:27Z

I have some HDF5 files which contain very long strings (>300k characters). With HDF5 v0.13.6, I can read these no problem:

import Random
import HDF5

file = HDF5.h5open("string.h5")
file["string"] = Random.randstring(300000)
close(file)

import HDF5
using BenchmarkTools
@btime HDF5.h5open("string.h5") do file
    read(file["string"]);
end
593.301 μs (12 allocations: 586.38 KiB)

Updating to v0.14.1, however, even reading this same dataset just once takes so long that I haven't been able to measure it. For shorter strings (10k characters), I see a ~3-fold slowdown (700 μs vs 220 μs), but nothing so egregious.

All of this is on Windows 10, Julia 1.5.1

Any ideas?

The text was updated successfully, but these errors were encountered:

jmert · 2020-11-18T16:26:37Z

I can confirm that on Linux with Julia master, I've had to kill the process after a bare read(file["string"]).

jmert · 2020-11-18T17:36:26Z

I think this boils down to a (probably bad) use of huge tuples — read on a fixed string essentially boils down to trying to allocate the following buffer:

julia> struct FixedArray{T,L}
           data::NTuple{L,T}
       end

julia> Array{FixedArray{Float64,30000}}(undef, 1)
^C^C^C^C^C^CWARNING: Force throwing a SIGINT

Edit: If you leave the following running long enough, it'll eventually abort. (An error message with a stack trace scrolls by for a long time, so I'm not sure what the actual error was, but I'd guess stack overflow?)

julia> FixedArray{Float64, 30000}(ntuple(zero, 30000))

jmert · 2020-11-18T17:58:07Z

@kleinhenz It looks like the old read implementation never tried to realize the FixedArray type, just using it for it's type information.

HDF5.jl/src/HDF5.jl

Lines 1341 to 1364 in f4faf71

    
           function read(obj::DatasetOrAttribute, ::Type{Array{A}}) where {A<:FixedArray} 
        
               T = eltype(A) 
        
               if !(T <: HDF5Scalar) 
        
                   error("Sorry, not yet supported") 
        
               end 
        
               sz = size(A) 
        
               dims = size(obj) 
        
               data = Array{T}(undef,sz..., dims...) 
        
               nd = length(sz) 
        
               hsz = Hsize[sz[nd-i+1] for i = 1:nd] 
        
               memtype_id = h5t_array_create(hdf5_type_id(T), length(sz), hsz) 
        
               try 
        
                   h5d_read(obj.id, memtype_id, H5S_ALL, H5S_ALL, obj.xfer.id, data) 
        
               finally 
        
                   h5t_close(memtype_id) 
        
               end 
        
               ret = Array{Array{T}}(undef,dims) 
        
               # Because of garbage-collection concerns, it's best to copy the data 
        
               L = prod(sz) 
        
               for i = 1:prod(dims) 
        
                   ret[i] = reshape(data[(i-1)*L+1:i*L], sz) 
        
               end 
        
               ret 
        
           end

It looks like that kind of behavior probably needs to be restored. Would just a special-case within the generic read implementation be a way to do that?

kleinhenz · 2020-11-18T18:18:39Z

Yeah the problem is definitely the giant tuples. We can probably just add a special case to read like we have for opaque datatypes. This would fix it for the current case where you are just reading a fixed string dataset. You basically need the tuple approach to support fixed strings/arrays in compound datatypes though which is where this originally came from. It would be nice to have something like NTuple which didn't destroy the compiler but as far as I know there isn't really another solution.

kleinhenz · 2020-11-18T18:37:20Z

Really what we want is a mechanism to be able to sometimes allocate a Vector{UInt8} to read into and defer constructing the final type.

jmert · 2020-11-19T02:54:16Z

Allocating the working buffer can be successfully done by reinterpreting a UInt8 array. Using Julia v1.6's reinterpret(reshape, ...) feature:

julia> struct FS{S}
           data::NTuple{S, UInt8}
       end

julia> a = reinterpret(reshape, FS{30000}, Array{UInt8}(undef, 30000));

julia> eltype(a)
FS{30000}

julia> length(a)
1

There are still major issues, though — trying to print a hangs the REPL.

In general, the large tuple appears to be a problem for inference/type dispatch, so you have to try really hard to hide the type. I made a bit of progress in getting the string to read and normalize by rewriting normalize_types to use Refs instead of bare types, but that bit of progress ends up breaking the generic compound data reading. (I either end up with broken dispatch or junk values, presumably because I'm missing GC.@preserve and/or copys in necessary places...)

kleinhenz · 2020-11-20T18:06:34Z

See JuliaLang/julia#35619. I'm not sure if there is any prospect of this just getting fixed on the julia side.

chrisbrahms mentioned this issue Nov 18, 2020

HDF5 compat LupoLab/Luna.jl#190

Merged

jmert added a commit to jmert/HDF5.jl that referenced this issue Nov 19, 2020

fix JuliaIO#742, avoid instantiating large tuples

0b6e115

jmert added a commit to jmert/HDF5.jl that referenced this issue Nov 19, 2020

fix JuliaIO#742, avoid instantiating large tuples

e1be2c0

jmert mentioned this issue Nov 19, 2020

Fix #742, avoid instantiating large tuples #745

Merged

musm mentioned this issue Nov 30, 2020

Performance regression in h5read #752

Closed

jmert added a commit to jmert/HDF5.jl that referenced this issue Dec 4, 2020

fix JuliaIO#742, avoid instantiating large tuples

94873a5

jmert added a commit to jmert/HDF5.jl that referenced this issue Dec 5, 2020

fix JuliaIO#742, avoid instantiating large tuples

33386eb

jmert added a commit to jmert/HDF5.jl that referenced this issue Dec 5, 2020

fix JuliaIO#742, avoid instantiating large tuples

dbdd0b9

jmert added a commit to jmert/HDF5.jl that referenced this issue Dec 7, 2020

fix JuliaIO#742, avoid instantiating large tuples

02b1a3f

jmert closed this as completed in 862d4f7 Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow reading of long strings in v0.14 #742

Very slow reading of long strings in v0.14 #742

chrisbrahms commented Nov 18, 2020 •

edited

Loading

jmert commented Nov 18, 2020

jmert commented Nov 18, 2020 •

edited

Loading

jmert commented Nov 18, 2020

kleinhenz commented Nov 18, 2020

kleinhenz commented Nov 18, 2020

jmert commented Nov 19, 2020

kleinhenz commented Nov 20, 2020

Very slow reading of long strings in v0.14 #742

Very slow reading of long strings in v0.14 #742

Comments

chrisbrahms commented Nov 18, 2020 • edited Loading

jmert commented Nov 18, 2020

jmert commented Nov 18, 2020 • edited Loading

jmert commented Nov 18, 2020

kleinhenz commented Nov 18, 2020

kleinhenz commented Nov 18, 2020

jmert commented Nov 19, 2020

kleinhenz commented Nov 20, 2020

chrisbrahms commented Nov 18, 2020 •

edited

Loading

jmert commented Nov 18, 2020 •

edited

Loading