(de)serialization of Vector{Union{T,Missing}} very slow #30148

nalimilan · 2018-11-25T12:04:22Z

Serializing and deserializing arrays allowing for missing values such as Vector{Union{Int,Missing}} is much slower than doing the same operation for homogeneous isbits arrays. An optimization is probably missing to serialize the data and the type tag vectors, instead of saving each entry using a pointer?

Illustration (thanks @mauro3):

julia> using Serialization

julia> x = rand(Int, 100_000_000);

julia> xm = convert(Vector{Union{Int,Missing}}, x); xm[1:1000:end] .= missing;

julia> @time serialize(open("/tmp/test.jls", "w"), x)
  0.376715 seconds (119.99 k allocations: 5.847 MiB)
800000000

julia> @time y = deserialize(open("/tmp/test.jls"));
  0.570197 seconds (7.28 k allocations: 763.273 MiB, 40.02% gc time)

julia> @time serialize(open("/tmp/test.jls", "w"), xm)
  3.338387 seconds (100.00 M allocations: 1.990 GiB, 9.15% gc time)

julia> @time y = deserialize(open("/tmp/test.jls"));
 56.912022 seconds (399.90 M allocations: 8.787 GiB, 0.97% gc time)

Cc: @quinnj

The text was updated successfully, but these errors were encountered:

iwelch · 2018-11-26T15:40:12Z

Let me add that this can have drastic consequences when working with large datasets (in dataframes). so, I hope this is considered important.

JeffBezanson · 2018-11-26T17:07:46Z

deserialize looks excessively slow here; I think it will be possible to speed that up. Further improvements will require changing the data format. A possible way forward there is to introduce a new protocol version in v1.x, but keep the old format as the default, and add an interface for selecting which version to save data in.

iwelch · 2018-11-26T18:38:47Z

thanks, jeff. the performance here is really very important for working with (social science) data.

may I ask why serialization compatibility is not seamless?

the deserializer reads as first byte the version in which the data was serialized.

as to the serializer, why would you want a default writing format in which vectors with missing cannot be accommodated? I would put the option flag for writing older versions.

nalimilan · 2018-11-26T20:37:24Z

I thought the serialization format was advertized as being at risk of changing at any time, and that people should use JLD(2) for long-term storage?

JeffBezanson · 2018-11-26T21:17:01Z

Of course we can continue to read older files based on the version indicated in the header.

We could change the format we write in by default, but then files written by v1.2 won't be readable by v1.1 and v1.0 (yes there can be an option for it, but it's still breaking). Perhaps that's considered acceptable though.

iwelch · 2018-11-26T21:25:24Z

can I voice my preference again? please make the most recent serializing version always the default. if someone needs to write data for an older julia version, they can be asked to download the recent version (free!), deserialize the new version data, and then serialize-write with a switch to transfer the data to an older julia...which they probably should not be running anyway.

JeffBezanson · 2018-11-26T21:30:39Z

download the recent version (free!)

:)

Yes I think this is probably fine. Will flag for triage just to make sure there's consensus on that.

iwelch · 2018-11-27T18:53:35Z

jeff---would it make sense to add lz4 compression to the format at this step, too? It would likely speed up loads from disk, and save diskspace.

helps #30148

…30221) helps #30148

JeffBezanson · 2018-12-06T19:24:46Z

Triage is ok with upgrading the default protocol version used for writing in 1.x.

iwelch · 2018-12-06T20:31:13Z

that's great. allow me to mention lz4 again, at least as built-in option if not as the default for a serialized format. in my stock return data, which is numeric but has some modestly constant prices and volumes, I get a better than 3-to-1 compression ratio. 2.5GB instead of 9GB. if hard disk I/O is slow, this could even win on speed.

…30221) helps #30148 (cherry picked from commit 11c5680)

StefanKarpinski · 2018-12-06T21:02:57Z

Adding lz4 support would be a welcomed addition but I don't think anyone else has the motivation or bandwidth to implement it for you.

iwelch · 2018-12-06T21:51:31Z

couldn't it simply be piped through https://github.com/lz4/lz4 ?

StefanKarpinski · 2018-12-06T21:52:20Z

Can't you just do that externally?

chethega · 2018-12-06T21:53:46Z

Is this really needed?

Apart from the non-answer (use zfs or btrfs, generic compression is the kernel's job), you can use a TranscodingStream for LZ4 or zstandard. Pass the file into the transcoding stream, pass the transcoding stream into serialize / deserialize, done (the transcoding stream handles buffering).

edit: Sorry for double posting. It appears that all three of us had the same idea at the same time.

StefanKarpinski · 2018-12-06T21:58:19Z

I might also point out that this is a completely off-topic discussion on this issue.

iwelch · 2018-12-06T22:12:14Z

the lame answer is that R did it for their binary format, too. the less lame answer is that, yes, I can do it. I know how to do it. I can stick a function into my startup.jl, and I am done for myself. this feature would really be for the benefit of earlier, naive, and occasional users. and to reduce duplicate effort.

there is one use aspect that would be more convenient and that is not easy to replicate. this would be transparent decompression---if the format stored a hint whether its stream should or should not be run through a decompressor, it can work transparently with either format.

and then. there is every user struggling with making various packages work together:

julia> dfreadback= CSV.read( GzipDecompressorStream( open("sample2.csv.gz", "r") ) )
ERROR: MethodError: no method matching position(::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream})
Closest candidates are:
  position(!Matched::Base.SecretBuffer) at secretbuffer.jl:154
  position(!Matched::Base.Filesystem.File) at filesystem.jl:225
  position(!Matched::Base.Libc.FILE) at libc.jl:101

we agree that this is all on the spectrum. all higher-level formats are "convenience only" that a user could read and write themselves. compression is just another one of them....nice, but it is not essential. if easy, IMHO, then worth it. if not easy, not worth it. if you considered it and rejected it, so be it.

timholy · 2018-12-07T09:48:22Z

@iwelch, I think what folks are saying is that no one doubts this might sometimes be useful, so no need to try to convince anyone of anything. But also, you're also not going to convince others to do this unpaid and low-priority work for you. This is open-source, so just roll up your sleeves and implement it yourself; I'm sure all those users who won't experience the duplicated effort will thank you.

You could presumably do this as an external package, defining an LZ4-compressing subtype of AbstractSerializer and specializing the methods that would benefit from using it. Just copy/paste the relevant methods from https://github.com/JuliaLang/julia/blob/master/stdlib/Serialization/src/Serialization.jl and start editing.

…30221) helps #30148 (cherry picked from commit 11c5680)

nalimilan added performance Must go faster missing data Base.missing and related functionality labels Nov 25, 2018

JeffBezanson added the triage This should be discussed on a triage call label Nov 26, 2018

JeffBezanson added a commit that referenced this issue Nov 30, 2018

speed up (de)serialization of Base bits types in abstract containers

59cd598

helps #30148

JeffBezanson mentioned this issue Nov 30, 2018

speed up (de)serialization of Base bits types in abstract containers #30221

Merged

JeffBezanson added a commit that referenced this issue Dec 4, 2018

speed up (de)serialization of Base bits types in abstract containers (#…

11c5680

…30221) helps #30148

JeffBezanson removed the triage This should be discussed on a triage call label Dec 6, 2018

KristofferC pushed a commit that referenced this issue Dec 6, 2018

speed up (de)serialization of Base bits types in abstract containers (#…

19d7763

…30221) helps #30148 (cherry picked from commit 11c5680)

KristofferC pushed a commit that referenced this issue Dec 12, 2018

speed up (de)serialization of Base bits types in abstract containers (#…

1bd2334

…30221) helps #30148 (cherry picked from commit 11c5680)

KristofferC pushed a commit that referenced this issue Feb 11, 2019

speed up (de)serialization of Base bits types in abstract containers (#…

7e98805

…30221) helps #30148 (cherry picked from commit 11c5680)

KristofferC pushed a commit that referenced this issue Feb 20, 2020

speed up (de)serialization of Base bits types in abstract containers (#…

7fc47a8

…30221) helps #30148 (cherry picked from commit 11c5680)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(de)serialization of Vector{Union{T,Missing}} very slow #30148

(de)serialization of Vector{Union{T,Missing}} very slow #30148

nalimilan commented Nov 25, 2018

iwelch commented Nov 26, 2018

JeffBezanson commented Nov 26, 2018

iwelch commented Nov 26, 2018

nalimilan commented Nov 26, 2018

JeffBezanson commented Nov 26, 2018

iwelch commented Nov 26, 2018

JeffBezanson commented Nov 26, 2018

iwelch commented Nov 27, 2018

JeffBezanson commented Dec 6, 2018

iwelch commented Dec 6, 2018

StefanKarpinski commented Dec 6, 2018

iwelch commented Dec 6, 2018

StefanKarpinski commented Dec 6, 2018

chethega commented Dec 6, 2018 •

edited

Loading

StefanKarpinski commented Dec 6, 2018

iwelch commented Dec 6, 2018

timholy commented Dec 7, 2018

(de)serialization of Vector{Union{T,Missing}} very slow #30148

(de)serialization of Vector{Union{T,Missing}} very slow #30148

Comments

nalimilan commented Nov 25, 2018

iwelch commented Nov 26, 2018

JeffBezanson commented Nov 26, 2018

iwelch commented Nov 26, 2018

nalimilan commented Nov 26, 2018

JeffBezanson commented Nov 26, 2018

iwelch commented Nov 26, 2018

JeffBezanson commented Nov 26, 2018

iwelch commented Nov 27, 2018

JeffBezanson commented Dec 6, 2018

iwelch commented Dec 6, 2018

StefanKarpinski commented Dec 6, 2018

iwelch commented Dec 6, 2018

StefanKarpinski commented Dec 6, 2018

chethega commented Dec 6, 2018 • edited Loading

StefanKarpinski commented Dec 6, 2018

iwelch commented Dec 6, 2018

timholy commented Dec 7, 2018

chethega commented Dec 6, 2018 •

edited

Loading