Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(de)serialization of Vector{Union{T,Missing}} very slow #30148

Open
nalimilan opened this issue Nov 25, 2018 · 17 comments
Open

(de)serialization of Vector{Union{T,Missing}} very slow #30148

nalimilan opened this issue Nov 25, 2018 · 17 comments
Labels
missing data Base.missing and related functionality performance Must go faster

Comments

@nalimilan
Copy link
Member

Serializing and deserializing arrays allowing for missing values such as Vector{Union{Int,Missing}} is much slower than doing the same operation for homogeneous isbits arrays. An optimization is probably missing to serialize the data and the type tag vectors, instead of saving each entry using a pointer?

Illustration (thanks @mauro3):

julia> using Serialization

julia> x = rand(Int, 100_000_000);

julia> xm = convert(Vector{Union{Int,Missing}}, x); xm[1:1000:end] .= missing;

julia> @time serialize(open("/tmp/test.jls", "w"), x)
  0.376715 seconds (119.99 k allocations: 5.847 MiB)
800000000

julia> @time y = deserialize(open("/tmp/test.jls"));
  0.570197 seconds (7.28 k allocations: 763.273 MiB, 40.02% gc time)

julia> @time serialize(open("/tmp/test.jls", "w"), xm)
  3.338387 seconds (100.00 M allocations: 1.990 GiB, 9.15% gc time)

julia> @time y = deserialize(open("/tmp/test.jls"));
 56.912022 seconds (399.90 M allocations: 8.787 GiB, 0.97% gc time)

Cc: @quinnj

@nalimilan nalimilan added performance Must go faster missing data Base.missing and related functionality labels Nov 25, 2018
@iwelch
Copy link

iwelch commented Nov 26, 2018

Let me add that this can have drastic consequences when working with large datasets (in dataframes). so, I hope this is considered important.

@JeffBezanson
Copy link
Member

deserialize looks excessively slow here; I think it will be possible to speed that up. Further improvements will require changing the data format. A possible way forward there is to introduce a new protocol version in v1.x, but keep the old format as the default, and add an interface for selecting which version to save data in.

@iwelch
Copy link

iwelch commented Nov 26, 2018

thanks, jeff. the performance here is really very important for working with (social science) data.

may I ask why serialization compatibility is not seamless?

the deserializer reads as first byte the version in which the data was serialized.

as to the serializer, why would you want a default writing format in which vectors with missing cannot be accommodated? I would put the option flag for writing older versions.

@nalimilan
Copy link
Member Author

I thought the serialization format was advertized as being at risk of changing at any time, and that people should use JLD(2) for long-term storage?

@JeffBezanson
Copy link
Member

Of course we can continue to read older files based on the version indicated in the header.

We could change the format we write in by default, but then files written by v1.2 won't be readable by v1.1 and v1.0 (yes there can be an option for it, but it's still breaking). Perhaps that's considered acceptable though.

@iwelch
Copy link

iwelch commented Nov 26, 2018

can I voice my preference again? please make the most recent serializing version always the default. if someone needs to write data for an older julia version, they can be asked to download the recent version (free!), deserialize the new version data, and then serialize-write with a switch to transfer the data to an older julia...which they probably should not be running anyway.

@JeffBezanson
Copy link
Member

download the recent version (free!)

:)

Yes I think this is probably fine. Will flag for triage just to make sure there's consensus on that.

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Nov 26, 2018
@iwelch
Copy link

iwelch commented Nov 27, 2018

jeff---would it make sense to add lz4 compression to the format at this step, too? It would likely speed up loads from disk, and save diskspace.

@JeffBezanson
Copy link
Member

Triage is ok with upgrading the default protocol version used for writing in 1.x.

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Dec 6, 2018
@iwelch
Copy link

iwelch commented Dec 6, 2018

that's great. allow me to mention lz4 again, at least as built-in option if not as the default for a serialized format. in my stock return data, which is numeric but has some modestly constant prices and volumes, I get a better than 3-to-1 compression ratio. 2.5GB instead of 9GB. if hard disk I/O is slow, this could even win on speed.

KristofferC pushed a commit that referenced this issue Dec 6, 2018
@StefanKarpinski
Copy link
Member

Adding lz4 support would be a welcomed addition but I don't think anyone else has the motivation or bandwidth to implement it for you.

@iwelch
Copy link

iwelch commented Dec 6, 2018

couldn't it simply be piped through https://github.com/lz4/lz4 ?

@StefanKarpinski
Copy link
Member

Can't you just do that externally?

@chethega
Copy link
Contributor

chethega commented Dec 6, 2018

Is this really needed?

Apart from the non-answer (use zfs or btrfs, generic compression is the kernel's job), you can use a TranscodingStream for LZ4 or zstandard. Pass the file into the transcoding stream, pass the transcoding stream into serialize / deserialize, done (the transcoding stream handles buffering).

edit: Sorry for double posting. It appears that all three of us had the same idea at the same time.

@StefanKarpinski
Copy link
Member

I might also point out that this is a completely off-topic discussion on this issue.

@iwelch
Copy link

iwelch commented Dec 6, 2018

the lame answer is that R did it for their binary format, too. the less lame answer is that, yes, I can do it. I know how to do it. I can stick a function into my startup.jl, and I am done for myself. this feature would really be for the benefit of earlier, naive, and occasional users. and to reduce duplicate effort.

there is one use aspect that would be more convenient and that is not easy to replicate. this would be transparent decompression---if the format stored a hint whether its stream should or should not be run through a decompressor, it can work transparently with either format.

and then. there is every user struggling with making various packages work together:

julia> dfreadback= CSV.read( GzipDecompressorStream( open("sample2.csv.gz", "r") ) )
ERROR: MethodError: no method matching position(::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream})
Closest candidates are:
  position(!Matched::Base.SecretBuffer) at secretbuffer.jl:154
  position(!Matched::Base.Filesystem.File) at filesystem.jl:225
  position(!Matched::Base.Libc.FILE) at libc.jl:101

we agree that this is all on the spectrum. all higher-level formats are "convenience only" that a user could read and write themselves. compression is just another one of them....nice, but it is not essential. if easy, IMHO, then worth it. if not easy, not worth it. if you considered it and rejected it, so be it.

@timholy
Copy link
Member

timholy commented Dec 7, 2018

@iwelch, I think what folks are saying is that no one doubts this might sometimes be useful, so no need to try to convince anyone of anything. But also, you're also not going to convince others to do this unpaid and low-priority work for you. This is open-source, so just roll up your sleeves and implement it yourself; I'm sure all those users who won't experience the duplicated effort will thank you.

You could presumably do this as an external package, defining an LZ4-compressing subtype of AbstractSerializer and specializing the methods that would benefit from using it. Just copy/paste the relevant methods from https://github.com/JuliaLang/julia/blob/master/stdlib/Serialization/src/Serialization.jl and start editing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants