-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(de)serialization of Vector{Union{T,Missing}} very slow #30148
Comments
Let me add that this can have drastic consequences when working with large datasets (in dataframes). so, I hope this is considered important. |
|
thanks, jeff. the performance here is really very important for working with (social science) data. may I ask why serialization compatibility is not seamless? the deserializer reads as first byte the version in which the data was serialized. as to the serializer, why would you want a default writing format in which vectors with missing cannot be accommodated? I would put the option flag for writing older versions. |
I thought the serialization format was advertized as being at risk of changing at any time, and that people should use JLD(2) for long-term storage? |
Of course we can continue to read older files based on the version indicated in the header. We could change the format we write in by default, but then files written by v1.2 won't be readable by v1.1 and v1.0 (yes there can be an option for it, but it's still breaking). Perhaps that's considered acceptable though. |
can I voice my preference again? please make the most recent serializing version always the default. if someone needs to write data for an older julia version, they can be asked to download the recent version (free!), deserialize the new version data, and then serialize-write with a switch to transfer the data to an older julia...which they probably should not be running anyway. |
:) Yes I think this is probably fine. Will flag for triage just to make sure there's consensus on that. |
jeff---would it make sense to add lz4 compression to the format at this step, too? It would likely speed up loads from disk, and save diskspace. |
Triage is ok with upgrading the default protocol version used for writing in 1.x. |
that's great. allow me to mention lz4 again, at least as built-in option if not as the default for a serialized format. in my stock return data, which is numeric but has some modestly constant prices and volumes, I get a better than 3-to-1 compression ratio. 2.5GB instead of 9GB. if hard disk I/O is slow, this could even win on speed. |
Adding lz4 support would be a welcomed addition but I don't think anyone else has the motivation or bandwidth to implement it for you. |
couldn't it simply be piped through https://github.com/lz4/lz4 ? |
Can't you just do that externally? |
Is this really needed? Apart from the non-answer (use zfs or btrfs, generic compression is the kernel's job), you can use a edit: Sorry for double posting. It appears that all three of us had the same idea at the same time. |
I might also point out that this is a completely off-topic discussion on this issue. |
the lame answer is that R did it for their binary format, too. the less lame answer is that, yes, I can do it. I know how to do it. I can stick a function into my startup.jl, and I am done for myself. this feature would really be for the benefit of earlier, naive, and occasional users. and to reduce duplicate effort. there is one use aspect that would be more convenient and that is not easy to replicate. this would be transparent decompression---if the format stored a hint whether its stream should or should not be run through a decompressor, it can work transparently with either format. and then. there is every user struggling with making various packages work together:
we agree that this is all on the spectrum. all higher-level formats are "convenience only" that a user could read and write themselves. compression is just another one of them....nice, but it is not essential. if easy, IMHO, then worth it. if not easy, not worth it. if you considered it and rejected it, so be it. |
@iwelch, I think what folks are saying is that no one doubts this might sometimes be useful, so no need to try to convince anyone of anything. But also, you're also not going to convince others to do this unpaid and low-priority work for you. This is open-source, so just roll up your sleeves and implement it yourself; I'm sure all those users who won't experience the duplicated effort will thank you. You could presumably do this as an external package, defining an LZ4-compressing subtype of |
Serializing and deserializing arrays allowing for missing values such as
Vector{Union{Int,Missing}}
is much slower than doing the same operation for homogeneousisbits
arrays. An optimization is probably missing to serialize the data and the type tag vectors, instead of saving each entry using a pointer?Illustration (thanks @mauro3):
Cc: @quinnj
The text was updated successfully, but these errors were encountered: