-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
de-duplicate strings in serialization #35056
Conversation
Do you mean deduplicate precisely when
Isn't there a natural cutoff where it's more expensive to store a backref than just store the string? Is that how you chose 7? If so, then 👍 |
wow, this was so straightforward. Thanks so much, Jeff! I'm going to try running a test now on our end. |
I suggested the same thing in my original comment on #35030, but after thinking more about it, the problem isn't only about serialization, but also deserialization, and if there was a program with millions of identical tiny strings that are shared in memory, once the struct is serialized and deserialized, they'll have become duplicated and might take up significant resources. So i find myself increasingly leaning towards just always interning strings all the time, even if it means the serialized object is a few bytes larger per string than it might otherwise be. |
Would a good approach to deduplicate small strings when you deserialize? I don't know how expensive that would be, but it would be best for size. |
Reporting back: Thank you, yes this definitely fixed things on our end! :) 🚀 |
The point about wanting strings to be deduplicated on deserialization as well is a good one. The downside is fixed since strings can only get so small. |
Fortunately, we can add deduplication on deserialization any time. This PR also gives us the flexibility to change which strings are deduplicated on serialization, but at the cost of an extra byte. We could eliminate the extra byte if we don't think we'll want this feature for other types, i.e. making the new tag |
In theory, don't we want it for anything large and immutable? |
There don't tend to be a lot of large, immutable things. Strings are the main ones I can think of. Then again, I'm fine with the extra byte to make this a bit more general and future proof. |
Oh, yeah, i was going to ask about that too, actually. Do we not already do deduplication for other large immutable things? I do think that we're going to see more and more of those in our serialization as well, since we're starting to use (and write) more immutable collections / functional data structures. (But we are also probably planning to move away from julia native serialization to something more upgrade-invariant in the long-term, like protobuf or something similar, so this isn't as pressing for us, probably.) But so, julia doesn't currently deduplicate immutables? What are the things that get interned? Symbols, Strings (now), and mutables? |
No, we don't generally deduplicate immutables; it's too expensive and the average immutable object is likely to be fairly unique. Anyway I'll merge this now since the other variations discussed here can be added in a backwards-compatible way. |
```bash $ git apply <(curl https://patch-diff.githubusercontent.com/raw/julialang/julia/pull/35056.patch) ```
fixes #35030
There are two design choices: