de-duplicate strings in serialization #35056

JeffBezanson · 2020-03-09T21:14:24Z

Follows the usual backwards compatibility rule (new versions can read old files, but old versions can't necessarily read new files)
I opted to do this by adding a tag that can be used in general to add optional backreferences for any type that one might want to de-duplicate

There are two design choices:

Do we try to de-duplicate all strings (which this PR does), or save the exact reference topology we have in memory?
Do we use a size cutoff (this PR uses 7), or apply this to all strings? Some datasets do have large numbers of non-unique small strings.

StefanKarpinski · 2020-03-10T12:30:10Z

Do we try to de-duplicate all strings (which this PR does), or save the exact reference topology we have in memory?

Do you mean deduplicate precisely when pointer(s1) == pointer(s2) versus deduplicating whenever s1 === s2? Somewhat unintuitively, the former is a finer grained equivalence relation than the latter because === treats Strings somewhat specially and considers them indistinguishable even if they are stored at different memory locations (we sweep this under the rug by saying that pointer is and impure function and it will give you some pointer to a string, not necessarily a specific one).

Do we use a size cutoff (this PR uses 7), or apply this to all strings? Some datasets do have large numbers of non-unique small strings.

Isn't there a natural cutoff where it's more expensive to store a backref than just store the string? Is that how you chose 7? If so, then 👍

NHDaly · 2020-03-10T13:37:42Z

wow, this was so straightforward. Thanks so much, Jeff! I'm going to try running a test now on our end.

NHDaly · 2020-03-10T13:44:54Z

Isn't there a natural cutoff where it's more expensive to store a backref than just store the string? Is that how you chose 7? If so, then 👍

I suggested the same thing in my original comment on #35030, but after thinking more about it, the problem isn't only about serialization, but also deserialization, and if there was a program with millions of identical tiny strings that are shared in memory, once the struct is serialized and deserialized, they'll have become duplicated and might take up significant resources. So i find myself increasingly leaning towards just always interning strings all the time, even if it means the serialized object is a few bytes larger per string than it might otherwise be.

oscardssmith · 2020-03-10T15:05:39Z

Would a good approach to deduplicate small strings when you deserialize? I don't know how expensive that would be, but it would be best for size.

NHDaly · 2020-03-12T14:37:23Z

I'm going to try running a test now on our end.

Reporting back: Thank you, yes this definitely fixed things on our end! :) 🚀
It took a while to test because I was mistaken, and we actually were not using source builds of julia in production -- we were using the binary releases -- so it took a bit of time to set things up to support applying patches. But that's all set now, and we have verified that this drastically improved things for us. Thanks again!

StefanKarpinski · 2020-03-13T13:53:11Z

The point about wanting strings to be deduplicated on deserialization as well is a good one. The downside is fixed since strings can only get so small.

JeffBezanson · 2020-03-13T16:00:49Z

Fortunately, we can add deduplication on deserialization any time. This PR also gives us the flexibility to change which strings are deduplicated on serialization, but at the cost of an extra byte. We could eliminate the extra byte if we don't think we'll want this feature for other types, i.e. making the new tag SHARED_STRING_TAG instead of SHARED_REF_TAG.

oscardssmith · 2020-03-13T16:16:17Z

In theory, don't we want it for anything large and immutable?

StefanKarpinski · 2020-03-13T17:04:50Z

There don't tend to be a lot of large, immutable things. Strings are the main ones I can think of. Then again, I'm fine with the extra byte to make this a bit more general and future proof.

NHDaly · 2020-03-15T02:21:32Z

Oh, yeah, i was going to ask about that too, actually. Do we not already do deduplication for other large immutable things? I do think that we're going to see more and more of those in our serialization as well, since we're starting to use (and write) more immutable collections / functional data structures.

(But we are also probably planning to move away from julia native serialization to something more upgrade-invariant in the long-term, like protobuf or something similar, so this isn't as pressing for us, probably.)

But so, julia doesn't currently deduplicate immutables? What are the things that get interned? Symbols, Strings (now), and mutables?

JeffBezanson · 2020-03-19T20:37:34Z

No, we don't generally deduplicate immutables; it's too expensive and the average immutable object is likely to be fairly unique.

Anyway I'll merge this now since the other variations discussed here can be added in a backwards-compatible way.

fixes JuliaLang#35030

fixes #35030

```bash $ git apply <(curl https://patch-diff.githubusercontent.com/raw/julialang/julia/pull/35056.patch) ```

de-duplicate strings in serialization

a6e9de1

fixes #35030

JeffBezanson added performance Must go faster stdlib Julia's standard library labels Mar 9, 2020

JeffBezanson merged commit d33c5a5 into master Mar 19, 2020

JeffBezanson deleted the jb/serializestrings branch March 19, 2020 20:38

oxinabox pushed a commit to oxinabox/julia that referenced this pull request Apr 8, 2020

de-duplicate strings in serialization (JuliaLang#35056)

1f068b8

fixes JuliaLang#35030

ravibitsgoa pushed a commit to ravibitsgoa/julia that referenced this pull request Apr 9, 2020

de-duplicate strings in serialization (JuliaLang#35056)

ea5a7c9

fixes JuliaLang#35030

KristofferC pushed a commit that referenced this pull request Apr 11, 2020

de-duplicate strings in serialization (#35056)

1963138

fixes #35030

NHDaly added a commit to NHDaly/julia that referenced this pull request May 15, 2020

Apply changes from JuliaLang#35056

efc7879

```bash $ git apply <(curl https://patch-diff.githubusercontent.com/raw/julialang/julia/pull/35056.patch) ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

de-duplicate strings in serialization #35056

de-duplicate strings in serialization #35056

JeffBezanson commented Mar 9, 2020

StefanKarpinski commented Mar 10, 2020

NHDaly commented Mar 10, 2020

NHDaly commented Mar 10, 2020

oscardssmith commented Mar 10, 2020

NHDaly commented Mar 12, 2020

StefanKarpinski commented Mar 13, 2020

JeffBezanson commented Mar 13, 2020

oscardssmith commented Mar 13, 2020

StefanKarpinski commented Mar 13, 2020

NHDaly commented Mar 15, 2020

JeffBezanson commented Mar 19, 2020

de-duplicate strings in serialization #35056

de-duplicate strings in serialization #35056

Conversation

JeffBezanson commented Mar 9, 2020

StefanKarpinski commented Mar 10, 2020

NHDaly commented Mar 10, 2020

NHDaly commented Mar 10, 2020

oscardssmith commented Mar 10, 2020

NHDaly commented Mar 12, 2020

StefanKarpinski commented Mar 13, 2020

JeffBezanson commented Mar 13, 2020

oscardssmith commented Mar 13, 2020

StefanKarpinski commented Mar 13, 2020

NHDaly commented Mar 15, 2020

JeffBezanson commented Mar 19, 2020