Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

de-duplicate strings in serialization #35056

Merged
merged 1 commit into from
Mar 19, 2020
Merged

Conversation

JeffBezanson
Copy link
Member

fixes #35030

  • Follows the usual backwards compatibility rule (new versions can read old files, but old versions can't necessarily read new files)
  • I opted to do this by adding a tag that can be used in general to add optional backreferences for any type that one might want to de-duplicate

There are two design choices:

  • Do we try to de-duplicate all strings (which this PR does), or save the exact reference topology we have in memory?
  • Do we use a size cutoff (this PR uses 7), or apply this to all strings? Some datasets do have large numbers of non-unique small strings.

@JeffBezanson JeffBezanson added performance Must go faster stdlib Julia's standard library labels Mar 9, 2020
@StefanKarpinski
Copy link
Member

Do we try to de-duplicate all strings (which this PR does), or save the exact reference topology we have in memory?

Do you mean deduplicate precisely when pointer(s1) == pointer(s2) versus deduplicating whenever s1 === s2? Somewhat unintuitively, the former is a finer grained equivalence relation than the latter because === treats Strings somewhat specially and considers them indistinguishable even if they are stored at different memory locations (we sweep this under the rug by saying that pointer is and impure function and it will give you some pointer to a string, not necessarily a specific one).

Do we use a size cutoff (this PR uses 7), or apply this to all strings? Some datasets do have large numbers of non-unique small strings.

Isn't there a natural cutoff where it's more expensive to store a backref than just store the string? Is that how you chose 7? If so, then 👍

@NHDaly
Copy link
Member

NHDaly commented Mar 10, 2020

wow, this was so straightforward. Thanks so much, Jeff! I'm going to try running a test now on our end.

@NHDaly
Copy link
Member

NHDaly commented Mar 10, 2020

Isn't there a natural cutoff where it's more expensive to store a backref than just store the string? Is that how you chose 7? If so, then 👍

I suggested the same thing in my original comment on #35030, but after thinking more about it, the problem isn't only about serialization, but also deserialization, and if there was a program with millions of identical tiny strings that are shared in memory, once the struct is serialized and deserialized, they'll have become duplicated and might take up significant resources. So i find myself increasingly leaning towards just always interning strings all the time, even if it means the serialized object is a few bytes larger per string than it might otherwise be.

@oscardssmith
Copy link
Member

Would a good approach to deduplicate small strings when you deserialize? I don't know how expensive that would be, but it would be best for size.

@NHDaly
Copy link
Member

NHDaly commented Mar 12, 2020

I'm going to try running a test now on our end.

Reporting back: Thank you, yes this definitely fixed things on our end! :) 🚀
It took a while to test because I was mistaken, and we actually were not using source builds of julia in production -- we were using the binary releases -- so it took a bit of time to set things up to support applying patches. But that's all set now, and we have verified that this drastically improved things for us. Thanks again!

@StefanKarpinski
Copy link
Member

The point about wanting strings to be deduplicated on deserialization as well is a good one. The downside is fixed since strings can only get so small.

@JeffBezanson
Copy link
Member Author

Fortunately, we can add deduplication on deserialization any time. This PR also gives us the flexibility to change which strings are deduplicated on serialization, but at the cost of an extra byte. We could eliminate the extra byte if we don't think we'll want this feature for other types, i.e. making the new tag SHARED_STRING_TAG instead of SHARED_REF_TAG.

@oscardssmith
Copy link
Member

In theory, don't we want it for anything large and immutable?

@StefanKarpinski
Copy link
Member

There don't tend to be a lot of large, immutable things. Strings are the main ones I can think of. Then again, I'm fine with the extra byte to make this a bit more general and future proof.

@NHDaly
Copy link
Member

NHDaly commented Mar 15, 2020

Oh, yeah, i was going to ask about that too, actually. Do we not already do deduplication for other large immutable things? I do think that we're going to see more and more of those in our serialization as well, since we're starting to use (and write) more immutable collections / functional data structures.

(But we are also probably planning to move away from julia native serialization to something more upgrade-invariant in the long-term, like protobuf or something similar, so this isn't as pressing for us, probably.)

But so, julia doesn't currently deduplicate immutables? What are the things that get interned? Symbols, Strings (now), and mutables?

@JeffBezanson
Copy link
Member Author

No, we don't generally deduplicate immutables; it's too expensive and the average immutable object is likely to be fairly unique.

Anyway I'll merge this now since the other variations discussed here can be added in a backwards-compatible way.

@JeffBezanson JeffBezanson merged commit d33c5a5 into master Mar 19, 2020
@JeffBezanson JeffBezanson deleted the jb/serializestrings branch March 19, 2020 20:38
oxinabox pushed a commit to oxinabox/julia that referenced this pull request Apr 8, 2020
ravibitsgoa pushed a commit to ravibitsgoa/julia that referenced this pull request Apr 9, 2020
KristofferC pushed a commit that referenced this pull request Apr 11, 2020
NHDaly added a commit to NHDaly/julia that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster stdlib Julia's standard library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Serialization.serialize() strings are not interned, causing duplication
4 participants