Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally intern strings during unmarshaling #191

Closed
CAFxX opened this issue Sep 28, 2018 · 6 comments · Fixed by #279
Closed

Optionally intern strings during unmarshaling #191

CAFxX opened this issue Sep 28, 2018 · 6 comments · Fixed by #279

Comments

@CAFxX
Copy link
Contributor

CAFxX commented Sep 28, 2018

To reduce allocations when unmarshalling it would be useful to allow users to specify that when decoding certain string fields the field values are likely to be repeated and therefore should be interned to avoid having duplicate copies of the same string in memory.

A string/[]byte interning package is available at https://github.com/josharian/intern.

This mechanism should be optional (not all fields contain duplicated values); ideally via a field tag such as intern, example:

type Object struct {
  Name string `json:"name"`
  State string `json:"state,intern"`
}

The example above would instruct the unmarshaling code to perform interning on the value of the State field when that field is decoded.

See golang/go#5160 (comment) for an experience report about a json-unmarshaling-heavy application where interning certain values significantly reduced memory usage.

@darkdefender27
Copy link

@CAFxX I'm willing to pick this up. Let me know if I can proceed.

@CAFxX
Copy link
Contributor Author

CAFxX commented Sep 18, 2019

Yup, I'm not working on this.

@bgadrian
Copy link

bgadrian commented Apr 5, 2020

Hello, @darkdefender27 any news on this?

CAFxX added a commit to CAFxX/easyjson that referenced this issue Apr 7, 2020
CAFxX added a commit to CAFxX/easyjson that referenced this issue Apr 7, 2020
@CAFxX
Copy link
Contributor Author

CAFxX commented Apr 7, 2020

I actually went ahead and did this.

@kirillx
Copy link
Contributor

kirillx commented Apr 11, 2020

I think as most of unmarshalled data is short living it would be better to add "noclone" attribute to make all the strings refer to original buffer.
e.g. some APIs return large binary objects in base64 string, so one need to decode it anyway and no good reason to allocate 2nd memory copy.
@GoWebProd what do you think?

@bgadrian
Copy link

There will be no case of 2nd memory allocation in case of interning the strings, only at the first occurrence, and that will be long lived in memory.

I think the "noclone" optimisation should be done as a separate flag, using the already builtin unsafeString from the lexer. But I see a different case than the "string interning". One is for low cardinality fields and the other for short-lived long bytes/strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants