Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: memoize strings during decode #32779

Open
rsc opened this issue Jun 25, 2019 · 53 comments
Open

encoding/json: memoize strings during decode #32779

rsc opened this issue Jun 25, 2019 · 53 comments

Comments

@rsc
Copy link
Contributor

rsc commented Jun 25, 2019

Part of the motivation for #32593 is the observation in json-iterator/go#376 that when you json decode, many of the same strings appear over and over in the JSON and get copied and reconstructed in the result over and over as well.

We do not want to make the JSON coalesce all the allocated strings into one giant buffer, because then holding on to just one of them holds onto the entire thing.

But it might make sense to add to the decoder state a simple map[[]byte]string and remember which specific byte slices we've already converted to string and reuse those allocated strings instead of doing the same conversions again in a particular Decode or Unmarshal operation.

It's unclear to me whether the map in a Decoder should persist between Decode operations. Probably? That will affect users who put Decoders in a pool or something like that, though. (Solution: don't put decoders in pools.)

@rsc rsc added this to the Go1.14 milestone Jun 25, 2019
@rsc rsc added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jun 25, 2019
@rsc rsc removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jun 25, 2019
@bserdar
Copy link

bserdar commented Jun 25, 2019

A while ago I wrote some code to test this exact same thing with the json decoder, but could not detect any significant difference between the memory usage of the decoder with memoized strings and the regular json decoder, which made me think maybe this is already done somewhere? Apparently not. I can try to find out my test code...

@bserdar
Copy link

bserdar commented Jun 25, 2019

Here's a link to my test code that does this for decoding json into interface{}:

https://github.com/bserdar/jsdecodertest

My initial comment was wrong, it does make a difference to store strings in a map. My test code stores only field keys in a map because those are the strings that are likely to repeat.

The json/ package is a copy of the stdlib encoding/json package modified to include a map[string]string in the decoder state.

@martisch
Copy link
Contributor

Nit: I assume the proposal meant to use a map[string]string with map[string([]byte)] lookup as map[[]byte]string is not a supported map type.

@mvdan
Copy link
Member

mvdan commented Jun 26, 2019

Also cc @josharian, who had a very similar idea a while ago. That was briefly discussed in #5160.

As a thought, we probably don't want to reuse all strings. For example, in a map the keys probably tend to repeat much more often than the values.

Also, do we want to make this configurable? Someone who writes a program likely knows whether they'll read somewhat static/repeated data or not.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/188500 mentions this issue: encoding/json: memoize keys during decode

@rsc
Copy link
Contributor Author

rsc commented Aug 6, 2019

@mvdan, are you concerned about the overhead of the map lookup in your Jun 26 comment? From the outside there can be no possible downside to reusing all strings that are constructed.

@bserdar
Copy link

bserdar commented Aug 27, 2019

The CL mentioned above has an implementation of this for JSON decoder. It introduces some overhead when decoding to an interface{}, because that's the only case where the keys read during decoding will be kept after decoding is done.

However, I think there is a better way to do this. What I propose is to write an encoding/LiteralStore type (a map[string]string) containing a Get(string) string method to retrieve the shared instance of the string. Then this can be used from the decoders in encoding/xml and encoding/json. To use the LiteralStore, the caller has to assign an instance to the JSON/XML decoder. This way, the caller has the option to use a literal store for large input where it makes a difference, use the store for multiple docs, or avoid it completely.

What do you think?

@bcmills
Copy link
Contributor

bcmills commented Aug 28, 2019

This is one of the conventional use-cases for weak references in GC'd languages.

Ideally, we should reuse strings that remain reachable from the last Decode invocation, and let the collector clean up any strings that have since become completely unreachable. That way, the frequently-used strings will remain in the map, but the infrequent strings can still be collected (so that the map doesn't grow without bound if the same Decoder is used on many different inputs over time).

@bcmills
Copy link
Contributor

bcmills commented Aug 28, 2019

Alternatives that do not require weak references include:

  • Make the interning behavior an explicit opt-in using a Decoder method (perhaps func (*Decoder) InternStrings(func([]byte)bool)).
  • Make the interning behavior an explicit annotation on the struct field tag.
    • But this gets awkward to describe the behavior for map keys vs. map values.

@bserdar
Copy link

bserdar commented Aug 28, 2019

@bcmills, I propose to make the interning behavior explicit opt-in using a Decoder field instead of a method. After reading #32593, I believe one way to make this acceptable in terms of performance and usage is this:

  • Add a strings/LiteralStore type (a map[string]string), with an Add(string) string method that will return a shared copy for string
  • Add Decoder.KeyLiterals field of type LiteralStore to both json and xml decoders.
  • Use the KeyLiterals if it is non-nil during decoding to store keys

This implementation does not change the existing behavior and makes the use of a literal store explicit opt-in.

A literal store like this is only useful for larger json/xml input, and only if it is manually decoded, or unmarshaled into an interface{}. If the input is unmarshaled/decoded into a struct, then there is no need to keep the keys as they will be lost.

Re: weak references, afaik there is no way to implement this using finalizers, am I wrong?

@ianlancetaylor
Copy link
Member

@bserdar I don't think there is any way to do this with finalizers because existing users will be using string values, not pointers, so there is nothing to hang a finalizer on. For that matter weak references wouldn't work either: you would have to change all the users to use the weak reference type rather than string.

In general finalizers and weak references are equivalent in power, so anything you can do with one you can do with the other. But you have to be able to control the types that people will use.

@bserdar
Copy link

bserdar commented Aug 28, 2019

@ianlancetaylor thanks for the clarification.

If there is interest in this, I can modify https://golang.org/cl/188500 to implement this. I already have some wrapper code around json/xml decoder to do this because reusing just the key strings makes a lot of difference for large docs.

cc @mvdan

@bcmills
Copy link
Contributor

bcmills commented Aug 29, 2019

For that matter weak references wouldn't work either: you would have to change all the users to use the weak reference type rather than string.

No? We would need some sort of language-supported “weak string” type, but then only the json.Decoder would hold a weak reference — the decoded messages would use a regular string so that the data doesn't disappear out from beneath them.

(But that assumes a language change that seems pretty unlikely to ever happen.)

@bcmills
Copy link
Contributor

bcmills commented Aug 29, 2019

In general finalizers and weak references are equivalent in power, so anything you can do with one you can do with the other. But you have to be able to control the types that people will use.

Ah, I think I see what you're saying. If we could control the types in use, then we could add a pointer indirection, and require users to keep the pointer alive in order to retain the string value in the decoder's cache.

But then you can't use those pointers in the same way as ordinary strings: they can't be map keys, they won't work with the normal built-in string operations, and they can't be passed to the strings package without falling back to manual memory management (via runtime.KeepAlive).

It may be true that they are technically equivalent in power, but they are not equivalently usable.

@bcmills
Copy link
Contributor

bcmills commented Aug 29, 2019

@bserdar, strings.LiteralStore seems unfortunate given that we're also discussing adding generics to the language, because there is a much more general “interning table” API that can be used with any type that supports hashing and equality.

The general form could have an API like:

package intern

type Table(type T HashEqualer) […]

func (t Table(T)) Get(x T) (canonicalX T) {
	[…]
}

Such an API is also useful for building things like abstract syntax trees, where it can be useful to compress trees by coalescing occurrences of a common subexpression to the same canonical instance.

(A long time ago I maintained a server that evaluated large numbers of boolean expressions for ad targeting, and it used a similar approach — to great effect — to compress the targeting expressions, including in our wire protocol.)

@bserdar
Copy link

bserdar commented Aug 29, 2019

@bcmills, interning table would be nice, but it won't be available any time soon. Until that becomes available, we can keep interning internal to the Decoder, but expose an InternKeys(bool) method for explicit opt-in. Or, implement the strings.LiteralStore, and change it to use the intern package when generics are done.

@bserdar
Copy link

bserdar commented Oct 2, 2019

xml.Decoder can benefit from interning strings as well. What is the opinion about that? Decoder.InternKeys() would be the explicit opt-in for interning keys during decode. Something similar to this: https://go-review.googlesource.com/c/go/+/188500

@rsc
Copy link
Contributor Author

rsc commented Nov 27, 2019

There's been a lot of discussion here but I haven't seen any objections to simply putting the map I suggested into Decoder and not trying to reuse it any more than that.

I've seen no rationale for adding new API. Just do it all the time. Am I missing something?

@rsc rsc removed this from the Backlog milestone Nov 27, 2019
@bserdar
Copy link

bserdar commented Apr 8, 2020

I can rebase this old CL unless there are other plans: https://go-review.googlesource.com/c/go/+/188500

@rsc rsc modified the milestones: Proposal, Backlog Apr 8, 2020
@rsc rsc changed the title proposal: encoding/json: memoize strings during decode? encoding/json: memoize strings during decode Apr 8, 2020
@mvdan
Copy link
Member

mvdan commented Apr 9, 2020

@rsc I think we should answer the two questions I raised again in #32779 (comment) before getting into a CL. To repeat my current understanding:

  1. Would this only affect map keys, or all strings? This is unclear, and IMO should be decided before we jump to an implementation.

  2. The change, as proposed right now, would silently introduce memory leaks in programs that currently keep a Decoder alive for a long time. You say:

If you want to avoid the memoization, you could use multiple json.Decoders.

In my opinion, this is a subtle and dangerous breaking change. Even if we strongly marked this change in behavior in the release notes, discouraging the use of global or long-lived decoders, it still seems like a change that could bring pain to our users since it's easy not to notice a leak until a service runs out of memory a week later.

@frioux
Copy link

frioux commented Apr 10, 2020

Maybe this is a dumb question but, would the strings being memoized be from the types, or strings that just flew by? If it's the types themselves I would be comfortable running this in prod, but if it's strings that happened to be passed from a user (like the values in a map or in a list or "unknown" keys to a struct) this sounds like a possible DoS.

I don't know how you could keep an encoder around for a long time in a web context other than maybe web sockets???

@bcmills
Copy link
Contributor

bcmills commented Apr 10, 2020

@mvdan, what if we added some sort of decay behavior or LRU caching, so that infrequently used-strings would expire instead of being memoized forever?

That would at least prevent unbounded memory leaks for infrequent keys, while still providing most of the memory reduction of memoization.

@CAFxX
Copy link
Contributor

CAFxX commented Apr 12, 2020

@mvdan, what if we added some sort of decay behavior or LRU caching, so that infrequently used-strings would expire instead of being memoized forever?

That would at least prevent unbounded memory leaks for infrequent keys, while still providing most of the memory reduction of memoization.

Agreed, this is similar in spirit to what you get with josharian/intern, and it worked well in practice for us. Worth noting that we were interning only the field values, not the field keys (we didn't have maps in the decoded struct, but we had quite some string fields whose values were likely to repeat).

Would this only affect map keys, or all strings?

Given what I wrote above, I think we should do the latter. If the worry is keeping too many strings alive it should be easy enough to cap the size of the interning map (even just random replacement would likely be enough, and trivial to implement; as the current proposal is to have an interning map per Decoder, it will likely not make much of a difference to implement a better eviction policy since multiple Decoders will allocate the same strings anyway).

@mvdan
Copy link
Member

mvdan commented Apr 20, 2020

Some sort of LRU, or mechanism to limit the amount of memory we hold onto at once, seems like an OK tradeoff to me. I guess we can work this out on the CL itself, but it seemed clearly backwards incompatible to me to make this entirely unbounded by design.

@dsnet
Copy link
Member

dsnet commented Nov 21, 2020

I've been experimenting with several possible implementations of this.

Here are some of my thoughts:

  • As a string gets longer, the probability of a cache hit drops and the cache lookup cost grows. This makes sense since longer strings have an exponentially larger number of possible representations and also take longer to compare or hash. Thus, we should only do caching for small strings (e.g., <= 16B).
  • The Go memory allocator is already fairly fast in allocating small strings. For example, a 16B string takes 35ns on my machine. Granted, this is the up-front CPU cost, and doesn't include the asynchronous costs associated with GC heap scanning and memory freeing. Any cache lookup cost should be comparable to or faster than 35ns.
  • For most realistic JSON data, strings in JSON object names have a high degree of locality. That is, seeing a specific object name has a high probability of seeing it again within the next few dozen object names. This is especially true for a JSON array containing homogenous JSON objects, but is also for tree-like representations of JSON objects where each node has a similar set of fields. Thus, a small cache can be highly beneficial for JSON object names.
  • For most realistic JSON data, identical strings in general JSON values exhibit poor locality. Repeated strings in data tend to occur either for enum-like values (e.g., "SUCCESS") or what are functionally "pointers" to create a graph out of JSON data (e.g., a ID reference to some other part of the JSON data). In order to detect repeated usages, we would need a relatively large cache. Also since most of strings will be a cache miss, there is the wasted cost of looking up a string in the common case where there is no benefit.
  • A naive map[string]string is not a good data structure for a string cache since its 1) hard to prevent unbounded growth 2) has double memory use (one for the key and one for the value; even though they hold the same value), and 3) holds a reference to a string value that may no longer be used by the application.

Here's a simple data structure that I tried that does seem sufficiently effective to use by default:

type stringCache [64]string

func (c *stringCache) make(b []byte) string {
	if len(b) < 2 || len(b) > 16 {
		return string(b)
	}
	h := hashBytes(b) % len(*c)
	if s := (*c)[h]; s == string(b) {
		return s
	}
	s := string(b)
	(*c)[h] = s
	return s
}

It's a small, fixed-size cache. Thus it avoids any concerns about unbounded memory growth or overtly wasting too much space holding onto old, unused strings. At worst, the memory usage for this is len(stringCache)*(unsafe.Sizeof(string) + maxStringSize) where maxStringSize is the maximum string length we try caching (16B in the above implementation). The effectiveness of this is highly dependent on the speed of hashBytes. The size of stringCache should be determined based on how often we expect to see the same strings repeat themselves and according to the birthday problem.

Using runtime.memhash as the implementation (see #42710) of hashBytes, I get the following numbers:

BenchmarkMalloc/FrequentNames    	48031	 25150 ns/op	 9600 B/op	1000 allocs/op
BenchmarkCache/FrequentNames     	66622	 18019 ns/op	   96 B/op	  10 allocs/op // 0.7x slower than Malloc
BenchmarkMap/FrequentNames       	40252	 31818 ns/op	  678 B/op	  11 allocs/op // 1.3x slower than Malloc

BenchmarkMalloc/PathalogicalNames	76096	 15348 ns/op	 8000 B/op	 500 allocs/op
BenchmarkCache/PathalogicalNames 	48894	 24289 ns/op	 8000 B/op	 500 allocs/op // 1.6x slower than Malloc
BenchmarkMap/PathalogicalNames   	10000	123864 ns/op	91250 B/op	 524 allocs/op // 8.1x slower than Malloc

This tests three different approaches:

  • Malloc is using a naive []byte to string conversion.
  • Cache is using stringCache.
  • Map is using a map[string]string as the cache.

This tests two extreme sets of data:

  • FrequentNames represents the best-case scenario where the same 10 strings occur over repeatedly.
  • Pathological represents the worst-case scenario where every string is exactly 16B long, but all different, resulting in a 100% cache miss. This is pathological for several reasons: 1) every entry in the cache will be populated, 2) the string in every entry is also 16B, so we must do a byte-for-byte comparison, 3) only the last few bytes of the string differ, 4) every entry results in a miss. Thus, for every string we do O(n) work to hash the string, and O(n) work to compare the string (where n is the string length).

Some notes:

  • A map[string]string rarely out-performs malloc and seems to be a net loss.
  • Assuming that most data look more like FrequentNames than PathologicalNames, the use of stringCache is probably a net win.
  • At least the pathological case for stringCache is only 1.3x slower (surprisingly 8x slower for map[string]string).

@josharian
Copy link
Contributor

@dsnet nice analysis. One more thing to try: use a fixed size array of strings, iterate instead of hashing for lookups, and do random replacement. Something like josharian@ab074c9. (There are many variations on this theme available.)

@randall77
Copy link
Contributor

Move-to-front is also a good caching scheme when there are a few very common keys.

@bserdar
Copy link

bserdar commented Nov 23, 2020 via email

@dsnet
Copy link
Member

dsnet commented Nov 23, 2020

@josharian and @randall77. I tried a simple linear search in some of my earlier approaches, and it does decent where there is a hit. However, I felt like it performed poorly in the cases where there were never any matches since the miss cost was proportional to the number of maximum elements that we linear search over.

One advantage of a linear search (with move-to-front) is that it provides a heuristic for which entry to evict. This avoids the collision problem that my naive hashing approach exhibits (due to the birthday problem). In order to have a decent hit rate on a list of strings with only N unique entries, the cache size needs to be approximately 2*N.

@randall77
Copy link
Contributor

We could consider any the frequent item algorithms described here: http://dimacs.rutgers.edu/~graham/pubs/papers/freq.pdf

It might be worth dividing the set of keys up in to buckets, hashing to the bucket, and performing one of the above algorithms on a single bucket.

@josharian
Copy link
Contributor

...with the cheapest possible “hash” being string length, since we are only considering lengths 2..16.

@dsnet
Copy link
Member

dsnet commented Nov 30, 2020

@randall77 and @josharian. Thanks for the idea. I haven't yet applied any of the specific algorithms in the paper you referenced, but I suspect they won't perform well in this specific application since allocation of small strings is already so dang fast that any book-keeping is often just as expensive as malloc itself (defeating the point of string interning).

I incorporated some of the ideas above and some new ideas:

  • Using the data provided by @bserdar, I computed a histogram of the string lengths of all unique object names:
 1 |> 0
 2 |=> 1
 3 |======> 6
 4 |==========> 10
 5 |===========> 11
 6 |====================> 20
 7 |==============> 14
 8 |================> 16
 9 |===============> 15
10 |=========> 9
11 |========> 8
12 |========> 8
13 |=====> 5
14 |=======> 7
15 |=======> 7
16 |===> 3
  • Combining the above data with the idea to bucketize based on the length, it seems that 8 buckets using the lowest 3 bits of the length might provide an okay uniform distribution across all buckets (i.e., we bucketize 8B and 16B together, 1B and 9B together, 2B and 10B together, etc). This produces a histogram as follows:
 1, 9 |===============> 15
 2,10 |==========> 10
 3,11 |==============> 14
 4,12 |==================> 18
 5,13 |================> 16
 6,14 |===========================> 27
 7,15 |=====================> 21
 8,16 |===================> 19

Bucketizing based on length has the advantage of isolating the effects of a given string length from affecting another bucket. However, it has the disadvantage that certain lengths may be under-utilized (i.e, the 2B,10B bucket for the specific data I'm playing with), leading to wasted space.

  • I also observed that most general-purpose hashing algorithms perform slowly for small strings since 1) they are often not inlineable, 2) may not benefit from any wide-integer instructions (at least from my observation). Thus, I tried performing a hash of a short string by relying on wide-integer loads and hashing the resulting integers.

The current algorithm I came up with is:

type stringCache [256]string

func (c *stringCache) make(b []byte) string {
	if len(b) < 2 || len(b) > 16 {
		return string(b)
	}

	var h uint64
	{
		// Read the string in two parts using wide-integer loads.
		// The prefix and suffix may overlap, which is fine.
		switch {
		case len(b) >= 8:
			h ^= uint64(binary.LittleEndian.Uint64(b[:8]))
			h *= 0x00000100000001B3 // inspired by FNV-64
			h ^= uint64(binary.LittleEndian.Uint64(b[len(b)-8:]))
		case len(b) >= 4:
			h ^= uint64(binary.LittleEndian.Uint32(b[:4]))
			h *= 0x01000193 // inspired by FNV-32
			h ^= uint64(binary.LittleEndian.Uint32(b[len(b)-4:]))
		default:
			h ^= uint64(binary.LittleEndian.Uint16(b[:2]))
			h *= 0x010f // inspired by hypothetical FNV-16
			h ^= uint64(binary.LittleEndian.Uint16(b[len(b)-2:]))
		}

		// Collapse a 64-bit, 32-bit, or 16-bit hash into an 8-bit hash.
		h ^= h >> 32
		h ^= h >> 16
		h ^= h >> 8

		// The cache has 8 buckets based on the lower 3-bits of the length.
		h = ((h << 3) | uint64(len(b)&0b111)) % uint64(len(*c))
	}

	if s := (*c)[h]; s == string(b) {
		return s
	}
	s := string(b)
	(*c)[h] = s
	return s
}

Relative to my earlier algorithm, this is decently faster and results is fewer misses:

BenchmarkMalloc/Best	41395	26997 ns/op	9600 B/op	1000 allocs/op
BenchmarkCache1/Best	66477	18359 ns/op	  96 B/op	  10 allocs/op // 1.47x faster
BenchmarkCache2/Best	88276	13699 ns/op	  96 B/op	  10 allocs/op // 1.97x faster
BenchmarkMap/Best   	39099	30543 ns/op	 678 B/op	  11 allocs/op // 0.88x faster

BenchmarkMalloc/Real	2503	505818 ns/op	161456 B/op	18529 allocs/op
BenchmarkCache1/Real	2769	440636 ns/op	 28960 B/op	 2475 allocs/op	// 1.15x faster
BenchmarkCache2/Real	4138	302386 ns/op	 17712 B/op	 1139 allocs/op	// 1.67x faster
BenchmarkMap/Real   	1668	741258 ns/op	 31706 B/op	  467 allocs/op	// 0.68x faster

BenchmarkMalloc/Worst	72957	 16689 ns/op	 8000 B/op	500 allocs/op
BenchmarkCache1/Worst	46144	 26646 ns/op	 8000 B/op	500 allocs/op // 1.60x slower
BenchmarkCache2/Worst	49250	 23897 ns/op	 8000 B/op	500 allocs/op // 1.43x slower
BenchmarkMap/Worst   	10000	135081 ns/op	91233 B/op	524 allocs/op // 8.09x slower

where Cache2 is the algorithm in this post, and Cache1 is the earlier algorithm. The Real benchmark was switched to using object names from the dataset provided by @bserdar.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/277376 mentions this issue: WIP: runtime: add string interning profile

@lkarlslund
Copy link

I did a string deduplicator (interner) a while ago with some dirty tricks to make it a weak map, perhaps some of that can give inspiration: https://github.com/lkarlslund/stringdedup

@dsnet
Copy link
Member

dsnet commented Mar 12, 2021

Hi @lkarlslund, thanks for the suggestion, but we won't be adding a dependency on unsafe from encoding/json. As such, any optimizations need to be implementable in pure Go and still be performant.

@bserdar
Copy link

bserdar commented Mar 12, 2021

@dsnet, one thing to consider re: json key string lengths: interning starts to make a difference with long keys. Performance is not the only consideration here. If you process expanded large JSON-LD documents, without interning, the memory footprint is huge. For example, the expanded JSON-LD version of the document from the dataset I mentioned earlier would have the key http://hl7.org/fhir/Patient.id instead of just id. Add that prefix to all the json keys, and without interning, the keys occupy more space than the actual data.
I am starting to think maybe a generic solution to string interning is not possible. Maybe it could be done with an optional interner with a default implementation that can be replaced at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

Successfully merging a pull request may close this issue.