-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/json: memoize strings during decode #32779
Comments
A while ago I wrote some code to test this exact same thing with the json decoder, but could not detect any significant difference between the memory usage of the decoder with memoized strings and the regular json decoder, which made me think maybe this is already done somewhere? Apparently not. I can try to find out my test code... |
Here's a link to my test code that does this for decoding json into interface{}: https://github.com/bserdar/jsdecodertest My initial comment was wrong, it does make a difference to store strings in a map. My test code stores only field keys in a map because those are the strings that are likely to repeat. The json/ package is a copy of the stdlib encoding/json package modified to include a map[string]string in the decoder state. |
Nit: I assume the proposal meant to use a |
Also cc @josharian, who had a very similar idea a while ago. That was briefly discussed in #5160. As a thought, we probably don't want to reuse all strings. For example, in a map the keys probably tend to repeat much more often than the values. Also, do we want to make this configurable? Someone who writes a program likely knows whether they'll read somewhat static/repeated data or not. |
Change https://golang.org/cl/188500 mentions this issue: |
@mvdan, are you concerned about the overhead of the map lookup in your Jun 26 comment? From the outside there can be no possible downside to reusing all strings that are constructed. |
The CL mentioned above has an implementation of this for JSON decoder. It introduces some overhead when decoding to an interface{}, because that's the only case where the keys read during decoding will be kept after decoding is done. However, I think there is a better way to do this. What I propose is to write an encoding/LiteralStore type (a map[string]string) containing a Get(string) string method to retrieve the shared instance of the string. Then this can be used from the decoders in encoding/xml and encoding/json. To use the LiteralStore, the caller has to assign an instance to the JSON/XML decoder. This way, the caller has the option to use a literal store for large input where it makes a difference, use the store for multiple docs, or avoid it completely. What do you think? |
This is one of the conventional use-cases for weak references in GC'd languages. Ideally, we should reuse strings that remain reachable from the last |
Alternatives that do not require weak references include:
|
@bcmills, I propose to make the interning behavior explicit opt-in using a
This implementation does not change the existing behavior and makes the use of a literal store explicit opt-in. A literal store like this is only useful for larger json/xml input, and only if it is manually decoded, or unmarshaled into an interface{}. If the input is unmarshaled/decoded into a struct, then there is no need to keep the keys as they will be lost. Re: weak references, afaik there is no way to implement this using finalizers, am I wrong? |
@bserdar I don't think there is any way to do this with finalizers because existing users will be using In general finalizers and weak references are equivalent in power, so anything you can do with one you can do with the other. But you have to be able to control the types that people will use. |
@ianlancetaylor thanks for the clarification. If there is interest in this, I can modify https://golang.org/cl/188500 to implement this. I already have some wrapper code around json/xml decoder to do this because reusing just the key strings makes a lot of difference for large docs. cc @mvdan |
No? We would need some sort of language-supported “weak string” type, but then only the (But that assumes a language change that seems pretty unlikely to ever happen.) |
Ah, I think I see what you're saying. If we could control the types in use, then we could add a pointer indirection, and require users to keep the pointer alive in order to retain the But then you can't use those pointers in the same way as ordinary strings: they can't be map keys, they won't work with the normal built-in It may be true that they are technically equivalent in power, but they are not equivalently usable. |
@bserdar, The general form could have an API like: package intern
type Table(type T HashEqualer) […]
func (t Table(T)) Get(x T) (canonicalX T) {
[…]
} Such an API is also useful for building things like abstract syntax trees, where it can be useful to compress trees by coalescing occurrences of a common subexpression to the same canonical instance. (A long time ago I maintained a server that evaluated large numbers of boolean expressions for ad targeting, and it used a similar approach — to great effect — to compress the targeting expressions, including in our wire protocol.) |
@bcmills, interning table would be nice, but it won't be available any time soon. Until that becomes available, we can keep interning internal to the Decoder, but expose an InternKeys(bool) method for explicit opt-in. Or, implement the strings.LiteralStore, and change it to use the intern package when generics are done. |
xml.Decoder can benefit from interning strings as well. What is the opinion about that? Decoder.InternKeys() would be the explicit opt-in for interning keys during decode. Something similar to this: https://go-review.googlesource.com/c/go/+/188500 |
There's been a lot of discussion here but I haven't seen any objections to simply putting the map I suggested into Decoder and not trying to reuse it any more than that. I've seen no rationale for adding new API. Just do it all the time. Am I missing something? |
I can rebase this old CL unless there are other plans: https://go-review.googlesource.com/c/go/+/188500 |
@rsc I think we should answer the two questions I raised again in #32779 (comment) before getting into a CL. To repeat my current understanding:
In my opinion, this is a subtle and dangerous breaking change. Even if we strongly marked this change in behavior in the release notes, discouraging the use of global or long-lived decoders, it still seems like a change that could bring pain to our users since it's easy not to notice a leak until a service runs out of memory a week later. |
Maybe this is a dumb question but, would the strings being memoized be from the types, or strings that just flew by? If it's the types themselves I would be comfortable running this in prod, but if it's strings that happened to be passed from a user (like the values in a map or in a list or "unknown" keys to a struct) this sounds like a possible DoS. I don't know how you could keep an encoder around for a long time in a web context other than maybe web sockets??? |
@mvdan, what if we added some sort of decay behavior or LRU caching, so that infrequently used-strings would expire instead of being memoized forever? That would at least prevent unbounded memory leaks for infrequent keys, while still providing most of the memory reduction of memoization. |
Agreed, this is similar in spirit to what you get with josharian/intern, and it worked well in practice for us. Worth noting that we were interning only the field values, not the field keys (we didn't have maps in the decoded struct, but we had quite some string fields whose values were likely to repeat).
Given what I wrote above, I think we should do the latter. If the worry is keeping too many strings alive it should be easy enough to cap the size of the interning map (even just random replacement would likely be enough, and trivial to implement; as the current proposal is to have an interning map per Decoder, it will likely not make much of a difference to implement a better eviction policy since multiple Decoders will allocate the same strings anyway). |
Some sort of LRU, or mechanism to limit the amount of memory we hold onto at once, seems like an OK tradeoff to me. I guess we can work this out on the CL itself, but it seemed clearly backwards incompatible to me to make this entirely unbounded by design. |
I've been experimenting with several possible implementations of this. Here are some of my thoughts:
Here's a simple data structure that I tried that does seem sufficiently effective to use by default: type stringCache [64]string
func (c *stringCache) make(b []byte) string {
if len(b) < 2 || len(b) > 16 {
return string(b)
}
h := hashBytes(b) % len(*c)
if s := (*c)[h]; s == string(b) {
return s
}
s := string(b)
(*c)[h] = s
return s
} It's a small, fixed-size cache. Thus it avoids any concerns about unbounded memory growth or overtly wasting too much space holding onto old, unused strings. At worst, the memory usage for this is Using
This tests three different approaches:
This tests two extreme sets of data:
Some notes:
|
@dsnet nice analysis. One more thing to try: use a fixed size array of strings, iterate instead of hashing for lookups, and do random replacement. Something like josharian@ab074c9. (There are many variations on this theme available.) |
Move-to-front is also a good caching scheme when there are a few very common keys. |
On Sat, Nov 21, 2020 at 12:18 PM Joe Tsai ***@***.***> wrote:
I've been experimenting with several possible implementations of this.
Here are some of my thoughts:
- As a string gets longer, the probability of a cache hit drops and
the cache lookup cost grows. This makes sense since longer strings have an
exponentially larger number of possible representations and also take
longer to compare or hash. Thus, we should only do caching for small
strings (e.g., <= 16B).
- The Go memory allocator is already fairly fast in allocating small
strings. For example, a 16B string takes 35ns on my machine. Granted, this
is the up-front CPU cost, and doesn't include the asynchronous costs
associated with GC heap scanning and memory freeing. Any cache lookup cost
should be comparable to or faster than 35ns.
- For most realistic JSON data, strings in JSON object names have a
high degree of locality. That is, seeing a specific object name has a high
probability of seeing it again within the next few dozen object names. This
is especially true for a JSON array containing homogenous JSON objects, but
is also for tree-like representations of JSON objects where each node has a
similar set of fields. Thus, a small cache can be highly beneficial for
JSON object names.
Here's a realistic JSON data set:
https://storage.googleapis.com/synthea-public/synthea_sample_data_fhir_r4_sep2019.zip
This data set contains many large JSON documents, generated health data.
These are the kind of documents that caching keys can make a real
difference.
Several observations:
* There is little point in caching values of key:value pairs in these
docs. However, caching keys make a significant difference.
* There are many unique keys
* Note that a map[string]string still has a single copy of the string:
map[s]=s
* The problem of unbounded growth can be dealt with using a separate
string cache for each JSON document. Once the document ends, the cache can
be forgotten.
… - For most realistic JSON data, identical strings in general JSON
values exhibit poor locality. Repeated strings in data tend to occur either
for enum-like values (e.g., "SUCCESS") or what are functionally "pointers"
to create a graph out of JSON data (e.g., a ID reference to some other part
of the JSON data). In order to detect repeated usages, we would need a
relatively large cache. Also since most of strings will be a cache miss,
there is the wasted cost of looking up a string in the common case where
there is no benefit.
- A naive map[string]string is not a good data structure for a string
cache since its 1) hard to prevent unbounded growth 2) has double memory
use (one for the key and one for the value; even though they hold the same
value), and 3) holds a reference to a string value that may no longer be
used by the application.
Here's a simple data structure that I tried that does seem sufficiently
effective to use by default:
type stringCache [64]string
func (c *stringCache) make(b []byte) string {
if len(b) < 2 || len(b) > 16 {
return string(b)
}
h := hashBytes(b) % len(*c)
if s := (*c)[h]; s == string(b) {
return s
}
s := string(b)
(*c)[h] = s
return s
}
It's a small, fixed-size cache. Thus it avoids any concerns about
unbounded memory growth or overtly wasting too much space holding onto old,
unused strings. At worst, the memory usage for this is len(stringCache)*(unsafe.Sizeof(string)
+ maxStringSize) where maxStringSize is the maximum string length we try
caching (16B in the above implementation). The effectiveness of this is
highly dependent on the speed of hashBytes. The size of stringCache
should be determined based on how often we expect to see the same strings
repeat themselves and according to the birthday problem
<https://en.wikipedia.org/wiki/Birthday_problem>.
Using runtime.memhash as the implementation (see #42710
<#42710>) of hashBytes, I get the
following numbers <https://play.golang.org/p/7e6w9w1V8dt>:
BenchmarkMalloc/FrequentNames 48031 25150 ns/op 9600 B/op 1000 allocs/op
BenchmarkCache/FrequentNames 66622 18019 ns/op 96 B/op 10 allocs/op // 0.7x slower than Malloc
BenchmarkMap/FrequentNames 40252 31818 ns/op 678 B/op 11 allocs/op // 1.3x slower than Malloc
BenchmarkMalloc/PathalogicalNames 76096 15348 ns/op 8000 B/op 500 allocs/op
BenchmarkCache/PathalogicalNames 48894 24289 ns/op 8000 B/op 500 allocs/op // 1.6x slower than Malloc
BenchmarkMap/PathalogicalNames 10000 123864 ns/op 91250 B/op 524 allocs/op // 8.1x slower than Malloc
This tests three different approaches:
- Malloc is using a naive []byte to string conversion.
- Cache is using stringCache.
- Map is using a map[string]string as the cache.
This tests two extreme sets of data:
- FrequentNames represents the best-case scenario where the same 10
strings occur over repeatedly.
- Pathological represents the worst-case scenario where every string
is exactly 16B long, but all different, resulting in a 100% cache miss.
This is pathological for several reasons: 1) every entry in the cache will
be populated, 2) the string in every entry is also 16B, so we must do a
byte-for-byte comparison, 3) only the last few bytes of the string differ,
4) every entry results in a miss. Thus, for every string we do O(n) work to
hash the string, and O(n) work to compare the string (where n is the string
length).
Some notes:
- A map[string]string rarely out-performs malloc and seems to be a net
loss.
- Assuming that most data look more like FrequentNames than
PathologicalNames, the use of stringCache is probably a net win.
- At least the pathological case for stringCache is only 1.3x slower
(surprisingly 8x slower for map[string]string).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32779 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4AGDKAYEOPE45EW7GYE6TSRAG65ANCNFSM4H3MJFLQ>
.
|
@josharian and @randall77. I tried a simple linear search in some of my earlier approaches, and it does decent where there is a hit. However, I felt like it performed poorly in the cases where there were never any matches since the miss cost was proportional to the number of maximum elements that we linear search over. One advantage of a linear search (with move-to-front) is that it provides a heuristic for which entry to evict. This avoids the collision problem that my naive hashing approach exhibits (due to the birthday problem). In order to have a decent hit rate on a list of strings with only N unique entries, the cache size needs to be approximately 2*N. |
We could consider any the frequent item algorithms described here: http://dimacs.rutgers.edu/~graham/pubs/papers/freq.pdf It might be worth dividing the set of keys up in to buckets, hashing to the bucket, and performing one of the above algorithms on a single bucket. |
...with the cheapest possible “hash” being string length, since we are only considering lengths 2..16. |
@randall77 and @josharian. Thanks for the idea. I haven't yet applied any of the specific algorithms in the paper you referenced, but I suspect they won't perform well in this specific application since allocation of small strings is already so dang fast that any book-keeping is often just as expensive as malloc itself (defeating the point of string interning). I incorporated some of the ideas above and some new ideas:
Bucketizing based on length has the advantage of isolating the effects of a given string length from affecting another bucket. However, it has the disadvantage that certain lengths may be under-utilized (i.e, the 2B,10B bucket for the specific data I'm playing with), leading to wasted space.
The current algorithm I came up with is: type stringCache [256]string
func (c *stringCache) make(b []byte) string {
if len(b) < 2 || len(b) > 16 {
return string(b)
}
var h uint64
{
// Read the string in two parts using wide-integer loads.
// The prefix and suffix may overlap, which is fine.
switch {
case len(b) >= 8:
h ^= uint64(binary.LittleEndian.Uint64(b[:8]))
h *= 0x00000100000001B3 // inspired by FNV-64
h ^= uint64(binary.LittleEndian.Uint64(b[len(b)-8:]))
case len(b) >= 4:
h ^= uint64(binary.LittleEndian.Uint32(b[:4]))
h *= 0x01000193 // inspired by FNV-32
h ^= uint64(binary.LittleEndian.Uint32(b[len(b)-4:]))
default:
h ^= uint64(binary.LittleEndian.Uint16(b[:2]))
h *= 0x010f // inspired by hypothetical FNV-16
h ^= uint64(binary.LittleEndian.Uint16(b[len(b)-2:]))
}
// Collapse a 64-bit, 32-bit, or 16-bit hash into an 8-bit hash.
h ^= h >> 32
h ^= h >> 16
h ^= h >> 8
// The cache has 8 buckets based on the lower 3-bits of the length.
h = ((h << 3) | uint64(len(b)&0b111)) % uint64(len(*c))
}
if s := (*c)[h]; s == string(b) {
return s
}
s := string(b)
(*c)[h] = s
return s
} Relative to my earlier algorithm, this is decently faster and results is fewer misses: BenchmarkMalloc/Best 41395 26997 ns/op 9600 B/op 1000 allocs/op
BenchmarkCache1/Best 66477 18359 ns/op 96 B/op 10 allocs/op // 1.47x faster
BenchmarkCache2/Best 88276 13699 ns/op 96 B/op 10 allocs/op // 1.97x faster
BenchmarkMap/Best 39099 30543 ns/op 678 B/op 11 allocs/op // 0.88x faster
BenchmarkMalloc/Real 2503 505818 ns/op 161456 B/op 18529 allocs/op
BenchmarkCache1/Real 2769 440636 ns/op 28960 B/op 2475 allocs/op // 1.15x faster
BenchmarkCache2/Real 4138 302386 ns/op 17712 B/op 1139 allocs/op // 1.67x faster
BenchmarkMap/Real 1668 741258 ns/op 31706 B/op 467 allocs/op // 0.68x faster
BenchmarkMalloc/Worst 72957 16689 ns/op 8000 B/op 500 allocs/op
BenchmarkCache1/Worst 46144 26646 ns/op 8000 B/op 500 allocs/op // 1.60x slower
BenchmarkCache2/Worst 49250 23897 ns/op 8000 B/op 500 allocs/op // 1.43x slower
BenchmarkMap/Worst 10000 135081 ns/op 91233 B/op 524 allocs/op // 8.09x slower where |
Change https://golang.org/cl/277376 mentions this issue: |
I did a string deduplicator (interner) a while ago with some dirty tricks to make it a weak map, perhaps some of that can give inspiration: https://github.com/lkarlslund/stringdedup |
Hi @lkarlslund, thanks for the suggestion, but we won't be adding a dependency on |
@dsnet, one thing to consider re: json key string lengths: interning starts to make a difference with long keys. Performance is not the only consideration here. If you process expanded large JSON-LD documents, without interning, the memory footprint is huge. For example, the expanded JSON-LD version of the document from the dataset I mentioned earlier would have the key |
Part of the motivation for #32593 is the observation in json-iterator/go#376 that when you json decode, many of the same strings appear over and over in the JSON and get copied and reconstructed in the result over and over as well.
We do not want to make the JSON coalesce all the allocated strings into one giant buffer, because then holding on to just one of them holds onto the entire thing.
But it might make sense to add to the decoder state a simple map[[]byte]string and remember which specific byte slices we've already converted to string and reuse those allocated strings instead of doing the same conversions again in a particular Decode or Unmarshal operation.
It's unclear to me whether the map in a Decoder should persist between Decode operations. Probably? That will affect users who put Decoders in a pool or something like that, though. (Solution: don't put decoders in pools.)
The text was updated successfully, but these errors were encountered: