-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: hash: export a built-in hash function for comparable values #21195
Comments
Also this would allow us to use a very efficient version of sync.Map that allows sharding by a key's hash. |
It does feel weird to me though that this would go in the |
I'm aware of where the function is implemented, but that doesn't always correlate to where it should be categorically located. Arguably a lot of logic in reflect is in the "runtime", but we don't move it all to the |
@OneOfOne, |
@dsnet I don't feel strongly about which package it should go in, but |
Previously proposed, more or less, in this golang-dev thread: https://groups.google.com/d/msg/golang-dev/vd5Nnt3Wjj0/a5XFu6d9wsIJ . Any hash function can only be discussed in conjunction with an equality function, so the docs for your proposed function should make clear that values that are You suggest that this should only be used for comparable values, but of course that means that it can not be used for slices, which would seem to leave out many real world applications. (The current runtime hash will panic if used on a slice, as can be seen using a If we do this I definitely don't think it should be in the runtime package. That would suggest that it is specifically the hash function used by things like map, and even if that is in fact what is implemented, it should not be implied. |
Thanks for the link; an interesting read. It looks like @randall77's objection there was mainly that we would have to specify and freeze the algorithm, but here I'm suggesting the opposite: if we expose a hash function, we should randomize it so that folks won't accidentally rely on the particulars.
That's true, but having the built-in hash would make it much easier to define application hash functions for slices: it is likely that one could use the built-in hash function for the individual elements and write only a very small function to combine them. func hashSlice(s []T) uintptr {
h := uintptr(0)
for _, e := range s {
h = Hash([2]uintptr{h, Hash(e)})
}
return h
}
I think that's a good starting point since, as you note, a hash function must be paired with an equality function. The language spec already defines an equality function, so piggybacking on that minimizes the amount of new information one would need to learn in order to use the function.
That's fair. Any suggestions for alternatives? It could plausibly live in |
Updated, with an explicit caveat for NaN values (to avoid the confusion of #20660). |
Given that the standard library already has a package named "hash", I would vote for that. |
Comparability is well defined in Go. If we expand this, would it still perform a shallow hash of pointers or would it hash the sub-value pointed to by the pointer? |
If we expand this Currently, the |
The runtime hash function is free to use different algorithms on a per architecture and even feature set basis (AES vs non AES). It think it would be nice to take this into account for the proposed std lib hash function too such that we can optimize the hash function used for each architecture and feature set independently if the proposal is accepted. Therefore, apart from only having reproducible results within a process we could clarify that there should not be the expectation that the algorithm and "quality" of the hash is the same between different releases/archs/cpus. Another discussion could be if we want to extend the compiler to optimize this specific std lib hash function e.g. to a specialized call if the type is known during compile time. |
Agreed. The A |
Updated the proposal to name the function |
The first (CS related) search result for local hash is https://en.wikipedia.org/wiki/Locality-sensitive_hashing so that may not be the most appropriate name. |
Some observations. I'm still not sure where I stand on this, on balance.
Without this, the calls will allocate+copy, and as a result probably be prohibitively expensive in the contexts it is most likely to be used. With compiler optimization, this ends up seeming a bit like a language change. Another alternative is to expose hash functions for basic types (ints of various widths, strings). These are the core building blocks of the rest, which can then be composed as desired, particularly if the hash function accepts a seed (more on that below). An alternative is to expose these via the reflect package. For example, hash could be a method on reflect.Value. Or reflect.Type could return a hash function that can be invoked with values of its type. A good concrete test case might be how e.g. [1000]uint64 gets hashed. Passed as an interface{}, this will allocate and copy 64k. If we expose a suite of hash functions, the caller either needs to loop 1000 times (slow) or we need to provide some variable length memhash, which is inherently unsafe. If we expose only via reflect, then we add the readability costs of reflection, we add the high general costs of generating a reflect.Value or the cost of an indirect function call (reflect.Type), and we tie the tool chain's hands a bit (since IIRC it can do more magic when reflect is not in use). The only reasonable choice appears to be magic compiler support, since the compiler can ~safely do unsafe things. As an aside, an interesting test of a generics proposal is whether and how it would let you write a generic hash function (attn: @ianlancetaylor). One other observation: The runtime hash functions accept a seed. They do this, to quote @randall77, because in the case of maps:
I assume that a prime use case for exposing hash functions is allow people to write their own maps. We should at least afford them the opportunity to write secure implementations. That is, any exposed hash function should accept a seed, just like the runtime's do. And as observed above, accepting a seed allows the programmer to compose hash calls as necessary. |
Why would the calls allocate? With Mid-stack inlining (#19348) should omit that allocation entirely if the value is not already an interface type, since the top-level hash function will dispatch based on a constant type tag.
Ahh, now I see what you're getting at. But that should still be a general optimization ("don't copy interface payloads when they are treated as const and do not escape"), not something specific to the hash function. (It's the same optimization as #18822, but made much easier because the call does not occur through an interface.) |
I agree that the copy is a problem. The function could instead be defined to take a pointer to the value to be hashed so that it doesn't have to be copied. If we do that, I don't think allocation is a problem either, since escape analysis should already realize that we're not escaping the pointer in the interface. Arguably, the compiler could avoid the copy even if we didn't pass a pointer if it could prove that neither the hash function nor the caller mutates the value, but I'd be fine with passing a pointer. :) |
That's part of the proposal already, I think? I proposed that each invocation of the program should compute a process-local seed that the hash function uses implicitly; that way users won't have to figure out how to pick an to pick an appropriate seeding algorithm, and programs will have at least a basic measure of collision-resistance by default. |
I think @josharian is talking about the per-invocation seed. For example, every map in a process has a different seed, so even if someone manages to create a collision in one map, it won't carry over to another. This is in addition to the per-process seed. |
I'm super-skeptical about exposing any hash function. In the case of the current map implementation, if we did ever want to do things like moving pointers (which would influence the hashes) we know where the maps are and where the hash values are written, so we could rehash. Exposing a new hash function limits our ability to make such changes, with no clear benefit. Existing packages are using other hash functions just fine without our help. |
The runtime hash functions currently make indirect function calls, including to routines written in assembly. I'm very skeptical that escape analysis is going to plumb those depths without explicitly teaching the compiler about these special cases. Mid-stack inlining doesn't help, because the value eventually escapes.
This seems more promising, although it's a bit of a footgun: People have to remember to pass a pointer to the value to be hashed, even when that value is itself already a pointer (that is, a **T instead of a *T, or a ***T instead of a **T). I see this going wrong on a regular basis. And there's no clear way that we can help catch that mistake at compile time or at run time, at least not without a language change.
I was indeed, although it now occurs to me that that can be simulated by just hashing a struct containing the value you care about and a seed. So then the question is whether it is important/useful enough to impose it in the API signature.
Agreed. For what it's worth, the thing I have wanted over and over is access to runtime.strhash. (I tend to modify my tool chain to export it, but that's obviously not something we want people doing in general.) Exposing it appropriately would not inhibit moving pointers, nor would exporting the runtime's int- or float-hashing routines which I have also wanted, although less frequently. I think this wouldn't satisfy @bcmills's original goals, though. |
Doesn't reflect.Value.Pointer already limit our ability to make changes that affect object identity?
Well, sort of. I think this proposal is a lot like #15292 (generic programming), in that the primary impact of the status quo is to reduce the diversity of packages that people are willing to provide, support, and/or experiment with. The difficulty of writing hash-based data structures in user code pushes people toward less efficient workarounds (such as serializing keys to strings using |
I haven't seen anyone mention this so I'll throw in this use case as well: Disk-backed data structures. From what I've seen here, if you have a memory mapped data structure relying on hashes you'd be unable to take advantage of a hash function that doesn't allow for a seed to be passed in. Even then, if the implementation is opaque it'd be unwise as an implementer to rely on it not changing between versions unless it's explicitly called out as static. I saw @bcmills mention that @randall77 had rejected the idea in the golang-dev thread because of the specification needed, but that would be required in this case. To make things more concrete: suppose you're building a database system that is writing a log to a mmap'd file and generating hash values to verify data. If the hash function changes on different architectures and releases, it would not work at all. |
Please consider an API that lets callers rerandomize that. Something like E.g. a good DoS-safe If you let a long-running process run with a singular randomization, that lets attackers probe it and discover the logic behind the randomization. |
@ScottMansfield Persistent data structures sound like a use case that should specify their own hashing, e.g. say " |
It seems like you would need to separate "create a hash function (pick a random seed)" from "start hashing a particular value with that function". If you do that, and if you are sensitive to allocation (not allocating), then you might have something like:
That seems fine, but it doesn't seem like it needs to be in the standard library. Why not experiment outside the standard library with something like this first? |
I think the path forward here would be to experiment outside the standard library, as mentioned in the previous comment. That would help us gather some evidence about how well it works, how much it's needed and so on. Closing until we have that evidence. |
You can add to the list of "typical workarounds" requirement to calculate the hash, like in https://github.com/larytet/mcachego/blob/master/hashtable/hashtable.go#L141 |
Background
The Go runtime contains an efficient hash function that it uses to implement built-in maps.
The runtime hash function is capable of hashing arbitrary comparable values, and usually cannot be pruned away during linking: for example, it would be very difficult for the linker to determine which hash functions are needed in a program that uses
reflect.MapOf
to construct maps at run-time, or a program that storesinterface{}
values in amap[interface{}]Anything
.Occasionally, we need an efficient hash function outside the standard library, too: for example, we might want to implement other hash-based data structures, such as hash array mapped tries, concurrent tries, or prototypes for potential additions to the Go standard library (e.g. sync.Map), or write application code to compute hashes for normally-incomparable types (such as slices or maps) to be able to store them in built-in maps. (For the latter case, a built-in hash function would be a building block for a larger application-level hash function.)
Typical workarounds include:
http://godoc.org/github.com/apmckinlay/gsuneido/util/hmap#Key
http://godoc.org/github.com/hit9/htree#Item
http://godoc.org/github.com/lleo/go-hamt-key#Key
http://godoc.org/github.com/Workiva/go-datastructures/trie/dtrie#New
http://godoc.org/github.com/Workiva/go-datastructures/trie/ctrie#Ctrie.Insert
http://godoc.org/github.com/lalonde/hamt#Key
https://github.com/crawshaw/readmap/blob/master/readmap.go
http://godoc.org/github.com/orcaman/concurrent-map#ConcurrentMap.Set
http://godoc.org/github.com/OneOfOne/cmap#CMap.Set
reflect
orunsafe
to reimplement arbitrary hashing in application code:http://godoc.org/github.com/dlsniper/anyhash
http://godoc.org/github.com/mitchellh/hashstructure
http://godoc.org/github.com/cnf/structhash
I believe that the toil involved in implementing these workarounds discourages experimentation with more efficient data structures in Go, and needlessly complicates application code.
It would be reckless for us to expose a stable hash function in the standard library. However, it seems like it would be useful to expose a per-process hash function, randomly salted at each invocation of the process, in order to facilitate the use of custom data structures in user code.
Proposal
I propose that we add the following function to the
hash
package:(@orcaman @OneOfOne @aclements @dsnet)
The text was updated successfully, but these errors were encountered: