You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the future, I'd like to rewrite the hash set/map/table code once again. Some of the things I'd like to do in the new new code would include:
implement it so that it can be used as a true hashset, i.e. with no unnecessay memory or performance overhead (with the current code, you can implement a hashset as a hashmap that maps its elements to an arbitrary value, e.g. true, but this wastes memory and CPU cycles)
I'd like to implement it as an ordered hash set, which is the perfect datastructure for orbit enumerations, and in practice can even be faster than a classic hash set (as was implemented here). For an explanation on this, see e.g. here: https://morepypy.blogspot.de/2015/01/faster-more-memory-efficient-and-more.html
Another things I'd like to experiment with is Robin Hood hashing, see e.g.
collect (more) statistics e.g. about collisions -- this is very helpful for debugging, and helps identify bad hash functions and also bugs in the hash table code
experiment with the load factor at which we resize the table (currently hard coded to 70 percent in the C implementation)
our open addressing currently uses this probe function (with PERTURB_SHIFT equal to 5):
@fingolfin mentioned yesterday the idea of an hashmap which also preserves the order in which elements were added. As I understood it this essentially a PLIST to which new elements were added at the end, combined with a hashtable storing indices into the PLIST.
A few thoughts about this:
The way to present this at GAP level is probably as a List with Add but no assignment and a super-fast Position method (and consequently also an \in method). That very neatly meets the needs of orbit algorithms.
If you don't plan to delete much then this approach may always be correct. In the hash table you store the index of the key-value pair (if you have values) and some bits of the hash value of the key. Depending on the size of the hash table, you could get all of that into 32 or 64 bits per entry for all but the most enormous tables. With linear probing and Robin Hood hashing, for instance, you will basically need just one cache line from the hash table plus one from the PLIST plus whatever you have to do to compare entries for almost all lookups.
Making one of these would replace the common idiom of sorting a PLIST when you have finished making it and before you start doing a lot of Position or \in tests on it. Provided you can find a hash function, this is strictly better.
For the future, I'd like to rewrite the hash set/map/table code once again. Some of the things I'd like to do in the new new code would include:
true
, but this wastes memory and CPU cycles)Also,
PERTURB_SHIFT
equal to 5):Experiment with others, e.g. linear or quadratic probing
More ideas and TODOs:
KeysIterator
,ValuesIterator
,KeyValuesIterator
(the last one would return pairs[key, value]
)for x in hashmap do ...
-- I think this should iterate over keysHashSet
-- this could use almost the exact same code, it would simply omit thevalues
plistUPDATE: more stuff:
add a
LookupWithDefault(hashmap, key, default)
, which works like the usual lookup, but ifkey
is not present in the hashmap, it returnsdefault
.provide a high-level interface for
DS_Hash_AccumulateValue
The text was updated successfully, but these errors were encountered: