Skip to content

Commit

Permalink
Restore "Tweak our hash bucket splitting rules"
Browse files Browse the repository at this point in the history
This reverts commit e4343ef,
which was a revert of 05f97de.

Prior to this patch we resized hashes when after inserting a key
the load factor of the hash reached 1 (load factor= keys / buckets).

This patch makes two subtle changes to this logic:

1. We split only after inserting a key into an utilized bucket,
2. and the maximum load factor exceeds 0.667

The intent and effect of this change is to increase our hash tables
efficiency. Reducing the maximum load factor 0.667 means that we should
have much less keys in collision overall, at the cost of some unutilized
space (2/3rds was chosen as it is easier to calculate than 0.7). On the
other hand, only splitting after a collision means in theory that we execute
the "final split" less often. Additionally, insertin a key into an unused
bucket increases the efficiency of the hash, without changing the worst
case.[1] In other words without increasing collisions we use the space
in our hashes more efficiently.

A side effect of this hash is that the size of a hash is more sensitive
to key insert order. A set of keys with some collisions might be one
size if those collisions were encountered early, or another if they were
encountered later. Assuming random distribution of hash values about
50% of hashes should be smaller than they would be without this rule.

The two changes complement each other, as changing the maximum load
factor decreases the chance of a collision, but changing to only split
after a collision means that we won't waste as much of that space we
might.

[1] Since I personally didnt find this obvious at first here is my
explanation:

The old behavior was that we doubled the number of buckets when the
number of keys in the hash matched that of buckets. So on inserting
the Kth key into a K bucket hash, we would double the number of
buckets.  Thus the worse case prior to this patch was a hash
containing K-1 keys which all hash into a single  bucket, and the post
split worst case behavior would be having K items in a single bucket
of a hash with 2*K buckets total.

The new behavior says that we double the size of the hash once inserting
an item into an occupied bucket and after doing so we exceeed the maximum
load factor (leave aside the change in maximum load factor in this patch).
If we insert into an occupied bucket (including the worse case bucket) then
we trigger a key split, and we have exactly the same cases as before.
If we insert into an empty bucket then we now have a worst case of K-1 items
in one bucket, and 1 item in another, in a hash with K buckets, thus the
worst case has not changed.
  • Loading branch information
demerphq committed Jun 1, 2017
1 parent 4d8c782 commit 6f019ba
Show file tree
Hide file tree
Showing 6 changed files with 57 additions and 25 deletions.
4 changes: 2 additions & 2 deletions ext/Hash-Util/t/Util.t
Original file line number Diff line number Diff line change
Expand Up @@ -606,9 +606,9 @@ ok(defined($hash_seed) && $hash_seed ne '', "hash_seed $hash_seed");
my $array1= bucket_array({});
my $array2= bucket_array({1..10});
is("@info1","0 8 0");
is("@info2[0,1]","5 8");
like("@info2[0,1]",qr/5 (?:8|16)/);
is("@stats1","0 8 0");
is("@stats2[0,1]","5 8");
like("@stats2[0,1]",qr/5 (?:8|16)/);
my @keys1= sort map { ref $_ ? @$_ : () } @$array1;
my @keys2= sort map { ref $_ ? @$_ : () } @$array2;
is("@keys1","");
Expand Down
10 changes: 6 additions & 4 deletions ext/Hash-Util/t/builtin.t
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,15 @@ is(used_buckets(%hash), 1, "hash should have one used buckets");

$hash{$_}= $_ for 2..7;

like(bucket_ratio(%hash), qr!/8!, "hash has expected number of buckets in bucket_ratio");
is(num_buckets(%hash), 8, "hash should have eight buckets");
like(bucket_ratio(%hash), qr!/(?:8|16)!, "hash has expected number of buckets in bucket_ratio");
my $num= num_buckets(%hash);
ok(($num == 8 || $num == 16), "hash should have 8 or 16 buckets");
cmp_ok(used_buckets(%hash), "<", 8, "hash should have one used buckets");

$hash{8}= 8;
like(bucket_ratio(%hash), qr!/16!, "hash has expected number of buckets in bucket_ratio");
is(num_buckets(%hash), 16, "hash should have sixteen buckets");
like(bucket_ratio(%hash), qr!/(?:8|16)!, "hash has expected number of buckets in bucket_ratio");
$num= num_buckets(%hash);
ok(($num == 8 || $num == 16), "hash should have 8 or 16 buckets");
cmp_ok(used_buckets(%hash), "<=", 8, "hash should have at most 8 used buckets");


43 changes: 31 additions & 12 deletions hv.c
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,11 @@ holds the key and hash value.
#define PERL_HASH_INTERNAL_ACCESS
#include "perl.h"

#define DO_HSPLIT(xhv) ((xhv)->xhv_keys > (xhv)->xhv_max) /* HvTOTALKEYS(hv) > HvMAX(hv) */
/* we split when we collide and we have a load factor over 0.667.
* NOTE if you change this formula so we split earlier than previously
* you MUST change the logic in hv_ksplit()
*/
#define DO_HSPLIT(xhv) ( ((xhv)->xhv_keys + ((xhv)->xhv_keys >> 1)) > (xhv)->xhv_max )
#define HV_FILL_THRESHOLD 31

static const char S_strtab_error[]
Expand Down Expand Up @@ -343,6 +347,7 @@ Perl_hv_common(pTHX_ HV *hv, SV *keysv, const char *key, STRLEN klen,
HE **oentry;
SV *sv;
bool is_utf8;
bool in_collision;
int masked_flags;
const int return_svp = action & HV_FETCH_JUST_SV;
HEK *keysv_hek = NULL;
Expand Down Expand Up @@ -835,6 +840,7 @@ Perl_hv_common(pTHX_ HV *hv, SV *keysv, const char *key, STRLEN klen,
* making it harder to see if there is a collision. We also
* reset the iterator randomizer if there is one.
*/
in_collision = *oentry != NULL;
if ( *oentry && PL_HASH_RAND_BITS_ENABLED) {
PL_hash_rand_bits++;
PL_hash_rand_bits= ROTL_UV(PL_hash_rand_bits,1);
Expand Down Expand Up @@ -877,7 +883,7 @@ Perl_hv_common(pTHX_ HV *hv, SV *keysv, const char *key, STRLEN klen,
HvHASKFLAGS_on(hv);

xhv->xhv_keys++; /* HvTOTALKEYS(hv)++ */
if ( DO_HSPLIT(xhv) ) {
if ( in_collision && DO_HSPLIT(xhv) ) {
const STRLEN oldsize = xhv->xhv_max + 1;
const U32 items = (U32)HvPLACEHOLDERS_get(hv);

Expand Down Expand Up @@ -1450,29 +1456,42 @@ void
Perl_hv_ksplit(pTHX_ HV *hv, IV newmax)
{
XPVHV* xhv = (XPVHV*)SvANY(hv);
const I32 oldsize = (I32) xhv->xhv_max+1; /* HvMAX(hv)+1 (sick) */
const I32 oldsize = (I32) xhv->xhv_max+1; /* HvMAX(hv)+1 */
I32 newsize;
I32 wantsize;
I32 trysize;
char *a;

PERL_ARGS_ASSERT_HV_KSPLIT;

newsize = (I32) newmax; /* possible truncation here */
if (newsize != newmax || newmax <= oldsize)
wantsize = (I32) newmax; /* possible truncation here */
if (wantsize != newmax)
return;
while ((newsize & (1 + ~newsize)) != newsize) {
newsize &= ~(newsize & (1 + ~newsize)); /* get proper power of 2 */

wantsize= wantsize + (wantsize >> 1); /* wantsize *= 1.5 */
if (wantsize < newmax) /* overflow detection */
return;

newsize = oldsize;
while (wantsize > newsize) {
trysize = newsize << 1;
if (trysize > newsize) {
newsize = trysize;
} else {
/* we overflowed */
return;
}
}
if (newsize < newmax)
newsize *= 2;
if (newsize < newmax)
return; /* overflow detection */

if (newsize <= oldsize)
return; /* overflow detection */

a = (char *) HvARRAY(hv);
if (a) {
hsplit(hv, oldsize, newsize);
} else {
Newxz(a, PERL_HV_ARRAY_ALLOC_BYTES(newsize), char);
xhv->xhv_max = --newsize;
xhv->xhv_max = newsize - 1;
HvARRAY(hv) = (HE **) a;
}
}
Expand Down
2 changes: 1 addition & 1 deletion t/op/coreamp.t
Original file line number Diff line number Diff line change
Expand Up @@ -639,7 +639,7 @@ SKIP: {

my %h = 1..2;
&mykeys(\%h) = 1024;
like Hash::Util::bucket_ratio(%h), qr|/1024\z|, '&mykeys = changed number of buckets allocated';
like Hash::Util::bucket_ratio(%h), qr!/(?:1024|2048)\z!, '&mykeys = changed number of buckets allocated';
eval { (&mykeys(\%h)) = 1025; };
like $@, qr/^Can't modify keys in list assignment at /;
}
Expand Down
21 changes: 16 additions & 5 deletions t/op/hash.t
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,8 @@ sub torture_hash {
my ($h2, $h3, $h4);
while (keys %$h > 2) {
my $take = (keys %$h) / 2 - 1;
my @keys = (keys %$h)[0 .. $take];
my @keys = (sort keys %$h)[0..$take];

my $scalar = %$h;
delete @$h{@keys};
push @groups, $scalar, \@keys;
Expand All @@ -178,9 +179,19 @@ sub torture_hash {

# Each time this will get emptied then repopulated. If the fill isn't reset
# when the hash is emptied, the used count will likely exceed the array
use Devel::Peek;
%$h3 = %$h2;
is(join(",", sort keys %$h3),join(",",sort keys %$h2),"$desc (+$count copy) has same keys");
my (undef, $total3) = validate_hash("$desc (+$count copy)", $h3);
is($total3, $total2, "$desc (+$count copy) has same array size");
# We now only split when we collide on insert AND exceed the load factor
# when we did so. Building a hash via %x=%y means a pseudo-random key
# order inserting into %x, and we may end up encountering a collision
# at a different point in the load order, resulting in a possible power of
# two difference under the current load factor expectations. If this test
# fails then it is probably because DO_HSPLIT was changed, and this test
# needs to be adjusted accordingly.
ok( $total2 == $total3 || $total2*2==$total3 || $total2==$total3*2,
"$desc (+$count copy) array size within a power of 2 of each other");

# This might use fewer buckets than the original
%$h4 = %$h;
Expand All @@ -189,7 +200,7 @@ sub torture_hash {
}

my $scalar = %$h;
my @keys = keys %$h;
my @keys = sort keys %$h;
delete @$h{@keys};
is(scalar %$h, 0, "scalar keys for empty $desc");

Expand All @@ -205,11 +216,11 @@ sub torture_hash {
while (@groups) {
my $keys = pop @groups;
++$h->{$_} foreach @$keys;
my (undef, $total) = validate_hash("$desc " . keys %$h, $h);
my (undef, $total) = validate_hash($desc, $h);
is($total, $total0, "bucket count is constant when rebuilding");
is(scalar %$h, pop @groups, "scalar keys is identical when rebuilding");
++$h1->{$_} foreach @$keys;
validate_hash("$desc copy " . keys %$h1, $h1);
validate_hash("$desc copy", $h1);
}
# This will fail if the fill count isn't handled correctly on hash split
is(scalar %$h1, scalar %$h, "scalar keys is identical on copy and original");
Expand Down
2 changes: 1 addition & 1 deletion t/op/sub_lval.t
Original file line number Diff line number Diff line change
Expand Up @@ -557,7 +557,7 @@ SKIP: {
sub keeze : lvalue { keys %__ }
%__ = ("a","b");
keeze = 64;
is Hash::Util::bucket_ratio(%__), '1/64', 'keys assignment through lvalue sub';
like Hash::Util::bucket_ratio(%__), qr!1/(?:64|128)!, 'keys assignment through lvalue sub';
eval { (keeze) = 64 };
like $@, qr/^Can't modify keys in list assignment at /,
'list assignment to keys through lv sub is forbidden';
Expand Down

0 comments on commit 6f019ba

Please sign in to comment.