Skip to content

Commit dc787d2

Browse files
committed
Issue #8188: Introduce a new scheme for computing hashes of numbers
(instances of int, float, complex, decimal.Decimal and fractions.Fraction) that makes it easy to maintain the invariant that hash(x) == hash(y) whenever x and y have equal value.
1 parent 0372113 commit dc787d2

14 files changed

+569
-140
lines changed

Doc/library/stdtypes.rst

+103
Original file line numberDiff line numberDiff line change
@@ -595,6 +595,109 @@ hexadecimal string representing the same number::
595595
'0x1.d380000000000p+11'
596596

597597

598+
.. _numeric-hash:
599+
600+
Hashing of numeric types
601+
------------------------
602+
603+
For numbers ``x`` and ``y``, possibly of different types, it's a requirement
604+
that ``hash(x) == hash(y)`` whenever ``x == y`` (see the :meth:`__hash__`
605+
method documentation for more details). For ease of implementation and
606+
efficiency across a variety of numeric types (including :class:`int`,
607+
:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`)
608+
Python's hash for numeric types is based on a single mathematical function
609+
that's defined for any rational number, and hence applies to all instances of
610+
:class:`int` and :class:`fraction.Fraction`, and all finite instances of
611+
:class:`float` and :class:`decimal.Decimal`. Essentially, this function is
612+
given by reduction modulo ``P`` for a fixed prime ``P``. The value of ``P`` is
613+
made available to Python as the :attr:`modulus` attribute of
614+
:data:`sys.hash_info`.
615+
616+
.. impl-detail::
617+
618+
Currently, the prime used is ``P = 2**31 - 1`` on machines with 32-bit C
619+
longs and ``P = 2**61 - 1`` on machines with 64-bit C longs.
620+
621+
Here are the rules in detail:
622+
623+
- If ``x = m / n`` is a nonnegative rational number and ``n`` is not divisible
624+
by ``P``, define ``hash(x)`` as ``m * invmod(n, P) % P``, where ``invmod(n,
625+
P)`` gives the inverse of ``n`` modulo ``P``.
626+
627+
- If ``x = m / n`` is a nonnegative rational number and ``n`` is
628+
divisible by ``P`` (but ``m`` is not) then ``n`` has no inverse
629+
modulo ``P`` and the rule above doesn't apply; in this case define
630+
``hash(x)`` to be the constant value ``sys.hash_info.inf``.
631+
632+
- If ``x = m / n`` is a negative rational number define ``hash(x)``
633+
as ``-hash(-x)``. If the resulting hash is ``-1``, replace it with
634+
``-2``.
635+
636+
- The particular values ``sys.hash_info.inf``, ``-sys.hash_info.inf``
637+
and ``sys.hash_info.nan`` are used as hash values for positive
638+
infinity, negative infinity, or nans (respectively). (All hashable
639+
nans have the same hash value.)
640+
641+
- For a :class:`complex` number ``z``, the hash values of the real
642+
and imaginary parts are combined by computing ``hash(z.real) +
643+
sys.hash_info.imag * hash(z.imag)``, reduced modulo
644+
``2**sys.hash_info.width`` so that it lies in
645+
``range(-2**(sys.hash_info.width - 1), 2**(sys.hash_info.width -
646+
1))``. Again, if the result is ``-1``, it's replaced with ``-2``.
647+
648+
649+
To clarify the above rules, here's some example Python code,
650+
equivalent to the builtin hash, for computing the hash of a rational
651+
number, :class:`float`, or :class:`complex`::
652+
653+
654+
import sys, math
655+
656+
def hash_fraction(m, n):
657+
"""Compute the hash of a rational number m / n.
658+
659+
Assumes m and n are integers, with n positive.
660+
Equivalent to hash(fractions.Fraction(m, n)).
661+
662+
"""
663+
P = sys.hash_info.modulus
664+
# Remove common factors of P. (Unnecessary if m and n already coprime.)
665+
while m % P == n % P == 0:
666+
m, n = m // P, n // P
667+
668+
if n % P == 0:
669+
hash_ = sys.hash_info.inf
670+
else:
671+
# Fermat's Little Theorem: pow(n, P-1, P) is 1, so
672+
# pow(n, P-2, P) gives the inverse of n modulo P.
673+
hash_ = (abs(m) % P) * pow(n, P - 2, P) % P
674+
if m < 0:
675+
hash_ = -hash_
676+
if hash_ == -1:
677+
hash_ = -2
678+
return hash_
679+
680+
def hash_float(x):
681+
"""Compute the hash of a float x."""
682+
683+
if math.isnan(x):
684+
return sys.hash_info.nan
685+
elif math.isinf(x):
686+
return sys.hash_info.inf if x > 0 else -sys.hash_info.inf
687+
else:
688+
return hash_fraction(*x.as_integer_ratio())
689+
690+
def hash_complex(z):
691+
"""Compute the hash of a complex number z."""
692+
693+
hash_ = hash_float(z.real) + sys.hash_info.imag * hash_float(z.imag)
694+
# do a signed reduction modulo 2**sys.hash_info.width
695+
M = 2**(sys.hash_info.width - 1)
696+
hash_ = (hash_ & (M - 1)) - (hash & M)
697+
if hash_ == -1:
698+
hash_ == -2
699+
return hash_
700+
598701
.. _typeiter:
599702

600703
Iterator Types

Doc/library/sys.rst

+24
Original file line numberDiff line numberDiff line change
@@ -446,6 +446,30 @@ always available.
446446
Changed to a named tuple and added *service_pack_minor*,
447447
*service_pack_major*, *suite_mask*, and *product_type*.
448448

449+
450+
.. data:: hash_info
451+
452+
A structseq giving parameters of the numeric hash implementation. For
453+
more details about hashing of numeric types, see :ref:`numeric-hash`.
454+
455+
+---------------------+--------------------------------------------------+
456+
| attribute | explanation |
457+
+=====================+==================================================+
458+
| :const:`width` | width in bits used for hash values |
459+
+---------------------+--------------------------------------------------+
460+
| :const:`modulus` | prime modulus P used for numeric hash scheme |
461+
+---------------------+--------------------------------------------------+
462+
| :const:`inf` | hash value returned for a positive infinity |
463+
+---------------------+--------------------------------------------------+
464+
| :const:`nan` | hash value returned for a nan |
465+
+---------------------+--------------------------------------------------+
466+
| :const:`imag` | multiplier used for the imaginary part of a |
467+
| | complex number |
468+
+---------------------+--------------------------------------------------+
469+
470+
.. versionadded:: 3.2
471+
472+
449473
.. data:: hexversion
450474

451475
The version number encoded as a single integer. This is guaranteed to increase

Include/pyport.h

+14
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,20 @@ Used in: PY_LONG_LONG
126126
#endif
127127
#endif
128128

129+
/* Parameters used for the numeric hash implementation. See notes for
130+
_PyHash_Double in Objects/object.c. Numeric hashes are based on
131+
reduction modulo the prime 2**_PyHASH_BITS - 1. */
132+
133+
#if SIZEOF_LONG >= 8
134+
#define _PyHASH_BITS 61
135+
#else
136+
#define _PyHASH_BITS 31
137+
#endif
138+
#define _PyHASH_MODULUS ((1UL << _PyHASH_BITS) - 1)
139+
#define _PyHASH_INF 314159
140+
#define _PyHASH_NAN 0
141+
#define _PyHASH_IMAG 1000003UL
142+
129143
/* uintptr_t is the C9X name for an unsigned integral type such that a
130144
* legitimate void* can be cast to uintptr_t and then back to void* again
131145
* without loss of information. Similarly for intptr_t, wrt a signed

Lib/decimal.py

+32-48
Original file line numberDiff line numberDiff line change
@@ -862,15 +862,15 @@ def _cmp(self, other):
862862
# that specified by IEEE 754.
863863

864864
def __eq__(self, other, context=None):
865-
other = _convert_other(other, allow_float=True)
865+
other = _convert_other(other, allow_float = True)
866866
if other is NotImplemented:
867867
return other
868868
if self._check_nans(other, context):
869869
return False
870870
return self._cmp(other) == 0
871871

872872
def __ne__(self, other, context=None):
873-
other = _convert_other(other, allow_float=True)
873+
other = _convert_other(other, allow_float = True)
874874
if other is NotImplemented:
875875
return other
876876
if self._check_nans(other, context):
@@ -879,7 +879,7 @@ def __ne__(self, other, context=None):
879879

880880

881881
def __lt__(self, other, context=None):
882-
other = _convert_other(other, allow_float=True)
882+
other = _convert_other(other, allow_float = True)
883883
if other is NotImplemented:
884884
return other
885885
ans = self._compare_check_nans(other, context)
@@ -888,7 +888,7 @@ def __lt__(self, other, context=None):
888888
return self._cmp(other) < 0
889889

890890
def __le__(self, other, context=None):
891-
other = _convert_other(other, allow_float=True)
891+
other = _convert_other(other, allow_float = True)
892892
if other is NotImplemented:
893893
return other
894894
ans = self._compare_check_nans(other, context)
@@ -897,7 +897,7 @@ def __le__(self, other, context=None):
897897
return self._cmp(other) <= 0
898898

899899
def __gt__(self, other, context=None):
900-
other = _convert_other(other, allow_float=True)
900+
other = _convert_other(other, allow_float = True)
901901
if other is NotImplemented:
902902
return other
903903
ans = self._compare_check_nans(other, context)
@@ -906,7 +906,7 @@ def __gt__(self, other, context=None):
906906
return self._cmp(other) > 0
907907

908908
def __ge__(self, other, context=None):
909-
other = _convert_other(other, allow_float=True)
909+
other = _convert_other(other, allow_float = True)
910910
if other is NotImplemented:
911911
return other
912912
ans = self._compare_check_nans(other, context)
@@ -935,55 +935,28 @@ def compare(self, other, context=None):
935935

936936
def __hash__(self):
937937
"""x.__hash__() <==> hash(x)"""
938-
# Decimal integers must hash the same as the ints
939-
#
940-
# The hash of a nonspecial noninteger Decimal must depend only
941-
# on the value of that Decimal, and not on its representation.
942-
# For example: hash(Decimal('100E-1')) == hash(Decimal('10')).
943-
944-
# Equality comparisons involving signaling nans can raise an
945-
# exception; since equality checks are implicitly and
946-
# unpredictably used when checking set and dict membership, we
947-
# prevent signaling nans from being used as set elements or
948-
# dict keys by making __hash__ raise an exception.
938+
939+
# In order to make sure that the hash of a Decimal instance
940+
# agrees with the hash of a numerically equal integer, float
941+
# or Fraction, we follow the rules for numeric hashes outlined
942+
# in the documentation. (See library docs, 'Built-in Types').
949943
if self._is_special:
950944
if self.is_snan():
951945
raise TypeError('Cannot hash a signaling NaN value.')
952946
elif self.is_nan():
953-
# 0 to match hash(float('nan'))
954-
return 0
947+
return _PyHASH_NAN
955948
else:
956-
# values chosen to match hash(float('inf')) and
957-
# hash(float('-inf')).
958949
if self._sign:
959-
return -271828
950+
return -_PyHASH_INF
960951
else:
961-
return 314159
962-
963-
# In Python 2.7, we're allowing comparisons (but not
964-
# arithmetic operations) between floats and Decimals; so if
965-
# a Decimal instance is exactly representable as a float then
966-
# its hash should match that of the float.
967-
self_as_float = float(self)
968-
if Decimal.from_float(self_as_float) == self:
969-
return hash(self_as_float)
970-
971-
if self._isinteger():
972-
op = _WorkRep(self.to_integral_value())
973-
# to make computation feasible for Decimals with large
974-
# exponent, we use the fact that hash(n) == hash(m) for
975-
# any two nonzero integers n and m such that (i) n and m
976-
# have the same sign, and (ii) n is congruent to m modulo
977-
# 2**64-1. So we can replace hash((-1)**s*c*10**e) with
978-
# hash((-1)**s*c*pow(10, e, 2**64-1).
979-
return hash((-1)**op.sign*op.int*pow(10, op.exp, 2**64-1))
980-
# The value of a nonzero nonspecial Decimal instance is
981-
# faithfully represented by the triple consisting of its sign,
982-
# its adjusted exponent, and its coefficient with trailing
983-
# zeros removed.
984-
return hash((self._sign,
985-
self._exp+len(self._int),
986-
self._int.rstrip('0')))
952+
return _PyHASH_INF
953+
954+
if self._exp >= 0:
955+
exp_hash = pow(10, self._exp, _PyHASH_MODULUS)
956+
else:
957+
exp_hash = pow(_PyHASH_10INV, -self._exp, _PyHASH_MODULUS)
958+
hash_ = int(self._int) * exp_hash % _PyHASH_MODULUS
959+
return hash_ if self >= 0 else -hash_
987960

988961
def as_tuple(self):
989962
"""Represents the number as a triple tuple.
@@ -6218,6 +6191,17 @@ def _format_number(is_negative, intpart, fracpart, exp, spec):
62186191
# _SignedInfinity[sign] is infinity w/ that sign
62196192
_SignedInfinity = (_Infinity, _NegativeInfinity)
62206193

6194+
# Constants related to the hash implementation; hash(x) is based
6195+
# on the reduction of x modulo _PyHASH_MODULUS
6196+
import sys
6197+
_PyHASH_MODULUS = sys.hash_info.modulus
6198+
# hash values to use for positive and negative infinities, and nans
6199+
_PyHASH_INF = sys.hash_info.inf
6200+
_PyHASH_NAN = sys.hash_info.nan
6201+
del sys
6202+
6203+
# _PyHASH_10INV is the inverse of 10 modulo the prime _PyHASH_MODULUS
6204+
_PyHASH_10INV = pow(10, _PyHASH_MODULUS - 2, _PyHASH_MODULUS)
62216205

62226206

62236207
if __name__ == '__main__':

Lib/fractions.py

+22-9
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
import numbers
99
import operator
1010
import re
11+
import sys
1112

1213
__all__ = ['Fraction', 'gcd']
1314

@@ -23,6 +24,12 @@ def gcd(a, b):
2324
a, b = b, a%b
2425
return a
2526

27+
# Constants related to the hash implementation; hash(x) is based
28+
# on the reduction of x modulo the prime _PyHASH_MODULUS.
29+
_PyHASH_MODULUS = sys.hash_info.modulus
30+
# Value to be used for rationals that reduce to infinity modulo
31+
# _PyHASH_MODULUS.
32+
_PyHASH_INF = sys.hash_info.inf
2633

2734
_RATIONAL_FORMAT = re.compile(r"""
2835
\A\s* # optional whitespace at the start, then
@@ -528,16 +535,22 @@ def __hash__(self):
528535
529536
"""
530537
# XXX since this method is expensive, consider caching the result
531-
if self._denominator == 1:
532-
# Get integers right.
533-
return hash(self._numerator)
534-
# Expensive check, but definitely correct.
535-
if self == float(self):
536-
return hash(float(self))
538+
539+
# In order to make sure that the hash of a Fraction agrees
540+
# with the hash of a numerically equal integer, float or
541+
# Decimal instance, we follow the rules for numeric hashes
542+
# outlined in the documentation. (See library docs, 'Built-in
543+
# Types').
544+
545+
# dinv is the inverse of self._denominator modulo the prime
546+
# _PyHASH_MODULUS, or 0 if self._denominator is divisible by
547+
# _PyHASH_MODULUS.
548+
dinv = pow(self._denominator, _PyHASH_MODULUS - 2, _PyHASH_MODULUS)
549+
if not dinv:
550+
hash_ = _PyHASH_INF
537551
else:
538-
# Use tuple's hash to avoid a high collision rate on
539-
# simple fractions.
540-
return hash((self._numerator, self._denominator))
552+
hash_ = abs(self._numerator) * dinv % _PyHASH_MODULUS
553+
return hash_ if self >= 0 else -hash_
541554

542555
def __eq__(a, b):
543556
"""a == b"""

Lib/test/test_float.py

-9
Original file line numberDiff line numberDiff line change
@@ -914,15 +914,6 @@ def notest_float_inf(self):
914914
self.assertFalse(NAN.is_inf())
915915
self.assertFalse((0.).is_inf())
916916

917-
def test_hash_inf(self):
918-
# the actual values here should be regarded as an
919-
# implementation detail, but they need to be
920-
# identical to those used in the Decimal module.
921-
self.assertEqual(hash(float('inf')), 314159)
922-
self.assertEqual(hash(float('-inf')), -271828)
923-
self.assertEqual(hash(float('nan')), 0)
924-
925-
926917
fromHex = float.fromhex
927918
toHex = float.hex
928919
class HexFloatTestCase(unittest.TestCase):

0 commit comments

Comments
 (0)