Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient implementation of integers #548

Open
markshannon opened this issue Jan 24, 2023 · 14 comments
Open

More efficient implementation of integers #548

markshannon opened this issue Jan 24, 2023 · 14 comments
Assignees

Comments

@markshannon
Copy link
Member

markshannon commented Jan 24, 2023

Moving this from the old discussion to here, as we aren't using discussions any more.

@markshannon
Copy link
Member Author

As mentioned in python/cpython#101265, we need to prevent deallocation of smallints.

To do that we need to mark small ints. We could also use this mark to avoid DECREFs in code using integers, as a slightly faster check than on the refcount as the value will already be in a register.

@markshannon
Copy link
Member Author

Here's a possible layout with mark bit:

typedef struct {
    OBJECT_HEADER;
    intptr_t tagged_bits;
}  PyIntObject;

typedef struct {
    PyIntObject header;
    digit digits[1];
} PyLongObject;

Tagged bits

Bits 63/31 -- 2 1 0
Int Signed value Immortal bit 0
Long Unsigned digit count Sign 1

This allows us to perform the common operations efficiently.

C value of Int

tagged_bits >> 2 (Signed shift)

Sign of Long

(tagged_bits & 2) - 1

Digit count of long

tagged_bits >> 2

Is small Int?

tagged_bits & 1

Are both Ints (for binary ops)

((a->tagged_bits | b->tagged_bits) & 1) == 0

Is immortal?

(tagged_bits & 3) == 2

@markshannon
Copy link
Member Author

One other operation that is a bit less common, but needs consideration is normalization of "longs" to "ints".
This differs depending on the number of bits in the word.

For efficiency, it is probably OK to allow some values to be supported in both formats, and use a simple approximation in number of digits. However not all single digit ints fit in the tagged bits.

Digits 32 bit machine 64 bit machine
0 Zero Zero
1 -2**29 to 2**29-1 fit in tagged bits All fit
2 Do not fit All fit
3 Do not fit A few fit, but it is more efficient to reject these during normalization
4+ Do not fit Do not fit

In other words, it is probably best not to have a normal form for values near the threshold, to allow efficient implementation.

@markshannon
Copy link
Member Author

Given that we don't need ridiculously large ints, we could drop a further bit to allow immortal "longs".

Tagged bits

Bits 63/31 -- 3 2 1 0
Int Signed value Signed value (low bit) Immortal bit 0
Long Unsigned digit count Immortal bit Sign 1

@jneb
Copy link

jneb commented Jan 28, 2023

Just a thought. Would it help to tweak the bits so that you can find the sign and/or immortality without bothering if it is short or long? If you put the sign in the high bit for longs, you would get the sign check for free.
A comparison could check the sign bits first, and if they're different, be ready quickly.
I think the best way to make this decision is too see what operations on int are performed in real life. Most will be addition of 1 to a small int, I'm pretty sure :-)

@lpereira
Copy link

lpereira commented Feb 7, 2023

Maybe it's also time to reconsider dropping 15 and 30 bit digits and go with 63-bit digits only instead?

@jneb
Copy link

jneb commented Feb 7, 2023

(This remark was the result of me not realizing that it was about "digits" and not "numbers". Feel free to ignore.)
You're joking, right? I am certainly not the only one who uses long integers more often than not.
(Edited) I get it now. The size of the digits becomes 63 bits, not the numbers. Good idea, assuming that multiplication is doable on all platforms.

@lpereira
Copy link

lpereira commented Feb 8, 2023

I'm not joking! I've spent quite a bit of time last year looking over longobject.c and looking for ways to make it more efficient and/or use less memory.

For instance, I don't think that 15-bit digits are that useful anymore, with 32-bit platforms becoming less and less relevant every passing CPU generation; even with only a 30-bit digit implementation, the performance loss on those platforms would be mostly negligible compared to the cleanup of using a single digit width.

With time, moving from a 30-bit digit implementation to a 63-bit digit implementation should even improve performance a tad on 64-bit platforms -- by performing fewer iterations on most long operations and whatnot. It shouldn't increase memory usage significantly either; in most cases it should remain the same (e.g., a 1024-bit long would use 32 30-bit digits, and 16 63-bit digits, which would be 128 bytes either way).

If moving to a 63-bit digit, we would not have something akin to twodigits or stwodigits, though, as there's no standard 128-bit integer type in C11; this would change how carry and borrow is calculated, but for the better: functions like __builtin_add_overflow() (and similar) that leverage overflow/carry flags from CPUs could be used instead, reducing some masking/shifting in the fast path for most operations.

Even if we decide that it's not yet time to get rid of 16- and 32-bit math in long integers, it might be an interesting exercise to at least remove the 16-bit math and keep only the 32-bit math. This would allow, for instance, using intrinsics such as _addcarry_u32() and _subborrow_u32(), on x86 platforms, and avoid shifting/masking. The throughput/latency of a single ADC/SBB is better than that of an AND and SHR combined, especially since there wouldn't be a dependency to stall the pipeline. Since we don't use the full width of a digit to more easily (and portably) detect overflows, this could also mean that it might be possible to have 32-bit digits, potentially slightly reducing memory usage as well.

Of course, this goes a bit into non-portable territory, but implementations of these functions can be written in either assembly for some major CPU arches (e.g. aarch64), or portable C as a fallback.

@gvanrossum
Copy link
Collaborator

(Maybe @jneb thought you were proposing to limit int size to 63 bits? If someone proposed this I'd also assume they were joking. :-)

@lpereira
Copy link

lpereira commented Feb 8, 2023

(Maybe @jneb thought you were proposing to limit int size to 63 bits? If someone proposed this I'd also assume they were joking. :-)

That makes a whole lot more sense, and I agree :D

@markshannon
Copy link
Member Author

Speeding up arithmetic.

Python int arithmetic is very slow. We need to do a lot of work, checking the size and shape of ints before adding them, and then yet more work boxing them.

We want to minimize the amount of work we do. Many arithmetic expressions consume temporary values which we can reuse, if we discard "small ints" and use information that the compiler can gather to determine whether an operation is one of the forms:

  • temporary_value + ...
  • var += ... or var = var + ...

In the first case, the refcount of temporary_value is likely to be 1, and we can reuse the object.
In the second case, if the refcount of var is 2, which is likely, we can reuse the object.

The above is impossible with "small ints" because if the result is a small int, we need to use the immortal small object and free the temporary.

We can still have small ints. There are useful in lots of places, but we need to get rid of the requirement that we must use them, that (10 + 10) is 20.

What the specialized form would look like

We want the specialization of + for ints, BINARY_OP_ADD_INT_REUSE_LEFT, to be fully inline and do no allocation when the lhs has the expected refcount:
Something like:

        inst(BINARY_OP_ADD_INT_REUSE_LEFT, (unused/1, left, right -- sum)) {
            assert(cframe.use_tracing == 0);
            DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
            DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
            DEOPT_IF(Py_REFCNT(left) != 1, BINARY_OP); /* Or two for the x += ... case */
            DEOPT_IF((left->tagged_bits | right->tagged_bits) & 3, BINARY_OP); /* Both are mortal ints, not longs */
            DEOPT_IF((add_overflows(left->tagged_bits, right->tagged_bits), BINARY_OP);
            STAT_INC(BINARY_OP, hit);
            left->tagged_bits = (left->tagged_bits + right->tagged_bits) & ~(1<<2);
            _Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
        }

We attempted this before, but the code gets messy and very branchy handling all the special cases due to small ints, and refcounts.

We will want a bit of help from the compiler, to mark which binary ops are of the form x += ... or x = x + ..., that way we will know whether check for refcount 1 or 2 at specialization and runtime.

Prototyping on floats

We can implement the reference counting specializations first for floats which don't have to worry about overflow or immortal values. This should tell us whether we can reuse objects often enough for this sort of refcounting optimization will be beneficial.

@markshannon markshannon self-assigned this Mar 8, 2023
@markshannon
Copy link
Member Author

CPython issue: python/cpython#101291

@cdimauro
Copy link

For a slightly better longs representation you might consider this: https://ep2011.europython.eu/conference/talks/hacking-pylongobject-on-python-32.html
The idea was/is to have the common "small" ints represented as usual, in two-complements (and further optimization in future, considering this new representation).
The suggestion was to set 30 bits (which was done recently on CPython) as the standard digits size and consider 60 bits for 64-bit machines.

The above work can be combined with the introduction of "direct" ints which I made son WPython 1.1: https://code.google.com/archive/p/wpython2/downloads
WPython 1.1 already addressed the removal or some DECREFs.

@igorgatis
Copy link

For the sake of new comers (like me), what is the context of immortal ints? Are them known in advance (e.g. compile time)?

In time, I was wondering whether immortality (for any object) could be stored out side of the integer bits, keeping them as close to C implementation as possible.

I see two ways to store immortality status outside of the object:

  1. Using "unused" lowest bits of the PyObject* pointer. Given the size of PyObject, perhaps we can guarantee that it will always be at least 2-byte aligned, sparing the lowest bit which 0 is mortal and 1 for immortal. The major downside is the fact that we would need to mask before referencing its fields.
  2. Using a dedicate memory region for immortal objects. This approach is safe then the one above and the check is as simple as MIN_IMMORTAL_ADDR <= addr < MAX_IMMORTAL_ADDR. But it really depends on the nature of those immortal objects.

Perhaps the pointer parity trick could be used for long integers sign as well.

Does it make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants