Optimize serialization format for 2 bytes ints #20120

ilevkivskyi · 2025-10-26T00:46:24Z

This is important for serialized ASTs (think line numbers for every node). Also in this PR:

Re-use same integer logic for str/bytes length (it is slightly less optimal, but code re-use is good).
Remove unused field from Buffer type.
Make format the same on 32-bit and 64-bit platforms (we still assume little-endian platform).

JukkaL

Looks good overall, just some minor comments. This will help AST serialization a lot, in particular.

JukkaL · 2025-10-30T15:42:45Z

mypyc/lib-rt/librt_internal.c

+    if (likely(first != LONG_INT_TRAILER)) {
+        return _read_short_int(data, first);
    }
    // People who have literal ints not fitting in size_t should be punished :-)


This comment is out of date, since now anything not fitting in 29 bits triggers this path.

JukkaL · 2025-10-30T15:44:40Z

mypyc/lib-rt/librt_internal.c

+    one byte: last bit 0, 7 bits used
+    two bytes: last two bits 01, 14 bits used
+    four bytes: last three bits 011, 29 bits used
+    everything else: 00000111 followed by serialized string representation


What about using ...01111 (four 1 bits) for the final option, as this would leave ...0111 available for a possible 60-bit format in the future. (I don't have a strong opinion.)

OK. TBH I doubt we will ever need this, but it is very easy to make this change now.

JukkaL · 2025-10-30T15:47:08Z

mypyc/lib-rt/librt_internal.c

    return read_str_internal(data);
 }

+static inline char


Add comment explaining that this assumes that real_value is within allowed range (29 bits).

JukkaL · 2025-10-30T15:49:46Z

mypyc/lib-rt/librt_internal.c

+    _CHECK_READ(data, 3, CPY_INT_TAG)
+    // TODO: check if compilers emit optimal code for these two reads, and tweak if needed.
+    second = _READ(data, uint8_t)
+    two_more = _READ(data, uint16_t)


Since this code path will be quite rare, we could also read one byte at a time without any real performance impact. This would make this work the same on little and big endian systems.

This will not help much supporting big-endian platforms as we always write short integers (i.e. 1, 2, or 4 bytes) as a single value. And changing this for e.g. 2 bytes will probably have some performance impact. I think we can safely postpone this until people will actually ask for big-endian support. And even then I would prefer to use some #ifdef way of supporting big-endian platforms.

Btw this makes me think that right now this can produce some garbage on big-endian platforms without failing, which is not good. Also I found that there is #if PY_BIG_ENDIAN thing that we may use here.

After all I decided to add the big-endian support, it is not that hard. Essentially the idea is that I simply conditionally add byte reverse whenever I read/write anything longer than 1 byte. I know this is not optimal for big-endian but it is conceptually easy to reason about.

I will also make a separate PR to handle floats more robustly.

JukkaL · 2025-10-30T15:52:22Z

mypyc/lib-rt/librt_internal.c

+#define MAX_ONE_BYTE_INT 117  // 2 ** 7 - 1 - 10
+#define MIN_TWO_BYTES_INT -100
+#define MAX_TWO_BYTES_INT 16283  // 2 ** (8 + 6) - 1 - 100
+#define MIN_FOUR_BYTES_INT -10000


I think we could extend the negative range further here (e.g. make it symmetric with the positive range), since whether the upper bound is ~512M or ~256M probably won't make much difference in practice.

FWIW this will not really simplify code (and will actually make it less consistent). I would prefer to keep this as is, unless there is a reason to use 4 bytes storage for large negative integers.

JukkaL · 2025-10-30T15:54:34Z

mypyc/test-data/run-classes.test

        b = Buffer(b.getvalue())
        assert read_int(b) == i
-    for i in (-12345, -12344, -11, 118, 12344, 12345):
+    for i in (-100, -11, 118, 12344, 16283):


Test also each power of two up to, say, 200, just in case (positive and negative)? Maybe test both n**2 (lower bits all zero) and n**2 - 1 (lower bits all ones).

ilevkivskyi added 2 commits October 26, 2025 01:35

Optimize serialization format for 2 bytes ints

e0e945c

Remove extra whitespace in tests

d0d5291

ilevkivskyi requested a review from JukkaL October 26, 2025 00:46

ilevkivskyi mentioned this pull request Oct 30, 2025

Planned work related to fixed-format serialization #20072

Open

5 tasks

JukkaL reviewed Oct 30, 2025

View reviewed changes

ilevkivskyi added 3 commits October 30, 2025 23:12

Merge remote-tracking branch 'upstream/master' into 2-bytes-int

45cfe34

Address CR

e7fc76f

Try adding big endian support

d3cf915

Uh oh!

Optimize serialization format for 2 bytes ints #20120

Are you sure you want to change the base?

Optimize serialization format for 2 bytes ints #20120

Conversation

ilevkivskyi commented Oct 26, 2025

Uh oh!

JukkaL left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants