How are bytecode strings used?

We've been talking recently about how we might want to change the format of our instructions in different ways (variable-length instructions, wider opargs, compression of serialized forms, etc.). I think it's useful to consider all of the different forms that bytecode takes throughout a typical Python process when discussing these ideas.

The lifecycle of a string of bytecode (opcodes, opargs, and caches) currently looks something like this:

```mermaid
graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
FROZEN[frozen bytes]
PYC[.pyc file]
DEEPFROZEN[deep-frozen code tail]
HEAP["executable code tail (heap)"]
STATIC["executable code tail (static)"]

style RAW fill:blue
style MARSHALLED fill:blue
style FROZEN fill:blue
style PYC fill:blue
style DEEPFROZEN fill:blue
style STATIC fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW -----> |disassemble| USER

MARSHALLED --> |cache| PYC
PYC --> |import| MARSHALLED

MARSHALLED --> |unmarshal| RAW
RAW --> |marshal| MARSHALLED

MARSHALLED -.-> |freeze| FROZEN
FROZEN -.-> |deepfreeze| DEEPFROZEN
DEEPFROZEN --> |quicken| STATIC
STATIC --> |unquicken| DEEPFROZEN
STATIC --> |copy + unquicken| RAW

FROZEN --> |import| MARSHALLED

RAW -----> |copy + quicken| HEAP
HEAP -----> |copy + unquicken| RAW
```

The boxes in red are quickened forms, while the boxes in blue are unquickened forms. Quickening (`_PyCode_Quicken`) currently initializes adaptive counters and inserts superinstructions. Unquickening (`deopt_code`) removes superinstructions, converts other instructions back to their adaptive form, and zeroes out all caches (including counters).

Let's remove frozen and cached modules, for simplicity (they're basically just marshalled bytes):

```mermaid
graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
DEEPFROZEN[deep-frozen code tail]
HEAP["executable code tail (heap)"]
STATIC["executable code tail (static)"]

style RAW fill:blue
style MARSHALLED fill:blue
style DEEPFROZEN fill:blue
style STATIC fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW ----> |disassemble| USER

MARSHALLED --> |unmarshal| RAW
RAW --> |marshal| MARSHALLED

MARSHALLED -.-> |deepfreeze| DEEPFROZEN
DEEPFROZEN --> |quicken| STATIC
STATIC --> |unquicken| DEEPFROZEN
STATIC --> |copy + unquicken| RAW

RAW ----> |copy + quicken| HEAP
HEAP ----> |copy + unquicken| RAW
```

Some observations:

1. It would simplify things a lot (especially deepfreeze) if we didn't have a concept of "quickening" or "unquickening". Perhaps a more useful model would be the ability to "reset" code to its initial quickened form, for consumers of `co_code` and finalization of deepfrozen code objects. This means that superinstructions and non-zero counters would be present in `co_code`, but no specialized instructions or other populated caches. If we do this, we only have one idempotent transformation that can be applied to the bytecode, and what we currently call "quickening" can be entirely encapsulated in the compiler, where it belongs (not even `marshal` or code objects need to understand it). If so, the new graph would be roughly:

```mermaid
graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
HEAP["executable code tail (heap)"]
STATIC["executable code tail (static)"]

style RAW fill:red
style MARSHALLED fill:red
style STATIC fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW ---> |disassemble| USER

MARSHALLED --> |unmarshal| RAW
RAW --> |marshal| MARSHALLED

MARSHALLED -.-> |deepfreeze| STATIC
STATIC --> |reset| STATIC
STATIC --> |copy + reset| RAW

RAW ---> |copy + reset| HEAP
HEAP ---> |copy + reset| RAW
```

At this point, there's not really any difference between static and heap code (we just need to reset static code at finalization):

```mermaid
graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
HEAP[executable code tail]

style RAW fill:red
style MARSHALLED fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW ---> |disassemble| USER

MARSHALLED --> |unmarshal| RAW
RAW --> |marshal| MARSHALLED

MARSHALLED -.-> |deepfreeze| HEAP
HEAP --> |reset| HEAP

RAW --> |copy + reset| HEAP
HEAP --> |copy + reset| RAW
```

2. While it's an open question whether marshal should have an intimate knowledge of the bytecode format for compression purposes, it's certainly desirable to at least marshal the bytecode directly in and out of the code object's tail (and not through an intermediate `bytes` object):

```mermaid
graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
HEAP[executable code tail]

style RAW fill:red
style MARSHALLED fill:red
style HEAP fill:red

COMPILER --> |compile| RAW
RAW ---> |disassemble| USER

MARSHALLED --> |unmarshal| HEAP
HEAP --> |marshal + reset| MARSHALLED

MARSHALLED -.-> |deepfreeze| HEAP
HEAP --> |reset| HEAP

RAW --> |copy + reset| HEAP
HEAP --> |copy + reset| RAW
```

If marshal has a way of building code without an intermediate `bytes` object, then the compiler does too:

```mermaid
graph TB

COMPILER((compiler))
USER((user))
RAW[raw bytes]
MARSHALLED[marshalled bytes]
HEAP[executable code tail]

style RAW fill:red
style MARSHALLED fill:red
style HEAP fill:red

COMPILER --> |compile| HEAP

MARSHALLED --> |unmarshal| HEAP
HEAP --> |marshal + reset| MARSHALLED

MARSHALLED -.-> |deepfreeze| HEAP
HEAP --> |reset| HEAP

HEAP --> |copy + reset| RAW
RAW --> |disassemble| USER
```

So, by changing these two relatively minor things, it seems that we can simplify our handling of the bytecode quite a bit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How are bytecode strings used? #538

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How are bytecode strings used? #538

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions