Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle INT/LONG base discrepancy #126992

Open
Legoclones opened this issue Nov 19, 2024 · 9 comments
Open

Pickle INT/LONG base discrepancy #126992

Legoclones opened this issue Nov 19, 2024 · 9 comments
Labels
extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@Legoclones
Copy link
Contributor

Legoclones commented Nov 19, 2024

Bug report

Bug description:

The INT opcode in pickle is the I character followed by an ASCII number and a newline. There are multiple comments asking if the base should be explicitly set to 10, or kept as 0. However, a discrepancy exists between pickle implementations:

  • _pickle.c uses strtol(s, &endptr, 0); with a base of 0, meaning 0xf would succeed
  • pickle.py uses int(data, 0) with a base of 0, meaning 0xf would succeed
  • pickletools.py uses read_decimalnl_short(), which calls int(s), meaning any non-decimal base would fail

This same inconsistency exists with the LONG opcode:

This means an attempt to disassemble a pickle bytestream using pickletools would fail here, while the actual unpickling process would proceed undisputed.

Personally, I don't really care whether all implementations are changed to base 10 or base 0 (save_long() only puts it in decimal form), but I think it should be consistent across all implementations. I'd submit a pull request for one way or the other, but I'm not sure which way you'd prefer it.

Also as a note, the pickle bytestream b'I0001\n.' (INT with the argument 0001) fails in pickle.py because having leading 0s in a number with base 0 causes an error. Note that no errors are thrown in _pickle.c because it uses strtol or pickletools.py because it doesn't have base 0 specified. If we keep the implementation as base 0, that discrepancy between pickle.py and other pickle implementations would stay, whereas if we change it to base 10 (aka remove base 0), that inconsistency would also go away. For LONG, both pickle.py and _pickle.c fail with b'L0001L\n.', but pickletools.py has no problem displaying that number (since it has no base specified).

CPython versions tested on:

3.11

Operating systems tested on:

Linux

Linked PRs

@Legoclones Legoclones added the type-bug An unexpected behavior, bug, or error label Nov 19, 2024
@Legoclones Legoclones changed the title Pickle INT base discrepancy Pickle INT/LONG base discrepancy Nov 19, 2024
@vstinner
Copy link
Member

I think that using base 10 would be reasonable here.

@Legoclones
Copy link
Contributor Author

I'll make a pull request then

@skirpichev
Copy link
Member

I think that using base 10 would be reasonable here.

I think it's an incompatible change. Why not adjust pickletools instead?

BTW, I see no failing tests in #127042. Probably we should add some regression tests.

@Legoclones
Copy link
Contributor Author

No pickles produced by pickle.dumps() would use any base other than 10, so only custom-made pickles would be broken by this change (which is also why it wouldn't break any tests).

@skirpichev
Copy link
Member

Yes, this compatibility break doesn't look too severe. After all, I'm not sure that many projects use the version 0 protocol.

On another hand, it's really easy to not introduce backward-incompatible changes.

@vstinner
Copy link
Member

I think it's an incompatible change.

Yes, and IMO it's a good change.

In practice, no one should be impacted since everybody should use the Python module to serialize to pickle, and in this case, you cannot get a base different than 10 in the serialized files.

@serhiy-storchaka
Copy link
Member

I thought about using hexadecimals for LONG. This would fix the issue of quadratic complexity for converting long to/from string. But it was not supported in all implementations, so it would be a breaking change, while the purpose of the LONG opcode is compatibility with old Python versions and non-Python implementations.

@picnixz picnixz added stdlib Python modules in the Lib dir extension-modules C modules in the Modules dir labels Nov 21, 2024
@serhiy-storchaka
Copy link
Member

The CPython implementation always accepted non-decimal input for INT and LONG (except short period from 12 April to 15 May in 1996). It can be considered a hidden CPython feature.

I do not think it is an issue if the CPython implementation is more lenient than the specification (which was written after the implementation and still not complete).

@serhiy-storchaka
Copy link
Member

@tim-one, this may be interesting to you, as it relates to both pickletools and the problems with decimal representation of long integers.

Using hexadecimal representation for long integers would be more efficient in terms of time and size. But it is less efficient than using pickle protocol 1 or higher. And not all third-party implementations (for different programming languages) support hexadecimals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

5 participants