Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distutils is not reproducible #78214

Closed
vstinner opened this issue Jul 3, 2018 · 15 comments
Closed

distutils is not reproducible #78214

vstinner opened this issue Jul 3, 2018 · 15 comments
Labels
3.11 only security fixes stdlib Python modules in the Lib dir

Comments

@vstinner
Copy link
Member

vstinner commented Jul 3, 2018

BPO 34033
Nosy @vstinner, @benjaminp, @encukou, @methane, @yan12125, @bmwiedemann, @jefferyto
PRs
  • bpo-34033: distutils: byte_compile() sort files #8057
  • gh-78214: marshal: Stabilize FLAG_REF usage #8226
  • Dependencies:

    Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2018-07-03.15:46:25.794>
    labels = ['3.8', 'library']
    title = 'distutils is not reproducible'
    updated_at = <Date 2020-04-10.13:23:09.703>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2020-04-10.13:23:09.703>
    actor = 'yan12125'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2018-07-03.15:46:25.794>
    creator = 'vstinner'
    dependencies = ['31377', '34093']
    files = []
    hgrepos = []
    issue_num = 34033
    keywords = ['patch']
    message_count = 9.0
    messages = ['320988', '320990', '320991', '321383', '321408', '321432', '321434', '337975', '359595']
    nosy_count = 9.0
    nosy_names = ['vstinner', 'benjamin.peterson', 'petr.viktorin', 'sascha_silbe', 'zbysz', 'methane', 'yan12125', 'bmwiedemann', 'jefferyto']
    pr_nums = ['8057', '8226']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue34033'
    versions = ['Python 3.8']

    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 3, 2018

    Follow up of bpo-29708: OpenSUSE uses a downstream patch for distutils to fix https://bugzilla.opensuse.org/show_bug.cgi?id=1049186: distutils-reproducible-compile.patch. I converted the patch as a PR: PR 8057.

    Naoki INADA wrote:
    """
    Currently, marshal uses refcnt to determine using w_ref or not. Some immutable objects (especially, long and str) can be cached and reused. It may affects refcnt when byte compiling.

    I think we should use more deterministic way instead of refcnt. Maybe, count all constants in the module before marshal, like we did in compiling function for co_consts and co_names.
    As a bonus, it may reduce resource usage too by merging constants over functions.
    (e.g. ('self',) co_varnames and (None,) co_consts)
    """
    #8057 (comment)

    Serhiy Storchaka added:
    """
    I think we need to understand the issue better before committing changes. When found the source of unstability of file names, we can find other similar sources and make them stable too. For example if the source is listdir() or glob(), we can consider sorting results of all listdir() or glob() in distutils and related methods.

    On other side, if the problem is with reference counters in marshal, we can change the marshal module instead.
    """
    #8057 (comment)

    @vstinner vstinner added 3.8 (EOL) end of life stdlib Python modules in the Lib dir labels Jul 3, 2018
    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 3, 2018

    Copy of https://bugzilla.opensuse.org/show_bug.cgi?id=1049186 first message:
    """
    e.g. python-simplejson has one-bit diffs in .pyc files
    See
    http://rb.zq1.de/compare.factory-20170713/python-simplejson-compare.out

    in python3-simplejson.rpm we get
    -00004e50 68 6f 72 5f 5f da 07 64 65 63 69 6d 61 6c 72 0c |hor__..decimalr.|
    +00004e50 68 6f 72 5f 5f 5a 07 64 65 63 69 6d 61 6c 72 0c |hor__Z.decimalr.|

    in python3-simplejson-test.rpm we get the opposite change
    -00000580 72 13 00 00 00 5a 07 64 65 63 69 6d 61 6c 72 03 |r....Z.decimalr.|
    +00000580 72 13 00 00 00 da 07 64 65 63 69 6d 61 6c 72 03 |r......decimalr.|

    and it seems to be related to filesystem ordering, since it built reproducibly
    when using a filesystem with sorted readdir
    using disorderfs via reproducible-faketools-filesys from
    https://build.opensuse.org/package/show/home:bmwiedemann:reproducible/reproducible-faketools
    """
    https://bugzilla.opensuse.org/show_bug.cgi?id=1049186#c0

    @benjaminp
    Copy link
    Contributor

    I agree that we should fix the underlying issue (marshal) rather than papering over it by sorting. In fact, we should have a test that compiles a bunch of pycs in a random orders and sees if they're the same or not.

    @methane
    Copy link
    Member

    methane commented Jul 10, 2018

    Is this issue for only known marshal issue?
    Or is this issue for all issues in distutils including unknowns?

    @benjaminp
    Copy link
    Contributor

    We should probably discuss the marshal issue in the preëxisting bpo-31377.

    I'm not sure if "distutils is not reproducible" is a larger issue than "pyc compilation is not reproducible". This issue could be a meta issue for either.

    @vstinner
    Copy link
    Member Author

    Is this issue for only known marshal issue?

    IMHO the order in which .pyc files are created on disk also matters. It changes the result of "os.listdir()": some application can rely on unsorted os.listdir(). sorted() seems simple and hardless compared to the benefit.

    @methane
    Copy link
    Member

    methane commented Jul 11, 2018

    OK, I created sub issue for pyc.

    @bmwiedemann
    Copy link
    Mannequin

    bmwiedemann mannequin commented Mar 15, 2019

    unreproducible .pyc files are still one of the major headaches for my work on openSUSE reproducible builds.

    There is also one aspect where i586 builds end up with different .pyc files than x86_64 builds. And then we randomly chose one of them for our "noarch" python module packages and hope they work everywhere (including on arm and s390 architectures).

    So is someone working towards a concept that makes it is possible to create the same .pyc files anywhere?
    Can I help something there?
    Is there an ETA?

    @encukou
    Copy link
    Member

    encukou commented Jan 8, 2020

    There is also one aspect where i586 builds end up with different .pyc files than x86_64 builds. And then we randomly chose one of them for our "noarch" python module packages and hope they work everywhere (including on arm and s390 architectures).

    They are functionally identical, despite not being bit-by-bit identical.
    If they do not work everywhere, it's a very serious bug.

    So is someone working towards a concept that makes it is possible to create the same .pyc files anywhere?

    No, it's a known issue no one is working on.

    Can I help something there?

    Maybe?
    The two main culprits are in the marshal serialization algorithm: https://github.com/python/cpython/blob/master/Python/marshal.c
    Specifically:

    • a heuristic depends on refcount (i.e. state of objects in the entire interpreter, rather than just relationships between serialized objects):
      /* if it has only one reference, it definitely isn't shared */
    • (frozen)sets are serialized in iteration order, which is unpredictable (and determinig a predictable order is not trivial):
      else if (PyAnySet_CheckExact(v)) {

    A solution will probably come with an unacceptable performance hit -- it's good to keep generating the .pyc files fast. Two options to overcome that come to mind:

    • make reproducibility optional (which would make the testing more cumbersome)
    • make an add-on tool to re-serialize an existing .pyc.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    methane added a commit that referenced this issue May 4, 2022
    Use FLAG_REF always for interned strings.
    
    Refcounts of interned string is very unstable.
    When compiling same source, refcounts of interned string in the output may be 1 or >1.
    It makes FLAG_REF usage unstable.
    
    To help reproducible build, use FLAG_REF for interned string even if refcnt(obj)==1.
    @methane methane added 3.11 only security fixes and removed 3.8 (EOL) end of life labels May 4, 2022
    @vstinner
    Copy link
    Member Author

    @methane: Yeah! Thank you for the 6dcfd6c fix.

    @josch
    Copy link

    josch commented May 13, 2022

    Thank you @methane -- we are now carrying your patch in python 3.10 in Debian: https://sources.debian.org/src/python3.10/3.10.4-4/debian/patches/gh-78214.diff/

    @AA-Turner
    Copy link
    Member

    @vstinner should this be closed now or will there be any other patches to Distutils before its removal?

    A

    @AA-Turner AA-Turner added the pending The issue will be closed if no feedback is provided label Jun 7, 2022
    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2022

    I don't know the status of this issue, you should ask @methane who is more involved in this topic.

    @AA-Turner
    Copy link
    Member

    Ahh sorry, I will wait for @methane's opinion.

    A

    @methane
    Copy link
    Member

    methane commented Jun 8, 2022

    We have enough opening issues relating to reproducible pyc. So I agree to close this one.

    @methane methane closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2022
    @AA-Turner AA-Turner removed the pending The issue will be closed if no feedback is provided label Jun 9, 2022
    irl pushed a commit to defo-project/cpython that referenced this issue Nov 28, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants