Skip to content

Improve importlib backports-upstream integration #129307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
FFY00 opened this issue Jan 26, 2025 · 3 comments
Open

Improve importlib backports-upstream integration #129307

FFY00 opened this issue Jan 26, 2025 · 3 comments
Labels
infra CI, GitHub Actions, buildbots, Dependabot, etc. stdlib Python modules in the Lib dir topic-importlib

Comments

@FFY00
Copy link
Member

FFY00 commented Jan 26, 2025

The current status-quo when it comes to the development integration/synchronization between the importlib backports and the CPython upstream isn't optimal.

Before anything else, I must properly acknowledge @jaraco's monumental and tireless effort on maintaining the importlib backports, and handling the complex synchronization with the CPython upstream, not to mention the continued development of these modules. It has been instrumental to get things to the state they are today, and none of the issues discussed in this thread should reflect negatively on him, but rather our failure to ensure these projects got the resources they need — a far too common tale in open-source.

Here are some issues I think we should improve:

  • Synchronization process — even though @jaraco has left comments in some PRs describing his workflow, there's no properly documented process
  • Authorship stripping — the current way changes are synced in and from the backports strip commit authorship
  • Documentation fragmentation, resulting in a sub-optimal documentation
  • CLA enforcement — the backports do not enforce the CLA
  • Segmented development workflow — issues and changes happen in both places
  • Source history — the current way changes are synced in and from the backports strip commit history

cc @python/importlib-team

@FFY00 FFY00 added infra CI, GitHub Actions, buildbots, Dependabot, etc. topic-importlib labels Jan 26, 2025
@FFY00
Copy link
Member Author

FFY00 commented Jan 26, 2025

I would like to propose officially defining a development upstream, and enforcing it.

The solution that I think would more cleanly handle fragmentation, history, authorship, and CLA issues, is to select CPython as the upstream. An approach to implement this could be to track the backport version here, and when updated, have CI automation to update the backport repos, just like we do to backport to older Python versions.

While I think that's cleaner, it is a major change to how these modules are currently developed, and the implementation might be too complex, so I think it's more likely for us to go with the backports as the upstream. If so, there are a couple things I think we should do:

  • Add issue tags for each backport, which, once assigned, would cause the issue to be moved to the correct repo
  • Prevent PRs from changing the code (with a manual override)
  • Implement some level of automation for the synchronization operation, preserving commit authorship (perhaps even history)
  • Document the synchronization process
  • Have the backports maintain a copy of the correspondent CPython documentation, instead of having its own
  • Add the CLA bot to the backports

@AA-Turner
Copy link
Member

select CPython as the upstream

I think that this makes sense, especially as both of the importlib modules are no longer "provisional". Useful parallels can be drawn with PEP 360, which used to record "externally maintained" packages, and was updated in 2006 to say:

It has been deemed dangerous to codify external maintenance of any code checked into Python’s code repository. Code contributors should expect Python’s development methodology to be used for any and all code checked into Python’s code repository.

Another parallel is the changes to Pathlib that Barney has recently been making. He published the pathlib-abc package as a backport/preview, rather than primary development being in that package.

Whilst having a brief look at the history, I found that Jason noted in a comment from a few years ago that:

The advantage of having the module in the standard library is that at some point, the pace of change should slow and the stdlib can become the primary/only use.


The other two recently externally-developed modules seem to be tomli and zoneinfo (please let me know if I'm missing any).

Three of the four PRs to tomllib since it was added as a package were to synchronise with tomli as an upstream (#128907, #126428, and #124587). These have each been quite minor, and each has been opened as an individual PR, rather than an omnibus "sync with version X" update.

zoneinfo used to have some sync PRs, but the last one seems to be four years ago (#20499), and the backport package has not been updated for two years (last release 2020-06).


There have also been some problems with synchronising the documentation, as the d.p.o documentation used to point to the backport (python/importlib_metadata#485), with one user going so far as to manipulate Sphinx internals (stefan6419846/license_tools#63) to solve this problem. Ultimately, documentation was removed from the backport package (python/importlib_metadata#466).


To Jason's quoted comment above about pace of development eventually slowing, I wonder if at some point we should seek to update the backport packages less frequently, and to mirror Python releases. There is prior art for this with zipfile3{x} packages on PyPI. This would ease the burden of the actual backporting, as it would be done less often. The backport package could also use a rebase or fast-forward merge, which would preserve authorship details.

As such, I would be in favour of this python/cpython repo being the one where new features are developed, discussed, and merged for importlib.resources and importlib.metadata (and also tomllib).

A

@picnixz picnixz added the stdlib Python modules in the Lib dir label Jan 31, 2025
@jaraco
Copy link
Member

jaraco commented Feb 8, 2025

Thanks @FFY00 for raising this issue. It's been a lingering concern of mine as well, and I've had a lot of thoughts on the matter.

My instinct is the same, that ideally the stdlib should be the canonical implementation and upstream. That's the case for several other backports I maintain (backports.tarfile, singledispatch, configparser, ...).

The main reason I haven't taken the packages in this direction is the third-party libraries are more capable and thus drastically easier to develop. I have in fact documented the methodology. Probably I should link that document from the READMEs of the third-party projects for increased visibility.

In general, the third-party packages get a much more modern, complete, and sophisticated treatment. It's the presence of these documented advantages that have kept me reluctant to move the upstream to the stdlib.

I've been thinking about ways to make the integration (and attribution) better. There are some factors that make the integration more difficult.

  • CPython uses squash merges, meaning that regardless of which commits are in the PR, the final commit is attributed to the person who merged the PR and possibly one other author, plus maybe some notes in the merge commit description.
  • Because the code is in separate repositories, it's difficult for Git to associate the changes in the different histories. Contrast that with CPython's internal backport branches, which have a traceable history across branches.
  • Both projects have substantial artifacts that are irrelevant to the project at hand (CPython has the rest of the stdlib plus CPython itself, the third-party packages have backward-compatibility shims, packaging infrastructure, other config, and separate documentation).

In an ideal world, the canonical source for something like "importlib metadata" would exist somewhat independently, be linked into the various target projects, and have customizations overlay and extend the canonical source. I can imagine a couple of ways to model these concerns using VCS tools.

A branch per project

Imagine having a separate branch in CPython for each project, with its history rooted independent from the CPython history. This branch would have either the raw source or possibly the full third-party package in that branch, but when merged into CPython, would track the new location and CPython-specific requirements.

This approach doesn't work in the CPython pull request model due to the squashed merges (the tracking is lost).

That's why instead, each of these projects carries their own cpython branch to track those concerns.

Submodules

Another way to model subsets of an implementation is through Git Submodules. Some companies and projects use submodules as a way to compose larger systems from smaller components.

You could imagine the importlib subprojects to each be a submodule attached at Lib/importlib/{submodule}, and have branches off of those submodule repos implement the third-party packages and merge/cherry-pick changes.

This approach is fraught with problems:

  • Submodules are themselves a second-class feature, are not enabled by default, and require user intervention from developers working with the repo
  • These projects have more than one attachment point. They attach not only to Lib/importlib/* but also Lib/test/test_importlib/*.
    • This concern could potentially be addressed by attaching to a separate path and symlinking the relevant resources, but symlinks are also something that have limited portability and second-class experience.
    • Another potential workaround could be to move the tests to Lib/improtlib/metadata/tests (or similar), but that would violate established conventions. This approach could be applied across all of the stdlib, but that would be a highly disruptive migration. In my opinion, it would be a better outcome overall, but I'm not confident it can be executed safely.
  • Every change would need to be developed in stdlib or third-party package and then cherry-picked back into the canonical source to avoid contamination of the essential source.

Last year, I kicked off work on the essential layout, which aims to solve some of these problems and empower projects to be composable in this way, but it's already had to concede some of the purity of the design (pyproject.toml and .github) and still has some problems yet unsolved (it's incompatible with RTD).


Ultimately, I don't feel these options are very attractive, so I'm left limping along with the current methodology.

I quite like the suggestions Filipe has brought. They all sound reasonable - let's revisit them in light of the documented methodology.

One last thing I wanted to mention - although I dislike it, I sometimes batch several changes from the third-party packages into CPython, mainly because it's a bit of work to get everything synchronized and the amount of toil it would take to re-submit each contribution in multiple places would be impractical. If we had automation to apply mechanically changes to both projects together, that would be ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra CI, GitHub Actions, buildbots, Dependabot, etc. stdlib Python modules in the Lib dir topic-importlib
Projects
None yet
Development

No branches or pull requests

4 participants