Develop a CRM script to handle import name mis matches #218

schuylermartin45 · 2024-10-25T21:01:58Z

Resolves #181

The original ticket concept has changed a few times as others have pointed me to the cf-graph-countyfair data project.

We now cache a modified version of Conda Forge's mapping that attempts to map irregular cases where the "import name" does not match the package name.

In theory, this will make our Python dependency calculation efforts more accurate.

…nformation.

…r into 181-develop-a-crm-script-to-scan-feedstocks-for-dependencyimport-name-mis-matches

…ept in favor of caching the import name data supplied by conda-forge

schuylermartin45 · 2024-10-25T21:12:41Z

I'm unclear on the licensing implications of this change. I wouldn't consider the data derived from Conda Forge here as "source code", but I'm not sure what the legal interpretation of "source code" is over my engineering interpretation.

That being said, I'm guessing we want to give some kind of attribution.

I imagine this goes one of two routes:

Copy what we did in conda-recipe-manager-test-data, and modify the top of the LICENSE file in this repo to include:

Copyright (c) 2016-2018, conda-forge
Copyright (c) 2016-2018, conda-forge contributors

I'm guessing we could end in 2024 over 2018, but 2018 is the most recent listed in their LICENSE file.

Copy the names actually found in cf-graph-countyfair:

Copyright (c) 2018, Board of Trustees of Columbia University in the city of New York.
Copyright (c) 2017, Peter M. Landwehr
Copyright (c) 2017, Anthony Scopatz
Copyright (c) 2016 Aaron Meurer, Gil Forsyth

I imagine this file is very out of date and there have been many contributors post 2018 on the project not captured here.

@jezdez @beeankha Thoughts?

beckermr · 2024-10-25T21:30:06Z

You should read through conda/grayskull#564 before you do much more on this PR. We are trying to cleanup the mapping situation and there are multiple existing solutions already. It'd be a real shame to introduce yet another variant into the mix instead of developing a single solution.

cc @maresb

beckermr

A more direct comment related to this PR.

You should depend directly on conda-forge-metadata to get these mappings.

See the APIs we supply here: https://github.com/conda-forge/conda-forge-metadata/blob/main/conda_forge_metadata/autotick_bot/pypi_to_conda.py#L33

conda-forge-metadata is available on pypi and OFC conda-forge.

If there is another API you want/need, we should add it in conda-forge-metadata.

schuylermartin45 · 2024-10-28T14:47:54Z

Hold on, before we continue, let us be 100% clear on what data I'm pulling, because I (and have seen others) have already gotten tripped up on this nuance.

I'm interested in import name to conda package name mapping, NOT the pypi to conda package name mapping.

They are not the same. In other words, I need to know that the conda package pillow gets imported with import PIL. Unless I'm missing something, a lot of the conversations I've seen on GitHub and the Element chat refer to the pypi mapping.

As far as I know, this section in cf-countyfair is the only datasource for the import mapping and this PR is just caching that data to save on network requests. In this first iteration, I don't need to constantly check for updates, it looks like this data is seldom updated.

If there is another way, let's talk. But I've already spent a few days trying to hunt down what data source to use. As far as I know, none of this is documented or explained anywhere. It is incredibly frustrating when everyone you talk to has a partial picture of the problem.

beckermr · 2024-10-28T15:04:23Z

Ah cool. That's all fine. In any case, the url you are using is not a supported api and could break. We should add an api to conda-forge-metadata to pull the file you need if it is not there.

beckermr · 2024-10-28T15:29:12Z

As far as I know, none of this is documented or explained anywhere.

Yep, this is a big hole. We've not been able to close it yet and that is definitely largely my fault. The hope is that as we add more to conda-forge-metadata, we will add documentation and this will be the place to look.

Thank you for your patience!

schuylermartin45 · 2024-10-28T19:46:12Z

Even if it is in an API, I'd still like to locally cache the results. Maybe it comes from my personal bias working in the IoT and firmware space, but network calls have costs. I'd rather not incur API hits on data that appears to change pretty infrequently.

So even if this ends up on conda-forge-metadata, I'll want some local form for redundancy and speed.

I'm not sure how long/if I have the time to add this to conda-forge-metadata but I'll try to at least look. This PR might be a stop gap for our more immediate needs.

At the very least, I'm going to make a note at the top of the script indicating that the conda-forge-metadata API exists and we should consider using it in the future. Glad we talked through this!

schuylermartin45 · 2024-10-28T20:00:02Z

Ok, so it is at least partially implemented: def map_import_to_package(import_name: str) -> str:

But obviously I'd have to query every single string. The table this PR generates for the cache is only ~300kb, but I'd imagine we'd want some pagination scheme to future-proof, if we allow for a fully query of the mapping.

schuylermartin45 · 2024-10-28T20:11:12Z

I collected my thoughts together and opened this against conda-forge-metadata: conda-forge/conda-forge-metadata#55

beckermr · 2024-10-28T20:11:49Z

We can add an API to the file import_name_priority_mapping.json you are using. That file is made from the underlying mapping results from each package with some graph heuristics applied to resolve conflicts.

…r into 181-develop-a-crm-script-to-scan-feedstocks-for-dependencyimport-name-mis-matches

jezdez · 2024-11-14T19:37:50Z

I'm unclear on the licensing implications of this change. I wouldn't consider the data derived from Conda Forge here as "source code", but I'm not sure what the legal interpretation of "source code" is over my engineering interpretation.

That being said, I'm guessing we want to give some kind of attribution.

I imagine this goes one of two routes:

Copy what we did in conda-recipe-manager-test-data, and modify the top of the LICENSE file in this repo to include:
Copyright (c) 2016-2018, conda-forge
Copyright (c) 2016-2018, conda-forge contributors
I'm guessing we could end in 2024 over 2018, but 2018 is the most recent listed in their LICENSE file.

Copy the names actually found in cf-graph-countyfair:
Copyright (c) 2018, Board of Trustees of Columbia University in the city of New York.
Copyright (c) 2017, Peter M. Landwehr
Copyright (c) 2017, Anthony Scopatz
Copyright (c) 2016 Aaron Meurer, Gil Forsyth
I imagine this file is very out of date and there have been many contributors post 2018 on the project not captured here.

@jezdez @beeankha Thoughts?

I concur what @beckermr said above, it would be easiest to follow what conda-forge/conda-forge-metadata#55 is proposing, not just for legal but also for API stability reasons.

Once it's exposed, spelling out licensing information for the data becomes moot, not needed anymore, since the license of conda-forge-metadata is encoded in the package metadata. And importing the data via Python is a legally safe thing to do.

Of course, if you add any file from any permissively licensed repo, it's easiest to copy the full LICENSE file (or equivalent like COPYING) of the 3rd party repo into the LICENSE file of this repo, with a headline noting where it came from, like https://github.com/regro/cf-graph-countyfair/blob/master/License actually does as well. That way it becomes clear that there are components included here that are derived from other permissively licensed software. I've also seen separate LICENSE with a suffix being used successfully to express more complex licensing setups and simply referencing them in a central LICENSE file, e.g. a LICENSE.cf-graph-countyfair that is referenced in LICENSE.

jezdez · 2024-11-14T19:39:00Z

Of course the usual thing applies, I'm not a lawyer or licensing expert, and I'd encourage to reach out to Anaconda legal or NumFOCUS legal to get confirmation.

schuylermartin45 · 2024-11-18T17:31:02Z

I appreciate the insight, @jezdez
I don't think I have time to pursue either path right now and the project that needed this data is blocked by other high priority issues.

I will mark the tracking ticket as Blocked for now and circle-back when I have more time. The ticket can be found here: #181

schuylermartin45 added 3 commits October 8, 2024 15:40

Back-up commit, unclear if work is worth continuing in light of new i…

4346ef3

…nformation.

Merge branch 'main' of github.com:conda-incubator/conda-recipe-manage…

4d8be6c

…r into 181-develop-a-crm-script-to-scan-feedstocks-for-dependencyimport-name-mis-matches

Deprecates original scanning recipes for mismatched import names conc…

84ad3a2

…ept in favor of caching the import name data supplied by conda-forge

schuylermartin45 requested a review from a team as a code owner October 25, 2024 21:01

schuylermartin45 linked an issue Oct 25, 2024 that may be closed by this pull request

Develop a CRM script to scan feedstocks for dependency/import name mis-matches #181

Open

conda-bot added the cla-signed label Oct 25, 2024

beckermr requested changes Oct 25, 2024

View reviewed changes

Adds notes from PR conversation

702a4c1

schuylermartin45 mentioned this pull request Oct 28, 2024

Request: Expose the import to conda package name mapping table in a more cost-effective way. conda-forge/conda-forge-metadata#55

Open

schuylermartin45 added 2 commits November 4, 2024 08:23

Merge branch 'main' of github.com:conda-incubator/conda-recipe-manage…

b4ab58b

…r into 181-develop-a-crm-script-to-scan-feedstocks-for-dependencyimport-name-mis-matches

Merge branch 'main' of github.com:conda-incubator/conda-recipe-manage…

4e47bd5

…r into 181-develop-a-crm-script-to-scan-feedstocks-for-dependencyimport-name-mis-matches

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a CRM script to handle import name mis matches #218

Develop a CRM script to handle import name mis matches #218

schuylermartin45 commented Oct 25, 2024

schuylermartin45 commented Oct 25, 2024

beckermr commented Oct 25, 2024

beckermr left a comment

schuylermartin45 commented Oct 28, 2024

beckermr commented Oct 28, 2024

beckermr commented Oct 28, 2024

schuylermartin45 commented Oct 28, 2024

schuylermartin45 commented Oct 28, 2024

schuylermartin45 commented Oct 28, 2024

beckermr commented Oct 28, 2024

jezdez commented Nov 14, 2024 •

edited

Loading

jezdez commented Nov 14, 2024

schuylermartin45 commented Nov 18, 2024

Develop a CRM script to handle import name mis matches #218

Are you sure you want to change the base?

Develop a CRM script to handle import name mis matches #218

Conversation

schuylermartin45 commented Oct 25, 2024

schuylermartin45 commented Oct 25, 2024

beckermr commented Oct 25, 2024

beckermr left a comment

Choose a reason for hiding this comment

schuylermartin45 commented Oct 28, 2024

beckermr commented Oct 28, 2024

beckermr commented Oct 28, 2024

schuylermartin45 commented Oct 28, 2024

schuylermartin45 commented Oct 28, 2024

schuylermartin45 commented Oct 28, 2024

beckermr commented Oct 28, 2024

jezdez commented Nov 14, 2024 • edited Loading

jezdez commented Nov 14, 2024

schuylermartin45 commented Nov 18, 2024

jezdez commented Nov 14, 2024 •

edited

Loading