Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a CRM script to handle import name mis matches #218

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

schuylermartin45
Copy link
Collaborator

Resolves #181

The original ticket concept has changed a few times as others have pointed me to the cf-graph-countyfair data project.

We now cache a modified version of Conda Forge's mapping that attempts to map irregular cases where the "import name" does not match the package name.

In theory, this will make our Python dependency calculation efforts more accurate.

…r into 181-develop-a-crm-script-to-scan-feedstocks-for-dependencyimport-name-mis-matches
…ept in favor of caching the import name data supplied by conda-forge
@schuylermartin45
Copy link
Collaborator Author

I'm unclear on the licensing implications of this change. I wouldn't consider the data derived from Conda Forge here as "source code", but I'm not sure what the legal interpretation of "source code" is over my engineering interpretation.

That being said, I'm guessing we want to give some kind of attribution.

I imagine this goes one of two routes:

  1. Copy what we did in conda-recipe-manager-test-data, and modify the top of the LICENSE file in this repo to include:
Copyright (c) 2016-2018, conda-forge
Copyright (c) 2016-2018, conda-forge contributors

I'm guessing we could end in 2024 over 2018, but 2018 is the most recent listed in their LICENSE file.

  1. Copy the names actually found in cf-graph-countyfair:
Copyright (c) 2018, Board of Trustees of Columbia University in the city of New York.
Copyright (c) 2017, Peter M. Landwehr
Copyright (c) 2017, Anthony Scopatz
Copyright (c) 2016 Aaron Meurer, Gil Forsyth

I imagine this file is very out of date and there have been many contributors post 2018 on the project not captured here.

@jezdez @beeankha Thoughts?

@beckermr
Copy link
Member

You should read through conda/grayskull#564 before you do much more on this PR. We are trying to cleanup the mapping situation and there are multiple existing solutions already. It'd be a real shame to introduce yet another variant into the mix instead of developing a single solution.

cc @maresb

Copy link
Member

@beckermr beckermr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more direct comment related to this PR.

You should depend directly on conda-forge-metadata to get these mappings.

See the APIs we supply here: https://github.com/conda-forge/conda-forge-metadata/blob/main/conda_forge_metadata/autotick_bot/pypi_to_conda.py#L33

conda-forge-metadata is available on pypi and OFC conda-forge.

If there is another API you want/need, we should add it in conda-forge-metadata.

@schuylermartin45
Copy link
Collaborator Author

Hold on, before we continue, let us be 100% clear on what data I'm pulling, because I (and have seen others) have already gotten tripped up on this nuance.

I'm interested in import name to conda package name mapping, NOT the pypi to conda package name mapping.

They are not the same. In other words, I need to know that the conda package pillow gets imported with import PIL. Unless I'm missing something, a lot of the conversations I've seen on GitHub and the Element chat refer to the pypi mapping.

As far as I know, this section in cf-countyfair is the only datasource for the import mapping and this PR is just caching that data to save on network requests. In this first iteration, I don't need to constantly check for updates, it looks like this data is seldom updated.

If there is another way, let's talk. But I've already spent a few days trying to hunt down what data source to use. As far as I know, none of this is documented or explained anywhere. It is incredibly frustrating when everyone you talk to has a partial picture of the problem.

@beckermr
Copy link
Member

Ah cool. That's all fine. In any case, the url you are using is not a supported api and could break. We should add an api to conda-forge-metadata to pull the file you need if it is not there.

@beckermr
Copy link
Member

As far as I know, none of this is documented or explained anywhere.

Yep, this is a big hole. We've not been able to close it yet and that is definitely largely my fault. The hope is that as we add more to conda-forge-metadata, we will add documentation and this will be the place to look.

Thank you for your patience!

@schuylermartin45
Copy link
Collaborator Author

Even if it is in an API, I'd still like to locally cache the results. Maybe it comes from my personal bias working in the IoT and firmware space, but network calls have costs. I'd rather not incur API hits on data that appears to change pretty infrequently.

So even if this ends up on conda-forge-metadata, I'll want some local form for redundancy and speed.

I'm not sure how long/if I have the time to add this to conda-forge-metadata but I'll try to at least look. This PR might be a stop gap for our more immediate needs.

At the very least, I'm going to make a note at the top of the script indicating that the conda-forge-metadata API exists and we should consider using it in the future. Glad we talked through this!

@schuylermartin45
Copy link
Collaborator Author

Ok, so it is at least partially implemented: def map_import_to_package(import_name: str) -> str:

But obviously I'd have to query every single string. The table this PR generates for the cache is only ~300kb, but I'd imagine we'd want some pagination scheme to future-proof, if we allow for a fully query of the mapping.

@schuylermartin45
Copy link
Collaborator Author

I collected my thoughts together and opened this against conda-forge-metadata: conda-forge/conda-forge-metadata#55

@beckermr
Copy link
Member

We can add an API to the file import_name_priority_mapping.json you are using. That file is made from the underlying mapping results from each package with some graph heuristics applied to resolve conflicts.

…r into 181-develop-a-crm-script-to-scan-feedstocks-for-dependencyimport-name-mis-matches
…r into 181-develop-a-crm-script-to-scan-feedstocks-for-dependencyimport-name-mis-matches
@jezdez
Copy link
Member

jezdez commented Nov 14, 2024

I'm unclear on the licensing implications of this change. I wouldn't consider the data derived from Conda Forge here as "source code", but I'm not sure what the legal interpretation of "source code" is over my engineering interpretation.

That being said, I'm guessing we want to give some kind of attribution.

I imagine this goes one of two routes:

  1. Copy what we did in conda-recipe-manager-test-data, and modify the top of the LICENSE file in this repo to include:
Copyright (c) 2016-2018, conda-forge
Copyright (c) 2016-2018, conda-forge contributors

I'm guessing we could end in 2024 over 2018, but 2018 is the most recent listed in their LICENSE file.

  1. Copy the names actually found in cf-graph-countyfair:
Copyright (c) 2018, Board of Trustees of Columbia University in the city of New York.
Copyright (c) 2017, Peter M. Landwehr
Copyright (c) 2017, Anthony Scopatz
Copyright (c) 2016 Aaron Meurer, Gil Forsyth

I imagine this file is very out of date and there have been many contributors post 2018 on the project not captured here.

@jezdez @beeankha Thoughts?

I concur what @beckermr said above, it would be easiest to follow what conda-forge/conda-forge-metadata#55 is proposing, not just for legal but also for API stability reasons.

Once it's exposed, spelling out licensing information for the data becomes moot, not needed anymore, since the license of conda-forge-metadata is encoded in the package metadata. And importing the data via Python is a legally safe thing to do.

Of course, if you add any file from any permissively licensed repo, it's easiest to copy the full LICENSE file (or equivalent like COPYING) of the 3rd party repo into the LICENSE file of this repo, with a headline noting where it came from, like https://github.com/regro/cf-graph-countyfair/blob/master/License actually does as well. That way it becomes clear that there are components included here that are derived from other permissively licensed software. I've also seen separate LICENSE with a suffix being used successfully to express more complex licensing setups and simply referencing them in a central LICENSE file, e.g. a LICENSE.cf-graph-countyfair that is referenced in LICENSE.

@jezdez
Copy link
Member

jezdez commented Nov 14, 2024

Of course the usual thing applies, I'm not a lawyer or licensing expert, and I'd encourage to reach out to Anaconda legal or NumFOCUS legal to get confirmation.

@schuylermartin45
Copy link
Collaborator Author

I appreciate the insight, @jezdez
I don't think I have time to pursue either path right now and the project that needed this data is blocked by other high priority issues.

I will mark the tracking ticket as Blocked for now and circle-back when I have more time. The ticket can be found here: #181

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Develop a CRM script to scan feedstocks for dependency/import name mis-matches
4 participants