Skip to content
This repository has been archived by the owner on May 14, 2024. It is now read-only.

FEEDBACK WANTED: which directories / files / file patterns should be excluded from the Ansible installation? #65

Closed
felixfontein opened this issue Feb 1, 2022 · 31 comments

Comments

@felixfontein
Copy link
Contributor

Summary

Since we want to strip down the installed Ansible package for Ansible 6 (see WIP PR: ansible-community/antsibull-build#342), we should discuss which files, directories and file patterns to leave out.

I would really like to avoid having a list of special cases for special directories, so let's start with something as generic as possible. Some things that really have to stay are MANIFEST.json (needed to retrieve the collections' version) and meta/runtime.yml (needed for plugin routing). README, LICENSE and COPYING files should better be kept as well. Directories like plugins/ and roles/ should only be touched with care.

As a first approximation, I've added the following to the ignore list. Files are just named, directories have a trailing slash. All patterns are relative to the collection root.

If you think some of the above should not be excluded, or think some more should be excluded, feel free to mention that here! The idea is to gather a lot of feedback so we can come up with a definite list of things that should be excluded (and things that should not be excluded).

@tadeboro
Copy link

tadeboro commented Feb 1, 2022

Maybe we should decide what to include instead? Since we have no control over what collection maintainers package into artifacts, maintaining a comprehensive exclusion list seems like something that could turn into a full-time job.

My include list would contain:

  1. README* (case-insensitive match)
  2. LICENSE*, COPYING* (case-insensitive match)
  3. MANIFEST.json
  4. meta folder
  5. plugins folder
  6. roles folder (thanks, @felixfontein)

I must admit that it has been a while since I looked at the collection artifact, so do not be afraid to point out any omissions in my list because I am already assuming I forgot something important ;)

Edit: Just to make it clear: I do not propose to filter the contents of sdist tarballs (because legal). My inclusion list applies to things like wheels that target end-users.

@dmsimard
Copy link

dmsimard commented Feb 1, 2022

Maybe we should decide what to include instead? Since we have no control over what collection maintainers package into artifacts, maintaining a comprehensive exclusion list seems like something that could turn into a full-time job.

I like this take from @tadeboro because ansible-galaxy collection build takes EVERYTHING that is not specifically excluded which is why we end up with such a variety of unnecessary files and directories. I fear that ship has sailed but it would be nice if we had the ability to switch that around and have an "opt-in" instead of an "opt-out".

Otherwise I think it would be fair to link this other discussion topic about reducing the size of the package and improving the installation performance: #29

For the record, here's the unncessary/hidden files and directories that are currently deleted from Fedora packaging for Ansible 5.3.0: https://gist.github.com/dmsimard/edbbf4206b12ac77a694f96eefd60e41 (368 hits)

Unless mistaken, there is absolutely no use to keeping them in the package and it would be much cleaner to not include them at all -- even if they are not problematic in amount or in size.

In fact, the bulk of the package comes from sanity, unit and integration tests which make up more than 20k files:

> export gist="https://gist.githubusercontent.com/dmsimard/8959f608b958949709156184ff403bc5/raw/0e87a354c7a7d33060383a5bc71d028777f9d8d1/removed_files.log"

> curl -s $gist | grep "/tests/unit" | wc -l
4659

> curl -s $gist | grep "/tests/integration" | wc -l
18384

> curl -s $gist | grep "/tests/sanity" | wc -l
385

I originally thought that we shouldn't ship tests at all but my opinion has matured a bit as I worked on downstream packaging and discussed with maintainers for other distributions:

  • There should be a distinction between what is shipped in the ansible release tarball and what is installed (this is already WIP via [Ansible 6] Improve setup.py to exclude some unnecessary files antsibull-build#342 and otherwise described here). I don't believe tests should be installed.
  • Unit and sanity tests are relatively small in size while also being fairly easy to run without many extra dependencies. As such, there is value in keeping them in the tarball so they can be run by downstream consumers. I'd like to eventually run them as part of our package build CI jobs while Debian are already running them and Fedora will soon as well. This is a fairly inexpensive way to ensure that the resulting package is not horribly broken.
  • Integration tests are harder to run, especially without containers. It is not something that we could run in the Fedora RPM build environment, for example. Given that they make up the bulk of the tests and the package itself, I could be convinced to not include them in the tarball at all if I am not presented with a legitimate use case for them.

With that in mind and including what @felixfontein originally suggested, I came up with the following table (not final, just my first opinion):

path tarball install
hidden files
changelogs/ ✔️
docs/ ✔️
tests/sanity/ ✔️
tests/unit/ ✔️
tests/integration/

@felixfontein
Copy link
Contributor Author

@tadeboro you definitely forgot roles/ :-) I do like the approach "exclude everything that's not explicitly approved" though.

@dmsimard I don't think we can exclude anything from the tarball, since it's our source distribution, until legal approves that.

If someone wants to take a look at the current tarball: https://files.pythonhosted.org/packages/d3/67/ceac5a18e6b675e7ea5330a0737625593ab8f9706c1ed31071c29b62322c/ansible-5.3.0.tar.gz

@dmsimard
Copy link

dmsimard commented Feb 2, 2022

@dmsimard I don't think we can exclude anything from the tarball, since it's our source distribution, until legal approves that.

Still haven't heard back on that topic...

@mariolenz
Copy link
Contributor

Tricky question. At first, I thought it would be a good idea. But now I think you shouldn't leave out anything.

You see, if the Ansible package contains collection xyz, it should contain it as it was released imho. Otherwise, you'll have a difference between installing Ansible and installing ansible-core plus the collection. (You'll have a difference in where it's installed, but not in what.)

I think the best way would be to urge collection authors to strip down their releases by making use of build_ignore in galaxy.yml.

If the collection authors think there's a good reason to not exclude a file or directory, the same reason might apply to not exclude it from Ansible.

@mariolenz
Copy link
Contributor

For the record:

root [ /playbooks ]# du -hsc /usr/lib/python3.10/site-packages/ansible_collections/ | tail -n 1
377M	total
root [ /playbooks ]# du -hcs /usr/lib/python3.10/site-packages/ansible_collections/*/*/plugins | tail -n 1
160M	total
root [ /playbooks ]# du -hsc /usr/lib/python3.10/site-packages/ansible_collections/*/*/tests | tail -n 1
126M	total
root [ /playbooks ]# du -hsc /usr/lib/python3.10/site-packages/ansible_collections/*/*/docs | tail -n 1
41M	total
root [ /playbooks ]# du -hsc /usr/lib/python3.10/site-packages/ansible_collections/*/*/docs /usr/lib/python3.10/site-packages/ansible_collections/*/*/tests | tail -n 1
166M	total
root [ /playbooks ]#

So there are more docs and tests than modules.

@felixfontein
Copy link
Contributor Author

I think the best way would be to urge collection authors to strip down their releases by making use of build_ignore in galaxy.yml.

That's basically impossible: the artefact uploaded to Galaxy serves as the source distribution of the collection (at least that's how it works for community collections), and thus it must contain everything needed to build and develop on the collection. If Galaxy would allow to host both a source distribution and a built version, that would be possible.

(At least that is how I understand the legal situation. I'm no lawyer.)

@mariolenz
Copy link
Contributor

the artefact uploaded to Galaxy serves as the source distribution of the collection (at least that's how it works for community collections), and thus it must contain everything needed to build and develop on the collection.

Sounds like we need a definition of what a release on galaxy should be. Is it a source distribution or is it something for end users?

If Galaxy would allow to host both a source distribution and a built version, that would be possible.

Linux distributions often have a something like package abc and also abc-devel or similar. Do you mean something like this?

Sounds like a good idea, but would take some time to implement imho.

(At least that is how I understand the legal situation. I'm no lawyer.)

Where do you see a legal problem? I see a technical and organizational problem here, but not a legal one.

@schurzi
Copy link

schurzi commented Feb 22, 2022

From an external standpoint I'm also in favor of an include list, that has all the basic files/directories listed (like @tadeboro proposed) and can be extended by a collection maintainer if needed.

My rationale for this is, that I expect that people forget to manage an explicit exclude list, when they add a scripts/ or .somecoolsaas.yml to their repo. Also people might decide to exclude things that are actually needed (ansible-collections/community.crypto#185 ;))

On the flip-side this approach could break collections, that need stuff outside the normal directories. Currently I cannot think of a case, where this would be an issue (for the collection I maintain). Maybe this is also a case that should break, since it is generally not expected to extend the collection folder structure by individual folders?

It should be easy to do the right thing. I think the right thing is to keep the packages small and to maintain a clean structure of folders/files for Ansible collections.

@mariolenz
Copy link
Contributor

I'd like to step back and think again. What's the goal? Well, it's

to strip down the installed Ansible package for Ansible 6 (see WIP PR: ansible-community/antsibull#342)

So are we looking for the best solution, that is should Ansible 6 contain only the minimum of required files? This would be nice, I like the purity of this. But maybe we should think: What's good enough to reach this goal? For example, while all those .gitignore files really don't make sense, would removing them make Ansible 6 significantly smaller? I don't think so.

As stated above, the bulk of non-code is in tests and docs, so maybe you should start by removing those.

As you say, some of this documentation is redundant. On the other hand, for someone working in an air-gaped environment it might be useful to have scenario guides, filter guides, etc. available offline. But I don't really think so. All in all, I tend to removing docs.

I think you should start with removing tests and docs. They're the bulk of stuff you can save. The rest of your list looks OK for me, but it's small fry.

@schurzi
Copy link

schurzi commented Feb 22, 2022

So are we looking for the best solution, that is should Ansible 6 contain only the minimum of required files? This would be nice, I like the purity of this. But maybe we should think: What's good enough to reach this goal? For example, while all those .gitignore files really don't make sense, would removing them make Ansible 6 significantly smaller? I don't think so.

Good point. In that case I'd like this to be a hard coded exclude list and not be configurable by a collection maintainer. If this is configurable it will take some time for people to pick this up and update their repos. Once they have done that, they discover Ansible switched to a "better" (inclusion list) approach. And everything starts anew. This would bring the disadvantages from both solutions to collection maintainers.

@felixfontein
Copy link
Contributor Author

the artefact uploaded to Galaxy serves as the source distribution of the collection (at least that's how it works for community collections), and thus it must contain everything needed to build and develop on the collection.

Sounds like we need a definition of what a release on galaxy should be. Is it a source distribution or is it something for end users?

I don't think we have a choice here. This is basically dictated by the GPL - or at least that's how I understood it. Again: I'm not a lawyer.

If Galaxy would allow to host both a source distribution and a built version, that would be possible.

Linux distributions often have a something like package abc and also abc-devel or similar. Do you mean something like this?

So probably mean abc-src. That's the relevant part here. abc-devel usually only includes headers.

(At least that is how I understand the legal situation. I'm no lawyer.)

Where do you see a legal problem? I see a technical and organizational problem here, but not a legal one.

I mentioned above what I think I understood what the problem is. I am not a lawyer.

@felixfontein
Copy link
Contributor Author

My rationale for this is, that I expect that people forget to manage an explicit exclude list, when they add a scripts/ or .somecoolsaas.yml to their repo. Also people might decide to exclude things that are actually needed (ansible-collections/community.crypto#185 ;))

The changelogs/changelog.yaml file is used by the build tool (antsibull). There is no need to actually install it for end-users. But it must be in the artefact uploaded to Galaxy.

@felixfontein
Copy link
Contributor Author

On the flip-side this approach could break collections, that need stuff outside the normal directories. Currently I cannot think of a case, where this would be an issue (for the collection I maintain). Maybe this is also a case that should break, since it is generally not expected to extend the collection folder structure by individual folders?

That is a valid point. I don't think collections should use files from outside the default directories (i.e. plugins/, roles/ and meta/), but right now they can. I guess we should include that into the requirements for collections included in Ansible (https://github.com/ansible-collections/overview/blob/main/collection_requirements.rst).

@felixfontein
Copy link
Contributor Author

As you say, some of this documentation is redundant. On the other hand, for someone working in an air-gaped environment it might be useful to have scenario guides, filter guides, etc. available offline. But I don't really think so. All in all, I tend to removing docs.

People interested in that can always download the ansible tarball from PyPi and extract it. It contains all the files that are part of the collections, including all docs/ folders. If you have an air-gapped environment, you have likely downloaded it anyway (if you didn't only downloaded the wheels).

I think you should start with removing tests and docs. They're the bulk of stuff you can save. The rest of your list looks OK for me, but it's small fry.

I'm fine with removing these (and a few more obvious ones, like .gitignore, .github/, .azure-pipelines/ and changelogs/). That was actually my initial suggestion anyway.

@mariolenz
Copy link
Contributor

I think you should start with removing tests and docs. They're the bulk of stuff you can save. The rest of your list looks OK for me, but it's small fry.

I'm fine with removing these (and a few more obvious ones, like .gitignore, .github/, .azure-pipelines/ and changelogs/). That was actually my initial suggestion anyway.

I just wanted to point out that removing .gitignore, .github/, .azure-pipelines/ and changelogs/ probably won't help you much to strip down Ansible. But it wouldn't hurt either, imho.

I don't think collections should use files from outside the default directories (i.e. plugins/, roles/ and meta/), but right now they can. I guess we should include that into the requirements for collections included in Ansible (https://github.com/ansible-collections/overview/blob/main/collection_requirements.rst).

I think this would be a very good idea 👍

Your original suggestion looks OK to me, and I can't think of something to add. However, I think it should be documented somewhere that those files and directories will be removed for collections included in Ansible.

@felixfontein
Copy link
Contributor Author

On the flip-side this approach could break collections, that need stuff outside the normal directories. Currently I cannot think of a case, where this would be an issue (for the collection I maintain). Maybe this is also a case that should break, since it is generally not expected to extend the collection folder structure by individual folders?

That is a valid point. I don't think collections should use files from outside the default directories (i.e. plugins/, roles/ and meta/), but right now they can. I guess we should include that into the requirements for collections included in Ansible (https://github.com/ansible-collections/overview/blob/main/collection_requirements.rst).

I created #70 for this.

@felixfontein
Copy link
Contributor Author

I just wanted to point out that removing .gitignore, .github/, .azure-pipelines/ and changelogs/ probably won't help you much to strip down Ansible. But it wouldn't hurt either, imho.

@mariolenz I mainly added these since for them I am very sure that they can be removed without impacting functionality. Assuming a collection does not do some very naughty things :)

@markuman
Copy link
Contributor

I'm fine with removing these (and a few more obvious ones, like .gitignore, .github/, .azure-pipelines/ and changelogs/). That was actually my initial suggestion anyway.

I just wanted to point out that removing .gitignore, .github/, .azure-pipelines/ and changelogs/ probably won't help you much to strip down Ansible. But it wouldn't hurt either, imho.

What about excluding dotfiles/folders in general. That will catch other CI systems and even .gitlab-ci.yml (never say no).

@felixfontein
Copy link
Contributor Author

@markuman Thinking about this a bit more, I think I would only do that in the collection root. I would avoid excluding files from other directories, especially plugins/, roles/, playbooks/, even if they are dotfiles, since who knows what a collection creator was thinking when adding these files :)

@felixfontein
Copy link
Contributor Author

(Also some collections included in Ansible are hosted on GitLab instances. Not sure whether they have a .gitlab-ci.yml though.)

@felixfontein
Copy link
Contributor Author

Right now we have two different strategies:

  1. (reaction: 🎉) Remove everything but some explicit exceptions (meta/, plugins/, roles/, playbooks/, and regular non-hidden files in the root);
  2. (reaction: 🚀) Remove only some small list, including tests/ and docs/, which cause the largest amount of 'unnecessary' content.

Would be great if you could quickly indicate which of these two choices you prefer - or if you prefer a third one, add a comment :) For a quick reaction, use 🎉 for 1 and 🚀 for 2.

@mariolenz
Copy link
Contributor

I suggest 2 for Ansible 6 and then 1 for Ansible 7. This would give collections some time to adapt (if needed).

@felixfontein
Copy link
Contributor Author

@mariolenz for Ansible 7 we'll have to reevaluate the approach anyway, depending on how smooth (or not) things go with Ansible 6.

@felixfontein
Copy link
Contributor Author

The feedback seems to be pretty clear in favor of 2. I'll start formulating something we can vote on (a classic yes/no vote :) ) in this comment; feel free to comment on it if you want to fix the formulation. If "public opinion" doesn't change until say this weekend, I'll create a voting issue on this next week so we'll have a decision in two weeks after that.

From Ansible 6 on, exclude the following directories and files from the installation (and thus from the wheels) for included every collection: tests/, docs/, and every file and directory starting with a period in the collection root ("dot files" like .gitignore, directories like .github/). This approach can be amended (after another vote) if it turns out to cause big problems during the alpha/beta phase of Ansible 6. (For Ansible 7 or a later release, we can of course decide to do something else.)

Does this sound good?

@dmsimard
Copy link

dmsimard commented Mar 4, 2022

From Ansible 6 on, exclude the following directories and files from the installation (and thus from the wheels) for included every collection: tests/, docs/, and every file and directory starting with a period in the collection root ("dot files" like .gitignore, directories like .github/). This approach can be amended (after another vote) if it turns out to cause big problems during the alpha/beta phase of Ansible 6. (For Ansible 7 or a later release, we can of course decide to do something else.)

My understanding is that these files would still be in the source tarball (just not installed) which I believe is worth mentioning for the sake of clarity.

Does this sound good?

I didn't think about this before but came to the realization while I was trying to figure out if this could have an impact on the new collection signature verification feature.

The one thing I ponder on are the implications that installing ansible from PyPI will result in something different than with ansible-galaxy collection install. There's files like MANIFEST.json and FILES.json which would no longer match and would probably fail an ansible-galaxy collection verify but I haven't tested it.

What would be the right approach ?

@felixfontein
Copy link
Contributor Author

New version:

From Ansible 6 on, exclude the following directories and files from the installation (and thus from the wheels) for included every collection: tests/, docs/, and every file and directory starting with a period in the collection root ("dot files" like .gitignore, directories like .github/). All files will still be present in the source tarball on PyPi, they will only not be installed and not be present in wheels.

This approach can be amended (after another vote) if it turns out to cause big problems during the alpha/beta phase of Ansible 6. (For Ansible 7 or a later release, we can of course decide to do something else.)

@felixfontein
Copy link
Contributor Author

I didn't think about this before but came to the realization while I was trying to figure out if this could have an impact on the new collection signature verification feature.

The one thing I ponder on are the implications that installing ansible from PyPI will result in something different than with ansible-galaxy collection install. There's files like MANIFEST.json and FILES.json which would no longer match and would probably fail an ansible-galaxy collection verify but I haven't tested it.

What would be the right approach ?

I already thought a bit about this some time ago. The installed collections will obviously no longer verify, since they are not complete. Which is unfortunate. But at the same time, these collections were not installed with the ansible-galaxy "package manager", but with pip, so if some validation is done it should be done with pip and not with ansible-galaxy.

(Some OS packages for Ansible will already now have a similar problem since they move README files around or not include every file.)

I would say:

  1. If you want validated installations of collections, you need to install them yourself with ansible-galaxy on top of ansible-core.
  2. If you want a "batteries included" package, you can install the Ansible community package from PyPi
  3. You cannot really have both (except by re-installing the collections you want to verify on top of the Python package - but then you are responsible on updating them).

@felixfontein
Copy link
Contributor Author

@ansible-community/steering-committee the above question by @dmsimard is a pretty important one, this is something you all should look at and think about.

@felixfontein
Copy link
Contributor Author

There's a new feature in development that will allow to exactly specify which files to include: ansible/ansible#78422

@felixfontein
Copy link
Contributor Author

So far nobody ever complained about missing files, so I think we are on a good track. I think having one issue for this is enough, so let's keep #126 which is newer.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants