Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split packages proposal #1338

Closed
msarahan opened this issue Sep 8, 2016 · 13 comments · Fixed by #1576
Closed

Split packages proposal #1338

msarahan opened this issue Sep 8, 2016 · 13 comments · Fixed by #1576
Labels
locked [bot] locked due to inactivity

Comments

@msarahan
Copy link
Contributor

msarahan commented Sep 8, 2016

@mingwandroid and I have been discussing split packages as an urgent prerequisite to enabling easier construction of build toolchains (perhaps with crosstools-ng, http://crosstool-ng.org/)

In coming up with ideas for implementation, we're looking at precedent set by Linux distributions.

What we have as initial design ideas for conda-build are:

  • Change the name field to allow lists. Each entry is a separate package. These entries would be used to select logic for installation in the shell scripts.
package:
  name:
    - somename_somefeature1
    - somename_somefeature2
  • Alternatively, Keep the name field the same, but add an "outputs" field that lists other outputs. If not specified, defaults to normal conda package output.
package:
  name: somename
  outputs:
    - somefeature1
    - somefeature2
  • Prefer automatic collection of files, as is done now, over explicit lists of files. After each install+packaging step, the files would be removed from the build prefix.
  • Prefer functions defined in central shell/batch scripts (as opposed to several scripts), but have conda-build call the correct function/subroutine based on output name.
  • Alternatively, have outputs list as dictionary instead, with values being scripts to run for packaging.

CC @pelson @ocefpaf @JanSchulz - we wanted to involve you, given your involvement with some of these systems. Do you have opinions or battle scars to share? Naming is all negotiable here at this point - if something like the "outputs" field seems like a good idea with a bad name.

@jankatins
Copy link

jankatins commented Sep 8, 2016

My "ideas" are in conda/conda#793 (comment) I would prefer a file based package split, mainly because I think (but without hard data...) that's the only way which would scale over different build systems and prevent filename collisions (e.g. if all matplotlib install features install a base file, how to only add it to one package?)

[Rest ist from the linked comment:]

I would find it better if there is an additional way to add binary packages, which can take specific files and the rest is taken by the main package. Like:

package:
  name: mypackage
  version: 1.0.0

  requirements:
    build:
     # build requirements are for all packages...
      - python
      - .... all the rest, including the qt dependencies...

    run:
      - python
      - numpy

binary-package:
  name: mypackage-pyqt
  run-requirements:
       - pyqt
       - matplotlib {PACKAGE_VERSION} # replaced by the complete version of this package
  files:
        include: 
          - pyqt/*.*
        exclude:
          - pyqt/README.md

binary-package:
  name: mypackage-docs
  run-requirements:
       - matplotlib {PACKAGE_VERSION} # replaced by the complete version of this package
  files:
        include: 
          - docs/*.*
        exclude:
          - docs/README.md

binary-package:
  name: mypackage-tests
  run-requirements:
       - nose
       - mock
       - matplotlib {PACKAGE_VERSION} # replaced by the complete version of this package
  files:
        include: 
          - src/matplotlib/tests

This would build 4 packages: mypackage-tests, mypackage-docs, mypackage-pyqt and mypackage. Each package can be installed as a normal package... In this case, the three additional packages depend on the exact version of main package, so that updates to e.g mypackage-pyqt will also update the main package and keep them in sync.

See also the debian dir for the matplotlib debian package, which works similar, only the above info is split across multiple files: https://anonscm.debian.org/cgit/python-modules/packages/matplotlib.git/tree/debian

  • control defines the packages and their dependencies
  • *.install tells the build process, which files belong to which package

@msarahan
Copy link
Contributor Author

Thanks @JanSchulz

I have thought about your idea, and come up with this syntax that I hope captures most or all of what you want, along with some additions:

        outputs:
            - name: filename
              script: script file or list of commands to install files into the build prefix
              script_interpreter: program to use to run script (optional.  Limited autodetection
                                                                when not provided.)
              type: (optional, to support wheel/rpm/deb output someday, defaults to 'conda' for
                        conda package)
              noarch: (optional string identifier of noarch variety)
              requirements: (optional list of runtime requirements for installing the output file)
              test:
                  requirements: (same as conda-build recipe)

                  script: script file or list of commands to execute for testing
                  script_interpreter: program to use to run script (optional. autodetection when
                                                                    not provided)

            # It should be possible to use Jinja2 here to fill in things like the parent package
            #    name and build string, where desired:
            - name: {{ PKG_NAME }}-src_{{ CONDA_BUILD_STRING }}
            # Alternatively, build string could be computed just from runtime dependencies
  • I want all outputs or binary-packages under a single key for easier iteration over them.
  • I think each package should be independently testable (to make sure that you've included everything you need to, for example)
  • This syntax allows people to either just copy files, or to install files and let conda take care of it. The core idea is just that however you move things into the build prefix to get detected is up to you.
  • I think that for simplicity, we probably should have these outputs be in addition to the main output, which retains its default behavior, and which is built and packaged prior to any additional outputs, with the thinking that subsequent outputs may use the primary output.
  • The script and script_interpreter stuff is meant to facilitate running bash on windows, for example.
  • The type is something of a placeholder to help in future specification of building wheels, for example.
  • noarch spec is because some packages (like source packages or test data) will be noarch, while others might be specialized. noarch string stuff is in Add build/noarch to recipe metadata. #1285 and Insert value of noarch into index file #1334

CC @jjhelmus - for wheels.

@msarahan
Copy link
Contributor Author

In discussion with @mingwandroid, he would like to be able to take something like thrift, which has several language backends supported, and break out the backends within one recipe. I think this will not work well, because each backend may then be split further (for example, Python 2.7/3.4/3.5). I think that a constraint here is a one-to-one mapping between output entries in the recipe and outputs. Additionally, although the script entry may be used to build arbitrarily and deploy, I would discourage it, as it skirts most of conda-build's logic for setup and such, and ultimately may not work well.

@jankatins
Copy link

jankatins commented Sep 14, 2016

Do I understand that right that if I want to split a package, e.g. libpng into header and library, I would have to create two install scripts and would need to compile+install it twice plus different "deletes" in PRefix to get the right filesets? If so: how would that handle longrunning builds of big libs which would now need to be done twice? Or is the build script kept and only the install step would be done in the new scripts under outputs?

I still think using something like the following is easier:

build:
     [whatever was done before, e.g. use the old script to compile a python/... whatever package and install it into PREFIX]

outputs:
            - name: packagename
              filelist: [globs*, files] # instead of scripts
              type: [... rest of your proposal...]
            - name: packagename2
              filelist: [...]
              include-leftover-files: True # optional, only once, gets all the files which are not in other packages

[This is inspired by the debian system of splitting a package]

Why: as far as I know you can't really split a package in any other lang than python? At least makefile based systems are either split at configure time or via selectively "include" files form the installed directory tree. So if the build script is kept (=only one build) and the script sunder outputs is used only for installing stuff you are either running make install multiple times and remove all files which do not belong to the package or you duplicate the makefile for only installing some stuff.

@msarahan
Copy link
Contributor Author

No, not really. The way it would work would be:

  1. normal build, as it already is, happens.
  2. normal packaging, as it already is, happens. After installing files to build prefix, then packaging them, they will be removed.
  3. output package entries will be run. Any files copied to the build prefix will make up that output package. These files will be removed after each step.

The actual building is done once, but the packaging can be split across more steps. Also, you don't need to create separate files for simple commands. For example, if there were several different make install targets, you could just run each of them - one per output. This is effectively offloading the listing of files to the install step for a given package (or subpackage).

I'm not opposed to a filelist, but I do think it is more trouble. I absolutely will not remove scripts in favor of a filelist, but would be OK with supporting both.

@jankatins
Copy link

jankatins commented Sep 14, 2016

I still don't get it: lets assume a package where the upstream installer installs two files (LIB and HEADER) which should end up in different packages. The current "build and install" script would both copy them to PREFIX, like in the following build.sh file:

./configure --prefix=$PREFIX # probably no needed do to condas rewritings?
make
make install DESTDIR=$PREFIX

After the "build" step, the PREFIX dir will have both files installed. Would they both end up in a package named after the source package name, under the "old" rules. Not sure if that is still the case, even if there is a output? If the latter, I would need to remove the make install from my current build.sh?

What would the outputs scripts then do? Do each another install and remove files which should not belong to the current package?

@msarahan
Copy link
Contributor Author

Here's a few examples we have come up with:

  1. "source packages" - in some cases, it is useful to archive the source with a built package. Keeping the original tarball around is certainly one way. Being able to conda install it might make life easier. Such a package might overlap with the main package installed by a make install step. We can avoid that by convention, or allow it, but mitigate potential confusion by pinning versions to the parent package.
  2. If you want to split lib and header, but your make install script installs both, then yes, you would need to not use it in build.sh, but rather copy or move the appropriate files in output scripts or filelist listings. If you're OK with file overlaps (plus any safety mitigation), then you could just run make install in the main package, and have that be a complete package, but then others be subsections. We have such a thing with postgresql - we have a complete package, and then a separate package for just libpq for clients.
  3. If you are building a toolchain with crosstool-ng, you'll build glibc, gcc, and several other things at once. Here the master package really doesn't make sense, and all we're doing is avoiding the complete build process for each step - do it once, and then break up the pieces.

If file overlaps in these packages are a problem, we can perhaps work around it by making the subpackages be the "owners" of files, and compose the top-level packages as metapackages of the subpackages. We can hybridize, too - partially metapackage, partially real files. In this case, it may be advantageous to create all of the subpackages first, and then have the main package pick up any unhandled files, as you mention.

@jankatins
Copy link

jankatins commented Sep 15, 2016

I think the most important usecase is the "split header and lib" case and if that ends up the hardes to achieve case that's not good.

E..g. for matplotlib and python scripts in general, I'm not sure if there is a way to do a python setup.py install which skips the build step (at least I never managed on windows). So if you want to split matplotlib into one package with python code + "normal" backends and a package just for the pyqt backend, you end up with either doing the build twice and basically write a script which removes first the pyqt backend (in the default build.sh and then duplicate it, call it from the output key and in there remove all the other stuff.

In that case, the existence of an output key would indicate that only the package names below the output key are build as binary packages and you would explicitly need to add a package with the same name as the source package. )

for 1), if you want to build a source package, you would need to do a python setup.py sdist in the normal build.sh, copy it somewhere (not sure if there is an environment variable which points to the source: if so, it would just be a copy) and then pick the file "as usual" in the file specifications.

  1. is also possible with such file base split: do the build in one gigantic build.sh file, install and then pick the files. It's a bit harder, because you have to juggle all the files which would be nicely split by individual installs. But then what's preventing you from building a common crosscompiler setup script, put it into a build dependency and then change the individual package to use it, so that you end up with individual packages (+build.sh) per upstream package, like now?

Debian packages and RPM based packages are both based on creating packages based on filespecs: why throw that "experience" away?

@mingwandroid
Copy link
Contributor

There are many use-cases of split packages, and no one type should be used to limit the design of this feature.

ArchLinux and MSYS2 go for the procedural route using bash as the language. This offers complete flexibility. If you just want to copy files, you can use cp and if you want to do make install-subtarget you can do that too.

My issue with filelists is that they are limiting and require constant gardening (e.g. updates to a package mean that a new file extension gets added, or some *.dat files now belong in one package and some in another). Being able to rely on the build system's sub-targets (where present) seems to me to be a good thing.

@pelson
Copy link
Contributor

pelson commented Sep 16, 2016

In https://groups.google.com/a/continuum.io/forum/#!topic/conda/qss8IlzxweI I proposed the idea of having multiple build environments for a single recipe - the build script is then responsible for putting appropriate files in the appropriate build environment (rather than listing files a-la RPM).

I think my suggestion from 18 months ago still has some legs, though perhaps some of the metadata could be refined somewhat.

@msarahan
Copy link
Contributor Author

Thanks for pointing that out @pelson. I think that's very compatible with what I've proposed above. Sorry I wasn't around to see it the first time!

I'm going to start hacking a prototype out. My intent is to support both file lists and scripts, but both will be based on "clean build environments" which conda then treats as it does already - just two different ways to copy files there. Thanks everyone for your feedback.

@msarahan
Copy link
Contributor Author

Right now I'm pondering how to handle file collisions and association of subpackages with parent packages. I think that we should disallow file collisions. What I've come up with as a way to avoid them is the following scheme:

  1. Pattern the composition of packages from subpackages. If a recipe has an outputs section, the contents of that section replace the standard packaging behavior. This means people should cut any existing install script out of bld.bat or build.sh, breaking it up across any subpackage outputs. Parent packages should list any desired subpackages as runtime dependencies to compose a "complete" install, which may omit things like tests or source.
  2. Check file lists, and raise errors for any files that exist in more than one subpackage
  3. Tie subpackages to parent packages with both the parent package version, and a unique identifier. Right now, I'd like to do a hash, perhaps base64 encoded for brevity. Hash the recipe. If the recipe uses path or some other locally-alterable source, hash that also and lump it together with the recipe hash. This hash is necessary, I think, because managing build numbers for subpackages seems likely to get confusing.
  4. When a subpackage is listed as a runtime dependency for the parent package, pin the version exactly with the hash. People should always also be able to install subpackages separately with just the version (not the hash), so it is important that the hash not replace the version.
  5. Different package types (wheel vs conda package) would be allowed to contain the same files. Output types could have a flag whether to enforce the rule for a given type, with the default being to disallow (enforce).

@github-actions
Copy link

Hi there, thank you for your contribution!

This issue has been automatically locked because it has not had recent activity after being closed.

Please open a new issue if needed.

Thanks!

@github-actions github-actions bot added the locked [bot] locked due to inactivity label Apr 18, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked [bot] locked due to inactivity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants