V2 #15

hexylena · 2015-04-22T20:06:07Z

I recognise that this is a huge PR, so please take your time to review and comment. :)

chado build added (mostly for my purposes/GMOD's)
new folder layout, <package>/<version>, where the version is usually 'default' for the base recipe, and then subsequent/specific versions of recipes will be added in their own folders.
- this is done in concert with the check-for-updates.py script which should automatically update certain pieces of software.
v2 features a .yaml format image/build metadata.
- see chado's, as it uses 100% of the spec.
- image metadata is stored in the 'meta' section. At runtime these are available in the environment, except augmented by any command line parameters changing image/version, and with the package based on the folder
- environment variables can further be specified in their own ENV section (probably overkill, but logically separate from meta)
- prebuild are commands that will occur in the Dockerfile, allowing you to extract all of the packages/dependencies into that section. The packages key is a text field of packages, while the commands section allows you to specify a list of commands to run. Useful for things like cpaning perl dependencies.
- The build section concerns itself with things occurring during the build phase.
  - urls are a list of urls to download via wget (wget, ca-certificates (so --no-check-cerfificate isn't needed), and build-essential are provided by default).
  - commands are a list of commands to execute. They start execution from the /build/ directory. Downloaded archives are not extracted for you (wanted in select cases, e.g. atlas).

This probably shouldn't be merged as-is, as there are some packages that fail to build/aren't function. E.g. bcftools removed version 1.0 from their github, so I was forced to update to 1.2. Atlas has some build issues but those are beyond my knowledge. All packages that people added should be manually built once just to verify :)

@natefoo

@natefoo this doesn't build properly, and I'm not sure where the bug is, just a reminder to check this before this branch gets merged. No action needed right now.

- support changing version - use 'default' recipes - fixes for 'image' metadata - fixes for ${version}

This is an EXTREMELY experimental tool. Checking for updates is a waste of time. We should keep up with updates to the software we care about. Thus, `check-for-updates` was born. It inspects the 'default' entry of every package with a `build.yml` file. If that's found, then it extracts the urls from the urls section, and checks for those with a `${version}` in them. When those are found, it uses some hardcoded logic to check for more recent releases. When a more recent release is found, it will copy the build.yml file across, and generate a new folder/version with the updated version number value. This should let the docker image for the original be run once, that archive generated. Later, updates can be checked for, and any missing packages can be generated.

…sion was solid

- Apparently ca-certs is included when you install wget from the command line, but not in the automated way we do it.

hexylena · 2015-04-22T20:06:48Z

TODO (not necessarily as part of this PR):

stash downloaded archives in a separate folder during build phase. I don't like things like bcftools 1.0 archives disappearing. :(

Really I'm committing this to close out issues. Fixes #3 Fixes #12

bgruening · 2015-04-22T21:17:27Z

I have a few problems, need to investigate this further ...

hexylena · 2015-04-22T21:31:42Z

@bgruening okay.

I don't want to force my view of "v2" on anyone, it didn't take that long to hack this out and anything can be changed if you have needs.

natefoo · 2015-04-23T17:08:51Z

build.py

+
+    prebuild_packages = ['wget', 'openssl', 'ca-certificates', 'build-essential']
+    if 'prebuild' in image_data and 'packages' in image_data['prebuild']:
+        prebuild_packages.extend(image_data['prebuild']['packages'].strip().split())


I suppose it's easier for cuttin' and pastin' if the packages are a blob, but being part of structured data it feels somewhat icky that a list is a string.

Agreed. I'll make that accept a list or a string for those who want to put effort into refactoring their package lists

bgruening · 2015-04-24T12:29:53Z

Eric;
in general I like this improvement a lot. But I do think this overcomplicates it a little bit and do not solve #3 completely (still it improves it!).

Dockerfiles are simply enough to code them, at least for the target users we are aiming for (us? :)). So why do we need the yaml abstraction et all? As I understood your code correctly you are creating a shell script out of the yaml definitions and using a reusable/chaced Dockerfile with build-essentials. Wouldn't it be easier to transfer build.sh to a build.Dockerfile and merge the Dockerfiles during the build process (nate's initial Dockerfile and the one specified by the user)? If we want to include shell scripts, simply put it into the directory next to users Dockerfile and we will COPY all of those files into the container so we can use it in build.Dockerfile.

Do I miss something obvious?
Thanks Eric!

hexylena · 2015-04-24T14:01:19Z

@bgruening

do not solve #3 completely

What am I missing? (Other than the --nocache flag)

As I understood your code correctly you are creating a shell script out of the yaml definitions and using a reusable/chaced Dockerfile with build-essentials.

Well, some of the build stuff from the yaml goes into the image. The image isn't just build-essential, it's that plus all listed packages (e.g. see chado image) and any commands you want in the image. That's why I calculate an image name in the code, as all of the images are different due to installing different packages.

So why do we need the yaml abstraction et all?

We probably don't, I like the cleanliness of writing a pure metadata file rather than tying us to docker, in case someone wants to reuse the metadata with some other build system. You prefer thinking about docker layers and stringing dozens of commands togethwe with && to a list of commands which are strung together for you? I personally don't particularly enjoy writing dockerfiles as much as I do writing a generalised data structure and knowing that at any time we will be able to go back and recreate every single image with tweaks.

E.g. I'll be able to patch in a command to store the downloaded URLs so we have the original .tar.gz (for reproducibility) in one place, in build.py, and then rerun/regenerate all of the docker images. If we hardcode the images, we either are merging two dockerfiles together and doing this is non-trivial (as the urls would be in the 'custom' part of the image), or I'm patching that into N imaged and every new one we generate

bgruening · 2015-04-24T14:22:53Z

do not solve #3 completely

What am I missing? (Other than the --nocache flag)

OpenMS for example downloads more than 200MB and the first commands taking ages (compiling boost etc...). Would be nice to have this cached, or at least to have control over it.

As I understood your code correctly you are creating a shell script out of the yaml definitions and using a reusable/chaced Dockerfile with build-essentials.

Well, some of the build stuff from the yaml goes into the image. The image isn't just build-essential, it's that plus all listed packages (e.g. see chado image) and any commands you want in the image. That's why I calculate an image name in the code, as all of the images are different due to installing different packages.

Indeed, and this is a very nice improvements.

So why do we need the yaml abstraction et all?

We probably don't, I like the cleanliness of writing a pure metadata file rather than tying us to docker, in case someone wants to reuse the metadata with some other build system. You prefer thinking about docker layers and stringing dozens of commands togethwe with && to a list of commands which are strung together for you?

The && gives you control over the caching. You don't need to do that it's not needed at all.

I personally don't particularly enjoy writing dockerfiles as much as I do writing a generalised data structure and knowing that at any time we will be able to go back and recreate every single image with tweaks.

I don't see where yaml has an advantage over a plain Dockerfile here.
I'm pretty sure you have more freedom in a Dockerfile than in a yaml file, at least without adding logic to build.py.

E.g. I'll be able to patch in a command to store the downloaded URLs so we have the original .tar.gz (for reproducibility) in one place, in build.py, and then rerun/regenerate all of the docker images. If we hardcode the images, we either are merging two dockerfiles together and doing this is non-trivial (as the urls would be in the 'custom' part of the image), or I'm patching that into N imaged and every new one we generate.

Yes I understand this, but the question is what we want to archive here?
For me this is a hack (sorry @natefoo ) to produce binaries, because we are simply not able to do this on every machine. We don't want to create a new package format, like brew, easyinstall, conda. We could think about using these projects to create our binaries, but creating a new layer and a package definition feels over engineered / replicated. Still cool :) Don't get me wrong this is a huge improvement, just playing Devil's advocate and asking about the scope of the project.

hexylena · 2015-04-24T15:07:51Z

OpenMS for example downloads more than 200MB and the first commands taking ages (compiling boost etc...). Would be nice to have this cached, or at least to have control over it.

hmm. Agreed. For this, you could place the wgets in the prebuild/command section, and then those would be cached in the Dockerfile, rather than in the build script. https://github.com/galaxyproject/docker-build/blob/v2/openms/default/build.yml Admittedly, that's not as "pretty" as url fetching being part of the dockerfile. I could definitely move the url building to the docker file.

You don't need to do that it's not needed at all.

fair point.

I'm pretty sure you have more freedom in a Dockerfile than in a yaml file, at least without adding logic to build.py.

you do! Definitely. I had trouble converting your packages because you use the squeeze backports, and there wasn't an easy way to stick that in before the package fetching.

For me this is a hack

it always was, still is.

We don't want to create a new package format, like brew, easyinstall, conda.

definitely not! @jmchilton teased me about this in IRC, I agree. I don't want another packaging format. Since this can ONLY be used to build docker images of packages right now...it feels like we're not trying to do that.

creating a new layer and a package definition feels over engineered / replicated.

I agree. It's definitely over-engineered compared to writing Dockerfiles, but I'd argue that this is the cleaner, easier to refactor approach should we decide to add features in the future. Given how the feature scope has creeped in the past, I expect that this will be an easier format to work with going forward given that we can regenerate ALL of the dockerfiles at once, from metadata. That's the crux of my argument, is that if we decide to do different things with the dockerfiles (e.g. my storing of packages), all of our code stays DRY.

but the question is what we want to archive here?

(We=me) want to archive all the artifacts of the build process. Inputs and outputs. This happens on a VM somewhere, so I don't care about space/runtime/etc. I just don't like having bedtools fail with the version specified in build.sh because they removed the file, so now we can never redo that build. I really understand that we probably won't ever want to redo that build, but I don't see that as a reason not to store artifacts.

bgruening · 2015-04-24T16:06:27Z

@erasche can you come up with a different use case for what we need metadata? With simple rules like: "Put all downloaded packages in /downloads" we can get the same result and archive all tarballs.

I just don't like having bedtools fail with the version specified in build.sh because they removed the file, so now we can never redo that build.

Yes an old problem, but it's not the problem we can fix here. I'm more worried about the debian team disables the archive :)

My main argument is: This is a hack, so let's use hacky scripts to get our binaries and not split the logic in (wonderful) yaml + a backend that needs to be constantly updated if we need a new feature.

hexylena · 2015-04-24T16:28:59Z

@bgruening I can't come up with another use case for the metadata. All of my use cases are theoretical "someone might want to do this" and thatit's prettier/cleaner.

I understand what you're saying, and I don't have anything to refute it with. It's a hack, it doesn't need to be over-engineered. If you'd prefer, I'll back out all of my changes to other packages, and will settle for just adding this in as another, optional build system.

hexylena · 2015-04-24T19:11:48Z

@natefoo any comments?

bgruening · 2015-04-24T19:14:09Z

Feel free to merge as it is, this was not a -1. Just wanted to discuss the direction.

hexylena · 2015-04-28T20:29:42Z

That said, I don't think we need to make this easily portable to anything other than docker.

okay! Less work! Yay! :P

The scope is creeping and if it limits us being able to do the core task of what we want to accomplish,

yeah. Well, let's keep all of the build scripts (for the time being) and call it good?

hexylena · 2015-04-29T14:19:16Z

FYI, I've been building against this branch in jenkins and we had our first completely successful build!!

https://gx.hx42.org/job/Docker-Build/18/

Clicking "Expand all" will show all of the build files and build logs. Build logs/outputs will need to be checked manually at some point, but most things seem to be building at least semi-correctly.

I would not automatically pull yet, lest any 100% real/known correct artifacts get overwritten.

NB: It's surprising that it's completely successful because ugly hacks were needed to get there. 20 GB of disk space = jenkins crashes and burns after about 4-5 image builds due to disk space required for layers. Thus, I had to insert some steps to regularly wipe out all unused images and remove them from disk, in between the build.py steps.

hexylena · 2015-05-12T16:26:04Z

Any -1s or should this be merged now that it's ~3 weeks old?

natefoo · 2015-05-13T14:11:38Z

@erasche you mentioned keeping the build scripts, should I add them back?

hexylena · 2015-05-13T22:21:16Z

@natefoo you can add them back if you wish, but most everything converted builds properly. You're probably right, I'll re-add the original build.sh files and they can all co-exist until $later_time. I'll ping you when that happens.

hexylena · 2015-06-04T17:03:40Z

@natefoo okay, this is ready to merge.

the old build system and its files are 100% retained
the new build system is added as an alternative
build-all.sh lets you build everything using a preferred build system. Only the stuff I personally know works has migrated to that.

natefoo · 2015-06-04T17:42:09Z

👍 from me. @bgruening, any comments?

bgruening · 2015-06-04T17:46:48Z

Let's merge this! Awesome work Eric.

V2

Eric Rasche added 14 commits April 21, 2015 10:14

Completed implementation of v2

51191d2

UCSC 312 should be functional

7f0a8ee

Added tbl2asn, chado

b7111a0

More useful dryrun command

946767f

Rewrite of atlas in v2

f91a7e6

@natefoo this doesn't build properly, and I'm not sure where the bug is, just a reminder to check this before this branch gets merged. No action needed right now.

Convert bcftools, emboss, fixes for build.py

4868988

- support changing version - use 'default' recipes - fixes for 'image' metadata - fixes for ${version}

More build scripts

cf04161

Added jellyfish

d82d337

Converted kraken

e147c67

Converted nginx

85af79c

@bgruening would you check this package for me, not sure if my conver…

a9bc511

…sion was solid

Convert perlgd and install ca-certificates

69226fb

- Apparently ca-certs is included when you install wget from the command line, but not in the automated way we do it.

Untested conversion of pkiss, tpp

beb2099

Eric Rasche added 3 commits April 22, 2015 15:09

Convert samtools

26b8583

Only used for check-for-updates

59f3129

Really I'm committing this to close out issues. Fixes #3 Fixes #12

Automatically create directory of new --versions

27fc541

natefoo reviewed Apr 23, 2015
View reviewed changes

Fix @natefoo's to accept list or string of packages

b2dda2b

Eric Rasche added 2 commits April 27, 2015 17:58

Log to a file instead of stdout

213e2d2

Implement quiet mode

c2effb8

Eric Rasche added 13 commits June 4, 2015 11:40

Import build.sh files from master

935029a

Give tbl2asn proper names

1276b3a

Remove chado as not galaxy-related enough

485f4d0

Remove unused ignores

1dfc492

Correct package names

32137f1

Update documentation

0de0a50

Recipe would have broken without patch files

410fbdd

Not a part of new infrastructure

55f1b20

Include from master

ab86c3c

Ignore gz files

b0dafb3

Fix gitignore

bad4189

Move out of the way

ffa98e8

Duplicate patch files

b3b50c4

Remove perlGD as it is in download_store

7893272

hexylena mentioned this pull request Jun 4, 2015

Built data should be pushed to depot #10

Closed

bgruening added a commit that referenced this pull request Jun 4, 2015

Merge pull request #15 from galaxyproject/v2

5bd7d99

V2

bgruening merged commit 5bd7d99 into master Jun 4, 2015

bgruening deleted the v2 branch June 4, 2015 17:46

natefoo mentioned this pull request Nov 10, 2015

Move wheel support into a Python library, create a starforge command #40

Merged

natefoo mentioned this pull request Jan 12, 2016

Convert from shell to YAML-based build system #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V2 #15

V2 #15

hexylena commented Apr 22, 2015

hexylena commented Apr 22, 2015

bgruening commented Apr 22, 2015

hexylena commented Apr 22, 2015

natefoo Apr 23, 2015

hexylena Apr 23, 2015

bgruening commented Apr 24, 2015

hexylena commented Apr 24, 2015

bgruening commented Apr 24, 2015

hexylena commented Apr 24, 2015

bgruening commented Apr 24, 2015

hexylena commented Apr 24, 2015

hexylena commented Apr 24, 2015

bgruening commented Apr 24, 2015

hexylena commented Apr 28, 2015

hexylena commented Apr 29, 2015

hexylena commented May 12, 2015

natefoo commented May 13, 2015

hexylena commented May 13, 2015

hexylena commented Jun 4, 2015

natefoo commented Jun 4, 2015

bgruening commented Jun 4, 2015

V2 #15

V2 #15

Conversation

hexylena commented Apr 22, 2015

hexylena commented Apr 22, 2015

bgruening commented Apr 22, 2015

hexylena commented Apr 22, 2015

natefoo Apr 23, 2015

Choose a reason for hiding this comment

hexylena Apr 23, 2015

Choose a reason for hiding this comment

bgruening commented Apr 24, 2015

hexylena commented Apr 24, 2015

bgruening commented Apr 24, 2015

hexylena commented Apr 24, 2015

bgruening commented Apr 24, 2015

hexylena commented Apr 24, 2015

hexylena commented Apr 24, 2015

bgruening commented Apr 24, 2015

hexylena commented Apr 28, 2015

hexylena commented Apr 29, 2015

hexylena commented May 12, 2015

natefoo commented May 13, 2015

hexylena commented May 13, 2015

hexylena commented Jun 4, 2015

natefoo commented Jun 4, 2015

bgruening commented Jun 4, 2015