Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 #15

Merged
merged 38 commits into from
Jun 4, 2015
Merged

V2 #15

merged 38 commits into from
Jun 4, 2015

Conversation

hexylena
Copy link
Member

I recognise that this is a huge PR, so please take your time to review and comment. :)

  • chado build added (mostly for my purposes/GMOD's)
  • new folder layout, <package>/<version>, where the version is usually 'default' for the base recipe, and then subsequent/specific versions of recipes will be added in their own folders.
    • this is done in concert with the check-for-updates.py script which should automatically update certain pieces of software.
  • v2 features a .yaml format image/build metadata.
    • see chado's, as it uses 100% of the spec.
    • image metadata is stored in the 'meta' section. At runtime these are available in the environment, except augmented by any command line parameters changing image/version, and with the package based on the folder
    • environment variables can further be specified in their own ENV section (probably overkill, but logically separate from meta)
    • prebuild are commands that will occur in the Dockerfile, allowing you to extract all of the packages/dependencies into that section. The packages key is a text field of packages, while the commands section allows you to specify a list of commands to run. Useful for things like cpaning perl dependencies.
    • The build section concerns itself with things occurring during the build phase.
      • urls are a list of urls to download via wget (wget, ca-certificates (so --no-check-cerfificate isn't needed), and build-essential are provided by default).
      • commands are a list of commands to execute. They start execution from the /build/ directory. Downloaded archives are not extracted for you (wanted in select cases, e.g. atlas).

This probably shouldn't be merged as-is, as there are some packages that fail to build/aren't function. E.g. bcftools removed version 1.0 from their github, so I was forced to update to 1.2. Atlas has some build issues but those are beyond my knowledge. All packages that people added should be manually built once just to verify :)

Eric Rasche added 14 commits April 21, 2015 10:14
@natefoo this doesn't build properly, and I'm not sure where the bug
is, just a reminder to check this before this branch gets merged. No
action needed right now.
- support changing version
- use 'default' recipes
- fixes for 'image' metadata
- fixes for ${version}
This is an EXTREMELY experimental tool. Checking for updates is a waste
of time. We should keep up with updates to the software we care about.
Thus, `check-for-updates` was born.

It inspects the 'default' entry of every package with a `build.yml`
file. If that's found, then it extracts the urls from the urls section,
and checks for those with a `${version}` in them. When those are found,
it uses some hardcoded logic to check for more recent releases. When a
more recent release is found, it will copy the build.yml file across,
and generate a new folder/version with the updated version number value.

This should let the docker image for the original be run once, that
archive generated. Later, updates can be checked for, and any missing
packages can be generated.
- Apparently ca-certs is included when you install wget from the command
  line, but not in the automated way we do it.
@hexylena
Copy link
Member Author

TODO (not necessarily as part of this PR):

  • stash downloaded archives in a separate folder during build phase. I don't like things like bcftools 1.0 archives disappearing. :(

Eric Rasche added 3 commits April 22, 2015 15:09
@bgruening
Copy link
Member

I have a few problems, need to investigate this further ...

@hexylena
Copy link
Member Author

@bgruening okay.

I don't want to force my view of "v2" on anyone, it didn't take that long to hack this out and anything can be changed if you have needs.


prebuild_packages = ['wget', 'openssl', 'ca-certificates', 'build-essential']
if 'prebuild' in image_data and 'packages' in image_data['prebuild']:
prebuild_packages.extend(image_data['prebuild']['packages'].strip().split())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's easier for cuttin' and pastin' if the packages are a blob, but being part of structured data it feels somewhat icky that a list is a string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll make that accept a list or a string for those who want to put effort into refactoring their package lists

@bgruening
Copy link
Member

Eric;
in general I like this improvement a lot. But I do think this overcomplicates it a little bit and do not solve #3 completely (still it improves it!).

Dockerfiles are simply enough to code them, at least for the target users we are aiming for (us? :)). So why do we need the yaml abstraction et all? As I understood your code correctly you are creating a shell script out of the yaml definitions and using a reusable/chaced Dockerfile with build-essentials. Wouldn't it be easier to transfer build.sh to a build.Dockerfile and merge the Dockerfiles during the build process (nate's initial Dockerfile and the one specified by the user)? If we want to include shell scripts, simply put it into the directory next to users Dockerfile and we will COPY all of those files into the container so we can use it in build.Dockerfile.

Do I miss something obvious?
Thanks Eric!

@hexylena
Copy link
Member Author

@bgruening

do not solve #3 completely

What am I missing? (Other than the --nocache flag)

As I understood your code correctly you are creating a shell script out of the yaml definitions and using a reusable/chaced Dockerfile with build-essentials.

Well, some of the build stuff from the yaml goes into the image. The image isn't just build-essential, it's that plus all listed packages (e.g. see chado image) and any commands you want in the image. That's why I calculate an image name in the code, as all of the images are different due to installing different packages.

So why do we need the yaml abstraction et all?

We probably don't, I like the cleanliness of writing a pure metadata file rather than tying us to docker, in case someone wants to reuse the metadata with some other build system. You prefer thinking about docker layers and stringing dozens of commands togethwe with && to a list of commands which are strung together for you? I personally don't particularly enjoy writing dockerfiles as much as I do writing a generalised data structure and knowing that at any time we will be able to go back and recreate every single image with tweaks.

E.g. I'll be able to patch in a command to store the downloaded URLs so we have the original .tar.gz (for reproducibility) in one place, in build.py, and then rerun/regenerate all of the docker images. If we hardcode the images, we either are merging two dockerfiles together and doing this is non-trivial (as the urls would be in the 'custom' part of the image), or I'm patching that into N imaged and every new one we generate

@bgruening
Copy link
Member

do not solve #3 completely

What am I missing? (Other than the --nocache flag)

OpenMS for example downloads more than 200MB and the first commands taking ages (compiling boost etc...). Would be nice to have this cached, or at least to have control over it.

As I understood your code correctly you are creating a shell script out of the yaml definitions and using a reusable/chaced Dockerfile with build-essentials.

Well, some of the build stuff from the yaml goes into the image. The image isn't just build-essential, it's that plus all listed packages (e.g. see chado image) and any commands you want in the image. That's why I calculate an image name in the code, as all of the images are different due to installing different packages.

Indeed, and this is a very nice improvements.

So why do we need the yaml abstraction et all?

We probably don't, I like the cleanliness of writing a pure metadata file rather than tying us to docker, in case someone wants to reuse the metadata with some other build system. You prefer thinking about docker layers and stringing dozens of commands togethwe with && to a list of commands which are strung together for you?

The && gives you control over the caching. You don't need to do that it's not needed at all.

I personally don't particularly enjoy writing dockerfiles as much as I do writing a generalised data structure and knowing that at any time we will be able to go back and recreate every single image with tweaks.

I don't see where yaml has an advantage over a plain Dockerfile here.
I'm pretty sure you have more freedom in a Dockerfile than in a yaml file, at least without adding logic to build.py.

E.g. I'll be able to patch in a command to store the downloaded URLs so we have the original .tar.gz (for reproducibility) in one place, in build.py, and then rerun/regenerate all of the docker images. If we hardcode the images, we either are merging two dockerfiles together and doing this is non-trivial (as the urls would be in the 'custom' part of the image), or I'm patching that into N imaged and every new one we generate.

Yes I understand this, but the question is what we want to archive here?
For me this is a hack (sorry @natefoo ) to produce binaries, because we are simply not able to do this on every machine. We don't want to create a new package format, like brew, easyinstall, conda. We could think about using these projects to create our binaries, but creating a new layer and a package definition feels over engineered / replicated. Still cool :) Don't get me wrong this is a huge improvement, just playing Devil's advocate and asking about the scope of the project.

@hexylena
Copy link
Member Author

OpenMS for example downloads more than 200MB and the first commands taking ages (compiling boost etc...). Would be nice to have this cached, or at least to have control over it.

hmm. Agreed. For this, you could place the wgets in the prebuild/command section, and then those would be cached in the Dockerfile, rather than in the build script. https://github.com/galaxyproject/docker-build/blob/v2/openms/default/build.yml Admittedly, that's not as "pretty" as url fetching being part of the dockerfile. I could definitely move the url building to the docker file.

You don't need to do that it's not needed at all.

fair point.

I'm pretty sure you have more freedom in a Dockerfile than in a yaml file, at least without adding logic to build.py.

you do! Definitely. I had trouble converting your packages because you use the squeeze backports, and there wasn't an easy way to stick that in before the package fetching.

For me this is a hack

it always was, still is.

We don't want to create a new package format, like brew, easyinstall, conda.

definitely not! @jmchilton teased me about this in IRC, I agree. I don't want another packaging format. Since this can ONLY be used to build docker images of packages right now...it feels like we're not trying to do that.

creating a new layer and a package definition feels over engineered / replicated.

I agree. It's definitely over-engineered compared to writing Dockerfiles, but I'd argue that this is the cleaner, easier to refactor approach should we decide to add features in the future. Given how the feature scope has creeped in the past, I expect that this will be an easier format to work with going forward given that we can regenerate ALL of the dockerfiles at once, from metadata. That's the crux of my argument, is that if we decide to do different things with the dockerfiles (e.g. my storing of packages), all of our code stays DRY.

but the question is what we want to archive here?

(We=me) want to archive all the artifacts of the build process. Inputs and outputs. This happens on a VM somewhere, so I don't care about space/runtime/etc. I just don't like having bedtools fail with the version specified in build.sh because they removed the file, so now we can never redo that build. I really understand that we probably won't ever want to redo that build, but I don't see that as a reason not to store artifacts.

@bgruening
Copy link
Member

@erasche can you come up with a different use case for what we need metadata? With simple rules like: "Put all downloaded packages in /downloads" we can get the same result and archive all tarballs.

I just don't like having bedtools fail with the version specified in build.sh because they removed the file, so now we can never redo that build.

Yes an old problem, but it's not the problem we can fix here. I'm more worried about the debian team disables the archive :)

My main argument is: This is a hack, so let's use hacky scripts to get our binaries and not split the logic in (wonderful) yaml + a backend that needs to be constantly updated if we need a new feature.

@hexylena
Copy link
Member Author

@bgruening I can't come up with another use case for the metadata. All of my use cases are theoretical "someone might want to do this" and thatit's prettier/cleaner.

I understand what you're saying, and I don't have anything to refute it with. It's a hack, it doesn't need to be over-engineered. If you'd prefer, I'll back out all of my changes to other packages, and will settle for just adding this in as another, optional build system.

@hexylena
Copy link
Member Author

@natefoo any comments?

@bgruening
Copy link
Member

Feel free to merge as it is, this was not a -1. Just wanted to discuss the direction.

@hexylena
Copy link
Member Author

That said, I don't think we need to make this easily portable to anything other than docker.

okay! Less work! Yay! :P

The scope is creeping and if it limits us being able to do the core task of what we want to accomplish,

yeah. Well, let's keep all of the build scripts (for the time being) and call it good?

@hexylena
Copy link
Member Author

FYI, I've been building against this branch in jenkins and we had our first completely successful build!!

https://gx.hx42.org/job/Docker-Build/18/

Clicking "Expand all" will show all of the build files and build logs. Build logs/outputs will need to be checked manually at some point, but most things seem to be building at least semi-correctly.

I would not automatically pull yet, lest any 100% real/known correct artifacts get overwritten.

NB: It's surprising that it's completely successful because ugly hacks were needed to get there. 20 GB of disk space = jenkins crashes and burns after about 4-5 image builds due to disk space required for layers. Thus, I had to insert some steps to regularly wipe out all unused images and remove them from disk, in between the build.py steps.

@hexylena
Copy link
Member Author

Any -1s or should this be merged now that it's ~3 weeks old?

@natefoo
Copy link
Member

natefoo commented May 13, 2015

@erasche you mentioned keeping the build scripts, should I add them back?

@hexylena
Copy link
Member Author

@natefoo you can add them back if you wish, but most everything converted builds properly. You're probably right, I'll re-add the original build.sh files and they can all co-exist until $later_time. I'll ping you when that happens.

@hexylena
Copy link
Member Author

hexylena commented Jun 4, 2015

@natefoo okay, this is ready to merge.

  • the old build system and its files are 100% retained
  • the new build system is added as an alternative
  • build-all.sh lets you build everything using a preferred build system. Only the stuff I personally know works has migrated to that.

@natefoo
Copy link
Member

natefoo commented Jun 4, 2015

👍 from me. @bgruening, any comments?

@bgruening
Copy link
Member

Let's merge this! Awesome work Eric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants