Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

files on the docker image #55

Closed
benmarwick opened this issue Jan 27, 2015 · 9 comments
Closed

files on the docker image #55

benmarwick opened this issue Jan 27, 2015 · 9 comments

Comments

@benmarwick
Copy link
Contributor

Hi Carl,

I'm curious about how you got just the contents of nonparametric-bayes/manuscripts to show in the docker image at /home/rstudio/pdg-control, can you help me make sense of that?

I gather that these lines in your dockerfile install the packages needed for the project, including the nonparametric-bayes/manuscripts directory:

  && installGithub.r --deps TRUE \
    cboettig/cboettigR \
    cboettig/pdg_control \
    cboettig/nonparametric-bayes

But I'm not quite sure how the COPY function is working:

COPY . /home/rstudio/pdg-control

In this context what is . referring to? How is it specifically getting the nonparametric-bayes/manuscripts directory?

Or are there some other lines you run (that I can't see here) when configuring the image initially to get that directory on the image?

thanks!

@cboettig
Copy link
Owner

Hi Ben,

No magic here since everything has to run on the Docker Hub. docker build takes all it's paths relative to the path of whatever working directory the Dockerfile is in (recall in running docker build locally you give it a path to that directory, rather than to the Dockerfile itself). Since the Dockerfile is in manuscripts, this corresponds to the manuscripts directory of the Github repository.

Recall that you can only refer to things in the directory containing the Dockerfile or subdirectories thereof, you cannot go back up (no ../), since that would refer to files that were not available to docker build. build temporarily loads everything in that working directory into a scratch space, so if you have lots of stuff in that directory (or its subdirectories), build will be very slow unless you use a .dockerignore file. For instance, if the Dockerfile is at the top level of a git repo, you probably want to ignore the .git dir in your .dockerignore file.

Note that /home/rstudio/pdg-control is a bit of a typo -- pdg-control is the repo name for a different project with a similar Dockerfile. I probably should have called this /home/rstudio/manuscripts or something like that.

Okay, questions for you:

I keep going back and forth on whether it is actually a good idea or not to copy the data/manuscript files etc onto the Docker container like this, or just omit that part. It sounds great to have everything together, but it creates a potential for confusion when linking volumes.

In practice, I tend to instead to clone the repo from Github and then run the analysis in Docker; e.g.:

git clone https://github.com/cboettig/nonparametric-bayes.git
cd nonparametric-bayes/manuscripts
docker run -v $(pwd):/data --user rstudio -ti --rm cboettig/nonparametric-bayes make

this ensures that the pdf is on the host machine and not stuck inside the container. However, if you run the above with an interactive RStudio instance, you see the two different copies of the manuscripts dir: the linked one in /data (or wherever you put it) and the pre-packaged one in /home/rstudio/pdg-control, unless you've overwritten the latter by doing -v $(pwd):/home/rstudio/pdg-control. In any event, you have to run on the linked one unless you want to manually download the pdf to your machine (e.g. using the RStudio-server download dialog or just saving the pdf) to get it off the container. Maybe that makes more sense/is more robust than linking volumes though, I don't know.

@benmarwick
Copy link
Contributor Author

Thanks very much for your quick and detailed reply. If I understand correctly, you might have done docker build Carls_computer/nonparametric-bayes/manuscripts which refers to a directory on your computer containing the dockerfile that I see here in this repo. That command generated the docker image that I pulled from the hub to reproduce your manuscript. And so the . in your dockerfile COPY refers only to the contents of that manuscripts directory.

Building from the local folder is a little awkward since I run boot2docker and it's not trivial (for me, at least) to have docker build run from one of my host windows folders. So I wonder if the solution is to include git clone in the dockerfile to get the package from github into the image? I think I recall you discussing that on your blog, anyway, it works for me, I clone the whole package repo, then keep only the vignettes directory (equivalent to your manuscript with Rmd and data). The functions from my package are present in the installed package, so I don't need another copy of those. That all seems to be working for me with this https://github.com/benmarwick/Steele_et_al_VR003_MSA_Pigments

Looking into it a bit, I think I now see a flaw in using the vignettes directory: I cannot store the HTML output from the rendered Rmd in there... is that why you use a non-standard package directory like manuscripts? The trade-off is that the manuscript is not included in the built package, but the vignette is, and it's handy to use the package build functions to generate the vignette. But I'm not sure how important the contents of the built package is for archival purposes, since we typically put the package source on zenodo, etc. I look forward to your thoughts on having a manuscripts directory in the package, versus vignettes.

My first thought on your specific question is to make isolation the priority, since that seems to be what Docker does so well. So my preference for the final compendium is to include the data, manuscript source and built package all on the Docker image, and avoid linking or any other kind of connection with the host. If the data got so big, perhaps then a data-only image linked to a manuscript-source-and-package image. But when developing the analysis I find it much more convenient to have access to my host file system (relevant files are usually scattered widely until the last moment when I organise the compendium), so I do similar to you with linking the docker container to a host directory.

By the way, running devtools::install_github(""cboettig/nonparametric-bayes") halts with an error thatpdgControlis not installed. After runningdevtools::install_github("cboettig/pdg_control")I can installnonparametric-bayes` fine. What's the general solution to installing packages from github when they are a dependency of another package?

I've had a bit of a look into how to do travis-style continuous integration for the docker image, circleci.com seems closest. After a bit of poking around I've got a circle.yml file, a shield on my readme, and my image passes their test, so that seems like a reasonable option. I can't see how to securely send my docker hub credentials to circle to push the image after a successful build though.

I'm looking forward to talking more about this with you at the rOpenSci event, perhaps we should propose a project during that event to document more of this process of using R packages and Docker containers to make research compendia.

@cboettig
Copy link
Owner

Inline replies below. good fodder for a wee blog post here.

On Wed Jan 28 2015 at 11:19:19 AM Ben Marwick notifications@github.com
wrote:

Thanks very much for your quick and detailed reply. If I understand
correctly, you might have done docker build
Carls_computer/nonparametric-bayes/manuscripts which refers to a
directory on your computer containing the dockerfile that I see here in
this repo. That command generated the docker image that I pulled from the
hub to reproduce your manuscript. And so the . in your dockerfile COPY
refers only to the contents of that manuscripts directory.

Basically yes. Recall that the Hub copy you pull is built by the Docker
Hub automatically though with nothing local from me. On setting up
automated builds, you define both the Github repo and the path to where the
Dockerfile you want built lives inside that repo. That path defines where
. is referencing. So there's no need to ever build locally; this whole
thing could be set up using only the Docker Hub web interface and Github
web interface without even having a local copy of docker.

Building from the local folder is a little awkward since I run
boot2docker and it's not trivial (for me, at least) to have docker build
run from one of my host windows folders. So I wonder if the solution is to
include git clone in the dockerfile to get the package from github into
the image? I think I recall you discussing that on your blog, anyway, it
works for me, I clone the whole package repo, then keep only the vignettes
directory (equivalent to your manuscript with Rmd and data). The
functions from my package are present in the installed package, so I don't
need another copy of those. That all seems to be working for me with this
https://github.com/benmarwick/Steele_et_al_VR003_MSA_Pigments

Yup, this is functionally equivalent to the . from the Hub's perspective,
other than assuming the image has git installed. Maybe it is more
transparent than 'copy'. Do you then run the Docker image with a linked
volume?

Looking into it a bit, I think I now see a flaw in using the vignettes
directory: I cannot store the HTML output from the rendered Rmd in there...
is that why you use a non-standard package directory like manuscripts?
The trade-off is that the manuscript is not included in the built package,
but the vignette is, and it's handy to use the package build functions to
generate the vignette. But I'm not sure how important the contents of the
built package is for archival purposes, since we typically put the package
source on zenodo, etc. I look forward to your thoughts on having a
manuscripts directory in the package, versus vignettes.

Bingo, I go back and forth on this one too and would really like your
thoughts on it. Several aspects here, some more conceptual than technical.

  1. install_github() used to build vignettes by default, meaning it had to
    install the SUGGESTS list of dependencies unique to the manuscript (knitr
    stuff, plotting stuff, etc) and compile the manuscript from scratch; which
    takes hours. This was fixed in devtools 1.6 (Sept 2014); but not sure how
    many older devtools are floating around out there.

My first thought on your specific question is to make isolation the
priority, since that seems to be what Docker does so well. So my preference
for the final compendium is to include the data, manuscript source and
built package all on the Docker image, and avoid linking or any other kind
of connection with the host. If the data got so big, perhaps then a
data-only image linked to a manuscript-source-and-package image. But when
developing the analysis I find it much more convenient to have access to my
host file system (relevant files are usually scattered widely until the
last moment when I organise the compendium), so I do similar to you with
linking the docker container to a host directory.

By the way, running devtools::install_github(""cboettig/nonparametric-bayes")
halts with an error thatpdgControlis not installed. After running
devtools::install_github("cboettig/pdg_control")I can installnonparametric-bayes`
fine. What's the general solution to installing packages from github when
they are a dependency of another package?

The cboettig/nonparametric-bayes Dockerfile should have installed
pdgControl already. Are you running this on a different Docker container?
In general, just put the install calls for Github packages in the
Dockerfile before the call to install the focal package; see:
https://github.com/cboettig/nonparametric-bayes/blob/22ef8818a62bf0f0be57874282661899c37a1137/manuscripts/Dockerfile#L7-8

I've had a bit of a look into how to do travis-style continuous
integration for the docker image, circleci.com seems closest. After a bit
of poking around I've got a circle.yml file, a shield on my readme, and my
image passes their test, so that seems like a reasonable option. I can't
see how to securely send my docker hub credentials to circle to push the
image after a successful build though.

WOW! Looks awesome, I've been wishing for something like that. Looks like
you can add secure credentials as environmental variables in the user
interface (Project Settings -> Environmental variables, see
https://circleci.com/docs/environment-variables ). Very cool that you can
build and run arbitrary docker images; most CI things I've checked won't
let you do that.

At the moment, I'm using Drone CI on a private digital-ocean server.
http://server.carlboettiger.info:88/ The benefit is that I can use custom
docker images (like Circle), but the image can be kept on the server so I
do not need to build it or pull it down from the hub each time to run the
tests. Also, I can optionally make the knitr cache persistant. This is
handy for things like my lab-notebook, where each post doesn't need to be
recompiled by every commit if it hasn't changed.

I'm looking forward to talking more about this with you at the rOpenSci

event, perhaps we should propose a project during that event to document
more of this process of using R packages and Docker containers to make
research compendia.

Likewise! Great idea.


Reply to this email directly or view it on GitHub
#55 (comment)
.

@benmarwick
Copy link
Contributor Author

Thanks again, your comments are very instructive, as usual! I'm not doing any linking (that I know of) for the compendium image, only the development image (which isn't part of the repository because of the relative paths and general disorder).

Sorry to be unclear, I meant that I tried devtools::install_github(""cboettig/nonparametric-bayes") on my local installation of R, not in the docker container. I see how you've used the dockerfile to take care of github-only package dependences, which is neat. Is this a unique feature of docker, or can the regular R build process fetch github-only package dependencies for building and testing?

I've posted a comment over on the devtools issues tracker to see get some feedback on integrating docker into devtools,

That's a neat private digital-ocean server you've set up, I can see how that would save time for testing. Anyway, thanks again for your help with making sense of all this, really looking forward to talking more at the rOpenSci event.

@cboettig
Copy link
Owner

Ah, good question about the github-only dependencies thing. The regular
build process doesn't, though perhaps one could inject a script on package
load, or simply include a call to install_github in the manuscript Rmd
itself. Packrat provides an alternative solution, since it can
automatically install github dependencies. This could be used in
conjunction with a more generic Docker image (like hadleyverse) rather than
providing a Dockerfile for every project: just start, say, a hadleyverse
container running RStudio, clone the project (along with its packrat
files), and it will have/be able to install all the project-specific
dependencies. (at least the R dependencies, the container providing
consistent versions of R and the low-level libraries)

Maybe that's a good way to go, but I got really annoyed at packrat always
bugging me about cross-grading and installing and reinstalling stuff when I
was using it though; I would prefer something lighter-weight and less
strict than packrat's approach of locking all versions of all packages. I
think @gmbecker's recent packages switchr and GRANbase might offer a better
intermediate solution, but I'm still learning my way around them.

On Wed Jan 28 2015 at 2:07:36 PM Ben Marwick notifications@github.com
wrote:

Thanks again, your comments are very instructive, as usual! I'm not doing
any linking (that I know of) for the compendium image, only the development
image (which isn't part of the repository because of the relative paths and
general disorder).

Sorry to be unclear, I meant that I tried
devtools::install_github(""cboettig/nonparametric-bayes") on my local
installation of R, not in the docker container. I see how you've used the
dockerfile to take care of github-only package dependences, which is neat.
Is this a unique feature of docker, or can the regular R build process
fetch github-only package dependencies for building and testing?

I've posted a comment over on the devtools issues tracker
r-lib/devtools#710 to see get some feedback
integrating docker into devtools,

That's a neat private digital-ocean server you've set up, I can see how
that would save time for testing. Anyway, thanks again for your help with
making sense of all this, really looking forward to talking more at the
rOpenSci event.


Reply to this email directly or view it on GitHub
#55 (comment)
.

@cboettig
Copy link
Owner

@benmarwick Thanks again for the pointer to Circle-CI. Looks like they have a max build time of 1.5 hrs, so I'm not sure if nonparametric-bayes or some of my other research papers could build in that time (mostly because I can't be bothered to make the code run faster than anything more intrinsic), but I did set up a quick example for the RNeXML paper: https://circleci.com/gh/ropensci/RNeXML

This is really nice -- building all the dependencies on rocker/ropensci from scratch (compiling R packages from source, installing the LaTeX) on travis is prohibitively long; travis has only ever tested the R package but not the compile. Super simple on circle ci.

@benmarwick
Copy link
Contributor Author

Yes, the circleci service looks pretty handy. Your description of the generic docker image and a package that contains its own dependencies does sound like a good option too. I guess an easy method of having a package contain its own dependencies doesn't seem to be available yet.

I really like the idea of packrat, but it's never worked properly for me (packrat::restore() usually ends with me getting frustrated and deleting every packrat related file in my project...). Thanks for the pointer to GRANbase, though a quick test of the vignette code gives errors for me on windows and rocker/hadleyverse (with root permission). Seems like the breakthrough on this problem is yet to come (maybe at the rOpenSci meet Hadley can whip something up? :) In the meantime, I'll just keep all of CRAN on my laptop, or try checkpoint

@gmbecker
Copy link

Hey all,

The installation mechanism in switchr (http://github.com/gmbecker/switchr)
actually does support github-only dependency during installation, so long
as you have a manifest which contains all the necessary github packages (it
need not be exclusive, i.e. you could have a very large manifest of which
you are installing only a small part at any given time).

See the vignette for a baby example, and please feel free to ping me with
any issues or questions.

~G

On Wed, Jan 28, 2015 at 11:40 PM, Ben Marwick notifications@github.com
wrote:

Yes, the circleci service looks pretty handy. Your description of the
generic docker image and a package that contains its own dependencies does
sound like a good option too. I guess an easy method of having a package
contain its own dependencies doesn't seem to be available yet.

I really like the idea of packrat, but it's never worked properly for me (
packrat::restore() usually ends with me getting frustrated and deleting
every packrat related file in my project...). Thanks for the pointer to
GRANbase, though a quick test of the vignette code gives errors for me on
windows and rocker/hadleyverse (with root permission). Seems like the
breakthrough on this problem is yet to come (maybe at the rOpenSci meet
Hadley can whip something up? :) In the meantime, I'll just keep all of
CRAN on my laptop, or try checkpoint


Reply to this email directly or view it on GitHub
#55 (comment)
.

Gabriel Becker, PhD
Computational Biologist
Bioinformatics and Computational Biology
Genentech, Inc.

@benmarwick
Copy link
Contributor Author

Thanks @gmbecker I've posted a question over there. Have either of you looked at rbundler as one solution to this problem, specifically focused on package development?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants