-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use wheels directly in the build process #1555
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved - some thoughts/questions, but nothing that makes me say we have to rewrite. Read before you merge, might jog an idea or two?
Okay maybe fixing the versioning of the apt-get would be nice.
|
||
RUN set -eux; \ | ||
apt-get -qq update; \ | ||
apt-get -qq -y --no-install-recommends install liblzo2-2 curl; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider versioning these, so the dockerfile command is different to handle updates rather than just take it for granted that the docker build cache is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Versioning the package installs? ie, liblzo2-2=x.y.z curl=...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, otherwise when apt-get installs something new, we won't detect it and update local cached images. It isn't likely to break things... except when it does, and we won't have visibility, will have to either push a garbage step above (or in) the RUN, or tell downstream users to clear the docker caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a scale between being very specific to maximize "reproducibility" vs having generically valid code that should be expected to work when upstream packages have changes. I'm not too keen on trying to be shepherds of package manager versions. It's somewhat nice to have "loose" dependencies that automatically update to their latest versions without needing a lot of engineering management around them.
It's true that you'll need to explicitly bust the cache to pick up changes, or you'll implicitly pick up changes when something earlier causes a cache bust; and when things do break it will take some image inspection to list out all of the differences in system packages.
We could potentially think about doing a single apt-get update
very high up in the stack and not deleting it - with the expectation we explicitly bust that part of the cache after a new major/minor release - such that any downstream installs should always install the same.
That said, I don't think I've seen these patterns in use elsewhere (neither being very specific with 3rd party packages, nor the persisted apt-get update). Not that I'm not willing to try if we really think it's valuable.
Related: apt-get upgrade
and apt-get dist-upgrade
have larger implications wrt reproducibility as well.
apt-get -qq -y --no-install-recommends install liblzo2-2 curl; \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
COPY deephaven-jpy-wheel/ /deephaven-jpy-wheel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not do just one big copy here, then one big run, install all three wheels in one step? I expect that jpy itself changes less often than the other two wheels, but do you think its enough to warrant tripling the layers?
The concern I have is that since old deephaven-wheel is based in part on our javadoc, it will see a change any time the JSON is correctly updated, so we will have to copy it, install it, copy the dh2-wheel, install it (even if it didnt change) - might as well just do them in one step?
("yes, i meant it this way" is a totally valid answer, just trying to make sure i get it all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we are "tripling" the layers, but it only effects the build stage, and not the final produced image. In essence, I think we are able to get the "best of both worlds" - we don't have to re-install deephaven-jpy.whl when deephaven.whl or deephaven2.whl changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also affects the local build cache, 2x the image churn - and deephaven.whl is the "big" one, relative to the other two (1MB, 50kB, 217MB) due to downloading other packages from the internet. On the one hand it should be first so we don't churn it so much, on the other hand, afaict it is the only one that changes when javadoc changes, so it should be last to avoid rebuilding the others.
And the final layer gets a complete fresh copy made when anything else changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to change the order of the COPY/RUN steps, or move into a single COPY/RUN.
But this also brings up another "reproducibility" concern very similar to your concern about apt-get installing specific versions. pip will likely choose the most recent version for all of the dependencies the wheel specifies. Our requirements aren't specific:
$ unzip -q -c docker/runtime-base/build/context/deephaven-wheel/deephaven-0.7.0-py2.py3-none-any.whl | grep Requires
Requires-Dist: deephaven-jpy (==0.7.0)
Requires-Dist: numpy
Requires-Dist: dill (>=0.2.8)
Requires-Dist: wrapt
Requires-Dist: pandas
Requires-Dist: enum34 ; python_version < "3.4"
Requires-Dist: numba ; python_version > "3.0"
If we want to maximize cache-ability, we would likely want to have a freeze list that we install before any of our wheels; and then can install just our wheels w/o any additional download.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a requirements file into the runtime-base PR - which in addition to being better for reproducibility, aids in cache-ability - but comes at the cost of additional maintenance.
into 'deephaven2-wheel' | ||
} | ||
|
||
from 'src/main/docker' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not embed the Dockerfile as a task, rather than an "external" build file to show how to use the wheels mentioned above?
Also, along with the "one big copy" comment, consider making the three from
s use one common into
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think of it so much as an "external" build file. When we don't need build-level parameterization of a dockerfile, I think a textual Dockerfile is easier to grok and modify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps - I find it easier to all have in one place, rather than knowing when to look at py, groovy, dockerfile, .gradle, .sh, etc to find configuration, consolidate the sh into Dockerfile, the Dockerfile into gradle.
Plus if we have a centralized tagging system (so we can produce arm images correctly) we are going to have to modify all these images to have the right tag for in-build. For example FROM dh/java-and-python:local-build-arm64
, or else we have to do a docker+gradle clean on CI to get it to start the next platform, etc. (or pick another option which keeps gradle in the dark about what kinds of images it is producing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up PR, I'm adding a "requirements.txt" file - now, we could consolidate the requirements file into a RUN/sh install command instead (no file). But I think as a text file it's easier to maintain, and in perspective lives right next to the Dockerfile.
|
||
COPY deephaven-wheel/ /deephaven-wheel | ||
RUN set -eux; \ | ||
python3 -m pip install -q --no-cache-dir /deephaven-wheel/*.whl; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(cc @jmao-denver) I just noticed this wondering if you have any insights - if I'm reading this right, the wheel itself is 137kB, but the installed sized is 217MB?
3c7d9ad99d54 8 minutes ago /bin/sh -c set -eux; python3 -m pip inst… 217MB
ec679d5275dd 8 minutes ago /bin/sh -c #(nop) COPY dir:7dc2bc98d33cb7f8d… 137kB
Step 5/10 : COPY deephaven-wheel/ /deephaven-wheel
---> ec679d5275dd
Step 6/10 : RUN set -eux; python3 -m pip install -q --no-cache-dir /deephaven-wheel/*.whl; rm -r /deephaven-wheel
---> Running in 046d3e02500b
+ python3 -m pip install -q --no-cache-dir /deephaven-wheel/deephaven-0.7.0-py2.py3-none-any.whl
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
+ rm -r /deephaven-wheel
Removing intermediate container 046d3e02500b
---> 3c7d9ad99d54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the sizes I get locally:
$ ls -la Integrations/build/context/*
-rw-r--r--. 1 devin devin 761 Nov 12 12:03 Integrations/build/context/Dockerfile
Integrations/build/context/deephaven2-wheel:
total 16
drwxr-xr-x. 1 devin devin 66 Nov 12 12:03 .
drwxr-xr-x. 1 devin devin 120 Nov 12 12:03 ..
-rw-r--r--. 1 devin devin 12942 Nov 12 12:03 deephaven2-0.7.0-py3-none-any.whl
Integrations/build/context/deephaven-jpy-wheel:
total 352
drwxr-xr-x. 1 devin devin 94 Nov 12 12:03 .
drwxr-xr-x. 1 devin devin 120 Nov 12 12:03 ..
-rw-r--r--. 1 devin devin 356445 Nov 12 12:03 deephaven_jpy-0.7.0-cp37-cp37m-linux_x86_64.whl
Integrations/build/context/deephaven-wheel:
total 136
drwxr-xr-x. 1 devin devin 72 Nov 12 12:03 .
drwxr-xr-x. 1 devin devin 120 Nov 12 12:03 ..
-rw-r--r--. 1 devin devin 137119 Nov 12 12:03 deephaven-0.7.0-py2.py3-none-any.whl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The large install size is due to dependency on numba and other stuff.
rm -r /deephaven2-wheel | ||
|
||
FROM deephaven/java-and-python:local-build | ||
COPY --from=install-wheels / / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand properly that these last two lines "start over" with the fresh image, and just put one big layer on top, with the contents of the above steps?
Assuming so, would it be possible to just mash in the created dist-packages from the three above RUNs, and let each RUN happen in its own distinct FROM, so that they don't affect each other? This is probably overkill, but since we're making an entirely fresh 200mb build each time the above 200mb build changes, it seems like we could have some room to improve a bit, or drop this "flattening" step and cut our churn in half?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - we are taking the changes from the "install-wheels" stage, and just adding the bits that have changed back against java-and-python:local-build.
The reason I don't want to run 3 pip installs runs to be parallel (instead of serial) is that I can't guarantee they won't have overlapping bits; and I'd prefer to let the pip package manager deal with any potential version conflicts than either breaking or silently choosing one during a copy stage.
No description provided.