-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use staged builds to minimize final image sizes #1031
Conversation
None of the test failures are due to my changes. CodeGen Gaudi test TGI fail is due to it trying to load HuggingFace model it has no rights for:
CodeGen Xeon test TGI seems to fail due to: VisualQnA Gaudi & Xeon tests fail is due to NPM dependency conflict for it's Node.js Svelte UI container build (which spec is not touched by this PR). |
3e49050
to
07051a7
Compare
Rebased this example to latest Note: I did not update the
|
07051a7
to
f43ab84
Compare
Updated also the new Rebased to latest |
No idea why guardrails times out:
And translation fails:
As CI does not provide enough information. |
f43ab84
to
95e0c76
Compare
Rebased to EdgeCraftRAG was not updated because it's using |
@lvliang-intel CI seems to be in rather bad state, as CMake is segfaulting on image builds:
|
95e0c76
to
95276fc
Compare
No changes, just rebase to latest Sadly it did not, "ChatQnA, gaudi" test still fails, to:
vLLM seems to return different results than TGI, so I wonder whether CI test is just out of date? |
Dependency Review✅ No vulnerabilities or license issues found.Scanned Files |
Not actively, but I'm occasionally updating it.
Based on my earlier testing, it reduces size of each app container by ~350MB. For details, see #225. But it's just the first step, demostrating that the only unique part in all these images is the app Python file. The end goal of switching to shared base image, will drop size of all these images from hundreds of MBs to just tens of KBs. |
@eero-t Thanks! |
I think it's better if somebody (else) creates first a GenAIComps repo base image [1], and makes sure that nightly I can then just replace all the preliminary stages in Dockerfiles in this PR with that base image. [1] Comps repo base image Dockerfile = basically the first 38 lines from any Dockerfile included to this PR, but with no need to install Git or pull the Comps repo with it. Just COPYing the relevant dirs is enough. |
Looking at the recent changes in GenAIExamples repo, FFmpeg needs to be added the DocSum image now. |
@eero-t are bugs filed for these .. at the very least there should be documentation that lists need to access specific model/kernel etc. |
@eero-t please submit yourself this base image and we can have @chensuyue help with publishing it. Completing this slimming of all containers for V 1.2 would be wonderful. |
b1a16ee
to
414d1c6
Compare
Rebased to "EdgeCraftRAG" |
"apt-get update" in previous stage was not enough, apparently it needs to be done for every "apt-get install" command. (Fixed that also for EdgeCraftRAG/Dockerfile.server that I did not otherwise touch.) |
Currently 8 of the 86 tests fail. All the failures are in services / containers coming from Comps project, not something touched by this PR. Gaudi & Xeon "AvatarChatbot" tests animation service fails type validation:
Most of rocm "run-test" cases fail, except for AudioQnA + FaqGen tests. "ChatQnA, rocm" test fail is a bit of a mystery:
As is "MultimodalQnA, rocm" test failure:
"CodeGen, rocm" test service exits with failure:
As does "CodeTrans, rocm" test service:
And "Translation, rocm" test service:
Queried "VisualQnA, rocm" container does not exist:
|
@ashahba's ChatQnA example PR (#1363) has optimization for speeding up the intermediate stage. It While the repo content fetching itself takes about same time in both (6s in my setup), installing Because that does not affect final image sizes, only build speed, and #1369 will replace these changes as soon as the new base image is available, I'll change it only if I need to otherwise update all the Dockerfiles in this PR. |
we can still stick with |
@chensuyue and I have reached out to AMD for help with the ROCm failures. |
Now the failures are down to:
I suspect that this PR really has nothing to do with the failures and most likely they are side effects of refactoring being discovered by this PR since they are touching some containers that were not testing as part of refactoring. |
AvatarChatbot will pass, after this PR merged. |
All those issues have resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, I'm going to approve this PR and we can always add:
apt-get clean autoclean && \
apt-get autoremove -y && \
in the future or if base container is merged, we'll just add it to that.
Out of 86 CI tests, 4 Gaudi failed due to container creation/startup issues. (Which are unrelated to changes done in this PR.) "ChatQnA, gaudi" and "Translation, gaudi" test errors:
"CodeTrans, gaudi" and "SearchQnA, gaudi" test errors:
Maybe different Gaudi test runs are not isolated from each other well enough, when this large number of them is started in parallel? Could they e.g. be purging created containers from each others' run, and try to re-create containers with the same name in same node? |
We only have 1 gaudi machine for several projects CI, so sometimes there are conflict between the test from different Repo. We can't do force image clean up since the test may run in parallel. |
Needing to rerun 86 tests, which execution takes ~3 hours, in hopes that one of those runs would not hit this race-condition, is IMHO not really acceptable. Depending on how likely this CI failure is, it may never pass...
Could CI use, or force use of unique names / suffixes for the image / container names, and remove them at end of the test? Purging could be done when there are no tests running (I would imagine there's some time during each day when there are no CI jobs running). |
These CI improvement seem unlikely to come in time for 1.2 release. Alternatives for that are:
|
Would it be better to create a CI specific registry, no need to then change image names, just where to retrieve the images from. Once CI run complete, remove that version of the registry. |
After merge commit there were many additional CI failures. First 2 errors below look like real issues, in code merged from Dockerfile check:
"AvatarChatbot, gaudi":
"ChatQnA, gaudi":
"CodeGen, gaudi":
"CodeGen, rocm":
"CodeTrans, rocm":
"FaqGen, gaudi":
"FaqGen, rocm":
"MultiModalQnA, gaudi":`
"MultiModalQnA, xeon":`
"Translation, rocm":
"VisualQnA, gaudi":
"VisualQnA, rocm":
"VisualQnA, xeon":
|
So that redundant things do not end in final image: - Git repo history - Test directories - Git tool and its deps And drop explicit installation of: - jemalloc & GLX: nothing uses them (in ChatQnA at least), and for testing it's trivial to create image adding those on top: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#switch-memory-allocator - langchain_core: GenAIComps install langchain which already depends on that This demonstrates that only 2-3 lines in the Dockerfiles are unique, and everything before those can be removed with a common base image. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Rebased to |
Rebase fixed CI issues, but It's with
And seems to be caused by this PR: https://github.com/opea-project/GenAIComps/pull/1132/files As it dropped It has nothing to do with this PR, so it's not a blocker for merging. |
@yao531441 please check this issue. |
@yao531441 is checking the issue, we can merge this PR with this issue left if this is target v1.2. |
Fix PR opea-project/GenAIComps#1160 |
Description
Staged image builds so that final images do not have redundant things like:
And drop explicit installation of:
langchain_core
: GenAIComps installslangchain
which already depends on thatjemalloc
& GLX: nothing uses them (in any of the ChatQnA services), and for testing[1] it's trivial to create separate image adding those on top~/.bashrc
(as these images run Python programs directly, not through Bash scripts)=> This demonstrates that only 2-3 lines in the Dockerfiles are unique, and everything preceding those could be removed with a common base image.
[1] I assume those files were there to test this: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#switch-memory-allocator
Issues
Fixes: #225
Type of change
Dependencies
n/a
(this removes redundant Git, Perl, jemalloc, GLX dependencies from final images)Tests
This is draft / example for fixing #225
I have not tested it apart from verifying that images still build.
Notes
In a proper fix, non-unique part of the Dockerfiles would be a separate base image, generated with GenAIComps repo Dockerfile, and Dockerfiles in this repository would depend on that image instead of
python-slim
.However, that requires co-operation between these two repositories (unless components base image Dockerfile is also in this repo) and:
(I.e. it needs to be done by a member of this project, I cannot do it.)