Use staged builds to minimize final image sizes #1031

eero-t · 2024-10-25T17:33:05Z

Description

Staged image builds so that final images do not have redundant things like:

Git tool and its deps (e.g. Perl)
Git repo history
Test directories

And drop explicit installation of:

langchain_core: GenAIComps installs langchain which already depends on that
jemalloc & GLX: nothing uses them (in any of the ChatQnA services), and for testing[1] it's trivial to create separate image adding those on top
File descriptor limit increase for ~/.bashrc (as these images run Python programs directly, not through Bash scripts)

=> This demonstrates that only 2-3 lines in the Dockerfiles are unique, and everything preceding those could be removed with a common base image.

[1] I assume those files were there to test this: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#switch-memory-allocator

Issues

Fixes: #225

Type of change

Others (image size improvement / Dockerfile cleanup)

Dependencies

n/a (this removes redundant Git, Perl, jemalloc, GLX dependencies from final images)

Tests

This is draft / example for fixing #225

I have not tested it apart from verifying that images still build.

Notes

In a proper fix, non-unique part of the Dockerfiles would be a separate base image, generated with GenAIComps repo Dockerfile, and Dockerfiles in this repository would depend on that image instead of python-slim.

However, that requires co-operation between these two repositories (unless components base image Dockerfile is also in this repo) and:

CI handling this dependency i.e. building the base image first, when relevant
That base image being in a repository accessible for building the application images
- E.g. in OPEA Docker hub project

(I.e. it needs to be done by a member of this project, I cannot do it.)

eero-t · 2024-10-25T21:24:08Z

None of the test failures are due to my changes.

CodeGen Gaudi test TGI fail is due to it trying to load HuggingFace model it has no rights for:

Access to model meta-llama/CodeLlama-7b-hf is restricted and you are not in the authorized list.
Visit https://huggingface.co/meta-llama/CodeLlama-7b-hf to ask for access.

CodeGen Xeon test TGI seems to fail due to: Could not import SGMV kernel from Punica, which may be similar issue.

VisualQnA Gaudi & Xeon tests fail is due to NPM dependency conflict for it's Node.js Svelte UI container build (which spec is not touched by this PR).

eero-t · 2024-11-14T13:35:31Z

Rebased this example to latest main, on assumption that CI issues have been fixed in the meanwhile.

Note: I did not update the Dockerfiles for applications that were added after this PR was created:

GraphRAG
EdgeCraftRAG

eero-t · 2024-11-18T11:46:59Z

Updated also the new ChatQnA/Dockerfile.wrapper to staged build.

Rebased to latest main, as previously used main failed in CI.

eero-t · 2024-11-22T10:26:47Z

No idea why guardrails times out:

Waiting for deployment "chatqna-tgi" rollout to finish: 0 of 1 updated replicas are available...
deployment "chatqna-tgi-guardrails" successfully rolled out
error: deployment "chatqna-tgi" exceeded its progress deadline
+ echo 'Timeout waiting for chatqna_guardrail pod ready!'
+ exit 1
Timeout waiting for chatqna_guardrail pod ready!

And translation fails:

curl: (18) transfer closed with outstanding read data remaining
Validate Translation failure!!!

As CI does not provide enough information.

eero-t · 2024-11-22T10:52:36Z

Rebased to main, and updated also GraphRAQ Dockerfile.

EdgeCraftRAG was not updated because it's using comps-base package from pip, instead of cloning Comps repo.

eero-t · 2024-11-22T10:59:19Z

@lvliang-intel CI seems to be in rather bad state, as CMake is segfaulting on image builds:

 [vllm build 5/7] RUN --mount=type=cache,target=/root/.cache/pip     --mount=type=cache,target=/root/.cache/ccache     --mount=type=bind,source=.git,target=.git     VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel &&     pip install dist/*.whl &&     rm -rf dist:
...
4.853 subprocess.CalledProcessError: Command '['cmake', '/workspace/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cpu', '-DCMAKE_C_COMPILER_LAUNCHER=ccache', '-DCMAKE_CXX_COMPILER_LAUNCHER=ccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache', '-DCMAKE_HIP_COMPILER_LAUNCHER=ccache', '-DVLLM_PYTHON_EXECUTABLE=/usr/bin/python3', '-DVLLM_PYTHON_PATH=/workspace/vllm:/usr/lib/python310.zip:/usr/lib/python3.10:/usr/lib/python3.10/lib-dynload:/usr/local/lib/python3.10/dist-packages:/usr/lib/python3/dist-packages:/usr/local/lib/python3.10/dist-packages/setuptools/_vendor', '-DFETCHCONTENT_BASE_DIR=/workspace/vllm/.deps', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=152']' returned non-zero exit status 1.
5.371 Segmentation fault (core dumped)

eero-t · 2024-12-16T19:49:30Z

No changes, just rebase to latest main in hope that the CI issues have been fixed in the meanwhile.

Sadly it did not, "ChatQnA, gaudi" test still fails, to:

Response check failed, please check the logs in artifacts!
Validate test_manifest_vllm_on_gaudi.sh failure!!!

vLLM seems to return different results than TGI, so I wonder whether CI test is just out of date?

github-actions · 2024-12-23T03:48:23Z

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

xiguiw · 2024-12-23T03:50:21Z

@eero-t @ashahba

Are you still working on this?

What the expected image size is with and without this PR?

Thanks!

eero-t · 2024-12-23T08:44:58Z

@eero-t @ashahba

Are you still working on this?

Not actively, but I'm occasionally updating it.

What the expected image size is with and without this PR?

Based on my earlier testing, it reduces size of each app container by ~350MB. For details, see #225.

But it's just the first step, demostrating that the only unique part in all these images is the app Python file.

The end goal of switching to shared base image, will drop size of all these images from hundreds of MBs to just tens of KBs.

xiguiw · 2024-12-26T03:07:10Z

Not actively, but I'm occasionally updating it.

What the expected image size is with and without this PR?

Based on my earlier testing, it reduces size of each app container by ~350MB. For details, see #225.

But it's just the first step, demostrating that the only unique part in all these images is the app Python file.

The end goal of switching to shared base image, will drop size of all these images from hundreds of MBs to just tens of KBs.

@eero-t
Great!
Shall we merge this PR and you create new PR for your next work? or you want to keep working on this PR?

Thanks!

eero-t · 2024-12-30T12:34:55Z

Shall we merge this PR and you create new PR for your next work? or you want to keep working on this PR?

I think it's better if somebody (else) creates first a GenAIComps repo base image [1], and makes sure that nightly latest builds of it end up in DockerHub. Please add me as reviewer for such PR.

I can then just replace all the preliminary stages in Dockerfiles in this PR with that base image.

[1] Comps repo base image Dockerfile = basically the first 38 lines from any Dockerfile included to this PR, but with no need to install Git or pull the Comps repo with it. Just COPYing the relevant dirs is enough.

eero-t · 2024-12-30T12:52:00Z

Looking at the recent changes in GenAIExamples repo, FFmpeg needs to be added the DocSum image now.

mkbhanda · 2025-01-07T16:15:06Z

None of the test failures are due to my changes.

CodeGen Gaudi test TGI fail is due to it trying to load HuggingFace model it has no rights for:
Access to model meta-llama/CodeLlama-7b-hf is restricted and you are not in the authorized list.
Visit https://huggingface.co/meta-llama/CodeLlama-7b-hf to ask for access.
CodeGen Xeon test TGI seems to fail due to: Could not import SGMV kernel from Punica, which may be similar issue.

VisualQnA Gaudi & Xeon tests fail is due to NPM dependency conflict for it's Node.js Svelte UI container build (which spec is not touched by this PR).

@eero-t are bugs filed for these .. at the very least there should be documentation that lists need to access specific model/kernel etc.

mkbhanda · 2025-01-07T16:18:24Z

Shall we merge this PR and you create new PR for your next work? or you want to keep working on this PR?

I think it's better if somebody (else) creates first a GenAIComps repo base image [1], and makes sure that nightly latest builds of it end up in DockerHub. Please add me as reviewer for such PR.

I can then just replace all the preliminary stages in Dockerfiles in this PR with that base image.

[1] Comps repo base image Dockerfile = basically the first 38 lines from any Dockerfile included to this PR, but with no need to install Git or pull the Comps repo with it. Just COPYing the relevant dirs is enough.

@eero-t please submit yourself this base image and we can have @chensuyue help with publishing it. Completing this slimming of all containers for V 1.2 would be wonderful.

eero-t · 2025-01-08T15:46:11Z

Rebased to main, updated "DocSum" Dockerfile to install FFmpeg, and added same changes also to "EdgeCraftRAG" Dockerfile.

"EdgeCraftRAG" Dockerfile.server file is the only one that is not modified. That's because it it imports opea-base module instead of fetching the code from Git.

eero-t · 2025-01-08T18:32:53Z

"apt-get update" in previous stage was not enough, apparently it needs to be done for every "apt-get install" command.

(Fixed that also for EdgeCraftRAG/Dockerfile.server that I did not otherwise touch.)

eero-t · 2025-01-09T15:22:26Z

Currently 8 of the 86 tests fail.

All the failures are in services / containers coming from Comps project, not something touched by this PR.

Gaudi & Xeon "AvatarChatbot" tests animation service fails type validation:

   File "/home/user/comps/animation/src/opea_animation_microservice.py", line 54, in animate
    return VideoPath(video_path=outfile)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for VideoPath
video_path
  Input should be a valid string [type=string_type, input_value=<coroutine object OpeaCom...nvoke at 0x7f5ea7ea2110>, input_type=coroutine]

Most of rocm "run-test" cases fail, except for AudioQnA + FaqGen tests.

"ChatQnA, rocm" test fail is a bit of a mystery:

 [2025-01-08 19:04:18,838] [    INFO] - prepare_doc_redis - [ upload ] File dataprep_file.txt does not exist.
/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py:75: DeprecationWarning: Call to deprecated add_document. (deprecated since redisearch 2.0, call hset instead) -- Deprecated since version 2.0.0.
  client.add_document(doc_id="file:" + key, file_name=key, key_ids=value)
[2025-01-08 19:04:20,009] [    INFO] - prepare_doc_redis - [ upload ] Link https://www.ces.tech/ does not exist. Keep storing.
...
 [ tei-rerank ] Content is as expected.
+ validate_service 10.53.22.29:9009/generate generated_text tgi-llm chatqna-tgi-server '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}'
...
++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' -H 'Content-Type: application/json' 10.53.22.29:9009/generate
+ HTTP_RESPONSE=HTTPSTATUS:000
Error: Process completed with exit code 7.

As is "MultimodalQnA, rocm" test failure:

2025-01-08T19:29:01.8774001Z [ retriever-redis ] Content is as expected.
2025-01-08T19:31:46.0053127Z + echo 'Evaluating lvm-llava'
2025-01-08T19:31:46.0054102Z + validate_service http://10.53.22.29:8399/generate '"generated_text":' tgi-llava-rocm-server tgi-llava-rocm-server '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}'
...
2025-01-08T19:31:46.0069882Z ++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' -H 'Content-Type: application/json' http://10.53.22.29:8399/generate
2025-01-08T19:31:46.0157166Z + HTTP_RESPONSE=HTTPSTATUS:000
2025-01-08T19:31:46.0231025Z ##[error]Process completed with exit code 7.

"CodeGen, rocm" test service exits with failure:

Container codegen-tgi-service  Error
dependency failed to start: container codegen-tgi-service exited (1)
Error: Process completed with exit code 1.

As does "CodeTrans, rocm" test service:

  Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

And "Translation, rocm" test service:

 Container translation-tgi-service  Error
dependency failed to start: container translation-tgi-service exited (1)
Error: Process completed with exit code 1.

Queried "VisualQnA, rocm" container does not exist:

+ echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs visualqna-tgi-service
Error response from daemon: No such container: visualqna-tgi-service
+ exit 1
Error: Process completed with exit code 1.

eero-t · 2025-01-09T17:49:35Z

@ashahba's ChatQnA example PR (#1363) has optimization for speeding up the intermediate stage. It curls a tarball of the repo contents instead of git cloning it, like is done here.

While the repo content fetching itself takes about same time in both (6s in my setup), installing curl + deps is 2x faster (9s) than git + deps install (18s), as latter installs more.

Because that does not affect final image sizes, only build speed, and #1369 will replace these changes as soon as the new base image is available, I'll change it only if I need to otherwise update all the Dockerfiles in this PR.

AudioQnA/Dockerfile

ashahba · 2025-01-09T18:11:19Z

@ashahba's ChatQnA example PR (#1363) has optimization for speeding up the intermediate stage. It curls a tarball of the repo contents instead of git cloning it, like is done here.

While the repo content fetching itself takes about same time in both (6s in my setup), installing curl + deps is 2x faster (9s) than git + deps install (18s), as latter installs more.

Because that does not affect final image sizes, only build speed, and #1369 will replace these changes as soon as the new base image is available, I'll change it only if I need to otherwise update all the Dockerfiles in this PR.

we can still stick with git and always bring curl back into the equation.
For now let's focus on getting tests pass 😃

mkbhanda · 2025-01-09T18:21:56Z

@chensuyue and I have reached out to AMD for help with the ROCm failures.

ashahba · 2025-01-09T23:47:29Z

Currently 8 of the 86 tests fail.

All the failures are in services / containers coming from Comps project, not something touched by this PR.

Gaudi & Xeon "AvatarChatbot" tests animation service fails type validation:

   File "/home/user/comps/animation/src/opea_animation_microservice.py", line 54, in animate
    return VideoPath(video_path=outfile)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for VideoPath
video_path
  Input should be a valid string [type=string_type, input_value=<coroutine object OpeaCom...nvoke at 0x7f5ea7ea2110>, input_type=coroutine]

Most of rocm "run-test" cases fail, except for AudioQnA + FaqGen tests.

"ChatQnA, rocm" test fail is a bit of a mystery:

 [2025-01-08 19:04:18,838] [    INFO] - prepare_doc_redis - [ upload ] File dataprep_file.txt does not exist.
/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py:75: DeprecationWarning: Call to deprecated add_document. (deprecated since redisearch 2.0, call hset instead) -- Deprecated since version 2.0.0.
  client.add_document(doc_id="file:" + key, file_name=key, key_ids=value)
[2025-01-08 19:04:20,009] [    INFO] - prepare_doc_redis - [ upload ] Link https://www.ces.tech/ does not exist. Keep storing.
...
 [ tei-rerank ] Content is as expected.
+ validate_service 10.53.22.29:9009/generate generated_text tgi-llm chatqna-tgi-server '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}'
...
++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' -H 'Content-Type: application/json' 10.53.22.29:9009/generate
+ HTTP_RESPONSE=HTTPSTATUS:000
Error: Process completed with exit code 7.

As is "MultimodalQnA, rocm" test failure:

2025-01-08T19:29:01.8774001Z [ retriever-redis ] Content is as expected.
2025-01-08T19:31:46.0053127Z + echo 'Evaluating lvm-llava'
2025-01-08T19:31:46.0054102Z + validate_service http://10.53.22.29:8399/generate '"generated_text":' tgi-llava-rocm-server tgi-llava-rocm-server '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}'
...
2025-01-08T19:31:46.0069882Z ++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' -H 'Content-Type: application/json' http://10.53.22.29:8399/generate
2025-01-08T19:31:46.0157166Z + HTTP_RESPONSE=HTTPSTATUS:000
2025-01-08T19:31:46.0231025Z ##[error]Process completed with exit code 7.

"CodeGen, rocm" test service exits with failure:

Container codegen-tgi-service  Error
dependency failed to start: container codegen-tgi-service exited (1)
Error: Process completed with exit code 1.

As does "CodeTrans, rocm" test service:

  Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

And "Translation, rocm" test service:

 Container translation-tgi-service  Error
dependency failed to start: container translation-tgi-service exited (1)
Error: Process completed with exit code 1.

Queried "VisualQnA, rocm" container does not exist:

+ echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs visualqna-tgi-service
Error response from daemon: No such container: visualqna-tgi-service
+ exit 1
Error: Process completed with exit code 1.

Now the failures are down to:

AvatarChatbot, (Gaudi and Xeon)
MultimodalQnA, Rocm
Translation, Rocm
VisualQnA, Rocm

I suspect that this PR really has nothing to do with the failures and most likely they are side effects of refactoring being discovered by this PR since they are touching some containers that were not testing as part of refactoring.

chensuyue · 2025-01-10T01:16:42Z

Currently 8 of the 86 tests fail.
All the failures are in services / containers coming from Comps project, not something touched by this PR.
Gaudi & Xeon "AvatarChatbot" tests animation service fails type validation:

   File "/home/user/comps/animation/src/opea_animation_microservice.py", line 54, in animate
    return VideoPath(video_path=outfile)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for VideoPath
video_path
  Input should be a valid string [type=string_type, input_value=<coroutine object OpeaCom...nvoke at 0x7f5ea7ea2110>, input_type=coroutine]

Most of rocm "run-test" cases fail, except for AudioQnA + FaqGen tests.
"ChatQnA, rocm" test fail is a bit of a mystery:

 [2025-01-08 19:04:18,838] [    INFO] - prepare_doc_redis - [ upload ] File dataprep_file.txt does not exist.
/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py:75: DeprecationWarning: Call to deprecated add_document. (deprecated since redisearch 2.0, call hset instead) -- Deprecated since version 2.0.0.
  client.add_document(doc_id="file:" + key, file_name=key, key_ids=value)
[2025-01-08 19:04:20,009] [    INFO] - prepare_doc_redis - [ upload ] Link https://www.ces.tech/ does not exist. Keep storing.
...
 [ tei-rerank ] Content is as expected.
+ validate_service 10.53.22.29:9009/generate generated_text tgi-llm chatqna-tgi-server '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}'
...
++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' -H 'Content-Type: application/json' 10.53.22.29:9009/generate
+ HTTP_RESPONSE=HTTPSTATUS:000
Error: Process completed with exit code 7.

As is "MultimodalQnA, rocm" test failure:

2025-01-08T19:29:01.8774001Z [ retriever-redis ] Content is as expected.
2025-01-08T19:31:46.0053127Z + echo 'Evaluating lvm-llava'
2025-01-08T19:31:46.0054102Z + validate_service http://10.53.22.29:8399/generate '"generated_text":' tgi-llava-rocm-server tgi-llava-rocm-server '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}'
...
2025-01-08T19:31:46.0069882Z ++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' -H 'Content-Type: application/json' http://10.53.22.29:8399/generate
2025-01-08T19:31:46.0157166Z + HTTP_RESPONSE=HTTPSTATUS:000
2025-01-08T19:31:46.0231025Z ##[error]Process completed with exit code 7.

"CodeGen, rocm" test service exits with failure:

Container codegen-tgi-service  Error
dependency failed to start: container codegen-tgi-service exited (1)
Error: Process completed with exit code 1.

As does "CodeTrans, rocm" test service:

  Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

And "Translation, rocm" test service:

 Container translation-tgi-service  Error
dependency failed to start: container translation-tgi-service exited (1)
Error: Process completed with exit code 1.

Queried "VisualQnA, rocm" container does not exist:

+ echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs visualqna-tgi-service
Error response from daemon: No such container: visualqna-tgi-service
+ exit 1
Error: Process completed with exit code 1.

Now the failures are down to:

AvatarChatbot, (Gaudi and Xeon)
MultimodalQnA, Rocm
Translation, Rocm
VisualQnA, Rocm

I suspect that this PR really has nothing to do with the failures and most likely they are side effects of refactoring being discovered by this PR since they are touching some containers that were not testing as part of refactoring.

AvatarChatbot will pass, after this PR merged.
#1371
ROCm issue failed with OOB, other PR also has this issue since yesterday, I have asked AMD to handle this.

AudioQnA/Dockerfile

chensuyue · 2025-01-10T03:50:26Z

Currently 8 of the 86 tests fail.
All the failures are in services / containers coming from Comps project, not something touched by this PR.
Gaudi & Xeon "AvatarChatbot" tests animation service fails type validation:

   File "/home/user/comps/animation/src/opea_animation_microservice.py", line 54, in animate
    return VideoPath(video_path=outfile)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for VideoPath
video_path
  Input should be a valid string [type=string_type, input_value=<coroutine object OpeaCom...nvoke at 0x7f5ea7ea2110>, input_type=coroutine]

Most of rocm "run-test" cases fail, except for AudioQnA + FaqGen tests.
"ChatQnA, rocm" test fail is a bit of a mystery:

 [2025-01-08 19:04:18,838] [    INFO] - prepare_doc_redis - [ upload ] File dataprep_file.txt does not exist.
/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py:75: DeprecationWarning: Call to deprecated add_document. (deprecated since redisearch 2.0, call hset instead) -- Deprecated since version 2.0.0.
  client.add_document(doc_id="file:" + key, file_name=key, key_ids=value)
[2025-01-08 19:04:20,009] [    INFO] - prepare_doc_redis - [ upload ] Link https://www.ces.tech/ does not exist. Keep storing.
...
 [ tei-rerank ] Content is as expected.
+ validate_service 10.53.22.29:9009/generate generated_text tgi-llm chatqna-tgi-server '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}'
...
++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' -H 'Content-Type: application/json' 10.53.22.29:9009/generate
+ HTTP_RESPONSE=HTTPSTATUS:000
Error: Process completed with exit code 7.

As is "MultimodalQnA, rocm" test failure:

2025-01-08T19:29:01.8774001Z [ retriever-redis ] Content is as expected.
2025-01-08T19:31:46.0053127Z + echo 'Evaluating lvm-llava'
2025-01-08T19:31:46.0054102Z + validate_service http://10.53.22.29:8399/generate '"generated_text":' tgi-llava-rocm-server tgi-llava-rocm-server '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}'
...
2025-01-08T19:31:46.0069882Z ++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' -H 'Content-Type: application/json' http://10.53.22.29:8399/generate
2025-01-08T19:31:46.0157166Z + HTTP_RESPONSE=HTTPSTATUS:000
2025-01-08T19:31:46.0231025Z ##[error]Process completed with exit code 7.

"CodeGen, rocm" test service exits with failure:

Container codegen-tgi-service  Error
dependency failed to start: container codegen-tgi-service exited (1)
Error: Process completed with exit code 1.

As does "CodeTrans, rocm" test service:

  Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

And "Translation, rocm" test service:

 Container translation-tgi-service  Error
dependency failed to start: container translation-tgi-service exited (1)
Error: Process completed with exit code 1.

Queried "VisualQnA, rocm" container does not exist:

+ echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs visualqna-tgi-service
Error response from daemon: No such container: visualqna-tgi-service
+ exit 1
Error: Process completed with exit code 1.

Now the failures are down to:

AvatarChatbot, (Gaudi and Xeon)
MultimodalQnA, Rocm
Translation, Rocm
VisualQnA, Rocm

I suspect that this PR really has nothing to do with the failures and most likely they are side effects of refactoring being discovered by this PR since they are touching some containers that were not testing as part of refactoring.

AvatarChatbot will pass, after this PR merged. #1371 ROCm issue failed with OOB, other PR also has this issue since yesterday, I have asked AMD to handle this.

All those issues have resolved.

ashahba

On a second thought, I'm going to approve this PR and we can always add:

apt-get clean autoclean && \
apt-get autoremove -y && \

in the future or if base container is merged, we'll just add it to that.

eero-t · 2025-01-10T10:10:38Z

Out of 86 CI tests, 4 Gaudi failed due to container creation/startup issues.

(Which are unrelated to changes done in this PR.)

"ChatQnA, gaudi" and "Translation, gaudi" test errors:

 Container tei-reranking-gaudi-server  Creating
Error response from daemon: Conflict. The container name "/tgi-gaudi-server" is already in use by container "7b44a11ea90647bc6eb285040061ac26c6af2d4db682acd9664e183ae85486de". You have to remove (or rename) that container to be able to reuse that name.
Error: Process completed with exit code 1.

"CodeTrans, gaudi" and "SearchQnA, gaudi" test errors:

  Container llm-textgen-gaudi-server  Starting
Error response from daemon: No such container: 89825e83e8edd701e190aec0d2e5be798c4e55a2d583975991ff8ac41ca2778f
Error: Process completed with exit code 1.

Maybe different Gaudi test runs are not isolated from each other well enough, when this large number of them is started in parallel?

Could they e.g. be purging created containers from each others' run, and try to re-create containers with the same name in same node?

chensuyue · 2025-01-10T10:17:39Z

Out of 82 tests (with 4 pending), I can already see CI failures in 4 of them, due to container creation/startup issues.

(Which are unrelated to changes done in this PR.)

"ChatQnA, gaudi" and "Translation, gaudi" test errors:
 Container tei-reranking-gaudi-server  Creating
Error response from daemon: Conflict. The container name "/tgi-gaudi-server" is already in use by container "7b44a11ea90647bc6eb285040061ac26c6af2d4db682acd9664e183ae85486de". You have to remove (or rename) that container to be able to reuse that name.
Error: Process completed with exit code 1.
"CodeTrans, gaudi" and "SearchQnA, gaudi" test errors:
  Container llm-textgen-gaudi-server  Starting
Error response from daemon: No such container: 89825e83e8edd701e190aec0d2e5be798c4e55a2d583975991ff8ac41ca2778f
Error: Process completed with exit code 1.
Maybe different test runs are not isolated from each other well enough, when this large number of them is started in parallel. Could they be purging created containers from each others' run, and try to re-create containers with the same name in same node?

We only have 1 gaudi machine for several projects CI, so sometimes there are conflict between the test from different Repo. We can't do force image clean up since the test may run in parallel.

eero-t · 2025-01-10T10:46:57Z

We only have 1 gaudi machine for several projects CI, so sometimes there are conflict between the test from different Repo. We can't do force image clean up since the test may run in parallel.

Needing to rerun 86 tests, which execution takes ~3 hours, in hopes that one of those runs would not hit this race-condition, is IMHO not really acceptable. Depending on how likely this CI failure is, it may never pass...

We only have 1 gaudi machine for several projects CI, so sometimes there are conflict between the test from different Repo. We can't do force image clean up since the test may run in parallel.

Could CI use, or force use of unique names / suffixes for the image / container names, and remove them at end of the test?

Purging could be done when there are no tests running (I would imagine there's some time during each day when there are no CI jobs running).

eero-t · 2025-01-10T13:36:59Z

Could CI use, or force use of unique names / suffixes for the image / container names, and remove them at end of the test?

Purging could be done when there are no tests running (I would imagine there's some time during each day when there are no CI jobs running).

These CI improvement seem unlikely to come in time for 1.2 release. Alternatives for that are:

Merge this PR despite these (unrelated) CI failures, or
Split this PR e.g. to 4 separate PRs, in hopes that those are less likely to trigger the CI race condition

mkbhanda · 2025-01-10T19:05:33Z

Could CI use, or force use of unique names / suffixes for the image / container names, and remove them at end of the test?
Purging could be done when there are no tests running (I would imagine there's some time during each day when there are no CI jobs running).

These CI improvement seem unlikely to come in time for 1.2 release. Alternatives for that are:

Merge this PR despite these (unrelated) CI failures, or

Split this PR e.g. to 4 separate PRs, in hopes that those are less likely to trigger the CI race condition

Would it be better to create a CI specific registry, no need to then change image names, just where to retrieve the images from. Once CI run complete, remove that version of the registry.

eero-t · 2025-01-13T09:54:28Z

After merge commit there were many additional CI failures.

First 2 errors below look like real issues, in code merged from main. Potentially also MultiModalQnA, and rocm FaqGen ones, rest look like already discussed CI problems...

Dockerfile check:

Missing Dockerfile: GenAIComps/comps/llms/faq-generation/tgi/langchain/Dockerfile (Referenced in GenAIExamples/./FaqGen/docker_compose/intel/cpu/xeon/README.md:22)
Missing Dockerfile: GenAIComps/comps/llms/faq-generation/tgi/langchain/Dockerfile (Referenced in GenAIExamples/./FaqGen/docker_compose/intel/hpu/gaudi/README.md:101)
Error: Process completed with exit code 1.

"AvatarChatbot, gaudi":

 [2025-01-13 04:24:17,864] [    INFO] - speecht5 - SpeechT5 generation begin.
...
 /usr/local/lib/python3.10/dist-packages/torch/_subclasses/functional_tensor.py:258: UserWarning: Failed to initialize NumPy: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
...
   File "/usr/local/lib/python3.10/dist-packages/basicsr/utils/diffjpeg.py", line 19, in <module>
    y_table = nn.Parameter(torch.from_numpy(y_table))
RuntimeError: Numpy is not available
...
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Result wrong.
Error: Process completed with exit code 1.

"ChatQnA, gaudi":

 Error response from daemon: a prune operation is already running
Error: Process completed with exit code 1.

"CodeGen, gaudi":

 Container llm-textgen-gaudi-server  Starting
Error response from daemon: No such container: 4febf8796f00c1a69e70d4773e0cf9f8296b397805dbcb77aa3ccb2df18d84a6
Error: Process completed with exit code 1.

"CodeGen, rocm":

 Container llm-textgen-gaudi-server  Starting
Error response from daemon: No such container: 4febf8796f00c1a69e70d4773e0cf9f8296b397805dbcb77aa3ccb2df18d84a6
Error: Process completed with exit code 1.

"CodeTrans, rocm":

 Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

"FaqGen, gaudi":

  Container llm-faqgen-server  Starting
Error response from daemon: No such container: 699e35b5f55c628f63ab5d40d084926fc5c4de17fefef42e55085c67c6edec35
Error: Process completed with exit code 1.

"FaqGen, rocm":

 resolve : lstat /home/huggingface/OPEA-CICD/actions-runner/_work/GenAIExamples/GenAIExamples/FaqGen/docker_image_build/GenAIComps/comps/llms/faq-generation: no such file or directory
Error: Process completed with exit code 17.

"MultiModalQnA, gaudi":`

resolve : lstat /home/sdp/actions-runner-examples/_work/GenAIExamples/GenAIExamples/MultimodalQnA/docker_image_build/GenAIComps/comps/lvms/tgi-llava: no such file or directory
Error: Process completed with exit code 17.

"MultiModalQnA, xeon":`

 could not find /home/sdp/GenAIExamples-actions-runner/_work/GenAIExamples/GenAIExamples/MultimodalQnA/docker_image_build/GenAIComps/comps/lvms/llava/dependency: stat /home/sdp/GenAIExamples-actions-runner/_work/GenAIExamples/GenAIExamples/MultimodalQnA/docker_image_build/GenAIComps/comps/lvms/llava/dependency: no such file or directory
Error: Process completed with exit code 17.

"Translation, rocm":

 Error response from daemon: driver failed programming external connectivity on endpoint translation-llm-textgen-server (10a6a8ee815160a5bfd4867ca3e224c276ba0b00090f422b2bd7102974b63df1): Bind for 0.0.0.0:9000 failed: port is already allocated
Error: Process completed with exit code 1.

"VisualQnA, gaudi":

 + echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs lvm-tgi-gaudi-server
[ lvm-tgi ] HTTP status is not 200. Received status was 000
Error response from daemon: No such container: lvm-tgi-gaudi-server
+ exit 1
Error: Process completed with exit code 1.

"VisualQnA, rocm":

+ docker logs visualqna-tgi-service
[ lvm-tgi ] HTTP status is not 200. Received status was 500
Error: ShardCannotStart
+ exit 1
Error: Process completed with exit code 1.

"VisualQnA, xeon":

+ docker logs lvm-tgi-xeon-server
Error response from daemon: No such container: lvm-tgi-xeon-server
+ exit 1
Error: Process completed with exit code 1.

So that redundant things do not end in final image: - Git repo history - Test directories - Git tool and its deps And drop explicit installation of: - jemalloc & GLX: nothing uses them (in ChatQnA at least), and for testing it's trivial to create image adding those on top: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#switch-memory-allocator - langchain_core: GenAIComps install langchain which already depends on that This demonstrates that only 2-3 lines in the Dockerfiles are unique, and everything before those can be removed with a common base image. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

eero-t · 2025-01-13T10:09:03Z

Rebased to main as it had some newer commits after the above merge from main was done, in hopes that those fix at least some of the new CI issues. And squashed the git fix commit to previous one. No changes in PR content.

eero-t · 2025-01-13T13:33:04Z

Rebase fixed CI issues, but nympy regression remains.

It's with wav2lip:

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

(Triggered internally at /npu-stack/pytorch-fork/torch/csrc/utils/tensor_numpy.cpp:84.)
 cpu = _conversion_method_template(device=torch.device("cpu"))
Calling add_step_closure function does not have any effect. It's lazy mode only functionality. (warning logged once)
Calling mark_step function does not have any effect. It's lazy mode only functionality. (warning logged once)
Calling iter_mark_step function does not have any effect. It's lazy mode only functionality. (warning logged once)
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
 return isinstance(object, types.FunctionType)
Traceback (most recent call last):
 File "/home/user/comps/third_parties/wav2lip/src/wav2lip_server.py", line 20, in <module>
   from utils import *
...
 File "/usr/local/lib/python3.10/dist-packages/basicsr/utils/diffjpeg.py", line 19, in <module>
   y_table = nn.Parameter(torch.from_numpy(y_table))
RuntimeError: Numpy is not available

And seems to be caused by this PR: https://github.com/opea-project/GenAIComps/pull/1132/files

As it dropped numpy v1 enforcement for the requirements.txt file used by wav2lip: https://github.com/opea-project/GenAIComps/blob/main/comps/third_parties/wav2lip/src/Dockerfile#L56

It has nothing to do with this PR, so it's not a blocker for merging.

chensuyue · 2025-01-16T01:37:13Z

Rebase fixed CI issues, but nympy regression remains.

It's with wav2lip:

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

(Triggered internally at /npu-stack/pytorch-fork/torch/csrc/utils/tensor_numpy.cpp:84.)
 cpu = _conversion_method_template(device=torch.device("cpu"))
Calling add_step_closure function does not have any effect. It's lazy mode only functionality. (warning logged once)
Calling mark_step function does not have any effect. It's lazy mode only functionality. (warning logged once)
Calling iter_mark_step function does not have any effect. It's lazy mode only functionality. (warning logged once)
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
 return isinstance(object, types.FunctionType)
Traceback (most recent call last):
 File "/home/user/comps/third_parties/wav2lip/src/wav2lip_server.py", line 20, in <module>
   from utils import *
...
 File "/usr/local/lib/python3.10/dist-packages/basicsr/utils/diffjpeg.py", line 19, in <module>
   y_table = nn.Parameter(torch.from_numpy(y_table))
RuntimeError: Numpy is not available

And seems to be caused by this PR: https://github.com/opea-project/GenAIComps/pull/1132/files

As it dropped numpy v1 enforcement for the requirements.txt file used by wav2lip: https://github.com/opea-project/GenAIComps/blob/main/comps/third_parties/wav2lip/src/Dockerfile#L56

It has nothing to do with this PR, so it's not a blocker for merging.

@yao531441 please check this issue.

chensuyue · 2025-01-16T02:47:02Z

@yao531441 is checking the issue, we can merge this PR with this issue left if this is target v1.2.

chensuyue · 2025-01-16T08:02:10Z

@yao531441 is checking the issue, we can merge this PR with this issue left if this is target v1.2.

Fix PR opea-project/GenAIComps#1160

eero-t requested a review from lvliang-intel as a code owner October 25, 2024 17:33

eero-t marked this pull request as draft October 25, 2024 17:33

eero-t mentioned this pull request Oct 25, 2024

Why containers use hundreds of MBs for Vim/Perl/OpenGL? #225

Closed

eero-t force-pushed the staged-images branch from 734bfd0 to 3e49050 Compare October 25, 2024 18:39

eero-t force-pushed the staged-images branch from 3e49050 to 07051a7 Compare November 14, 2024 13:30

eero-t force-pushed the staged-images branch from 07051a7 to f43ab84 Compare November 18, 2024 11:44

eero-t force-pushed the staged-images branch from f43ab84 to 95e0c76 Compare November 22, 2024 10:50

ashahba self-assigned this Nov 22, 2024

eero-t force-pushed the staged-images branch from 95e0c76 to 95276fc Compare December 16, 2024 15:12

eero-t mentioned this pull request Jan 8, 2025

Add Dockerfile for comps-base image opea-project/GenAIComps#1127

Merged

1 task

eero-t force-pushed the staged-images branch 2 times, most recently from b1a16ee to 414d1c6 Compare January 8, 2025 15:43

eero-t marked this pull request as ready for review January 8, 2025 15:46

eero-t requested a review from Spycsh as a code owner January 8, 2025 15:46

ashahba requested changes Jan 9, 2025

View reviewed changes

AudioQnA/Dockerfile Show resolved Hide resolved

This was referenced Jan 9, 2025

Isolate AvatarChatbot stage builds issue #1376

Open

Check AvatarChatbot's state after refactoring #1377

Open

chensuyue reviewed Jan 10, 2025

View reviewed changes

AudioQnA/Dockerfile Show resolved Hide resolved

ftian1 approved these changes Jan 10, 2025

View reviewed changes

ashahba approved these changes Jan 10, 2025

View reviewed changes

chensuyue approved these changes Jan 13, 2025

View reviewed changes

eero-t force-pushed the staged-images branch from aa9c68e to 35cd370 Compare January 13, 2025 10:06

chensuyue merged commit 0eae391 into opea-project:main Jan 16, 2025
85 of 86 checks passed

eero-t deleted the staged-images branch January 20, 2025 17:41

Use staged builds to minimize final image sizes #1031

Use staged builds to minimize final image sizes #1031

Conversation

eero-t commented Oct 25, 2024 • edited Loading

Description

Issues

Type of change

Dependencies

Tests

Notes

eero-t commented Oct 25, 2024 • edited Loading

eero-t commented Nov 14, 2024

eero-t commented Nov 18, 2024

eero-t commented Nov 22, 2024

eero-t commented Nov 22, 2024

eero-t commented Nov 22, 2024

eero-t commented Dec 16, 2024

github-actions bot commented Dec 23, 2024 • edited Loading

Dependency Review

Scanned Files

xiguiw commented Dec 23, 2024 • edited Loading

eero-t commented Dec 23, 2024

xiguiw commented Dec 26, 2024

eero-t commented Dec 30, 2024 • edited Loading

eero-t commented Dec 30, 2024

mkbhanda commented Jan 7, 2025

mkbhanda commented Jan 7, 2025

eero-t commented Jan 8, 2025

eero-t commented Jan 8, 2025

eero-t commented Jan 9, 2025

eero-t commented Jan 9, 2025

ashahba commented Jan 9, 2025

mkbhanda commented Jan 9, 2025

ashahba commented Jan 9, 2025

chensuyue commented Jan 10, 2025

chensuyue commented Jan 10, 2025

ashahba left a comment

Choose a reason for hiding this comment

eero-t commented Jan 10, 2025 • edited Loading

chensuyue commented Jan 10, 2025 • edited Loading

eero-t commented Jan 10, 2025 • edited Loading

eero-t commented Jan 10, 2025

mkbhanda commented Jan 10, 2025

eero-t commented Jan 13, 2025

eero-t commented Jan 13, 2025 • edited Loading

eero-t commented Jan 13, 2025 • edited Loading

chensuyue commented Jan 16, 2025

chensuyue commented Jan 16, 2025

chensuyue commented Jan 16, 2025

eero-t commented Oct 25, 2024 •

edited

Loading

eero-t commented Oct 25, 2024 •

edited

Loading

github-actions bot commented Dec 23, 2024 •

edited

Loading

xiguiw commented Dec 23, 2024 •

edited

Loading

eero-t commented Dec 30, 2024 •

edited

Loading

eero-t commented Jan 10, 2025 •

edited

Loading

chensuyue commented Jan 10, 2025 •

edited

Loading

eero-t commented Jan 10, 2025 •

edited

Loading

eero-t commented Jan 13, 2025 •

edited

Loading

eero-t commented Jan 13, 2025 •

edited

Loading