Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use staged builds to minimize final image sizes #1031

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Oct 25, 2024

Description

Staged image builds so that final images do not have redundant things like:

  • Git tool and its deps (e.g. Perl)
  • Git repo history
  • Test directories

And drop explicit installation of:

  • langchain_core: GenAIComps installs langchain which already depends on that
  • jemalloc & GLX: nothing uses them (in any of the ChatQnA services), and for testing[1] it's trivial to create separate image adding those on top
  • File descriptor limit increase for ~/.bashrc (as these images run Python programs directly, not through Bash scripts)

=> This demonstrates that only 2-3 lines in the Dockerfiles are unique, and everything preceding those could be removed with a common base image.

[1] I assume those files were there to test this: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#switch-memory-allocator

Issues

Fixes: #225

Type of change

  • Others (image size improvement / Dockerfile cleanup)

Dependencies

n/a (this removes redundant Git, Perl, jemalloc, GLX dependencies from final images)

Tests

This is draft / example for fixing #225

I have not tested it apart from verifying that images still build.

Notes

In a proper fix, non-unique part of the Dockerfiles would be a separate base image, generated with GenAIComps repo Dockerfile, and Dockerfiles in this repository would depend on that image instead of python-slim.

However, that requires co-operation between these two repositories (unless components base image Dockerfile is also in this repo) and:

  • CI handling this dependency i.e. building the base image first, when relevant
  • That base image being in a repository accessible for building the application images
    • E.g. in OPEA Docker hub project

(I.e. it needs to be done by a member of this project, I cannot do it.)

@eero-t
Copy link
Contributor Author

eero-t commented Oct 25, 2024

None of the test failures are due to my changes.

CodeGen Gaudi test TGI fail is due to it trying to load HuggingFace model it has no rights for:

Access to model meta-llama/CodeLlama-7b-hf is restricted and you are not in the authorized list.
Visit https://huggingface.co/meta-llama/CodeLlama-7b-hf to ask for access.

CodeGen Xeon test TGI seems to fail due to: Could not import SGMV kernel from Punica, which may be similar issue.

VisualQnA Gaudi & Xeon tests fail is due to NPM dependency conflict for it's Node.js Svelte UI container build (which spec is not touched by this PR).

@eero-t
Copy link
Contributor Author

eero-t commented Nov 14, 2024

Rebased this example to latest main, on assumption that CI issues have been fixed in the meanwhile.

Note: I did not update the Dockerfiles for applications that were added after this PR was created:

  • GraphRAG
  • EdgeCraftRAG

@eero-t
Copy link
Contributor Author

eero-t commented Nov 18, 2024

Updated also the new ChatQnA/Dockerfile.wrapper to staged build.

Rebased to latest main, as previously used main failed in CI.

@eero-t
Copy link
Contributor Author

eero-t commented Nov 22, 2024

No idea why guardrails times out:

Waiting for deployment "chatqna-tgi" rollout to finish: 0 of 1 updated replicas are available...
deployment "chatqna-tgi-guardrails" successfully rolled out
error: deployment "chatqna-tgi" exceeded its progress deadline
+ echo 'Timeout waiting for chatqna_guardrail pod ready!'
+ exit 1
Timeout waiting for chatqna_guardrail pod ready!

And translation fails:

curl: (18) transfer closed with outstanding read data remaining
Validate Translation failure!!!

As CI does not provide enough information.

@eero-t
Copy link
Contributor Author

eero-t commented Nov 22, 2024

Rebased to main, and updated also GraphRAQ Dockerfile.

EdgeCraftRAG was not updated because it's using comps-base package from pip, instead of cloning Comps repo.

@eero-t
Copy link
Contributor Author

eero-t commented Nov 22, 2024

@lvliang-intel CI seems to be in rather bad state, as CMake is segfaulting on image builds:

 [vllm build 5/7] RUN --mount=type=cache,target=/root/.cache/pip     --mount=type=cache,target=/root/.cache/ccache     --mount=type=bind,source=.git,target=.git     VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel &&     pip install dist/*.whl &&     rm -rf dist:
...
4.853 subprocess.CalledProcessError: Command '['cmake', '/workspace/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cpu', '-DCMAKE_C_COMPILER_LAUNCHER=ccache', '-DCMAKE_CXX_COMPILER_LAUNCHER=ccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache', '-DCMAKE_HIP_COMPILER_LAUNCHER=ccache', '-DVLLM_PYTHON_EXECUTABLE=/usr/bin/python3', '-DVLLM_PYTHON_PATH=/workspace/vllm:/usr/lib/python310.zip:/usr/lib/python3.10:/usr/lib/python3.10/lib-dynload:/usr/local/lib/python3.10/dist-packages:/usr/lib/python3/dist-packages:/usr/local/lib/python3.10/dist-packages/setuptools/_vendor', '-DFETCHCONTENT_BASE_DIR=/workspace/vllm/.deps', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=152']' returned non-zero exit status 1.
5.371 Segmentation fault (core dumped)

@ashahba ashahba self-assigned this Nov 22, 2024
@eero-t
Copy link
Contributor Author

eero-t commented Dec 16, 2024

No changes, just rebase to latest main in hope that the CI issues have been fixed in the meanwhile.

Sadly it did not, "ChatQnA, gaudi" test still fails, to:

Response check failed, please check the logs in artifacts!
Validate test_manifest_vllm_on_gaudi.sh failure!!!

vLLM seems to return different results than TGI, so I wonder whether CI test is just out of date?

Copy link

github-actions bot commented Dec 23, 2024

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

@xiguiw
Copy link
Collaborator

xiguiw commented Dec 23, 2024

@eero-t @ashahba

Are you still working on this?

What the expected image size is with and without this PR?

Thanks!

@eero-t
Copy link
Contributor Author

eero-t commented Dec 23, 2024

@eero-t @ashahba

Are you still working on this?

Not actively, but I'm occasionally updating it.

What the expected image size is with and without this PR?

Based on my earlier testing, it reduces size of each app container by ~350MB. For details, see #225.

But it's just the first step, demostrating that the only unique part in all these images is the app Python file.

The end goal of switching to shared base image, will drop size of all these images from hundreds of MBs to just tens of KBs.

@xiguiw
Copy link
Collaborator

xiguiw commented Dec 26, 2024

Not actively, but I'm occasionally updating it.

What the expected image size is with and without this PR?

Based on my earlier testing, it reduces size of each app container by ~350MB. For details, see #225.

But it's just the first step, demostrating that the only unique part in all these images is the app Python file.

The end goal of switching to shared base image, will drop size of all these images from hundreds of MBs to just tens of KBs.

@eero-t
Great!
Shall we merge this PR and you create new PR for your next work? or you want to keep working on this PR?

Thanks!

@eero-t
Copy link
Contributor Author

eero-t commented Dec 30, 2024

Shall we merge this PR and you create new PR for your next work? or you want to keep working on this PR?

I think it's better if somebody (else) creates first a GenAIComps repo base image [1], and makes sure that nightly latest builds of it end up in DockerHub. Please add me as reviewer for such PR.

I can then just replace all the preliminary stages in Dockerfiles in this PR with that base image.

[1] Comps repo base image Dockerfile = basically the first 38 lines from any Dockerfile included to this PR, but with no need to install Git or pull the Comps repo with it. Just COPYing the relevant dirs is enough.

@eero-t
Copy link
Contributor Author

eero-t commented Dec 30, 2024

Looking at the recent changes in GenAIExamples repo, FFmpeg needs to be added the DocSum image now.

@mkbhanda
Copy link
Collaborator

mkbhanda commented Jan 7, 2025

None of the test failures are due to my changes.

CodeGen Gaudi test TGI fail is due to it trying to load HuggingFace model it has no rights for:

Access to model meta-llama/CodeLlama-7b-hf is restricted and you are not in the authorized list.
Visit https://huggingface.co/meta-llama/CodeLlama-7b-hf to ask for access.

CodeGen Xeon test TGI seems to fail due to: Could not import SGMV kernel from Punica, which may be similar issue.

VisualQnA Gaudi & Xeon tests fail is due to NPM dependency conflict for it's Node.js Svelte UI container build (which spec is not touched by this PR).

@eero-t are bugs filed for these .. at the very least there should be documentation that lists need to access specific model/kernel etc.

@mkbhanda
Copy link
Collaborator

mkbhanda commented Jan 7, 2025

Shall we merge this PR and you create new PR for your next work? or you want to keep working on this PR?

I think it's better if somebody (else) creates first a GenAIComps repo base image [1], and makes sure that nightly latest builds of it end up in DockerHub. Please add me as reviewer for such PR.

I can then just replace all the preliminary stages in Dockerfiles in this PR with that base image.

[1] Comps repo base image Dockerfile = basically the first 38 lines from any Dockerfile included to this PR, but with no need to install Git or pull the Comps repo with it. Just COPYing the relevant dirs is enough.

@eero-t please submit yourself this base image and we can have @chensuyue help with publishing it. Completing this slimming of all containers for V 1.2 would be wonderful.

@eero-t eero-t force-pushed the staged-images branch 2 times, most recently from b1a16ee to 414d1c6 Compare January 8, 2025 15:43
@eero-t
Copy link
Contributor Author

eero-t commented Jan 8, 2025

Rebased to main, updated "DocSum" Dockerfile to install FFmpeg, and added same changes also to "EdgeCraftRAG" Dockerfile.

"EdgeCraftRAG" Dockerfile.server file is the only one that is not modified. That's because it it imports opea-base module instead of fetching the code from Git.

@eero-t eero-t marked this pull request as ready for review January 8, 2025 15:46
@eero-t eero-t requested a review from Spycsh as a code owner January 8, 2025 15:46
@eero-t
Copy link
Contributor Author

eero-t commented Jan 8, 2025

"apt-get update" in previous stage was not enough, apparently it needs to be done for every "apt-get install" command.

(Fixed that also for EdgeCraftRAG/Dockerfile.server that I did not otherwise touch.)

@eero-t
Copy link
Contributor Author

eero-t commented Jan 9, 2025

Currently 8 of the 86 tests fail.

All the failures are in services / containers coming from Comps project, not something touched by this PR.

Gaudi & Xeon "AvatarChatbot" tests animation service fails type validation:

   File "/home/user/comps/animation/src/opea_animation_microservice.py", line 54, in animate
    return VideoPath(video_path=outfile)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for VideoPath
video_path
  Input should be a valid string [type=string_type, input_value=<coroutine object OpeaCom...nvoke at 0x7f5ea7ea2110>, input_type=coroutine]

Most of rocm "run-test" cases fail, except for AudioQnA + FaqGen tests.

"ChatQnA, rocm" test fail is a bit of a mystery:

 [2025-01-08 19:04:18,838] [    INFO] - prepare_doc_redis - [ upload ] File dataprep_file.txt does not exist.
/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py:75: DeprecationWarning: Call to deprecated add_document. (deprecated since redisearch 2.0, call hset instead) -- Deprecated since version 2.0.0.
  client.add_document(doc_id="file:" + key, file_name=key, key_ids=value)
[2025-01-08 19:04:20,009] [    INFO] - prepare_doc_redis - [ upload ] Link https://www.ces.tech/ does not exist. Keep storing.
...
 [ tei-rerank ] Content is as expected.
+ validate_service 10.53.22.29:9009/generate generated_text tgi-llm chatqna-tgi-server '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}'
...
++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' -H 'Content-Type: application/json' 10.53.22.29:9009/generate
+ HTTP_RESPONSE=HTTPSTATUS:000
Error: Process completed with exit code 7.

As is "MultimodalQnA, rocm" test failure:

2025-01-08T19:29:01.8774001Z [ retriever-redis ] Content is as expected.
2025-01-08T19:31:46.0053127Z + echo 'Evaluating lvm-llava'
2025-01-08T19:31:46.0054102Z + validate_service http://10.53.22.29:8399/generate '"generated_text":' tgi-llava-rocm-server tgi-llava-rocm-server '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}'
...
2025-01-08T19:31:46.0069882Z ++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' -H 'Content-Type: application/json' http://10.53.22.29:8399/generate
2025-01-08T19:31:46.0157166Z + HTTP_RESPONSE=HTTPSTATUS:000
2025-01-08T19:31:46.0231025Z ##[error]Process completed with exit code 7.

"CodeGen, rocm" test service exits with failure:

Container codegen-tgi-service  Error
dependency failed to start: container codegen-tgi-service exited (1)
Error: Process completed with exit code 1.

As does "CodeTrans, rocm" test service:

  Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

And "Translation, rocm" test service:

 Container translation-tgi-service  Error
dependency failed to start: container translation-tgi-service exited (1)
Error: Process completed with exit code 1.

Queried "VisualQnA, rocm" container does not exist:

+ echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs visualqna-tgi-service
Error response from daemon: No such container: visualqna-tgi-service
+ exit 1
Error: Process completed with exit code 1.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 9, 2025

@ashahba's ChatQnA example PR (#1363) has optimization for speeding up the intermediate stage. It curls a tarball of the repo contents instead of git cloning it, like is done here.

While the repo content fetching itself takes about same time in both (6s in my setup), installing curl + deps is 2x faster (9s) than git + deps install (18s), as latter installs more.

Because that does not affect final image sizes, only build speed, and #1369 will replace these changes as soon as the new base image is available, I'll change it only if I need to otherwise update all the Dockerfiles in this PR.

AudioQnA/Dockerfile Show resolved Hide resolved
@ashahba
Copy link
Collaborator

ashahba commented Jan 9, 2025

@ashahba's ChatQnA example PR (#1363) has optimization for speeding up the intermediate stage. It curls a tarball of the repo contents instead of git cloning it, like is done here.

While the repo content fetching itself takes about same time in both (6s in my setup), installing curl + deps is 2x faster (9s) than git + deps install (18s), as latter installs more.

Because that does not affect final image sizes, only build speed, and #1369 will replace these changes as soon as the new base image is available, I'll change it only if I need to otherwise update all the Dockerfiles in this PR.

we can still stick with git and always bring curl back into the equation.
For now let's focus on getting tests pass 😃

@mkbhanda
Copy link
Collaborator

mkbhanda commented Jan 9, 2025

@chensuyue and I have reached out to AMD for help with the ROCm failures.

@ashahba
Copy link
Collaborator

ashahba commented Jan 9, 2025

Currently 8 of the 86 tests fail.

All the failures are in services / containers coming from Comps project, not something touched by this PR.

Gaudi & Xeon "AvatarChatbot" tests animation service fails type validation:

   File "/home/user/comps/animation/src/opea_animation_microservice.py", line 54, in animate
    return VideoPath(video_path=outfile)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for VideoPath
video_path
  Input should be a valid string [type=string_type, input_value=<coroutine object OpeaCom...nvoke at 0x7f5ea7ea2110>, input_type=coroutine]

Most of rocm "run-test" cases fail, except for AudioQnA + FaqGen tests.

"ChatQnA, rocm" test fail is a bit of a mystery:

 [2025-01-08 19:04:18,838] [    INFO] - prepare_doc_redis - [ upload ] File dataprep_file.txt does not exist.
/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py:75: DeprecationWarning: Call to deprecated add_document. (deprecated since redisearch 2.0, call hset instead) -- Deprecated since version 2.0.0.
  client.add_document(doc_id="file:" + key, file_name=key, key_ids=value)
[2025-01-08 19:04:20,009] [    INFO] - prepare_doc_redis - [ upload ] Link https://www.ces.tech/ does not exist. Keep storing.
...
 [ tei-rerank ] Content is as expected.
+ validate_service 10.53.22.29:9009/generate generated_text tgi-llm chatqna-tgi-server '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}'
...
++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' -H 'Content-Type: application/json' 10.53.22.29:9009/generate
+ HTTP_RESPONSE=HTTPSTATUS:000
Error: Process completed with exit code 7.

As is "MultimodalQnA, rocm" test failure:

2025-01-08T19:29:01.8774001Z [ retriever-redis ] Content is as expected.
2025-01-08T19:31:46.0053127Z + echo 'Evaluating lvm-llava'
2025-01-08T19:31:46.0054102Z + validate_service http://10.53.22.29:8399/generate '"generated_text":' tgi-llava-rocm-server tgi-llava-rocm-server '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}'
...
2025-01-08T19:31:46.0069882Z ++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' -H 'Content-Type: application/json' http://10.53.22.29:8399/generate
2025-01-08T19:31:46.0157166Z + HTTP_RESPONSE=HTTPSTATUS:000
2025-01-08T19:31:46.0231025Z ##[error]Process completed with exit code 7.

"CodeGen, rocm" test service exits with failure:

Container codegen-tgi-service  Error
dependency failed to start: container codegen-tgi-service exited (1)
Error: Process completed with exit code 1.

As does "CodeTrans, rocm" test service:

  Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

And "Translation, rocm" test service:

 Container translation-tgi-service  Error
dependency failed to start: container translation-tgi-service exited (1)
Error: Process completed with exit code 1.

Queried "VisualQnA, rocm" container does not exist:

+ echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs visualqna-tgi-service
Error response from daemon: No such container: visualqna-tgi-service
+ exit 1
Error: Process completed with exit code 1.

Now the failures are down to:

  • AvatarChatbot, (Gaudi and Xeon)
  • MultimodalQnA, Rocm
  • Translation, Rocm
  • VisualQnA, Rocm

I suspect that this PR really has nothing to do with the failures and most likely they are side effects of refactoring being discovered by this PR since they are touching some containers that were not testing as part of refactoring.

@chensuyue
Copy link
Collaborator

Currently 8 of the 86 tests fail.
All the failures are in services / containers coming from Comps project, not something touched by this PR.
Gaudi & Xeon "AvatarChatbot" tests animation service fails type validation:

   File "/home/user/comps/animation/src/opea_animation_microservice.py", line 54, in animate
    return VideoPath(video_path=outfile)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for VideoPath
video_path
  Input should be a valid string [type=string_type, input_value=<coroutine object OpeaCom...nvoke at 0x7f5ea7ea2110>, input_type=coroutine]

Most of rocm "run-test" cases fail, except for AudioQnA + FaqGen tests.
"ChatQnA, rocm" test fail is a bit of a mystery:

 [2025-01-08 19:04:18,838] [    INFO] - prepare_doc_redis - [ upload ] File dataprep_file.txt does not exist.
/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py:75: DeprecationWarning: Call to deprecated add_document. (deprecated since redisearch 2.0, call hset instead) -- Deprecated since version 2.0.0.
  client.add_document(doc_id="file:" + key, file_name=key, key_ids=value)
[2025-01-08 19:04:20,009] [    INFO] - prepare_doc_redis - [ upload ] Link https://www.ces.tech/ does not exist. Keep storing.
...
 [ tei-rerank ] Content is as expected.
+ validate_service 10.53.22.29:9009/generate generated_text tgi-llm chatqna-tgi-server '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}'
...
++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' -H 'Content-Type: application/json' 10.53.22.29:9009/generate
+ HTTP_RESPONSE=HTTPSTATUS:000
Error: Process completed with exit code 7.

As is "MultimodalQnA, rocm" test failure:

2025-01-08T19:29:01.8774001Z [ retriever-redis ] Content is as expected.
2025-01-08T19:31:46.0053127Z + echo 'Evaluating lvm-llava'
2025-01-08T19:31:46.0054102Z + validate_service http://10.53.22.29:8399/generate '"generated_text":' tgi-llava-rocm-server tgi-llava-rocm-server '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}'
...
2025-01-08T19:31:46.0069882Z ++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' -H 'Content-Type: application/json' http://10.53.22.29:8399/generate
2025-01-08T19:31:46.0157166Z + HTTP_RESPONSE=HTTPSTATUS:000
2025-01-08T19:31:46.0231025Z ##[error]Process completed with exit code 7.

"CodeGen, rocm" test service exits with failure:

Container codegen-tgi-service  Error
dependency failed to start: container codegen-tgi-service exited (1)
Error: Process completed with exit code 1.

As does "CodeTrans, rocm" test service:

  Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

And "Translation, rocm" test service:

 Container translation-tgi-service  Error
dependency failed to start: container translation-tgi-service exited (1)
Error: Process completed with exit code 1.

Queried "VisualQnA, rocm" container does not exist:

+ echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs visualqna-tgi-service
Error response from daemon: No such container: visualqna-tgi-service
+ exit 1
Error: Process completed with exit code 1.

Now the failures are down to:

  • AvatarChatbot, (Gaudi and Xeon)
  • MultimodalQnA, Rocm
  • Translation, Rocm
  • VisualQnA, Rocm

I suspect that this PR really has nothing to do with the failures and most likely they are side effects of refactoring being discovered by this PR since they are touching some containers that were not testing as part of refactoring.

AvatarChatbot will pass, after this PR merged.
#1371
ROCm issue failed with OOB, other PR also has this issue since yesterday, I have asked AMD to handle this.

@chensuyue
Copy link
Collaborator

Currently 8 of the 86 tests fail.
All the failures are in services / containers coming from Comps project, not something touched by this PR.
Gaudi & Xeon "AvatarChatbot" tests animation service fails type validation:

   File "/home/user/comps/animation/src/opea_animation_microservice.py", line 54, in animate
    return VideoPath(video_path=outfile)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for VideoPath
video_path
  Input should be a valid string [type=string_type, input_value=<coroutine object OpeaCom...nvoke at 0x7f5ea7ea2110>, input_type=coroutine]

Most of rocm "run-test" cases fail, except for AudioQnA + FaqGen tests.
"ChatQnA, rocm" test fail is a bit of a mystery:

 [2025-01-08 19:04:18,838] [    INFO] - prepare_doc_redis - [ upload ] File dataprep_file.txt does not exist.
/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py:75: DeprecationWarning: Call to deprecated add_document. (deprecated since redisearch 2.0, call hset instead) -- Deprecated since version 2.0.0.
  client.add_document(doc_id="file:" + key, file_name=key, key_ids=value)
[2025-01-08 19:04:20,009] [    INFO] - prepare_doc_redis - [ upload ] Link https://www.ces.tech/ does not exist. Keep storing.
...
 [ tei-rerank ] Content is as expected.
+ validate_service 10.53.22.29:9009/generate generated_text tgi-llm chatqna-tgi-server '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}'
...
++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' -H 'Content-Type: application/json' 10.53.22.29:9009/generate
+ HTTP_RESPONSE=HTTPSTATUS:000
Error: Process completed with exit code 7.

As is "MultimodalQnA, rocm" test failure:

2025-01-08T19:29:01.8774001Z [ retriever-redis ] Content is as expected.
2025-01-08T19:31:46.0053127Z + echo 'Evaluating lvm-llava'
2025-01-08T19:31:46.0054102Z + validate_service http://10.53.22.29:8399/generate '"generated_text":' tgi-llava-rocm-server tgi-llava-rocm-server '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}'
...
2025-01-08T19:31:46.0069882Z ++ curl --silent --write-out 'HTTPSTATUS:%{http_code}' -X POST -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' -H 'Content-Type: application/json' http://10.53.22.29:8399/generate
2025-01-08T19:31:46.0157166Z + HTTP_RESPONSE=HTTPSTATUS:000
2025-01-08T19:31:46.0231025Z ##[error]Process completed with exit code 7.

"CodeGen, rocm" test service exits with failure:

Container codegen-tgi-service  Error
dependency failed to start: container codegen-tgi-service exited (1)
Error: Process completed with exit code 1.

As does "CodeTrans, rocm" test service:

  Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

And "Translation, rocm" test service:

 Container translation-tgi-service  Error
dependency failed to start: container translation-tgi-service exited (1)
Error: Process completed with exit code 1.

Queried "VisualQnA, rocm" container does not exist:

+ echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs visualqna-tgi-service
Error response from daemon: No such container: visualqna-tgi-service
+ exit 1
Error: Process completed with exit code 1.

Now the failures are down to:

  • AvatarChatbot, (Gaudi and Xeon)
  • MultimodalQnA, Rocm
  • Translation, Rocm
  • VisualQnA, Rocm

I suspect that this PR really has nothing to do with the failures and most likely they are side effects of refactoring being discovered by this PR since they are touching some containers that were not testing as part of refactoring.

AvatarChatbot will pass, after this PR merged. #1371 ROCm issue failed with OOB, other PR also has this issue since yesterday, I have asked AMD to handle this.

All those issues have resolved.

Copy link
Collaborator

@ashahba ashahba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, I'm going to approve this PR and we can always add:

apt-get clean autoclean && \
apt-get autoremove -y && \

in the future or if base container is merged, we'll just add it to that.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 10, 2025

Out of 86 CI tests, 4 Gaudi failed due to container creation/startup issues.

(Which are unrelated to changes done in this PR.)

"ChatQnA, gaudi" and "Translation, gaudi" test errors:

 Container tei-reranking-gaudi-server  Creating
Error response from daemon: Conflict. The container name "/tgi-gaudi-server" is already in use by container "7b44a11ea90647bc6eb285040061ac26c6af2d4db682acd9664e183ae85486de". You have to remove (or rename) that container to be able to reuse that name.
Error: Process completed with exit code 1.

"CodeTrans, gaudi" and "SearchQnA, gaudi" test errors:

  Container llm-textgen-gaudi-server  Starting
Error response from daemon: No such container: 89825e83e8edd701e190aec0d2e5be798c4e55a2d583975991ff8ac41ca2778f
Error: Process completed with exit code 1.

Maybe different Gaudi test runs are not isolated from each other well enough, when this large number of them is started in parallel?

Could they e.g. be purging created containers from each others' run, and try to re-create containers with the same name in same node?

@chensuyue
Copy link
Collaborator

chensuyue commented Jan 10, 2025

Out of 82 tests (with 4 pending), I can already see CI failures in 4 of them, due to container creation/startup issues.

(Which are unrelated to changes done in this PR.)

"ChatQnA, gaudi" and "Translation, gaudi" test errors:

 Container tei-reranking-gaudi-server  Creating
Error response from daemon: Conflict. The container name "/tgi-gaudi-server" is already in use by container "7b44a11ea90647bc6eb285040061ac26c6af2d4db682acd9664e183ae85486de". You have to remove (or rename) that container to be able to reuse that name.
Error: Process completed with exit code 1.

"CodeTrans, gaudi" and "SearchQnA, gaudi" test errors:

  Container llm-textgen-gaudi-server  Starting
Error response from daemon: No such container: 89825e83e8edd701e190aec0d2e5be798c4e55a2d583975991ff8ac41ca2778f
Error: Process completed with exit code 1.

Maybe different test runs are not isolated from each other well enough, when this large number of them is started in parallel. Could they be purging created containers from each others' run, and try to re-create containers with the same name in same node?

We only have 1 gaudi machine for several projects CI, so sometimes there are conflict between the test from different Repo. We can't do force image clean up since the test may run in parallel.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 10, 2025

We only have 1 gaudi machine for several projects CI, so sometimes there are conflict between the test from different Repo. We can't do force image clean up since the test may run in parallel.

Needing to rerun 86 tests, which execution takes ~3 hours, in hopes that one of those runs would not hit this race-condition, is IMHO not really acceptable. Depending on how likely this CI failure is, it may never pass...

We only have 1 gaudi machine for several projects CI, so sometimes there are conflict between the test from different Repo. We can't do force image clean up since the test may run in parallel.

Could CI use, or force use of unique names / suffixes for the image / container names, and remove them at end of the test?

Purging could be done when there are no tests running (I would imagine there's some time during each day when there are no CI jobs running).

@eero-t
Copy link
Contributor Author

eero-t commented Jan 10, 2025

Could CI use, or force use of unique names / suffixes for the image / container names, and remove them at end of the test?

Purging could be done when there are no tests running (I would imagine there's some time during each day when there are no CI jobs running).

These CI improvement seem unlikely to come in time for 1.2 release. Alternatives for that are:

  • Merge this PR despite these (unrelated) CI failures, or
  • Split this PR e.g. to 4 separate PRs, in hopes that those are less likely to trigger the CI race condition

@mkbhanda
Copy link
Collaborator

Could CI use, or force use of unique names / suffixes for the image / container names, and remove them at end of the test?
Purging could be done when there are no tests running (I would imagine there's some time during each day when there are no CI jobs running).

These CI improvement seem unlikely to come in time for 1.2 release. Alternatives for that are:

  • Merge this PR despite these (unrelated) CI failures, or
  • Split this PR e.g. to 4 separate PRs, in hopes that those are less likely to trigger the CI race condition

Would it be better to create a CI specific registry, no need to then change image names, just where to retrieve the images from. Once CI run complete, remove that version of the registry.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 13, 2025

After merge commit there were many additional CI failures.

First 2 errors below look like real issues, in code merged from main. Potentially also MultiModalQnA, and rocm FaqGen ones, rest look like already discussed CI problems...

Dockerfile check:

Missing Dockerfile: GenAIComps/comps/llms/faq-generation/tgi/langchain/Dockerfile (Referenced in GenAIExamples/./FaqGen/docker_compose/intel/cpu/xeon/README.md:22)
Missing Dockerfile: GenAIComps/comps/llms/faq-generation/tgi/langchain/Dockerfile (Referenced in GenAIExamples/./FaqGen/docker_compose/intel/hpu/gaudi/README.md:101)
Error: Process completed with exit code 1.

"AvatarChatbot, gaudi":

 [2025-01-13 04:24:17,864] [    INFO] - speecht5 - SpeechT5 generation begin.
...
 /usr/local/lib/python3.10/dist-packages/torch/_subclasses/functional_tensor.py:258: UserWarning: Failed to initialize NumPy: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
...
   File "/usr/local/lib/python3.10/dist-packages/basicsr/utils/diffjpeg.py", line 19, in <module>
    y_table = nn.Parameter(torch.from_numpy(y_table))
RuntimeError: Numpy is not available
...
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Result wrong.
Error: Process completed with exit code 1.

"ChatQnA, gaudi":

 Error response from daemon: a prune operation is already running
Error: Process completed with exit code 1.

"CodeGen, gaudi":

 Container llm-textgen-gaudi-server  Starting
Error response from daemon: No such container: 4febf8796f00c1a69e70d4773e0cf9f8296b397805dbcb77aa3ccb2df18d84a6
Error: Process completed with exit code 1.

"CodeGen, rocm":

 Container llm-textgen-gaudi-server  Starting
Error response from daemon: No such container: 4febf8796f00c1a69e70d4773e0cf9f8296b397805dbcb77aa3ccb2df18d84a6
Error: Process completed with exit code 1.

"CodeTrans, rocm":

 Container codetrans-tgi-service  Error
dependency failed to start: container codetrans-tgi-service exited (1)
Error: Process completed with exit code 1.

"FaqGen, gaudi":

  Container llm-faqgen-server  Starting
Error response from daemon: No such container: 699e35b5f55c628f63ab5d40d084926fc5c4de17fefef42e55085c67c6edec35
Error: Process completed with exit code 1.

"FaqGen, rocm":

 resolve : lstat /home/huggingface/OPEA-CICD/actions-runner/_work/GenAIExamples/GenAIExamples/FaqGen/docker_image_build/GenAIComps/comps/llms/faq-generation: no such file or directory
Error: Process completed with exit code 17.

"MultiModalQnA, gaudi":`

resolve : lstat /home/sdp/actions-runner-examples/_work/GenAIExamples/GenAIExamples/MultimodalQnA/docker_image_build/GenAIComps/comps/lvms/tgi-llava: no such file or directory
Error: Process completed with exit code 17.

"MultiModalQnA, xeon":`

 could not find /home/sdp/GenAIExamples-actions-runner/_work/GenAIExamples/GenAIExamples/MultimodalQnA/docker_image_build/GenAIComps/comps/lvms/llava/dependency: stat /home/sdp/GenAIExamples-actions-runner/_work/GenAIExamples/GenAIExamples/MultimodalQnA/docker_image_build/GenAIComps/comps/lvms/llava/dependency: no such file or directory
Error: Process completed with exit code 17.

"Translation, rocm":

 Error response from daemon: driver failed programming external connectivity on endpoint translation-llm-textgen-server (10a6a8ee815160a5bfd4867ca3e224c276ba0b00090f422b2bd7102974b63df1): Bind for 0.0.0.0:9000 failed: port is already allocated
Error: Process completed with exit code 1.

"VisualQnA, gaudi":

 + echo '[ lvm-tgi ] HTTP status is not 200. Received status was 000'
+ docker logs lvm-tgi-gaudi-server
[ lvm-tgi ] HTTP status is not 200. Received status was 000
Error response from daemon: No such container: lvm-tgi-gaudi-server
+ exit 1
Error: Process completed with exit code 1.

"VisualQnA, rocm":

+ docker logs visualqna-tgi-service
[ lvm-tgi ] HTTP status is not 200. Received status was 500
Error: ShardCannotStart
+ exit 1
Error: Process completed with exit code 1.

"VisualQnA, xeon":

+ docker logs lvm-tgi-xeon-server
Error response from daemon: No such container: lvm-tgi-xeon-server
+ exit 1
Error: Process completed with exit code 1.

So that redundant things do not end in final image:
- Git repo history
- Test directories
- Git tool and its deps

And drop explicit installation of:
- jemalloc & GLX: nothing uses them (in ChatQnA at least), and
  for testing it's trivial to create image adding those on top:
  https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#switch-memory-allocator
- langchain_core: GenAIComps install langchain which already depends on that

This demonstrates that only 2-3 lines in the Dockerfiles are unique,
and everything before those can be removed with a common base image.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Jan 13, 2025

Rebased to main as it had some newer commits after the above merge from main was done, in hopes that those fix at least some of the new CI issues. And squashed the git fix commit to previous one. No changes in PR content.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 13, 2025

Rebase fixed CI issues, but nympy regression remains.

It's with wav2lip:

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

(Triggered internally at /npu-stack/pytorch-fork/torch/csrc/utils/tensor_numpy.cpp:84.)
 cpu = _conversion_method_template(device=torch.device("cpu"))
Calling add_step_closure function does not have any effect. It's lazy mode only functionality. (warning logged once)
Calling mark_step function does not have any effect. It's lazy mode only functionality. (warning logged once)
Calling iter_mark_step function does not have any effect. It's lazy mode only functionality. (warning logged once)
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
 return isinstance(object, types.FunctionType)
Traceback (most recent call last):
 File "/home/user/comps/third_parties/wav2lip/src/wav2lip_server.py", line 20, in <module>
   from utils import *
...
 File "/usr/local/lib/python3.10/dist-packages/basicsr/utils/diffjpeg.py", line 19, in <module>
   y_table = nn.Parameter(torch.from_numpy(y_table))
RuntimeError: Numpy is not available

And seems to be caused by this PR: https://github.com/opea-project/GenAIComps/pull/1132/files

As it dropped numpy v1 enforcement for the requirements.txt file used by wav2lip: https://github.com/opea-project/GenAIComps/blob/main/comps/third_parties/wav2lip/src/Dockerfile#L56

It has nothing to do with this PR, so it's not a blocker for merging.

@chensuyue
Copy link
Collaborator

Rebase fixed CI issues, but nympy regression remains.

It's with wav2lip:

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

(Triggered internally at /npu-stack/pytorch-fork/torch/csrc/utils/tensor_numpy.cpp:84.)
 cpu = _conversion_method_template(device=torch.device("cpu"))
Calling add_step_closure function does not have any effect. It's lazy mode only functionality. (warning logged once)
Calling mark_step function does not have any effect. It's lazy mode only functionality. (warning logged once)
Calling iter_mark_step function does not have any effect. It's lazy mode only functionality. (warning logged once)
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
 return isinstance(object, types.FunctionType)
Traceback (most recent call last):
 File "/home/user/comps/third_parties/wav2lip/src/wav2lip_server.py", line 20, in <module>
   from utils import *
...
 File "/usr/local/lib/python3.10/dist-packages/basicsr/utils/diffjpeg.py", line 19, in <module>
   y_table = nn.Parameter(torch.from_numpy(y_table))
RuntimeError: Numpy is not available

And seems to be caused by this PR: https://github.com/opea-project/GenAIComps/pull/1132/files

As it dropped numpy v1 enforcement for the requirements.txt file used by wav2lip: https://github.com/opea-project/GenAIComps/blob/main/comps/third_parties/wav2lip/src/Dockerfile#L56

It has nothing to do with this PR, so it's not a blocker for merging.

@yao531441 please check this issue.

@chensuyue
Copy link
Collaborator

@yao531441 is checking the issue, we can merge this PR with this issue left if this is target v1.2.

@chensuyue chensuyue merged commit 0eae391 into opea-project:main Jan 16, 2025
85 of 86 checks passed
@chensuyue
Copy link
Collaborator

@yao531441 is checking the issue, we can merge this PR with this issue left if this is target v1.2.

Fix PR opea-project/GenAIComps#1160

@eero-t eero-t deleted the staged-images branch January 20, 2025 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Why containers use hundreds of MBs for Vim/Perl/OpenGL?
6 participants