Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CodeGen Xeon and Gaudi Kubernetes codegen.yaml and docs #595

Closed
wants to merge 61 commits into from

Conversation

dmsuehir
Copy link
Contributor

Description

This PR has a few updates based on issues that I ran into when deploying the CodeGen example on a cluster for xeon and Gaudi. The following issues are addressed in the PR:

  • I added a note about potentially using a persistent volume claim instead of having to create the /mnt/opea-models directory on the nodes
  • Deploying the codegen.yaml files gave an error like:
    error: error validating "codegen.yaml": error validating data: [unknown object type "nil" in ConfigMap.data.http_proxy, unknown object type "nil" in ConfigMap.data.https_proxy, unknown object type "nil" in ConfigMap.data.no_proxy]; if you choose to ignore these errors, turn validation off with --validate=false
    
    This error is because the ConfigMap in the yaml has a few env vars that are just empty (nil). Changing these to have empty quotes "" fixes the issue.
  • I added a note about it taking a couple of minutes for the service to start and how to check the logs, because I ran into an issue where the curl command failed like "curl: (18) transfer closed with outstanding read data remaining" and it was just because the service wasn't ready yet. Also, knowing how to check the logs is useful for watching the status and figuring out if the curl command is failing because of an error.
  • When running on Gaudi wasn't working for me ("RuntimeError: synStatus=26 [Generic failure] Device acquire failed.") until I added the hugepages-2Mi/memory to the resource limits. The habana documentation for Kubernetes shows it using hugepages-2Mi and memory in the resources, so that seems to be the recommended config.

Issues

N/A

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

N/A

Tests

Manually tested the changes on a Kubernetes cluster with Xeon (GNR) and Gaudi 2 nodes.

Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Copy link
Collaborator

@ashahba ashahba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
dmsuehir and others added 18 commits August 15, 2024 11:33
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
* add AudioQnA example via GMC.
Signed-off-by: zhlsunshine <huailong.zhang@intel.com>

* add more information for e2e test scritpts.
Signed-off-by: zhlsunshine <huailong.zhang@intel.com>

* fix bug in e2e test scripts.
Signed-off-by: zhlsunshine <huailong.zhang@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
…ed (opea-project#608)

Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: letonghan <letong.han@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
* Update Dockerfile to use LANGCHAIN_VERSION argument

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

* Revert "Update Dockerfile to use LANGCHAIN_VERSION argument"

This reverts commit 1bff239.

* chore: Add manual freeze images workflow

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* split jobs

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

---------

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: Yingchun Guo <yingchun.guo@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
zhlsunshine and others added 28 commits August 16, 2024 09:34
…roject#589)

Signed-off-by: zhlsunshine <huailong.zhang@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
* fixed ISSUE-528

Signed-off-by: jaswanth8888 <karani.jaswanth@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jaswanth8888 <karani.jaswanth@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: Yingchun Guo <yingchun.guo@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: Xinyao Wang <xinyao.wang@intel.com>
Co-authored-by: chen, suyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: letonghan <letong.han@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
* add AudioQnA example via GMC.
Signed-off-by: zhlsunshine <huailong.zhang@intel.com>

* add more information for e2e test scritpts.
Signed-off-by: zhlsunshine <huailong.zhang@intel.com>

* fix bug in e2e test scripts.
Signed-off-by: zhlsunshine <huailong.zhang@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
…ed (opea-project#608)

Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: letonghan <letong.han@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
* Update Dockerfile to use LANGCHAIN_VERSION argument

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

* Revert "Update Dockerfile to use LANGCHAIN_VERSION argument"

This reverts commit 1bff239.

* chore: Add manual freeze images workflow

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* split jobs

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

---------

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: Yingchun Guo <yingchun.guo@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: Yingchun Guo <yingchun.guo@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
…xamples into dina/codegen_proxy_empty

Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
@dmsuehir
Copy link
Contributor Author

Reopened as PR #613 because rebasing seems to have gone wrong after trying to fix the DCO check

@dmsuehir dmsuehir closed this Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.