-
Notifications
You must be signed in to change notification settings - Fork 41
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[ChatQnA] Support the replica tuning for ChatQnA (#116)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
cf8bd83
commit 484b69a
Showing
22 changed files
with
3,117 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
# Auto-Tuning for ChatQnA: Optimizing Resource Allocation in Kubernetes | ||
|
||
This document describes the Auto-Tuning framework, a tool designed to streamline deployment strategies for resource-intensive services, particularly in ChatQnA environments. It leverages Kubernetes for container orchestration and integrates experimental data with out prior knowledge to fine-tune deployments for optimal performance. | ||
|
||
## Key Features | ||
* Hardware Efficiency: Focuses on adjusting replica counts and maximizing the utilization of CPU and HPU (Habana Processing Unit) resources. | ||
|
||
* Theoretical and Experimental Optimization: Integrates theoretical best practices with our prior knowledge to ensure optimal resource allocation for services. | ||
|
||
# Usage | ||
|
||
To generate the strategy.json configuration file for deployment, use the following command: | ||
|
||
|
||
```bash | ||
# Kubernetes Deployment | ||
python3 tuning.py --tuning_config replica_tuning_config.json --hardware_info hardware_info_gaudi.json --service_info chatqna_neuralchat_rerank_latest.yaml | ||
|
||
# Note: Add --config_only to output deployment configs only. | ||
``` | ||
|
||
## Configuration Files | ||
1. hardware_info_gaudi.json: Specifies the hardware details (CPU, HPU, etc.). | ||
|
||
2. chatqna_neuralchat_rerank_latest.yaml: Contains service deployment information. | ||
|
||
3. tuning_config.json: Customizes tuning parameters for replica counts and granularity. | ||
|
||
### Hardrware_info.json | ||
This file lists only the hardware devices to be used in deployment. | ||
|
||
```json | ||
{ | ||
"device_0": { | ||
"ip": ["10.239.1.5", "10.239.10.6"], | ||
"type": "hpu", | ||
"sockets": 2, | ||
"cores_per_socket": 64, | ||
"num_cards": 8 | ||
} | ||
} | ||
``` | ||
Please refer to `hardware_info_gaudi.json` for more details. | ||
|
||
### chatqna_neuralchat_rerank_latest.yaml | ||
This file includes all services that will be deployed. | ||
```yaml | ||
opea_micro_services: | ||
data_prep: | ||
... ... | ||
embedding: | ||
... ... | ||
|
||
reranking: | ||
... ... | ||
|
||
llm: | ||
opea/llm-tgi: | ||
tag: latest | ||
type: cpu | ||
dependency: | ||
ghcr.io/huggingface/tgi-gaudi: | ||
tag: 2.0.4 | ||
type: hpu | ||
requirements: | ||
model_id: "Intel/neural-chat-7b-v3-3" | ||
|
||
opea_mega_service: | ||
opea/chatqna: | ||
tag: latest | ||
type: cpu | ||
``` | ||
Please refer to `chatqna_neuralchat_rerank_latest.yaml` for more details. | ||
|
||
### Tuning Config Parameters | ||
|
||
`embedding_replicas_granularity = 1`: This defines the step size for scaling the number of replicas for the embedding server. | ||
* Value (1): Each scaling operation increases or decreases the number of replicas by 1 at a time. | ||
|
||
`embedding_replicas_min = 1`: This sets the minimum number of replicas allowed for the embedding server. | ||
* Value (1): The service will always have at least 1 replica running, ensuring that it is available for deployment. | ||
|
||
`embedding_replicas_max = 4`: This defines the maximum number of replicas allowed for the embedding server. | ||
* Value (4): The service can be scaled up to a maximum of 4 replicas, limiting resource consumption and avoiding over-provisioning. | ||
|
||
`microservice_replicas_granularity = 1`: This specifies the scaling step size for other microservices (such as retrieval, dataprep, etc.). | ||
* Value (1): Similar to the embedding_replicas_granularity, the number of replicas for these microservices will scale by 1 replica at a time. | ||
|
||
`microservice_replicas_min = 1`: This parameter sets the minimum number of replicas for these microservices. | ||
* Value (1): Ensures that each microservice always has at least 1 replica running. | ||
|
||
`microservice_replicas_max = 4`: This defines the upper limit for scaling replicas for these microservices. | ||
* Value (4): The maximum number of replicas allowed for the microservices is 4. | ||
|
||
|
||
If you want to adjust the default tuning parameters, just create a replica_tuning_config.json file. For example: | ||
|
||
```json | ||
{ | ||
"embedding_replicas_granularity": 1, | ||
"embedding_replicas_min": 1, | ||
"embedding_replicas_max": 4, | ||
"microservice_replicas_granularity": 1, | ||
"microservice_replicas_min": 1, | ||
"microservice_replicas_max": 4 | ||
} | ||
``` | ||
Please refer to `replica_tuning_config.json` for more details. | ||
|
||
## Output | ||
|
||
The output of the auto-tuning process includes two key components: | ||
1. strategy_files: Contains optimized configurations for deploying services, such as replica counts and hardware resource allocations. | ||
|
||
2. K8S manifests: Provides the Kubernetes deployment specifications, including pod definitions and resource limits, ready for deployment. | ||
|
||
Example of a strategy file: | ||
```json | ||
{ | ||
"embedding-dependency": { | ||
"type": "cpu", | ||
"image": "ghcr.io/huggingface/text-embeddings-inference:cpu-1.5", | ||
"model_id": "BAAI/bge-base-en-v1.5", | ||
"replica": 1 | ||
}, | ||
"llm-microservice": { | ||
"type": "cpu", | ||
"image": "opea/llm-tgi:latest", | ||
"replica": 4 | ||
}, | ||
... ... | ||
"reranking-dependency": { | ||
"type": "hpu", | ||
"image": "opea/tei-gaudi:latest", | ||
"model_id": "BAAI/bge-reranker-base", | ||
"replica": 1, | ||
"cards": 1 | ||
}, | ||
"chatqna_mega_service": { | ||
"image": "opea/chatqna:latest", | ||
"type": "cpu", | ||
"replica": 4 | ||
} | ||
} | ||
``` | ||
|
||
Both the K8S manifests and strategy files are generated in the current directory, providing everything needed for deployment. | ||
|
||
Deployment methods: simply run `kubectl apply -f` on the newly generated *_run.yaml files and the chatqna_config_map. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: qna-config | ||
namespace: default | ||
data: | ||
EMBEDDING_MODEL_ID: BAAI/bge-base-en-v1.5 | ||
RERANK_MODEL_ID: BAAI/bge-reranker-base | ||
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3 | ||
TEI_EMBEDDING_ENDPOINT: http://embedding-dependency-svc.default.svc.cluster.local:6006 | ||
TEI_RERANKING_ENDPOINT: http://reranking-dependency-svc.default.svc.cluster.local:8808 | ||
TGI_LLM_ENDPOINT: http://llm-dependency-svc.default.svc.cluster.local:9009 | ||
REDIS_URL: redis://vector-db.default.svc.cluster.local:6379 | ||
INDEX_NAME: rag-redis | ||
HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN} | ||
EMBEDDING_SERVICE_HOST_IP: embedding-svc | ||
RETRIEVER_SERVICE_HOST_IP: retriever-svc | ||
RERANK_SERVICE_HOST_IP: reranking-svc | ||
NODE_SELECTOR: chatqna-opea | ||
LLM_SERVICE_HOST_IP: llm-svc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: chatqna-backend-server-deploy | ||
namespace: default | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: chatqna-backend-server-deploy | ||
template: | ||
metadata: | ||
annotations: | ||
sidecar.istio.io/rewriteAppHTTPProbers: 'true' | ||
labels: | ||
app: chatqna-backend-server-deploy | ||
spec: | ||
nodeSelector: | ||
node-type: chatqna-opea | ||
topologySpreadConstraints: | ||
- maxSkew: 1 | ||
topologyKey: kubernetes.io/hostname | ||
whenUnsatisfiable: ScheduleAnyway | ||
labelSelector: | ||
matchLabels: | ||
app: chatqna-backend-server-deploy | ||
hostIPC: true | ||
containers: | ||
- envFrom: | ||
- configMapRef: | ||
name: qna-config | ||
image: opea/chatqna:latest | ||
imagePullPolicy: IfNotPresent | ||
name: chatqna-backend-server-deploy | ||
args: null | ||
ports: | ||
- containerPort: 8888 | ||
serviceAccountName: default | ||
--- | ||
kind: Service | ||
apiVersion: v1 | ||
metadata: | ||
name: chatqna-backend-server-svc | ||
spec: | ||
type: NodePort | ||
selector: | ||
app: chatqna-backend-server-deploy | ||
ports: | ||
- name: service | ||
port: 8888 | ||
targetPort: 8888 | ||
nodePort: 30888 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
--- | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: dataprep-deploy | ||
namespace: default | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: dataprep-deploy | ||
template: | ||
metadata: | ||
annotations: | ||
sidecar.istio.io/rewriteAppHTTPProbers: 'true' | ||
labels: | ||
app: dataprep-deploy | ||
spec: | ||
nodeSelector: | ||
node-type: chatqna-opea | ||
topologySpreadConstraints: | ||
- maxSkew: 1 | ||
topologyKey: kubernetes.io/hostname | ||
whenUnsatisfiable: ScheduleAnyway | ||
labelSelector: | ||
matchLabels: | ||
app: dataprep-deploy | ||
hostIPC: true | ||
containers: | ||
- env: | ||
- name: REDIS_URL | ||
valueFrom: | ||
configMapKeyRef: | ||
name: qna-config | ||
key: REDIS_URL | ||
- name: TEI_ENDPOINT | ||
valueFrom: | ||
configMapKeyRef: | ||
name: qna-config | ||
key: TEI_EMBEDDING_ENDPOINT | ||
- name: INDEX_NAME | ||
valueFrom: | ||
configMapKeyRef: | ||
name: qna-config | ||
key: INDEX_NAME | ||
image: opea/dataprep-redis:latest | ||
imagePullPolicy: IfNotPresent | ||
name: dataprep-deploy | ||
args: null | ||
ports: | ||
- containerPort: 6007 | ||
- containerPort: 6008 | ||
- containerPort: 6009 | ||
serviceAccountName: default | ||
--- | ||
kind: Service | ||
apiVersion: v1 | ||
metadata: | ||
name: dataprep-svc | ||
spec: | ||
type: ClusterIP | ||
selector: | ||
app: dataprep-deploy | ||
ports: | ||
- name: port1 | ||
port: 6007 | ||
targetPort: 6007 | ||
- name: port2 | ||
port: 6008 | ||
targetPort: 6008 | ||
- name: port3 | ||
port: 6009 | ||
targetPort: 6009 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
--- | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: embedding-dependency-deploy | ||
namespace: default | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: embedding-dependency-deploy | ||
template: | ||
metadata: | ||
annotations: | ||
sidecar.istio.io/rewriteAppHTTPProbers: 'true' | ||
labels: | ||
app: embedding-dependency-deploy | ||
spec: | ||
nodeSelector: | ||
node-type: chatqna-opea | ||
containers: | ||
- envFrom: | ||
- configMapRef: | ||
name: qna-config | ||
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 | ||
name: embedding-dependency-deploy | ||
args: | ||
- --model-id | ||
- $(EMBEDDING_MODEL_ID) | ||
- --auto-truncate | ||
volumeMounts: | ||
- mountPath: /data | ||
name: model-volume | ||
- mountPath: /dev/shm | ||
name: shm | ||
ports: | ||
- containerPort: 80 | ||
serviceAccountName: default | ||
volumes: | ||
- name: model-volume | ||
hostPath: | ||
path: /mnt/models | ||
type: Directory | ||
- name: shm | ||
emptyDir: | ||
medium: Memory | ||
sizeLimit: 1Gi | ||
--- | ||
kind: Service | ||
apiVersion: v1 | ||
metadata: | ||
name: embedding-dependency-svc | ||
spec: | ||
type: ClusterIP | ||
selector: | ||
app: embedding-dependency-deploy | ||
ports: | ||
- name: service | ||
port: 6006 | ||
targetPort: 80 |
Oops, something went wrong.