As the adoption of AI and ML continues to surge, there's an increasing demand to deploy these workloads in edge environments such as factories, farms, and remote locations where traditional computing infrastructure is limited, connectivity is unreliable or constrained, and reliability is paramount.
This repository contains tools and artifacts comprising a proof of concept to bring the power of Open Data Hub and OpenShift AI to such edge environments.
Warning
This is a proof of concept repository and is not intended for production use. The edge features are experimental and the repository is under active development.
The term edge is a highly overloaded term in the industry. For the purpose of this repository, the following diagram illustrates the concept of core and near edge environments.
Refer to AI Edge Terminology for terminologies used in this repo.
- Core
- The central OpenShift cluster that is used to manage the edge clusters.
- Can be an on-premise or cloud-based cluster.
- There are no resources or network constraints expected in the core cluster.
- Near Edge
- This is an edge environment to run and serve AI/ML inference workloads.
- For the purpose of this repository, the near edge environment is represented by separate OpenShift cluster(s) managed from the core OpenShift cluster.
- The near edge environment is expected to have moderate yet constrained compute resources and network.
The main objective currently is to showcase that a user can:
- Package an inference service by integrating a trained model with all required dependencies into a container image, ensuring it's ready for utilization at the near edge location(s) in a streamlined manner.
- Centrally control and manage the set of target near edge environments from the core OpenShift cluster, and configure the initialization of a GitOps workflow on each of those targets.
- Utilize GitOps tooling at each near edge environment to reconcile the manifests specified in the GitOps repository (managed centrally), and deploy the inference service.
- Monitor and observe the deployed AI/ML inference workloads in the near edge environments from the core OpenShift cluster.
Note
Developing and training a model is currently out of scope of this repository as this repository already contains example models in the examples/models/ directory. You can read more about developing and training a model in the Developing and training a model section.
The diagram above illustrates the architecture of the proof of concept. The two main categories of clusters are the core and near edge clusters.
Note
The rest of this documentation assumes usage of ACM1 for cluster and application management, and we highly recommend this approach. If you choose to use a different approach, you'll find some information on how to deploy the inference container application using standalone Argo CD instances on the edge clusters, in the dedicated Bring Your Own Cluster Management document.
- Packages different types of AI/ML models and serving runtimes into container images using
Red Hat OpenShift Pipelines
2. - Manages the near edge clusters and the AI/ML inference deployed on them in a GitOps manner using
Red Hat Advanced Cluster Management for Kubernetes (ACM)
1 andRed Hat OpenShift GitOps
. - Monitors and observes the deployed AI/ML inference workloads in the near edge environments using
ACM
andRed Hat OpenShift Observability
3. - Hosts the Open Data Hub and OpenShift AI components.
- Run the AI/ML workloads as defined by the core cluster in a GitOps manner using
ACM
andRed Hat OpenShift GitOps
. - Continuously reconcile the state of the AI/ML workloads with the core cluster using ACM's pull model with Argo CD.
- Collect metrics of the AI/ML workloads and send them to the core cluster using
Red Hat OpenShift Observability
andACM
.
Before you can start using the proof of concept, you need to prepare the infrastructure. This includes setting up the OpenShift clusters and installing the required components.
Follow the steps in Preparing the Infrastructure to set up the infrastructure.
The proof of concept is built to provide two main workflows as shown in the diagram above
In manifests/README.md we show how to take the trained models,
stored in an S3 bucket or Git repository,
build container image(s) with the model and serving runtime using OpenShift Pipelines,
push the container image(s) to an image registry accessible by the near edge cluster(s),
and update a clone of this repository with a pull request,
configuring acm/odh-edge/apps/*/kustomization.yaml
with location and digest (SHA-256) of the built images.
You can skip this step if you do not wish to rebuild the container images. If you use the default configuration as shown in this git repository, you will use already built container images from the https://quay.io/organization/rhoai-edge repositories.
With the remote near edge clusters now available in the cluster set and the Argo CD, it is now possible to deploy the applications to those clusters using the Argo CD Pull model integration in ACM GitOps.
We will create the objects in the respective namespaces from the acm/registration/
directory structure.
However, it is important to note that the actual application configuration, stored in acm/odh-edge/
,
does not come from our local git repository checkout.
Instead, acm/registration/near-edge/overlays/tensorflow-housing-app/kustomization.yaml
and acm/registration/near-edge/overlays/bike-rental-app/kustomization.yaml
contain source URLs of the git repositories that will become configurations of the ACM Channels,
and the applications in those remote git repositories will be deployed.
By default, github.com/opendatahub-io/ai-edge is configured;
edit the URLs to match your repositories.
Then, as a user with permissions to create/update the following resource types in the openshift-gitops
namespace on
the ACM hub cluster:
ApplicationSets
GitOpsClusters
ManagedClusterSetBindings
PlacementBindings
Placements
Policies
run the following command:
oc apply -k acm/registration/
If the GitOps repository that Argo CD on the edge clusters will deploy resources from requires some form of credentials, then these credentials will need to be provided to each cluster via a Secret. This can be done by specifying the Secret inside an ACM Policy resource on the ACM hub cluster in such a way that ACM will instruct the edge clusters to create it. The following example illustrates how this may be done:
cat << EOF | oc apply -f -
apiVersion: policy.open-cluster-management.io/v1
kind: Policy
metadata:
name: namespace-policy
namespace: openshift-gitops
spec:
disabled: false
policy-templates:
- objectDefinition:
apiVersion: policy.open-cluster-management.io/v1
kind: ConfigurationPolicy
metadata:
name: ensure-namespace-exists
spec:
object-templates:
- complianceType: musthave
objectDefinition:
apiVersion: v1
kind: Secret
metadata:
name: first-repo
namespace: openshift-gitops
labels:
argocd.argoproj.io/secret-type: repository
stringData:
type: git
url: https://github.com/argoproj/private-repo
pruneObjectBehavior: DeleteIfCreated
severity: low
remediationAction: enforce
EOF
In the OpenShift Console of the ACM hub cluster in All Clusters > Applications, search for the application name.
You should see a few results results: <app name>-appset
, <cluster name>-<app name>
, and <app name>-1
.
Clicking on the <cluster name>-<app name>
entry, and then navigating to the Topology tab, you can see what objects
were created on the remote cluster.
Everything should be shown green. If it is not, click the icon of the faulty object and check the displayed information for debugging clues.
The Argo CD provided by Openshift GitOps has a console on each near edge cluster, showing detailed information on each
application that it manages on that cluster. On the particular edge cluster, open the console by navigating to the
domain specified by the openshift-gitops-cluster
Route in the openshift-gitops
namespace on that cluster. Once
logged-in to the Argo console, search for the application name, and then select the entry named in the format <cluster name>-<application name>
to see more information.
- Login to the core cluster using an account that has cluster-admin privileges
- Edit contents of thanos-secret file.
- Install the ACM observability stack by running
make setup-observability
- Create the pull secret the MultiClusterObservability object expects by reading from the cluster secret
DOCKER_CONFIG_JSON=`oc extract secret/pull-secret -n openshift-config --to=-`
oc create secret generic multiclusterhub-operator-pull-secret -n open-cluster-management-observability --from-literal=.dockerconfigjson="${DOCKER_CONFIG_JSON}" --type=kubernetes.io/dockerconfigjson
- Login to the edge cluster using an account with cluster-admin privileges
- Enable monitoring for user-defined projects in OpenShift clusters
oc -n openshift-monitoring edit configmap cluster-monitoring-config
- Set variable
enableUserWorkload
totrue
- Install the
Red Hat Build of OpenTelemetry
from the OperatorHub
Note
If you are not deploying our example models then you must edit the static list of targets to contain the service pointing to your custom model.
This step is out of scope of this Proof of Concept repository, as this repository already contains trained models in the examples/models/ directory.
If you wish to develop and train different models, Jupyter notebooks provided by Open Data Hub (ODH) or Red Hat OpenShift AI can be used. To install ODH or OpenShift AI operators, admin privileges in the OpenShift cluster are needed.
Working and deploying your own models might require bigger changes to the definition and configuration of the pipelines and ACM setup below, so you might want to start with the pre-built models first.
In examples/model-upload/
you can upload a local
model file to be used in our pipelines. This is done by uploading a model to a PVC
and copying that model to our pipeline's workspace for use while it is running.
Upload model to PVC:
make MODEL_PATH="PATH_TO_A_FILE" NAME=my-model create
You should get a final output showing details of the upload
PVC name: model-upload-pvc
Size: 1G
Model path in pod: /workspace/model-upload-pvc/model_dir/model.model
You can set the SIZE
and PVC
values aswell
make MODEL_PATH="PATH_TO_A_FILE" NAME=my-model SIZE=1G PVC=my-new-PVC create
You can then use the copy-model-from-pvc task to copy the
model from the model-workspace
, which can be set to the PVC created in the last step, to the build-workspace-pv
workspace, which is then used by the buildah
task in the pipeline.
copy-model-from-pvc task is not included in the pipeline
by default, you have to add it yourself.
See CONTRIBUTING.md.
See DEVELOPMENT.md.