Snorkel Flow is the data-centric AI platform for automated data labeling, integrated model training and analysis, and enhanced domain expert collaboration
For more information, visit our official website
Get up and running with a few clicks! Install Snorkel Flow on your Google Kubernetes Enginer cluster using Google Cloud Marketplace. Follow the [on-screen instructions](TODO:Add Listing).
4 nodes, minimum 32 CPU / 128GB RAM per node.
You'll need the following tools in your development environment. If you are
using Cloud Shell, gcloud
, kubectl
, Docker, and Git are installed in your
environment by default.
Configure gcloud
as a Docker credential helper:
gcloud auth configure-docker
Create a new cluster from the command line:
export CLUSTER=snorkelflow
export ZONE=us-west1-a
gcloud container clusters create "$CLUSTER" --zone "$ZONE"
Configure kubectl
to connect to the new cluster:
gcloud container clusters get-credentials "$CLUSTER" --zone "$ZONE"
Snorkel Flow requires that the Filestore CSI Driver is installed on your cluster. To enable this driver on an existing cluster:
gcloud container clusters update "$CLUSTER" \
--update-addons=GcpFilestoreCsiDriver=ENABLED
--zone "$ZONE"
An Application resource is a collection of individual Kubernetes components, such as Services, Deployments, and so on, that you can manage as a group.
To set up your cluster to understand Application resources, run the following command:
kubectl apply -f "https://raw.githubusercontent.com/GoogleCloudPlatform/marketplace-k8s-app-tools/master/crd/app-crd.yaml"
You need to run this command once.
The Application resource is defined by the Kubernetes SIG-apps community. The source code can be found on github.com/kubernetes-sigs/application.
Clone this repo and the associated tools repo.
git clone --recursive https://github.com/snorkel-ai/snorkel-flow-gcp-marketplace.git
Navigate to the chart/snorkelflow
directory:
cd chart/snorkelflow
Choose an instance name and
namespace
for the app. In most cases, you can use the snorkelflow
namespace.
export APP_INSTANCE_NAME=snorkel-flow
export NAMESPACE=snorkelflow
If you use a different namespace than snorkelflow
, run the command below to create a new namespace:
kubectl create namespace "$NAMESPACE"
Use helm install
to install Snorkel Flow from the helm charts.
Note that you will need to insert the correct image paths in the values.yaml
file, or set them on the command line as shown here.
We will also toggle the RBAC creation for the secrets manager on, as by default all RBAC creation is disabled in our charts due to a requirement in the deployer image.
Finally, we need to override the default storageclass to use the one provided by the Filestore CSI driver.
cd chart/snorkelflow
helm install --generate-name \
--set image.imageNames.postgres="gcr.io/snorkelai-public/snorkelflow/postgres:0.76.12" \
--set image.imageNames.engine="gcr.io/snorkelai-public/snorkelflow/engine:0.76.12" \
--set image.imageNames.envoy="gcr.io/snorkelai-public/snorkelflow/envoy:0.76.12" \
--set image.imageNames.flowUi="gcr.io/snorkelai-public/snorkelflow/flow-ui:0.76.12" \
--set image.imageNames.grafana="gcr.io/snorkelai-public/snorkelflow/grafana:0.76.12" \
--set image.imageNames.influxdb="gcr.io/snorkelai-public/snorkelflow/influxdb:0.76.12" \
--set image.imageNames.minio="gcr.io/snorkelai-public/snorkelflow/minio:0.76.12" \
--set image.imageNames.modelRegistry="gcr.io/snorkelai-public/snorkelflow/model-registry:0.76.12" \
--set image.imageNames.notebook="gcr.io/snorkelai-public/snorkelflow/notebook:0.76.12" \
--set image.imageNames.redis="gcr.io/snorkelai-public/snorkelflow/redis:0.76.12" \
--set image.imageNames.secretsGenerator="gcr.io/snorkelai-public/snorkelflow/secrets-generator:0.76.12" \
--set image.imageNames.singleuserNotebook="gcr.io/snorkelai-public/snorkelflow/singleuser-notebook:0.76.12" \
--set image.imageNames.studio="gcr.io/snorkelai-public/snorkelflow/studio-api:0.76.12" \
--set image.imageNames.telegraf="gcr.io/snorkelai-public/snorkelflow/telegraf:0.76.12" \
--set image.imageNames.tdm="gcr.io/snorkelai-public/snorkelflow/tdm-api:0.76.12" \
--set services.secretsGenerator.createRole=true \
--set namespace="$NAMESPACE" \
--set volumes.snorkelflowData.storageClass="standard-rwx" \
--set volumes.snorkelflowData.storageRequest="1200Gi" \
./
The image versions here are just provided as a sample.
To get the Console URL for your app, run the following command:
echo "https://console.cloud.google.com/kubernetes/application/${ZONE}/${CLUSTER}/${NAMESPACE}/${APP_INSTANCE_NAME}"
To view your app, open the URL in your browser.
To access the Snorkel Flow UI, you can either expose a public service endpoint or keep it private, but connect
from your local environment with kubectl port-forward
.
Snorkel Flow will automatically create ingress objects of type gce
,
which will by default create an external load balancer. We also offer the option to specify a custom domain within the deployer schema.
In order to find the public IP, run the following command:
kubectl get ingresses --namespace {{ .Release.Namespace }}
The external address will be the address for snorkelflow-ingress
Keep in mind that that the provisioning process for the load balancer can take a while.
You can use port forwarding feature of kubectl
to forward Snorkel Flow's port to your local machine. Run the following command in background:
kubectl get pods --namespace {{ .Release.Namespace }}
Find the flow-ui
pod, and then execute the following command:
kubectl port-forward --namespace {{ .Release.Namespace }} flow-ui-{pod name} 5000
Replace the exact name of the flow-ui
pod with the actual name of the pod listed.
Now you can access Snorkel Flow with http://localhost:5000/
.
Upload the license provided to you and follow the instructions on screen to create user accounts.
This installation of Snorkel Flow is not meant to be scaled up.
To upgrade to a new version please redeploy from Marketplace. For help, contact google@snorkel.ai.
The Snorkel Flow data volumes are stored on PersistentVolume
objects, in order to view these, run
kubectl get pv --namespace {{ .Release.Namespace }}
The snorkelflow-data
volume is created by the Google Filestore CSI driver, and it points to an automatically
provisioned NFS
The other data volumes are created by the CSI driver.
This can be done through both the Google Cloud Platform Console or through the CLI.
To backup through the CLI, first find the instance name of the volume. This will be in the form of
pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
, the file share name will be the the string at the end of the
VolumeHandle
when executing
kubectl describe pv pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx --namespace {{ .Release.Namespace }}
For example, the VolumeHandle
could be modeInstance/us-central1-c/pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/vol1
where the file share name will then be vol1
Next, execute the following command, substituting in your specific file share and instance id.
gcloud filestore backups create my-backup --file-share=vol1 --instance=pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx--instance-zone=us-central1-c --region=us-central1 --description="Backup"
Now in the Backups console, we can restore this backup to the same Filestore instance.
As the other volumes are provisioned by the Compute Engine persistent disk CSI driver, follow these instructions in order to use Volume Snapshots to backup and restore the other persistent volumes.
In the GCP Console, open Kubernetes Applications. From the list of applications, click Snorkel Flow. On the Application Details page, click Delete.
NOTE: We recommend that you use a
kubectl
version that is the same as the version of your cluster. Using the same versions ofkubectl
and the cluster helps avoid unforeseen issues.
kubectl delete daemonset,service,configmap,application \
--namespace ${NAMESPACE} \
--selector app.kubernetes.io/name=${APP_INSTANCE_NAME}
Alternatively, if the namespace is no longer required, you can delete the namespace itself
kubectl delete namespace ${NAMESPACE}
Optionally, if you don't need the deployed application or the GKE cluster, delete the cluster using this command:
gcloud container clusters delete "${CLUSTER}" --zone "${ZONE}"