pingcap · gregwebs · Jun 18, 2019 · Jun 14, 2019 · Jun 16, 2019 · Jun 17, 2019
diff --git a/deploy/gcp/.gitignore b/deploy/gcp/.gitignore
@@ -2,3 +2,5 @@
 *.tfstate*
 credentials
 rendered
+terraform-key.json
+credentials.auto.tfvars
diff --git a/deploy/gcp/README.md b/deploy/gcp/README.md
@@ -34,30 +34,44 @@ gcloud services enable container.googleapis.com
 
 ### Configure Terraform
 
-The terraform script expects three environment variables. You can let Terraform prompt you for them, or `export` them in the `~/.bash_profile` file ahead of time. The required environment variables are:
+The terraform script expects three variables to be set.
 
-* `TF_VAR_GCP_CREDENTIALS_PATH`: Path to a valid GCP credentials file.
-    - It is recommended to create a new service account to be used by Terraform. See [this page](https://cloud.google.com/iam/docs/creating-managing-service-accounts) to create a service account and grant `Project Editor` role to it.
-    - See [this page](https://cloud.google.com/iam/docs/creating-managing-service-account-keys) to create service account keys, and choose `JSON` key type during creation. The downloaded `JSON` file that contains the private key is the credentials file you need.
 * `TF_VAR_GCP_REGION`: The region to create the resources in, for example: `us-west1`.
 * `TF_VAR_GCP_PROJECT`: The name of the GCP project.
+* `TF_VAR_GCP_CREDENTIALS_PATH`: Path to a valid GCP credentials file.
+    - It is recommended to create a new service account to be used by Terraform as shown in the below example.
+
+Below we will set these environment variables
+
+```bash
+# Replace the region with your GCP region and your GCP project name.
+echo GCP_REGION=us-west1 >> terraform.tfvars
+# First make sure you are connected to the correct project. gcloud config set project $PROJECT
+echo "GCP_PROJECT=$(gcloud config get-value project)" >> terraform.tfvars
+# Create a service account for terraform with restricted permissions and set the credentails path
+./create-service-account.sh
+```
+
+## Deploy
 
-> *Note*: The service account must have sufficient permissions to create resources in the project. The `Project Editor` primitive will accomplish this.
 
-To set the three environment variables, for example, you can enter in your terminal:
+Now that you have configured gcloud access, make sure you have a copy of the repo:
 
 ```bash
-# Replace the values with the path to the JSON file you have downloaded, the GCP region and your GCP project name.
-export TF_VAR_GCP_CREDENTIALS_PATH="/Path/to/my-project.json"
-export TF_VAR_GCP_REGION="us-west1"
-export TF_VAR_GCP_PROJECT="my-project"
+git clone --depth=1 https://github.com/pingcap/tidb-operator
+cd tidb-operator/deploy/gcp
 ```
 
-You can also append them in your `~/.bash_profile` so they will be exported automatically next time.
+You need to decide on instance types. If you just want to get a feel for a TiDB deployment and lower your cost, you can use the small settings.
 
-## Deploy
+    cat small.tfvars >> terraform.tfvars
+
+If you want to benchmark a production deployment, run:
 
-The default setup creates a new VPC, two subnetworks, and an f1-micro instance as a bastion machine. The GKE cluster is created with the following instance types as worker nodes:
+    cat prod.tfvars >> terraform.tfvars
+
+The terraform creates a new VPC, two subnetworks, and an f1-micro instance as a bastion machine.
+The production setup used the following instance types:
 
 * 3 n1-standard-4 instances for PD
 * 3 n1-highmem-8 instances for TiKV
@@ -66,13 +80,11 @@ The default setup creates a new VPC, two subnetworks, and an f1-micro instance a
 
 > *Note*: The number of nodes created depends on how many availability zones there are in the chosen region. Most have 3 zones, but us-central1 has 4. See [Regions and Zones](https://cloud.google.com/compute/docs/regions-zones/) for more information and see the [Customize](#customize) section on how to customize node pools in a regional cluster.
 
-The default setup, as listed above, requires at least 91 CPUs which exceed the default CPU quota of a GCP project. To increase your project's quota, follow the instructions [here](https://cloud.google.com/compute/quotas). You need more CPUs if you need to scale out.
+The production setup, as listed above, requires at least 91 CPUs which exceed the default CPU quota of a GCP project. To increase your project's quota, follow the instructions [here](https://cloud.google.com/compute/quotas). You need more CPUs if you need to scale out.
 
-Now that you have configured everything needed, you can launch the script to deploy the TiDB cluster:
+Once you choose your instances, you can install your TiDB cluster with:
 
 ```bash
-git clone --depth=1 https://github.com/pingcap/tidb-operator
-cd tidb-operator/deploy/gcp
 terraform init
 terraform apply
 ```
@@ -86,11 +98,11 @@ Apply complete! Resources: 17 added, 0 changed, 0 destroyed.
 
 Outputs:
 
-cluster_id = my-cluster
-cluster_name = my-cluster
+cluster_id = tidb
+cluster_name = tidb
 how_to_connect_to_mysql_from_bastion = mysql -h 172.31.252.20 -P 4000 -u root
 how_to_ssh_to_bastion = gcloud compute ssh bastion --zone us-west1-b
-kubeconfig_file = ./credentials/kubeconfig_my-cluster
+kubeconfig_file = ./credentials/kubeconfig_tidb
 monitor_ilb_ip = 35.227.134.146
 monitor_port = 3000
 region = us-west1
@@ -113,7 +125,7 @@ mysql -h <tidb_ilb_ip> -P 4000 -u root
 
 ## Interact with the cluster
 
-You can interact with the cluster using `kubectl` and `helm` with the kubeconfig file `credentials/kubeconfig_<cluster_name>` as follows. The default `cluster_name` is `my-cluster`, which can be changed in `variables.tf`.
+You can interact with the cluster using `kubectl` and `helm` with the kubeconfig file `credentials/kubeconfig_<cluster_name>` as follows. The default `cluster_name` is `tidb`, which can be changed in `variables.tf`.
 
 ```bash
 # By specifying --kubeconfig argument.
@@ -178,7 +190,7 @@ You can change default values in `variables.tf` (such as the cluster name and th
 
 ### Customize GCP resources
 
-GCP allows attaching a local SSD to any instance type that is `n1-standard-1` or greater. This allows for good customizability.
+GCP allows attaching a local SSD to any instance type that is `n1-standard-1` or greater.
 
 ### Customize TiDB parameters
 
@@ -199,9 +211,9 @@ gcloud compute instance-groups managed list | grep monitor
 And the result will be something like this:
 
 ```bash
-gke-my-cluster-monitor-pool-08578e18-grp  us-west1-b  zone   gke-my-cluster-monitor-pool-08578e18  0     0            gke-my-cluster-monitor-pool-08578e18  no
-gke-my-cluster-monitor-pool-7e31100f-grp  us-west1-c  zone   gke-my-cluster-monitor-pool-7e31100f  1     1            gke-my-cluster-monitor-pool-7e31100f  no
-gke-my-cluster-monitor-pool-78a961e5-grp  us-west1-a  zone   gke-my-cluster-monitor-pool-78a961e5  1     1            gke-my-cluster-monitor-pool-78a961e5  no
+gke-tidb-monitor-pool-08578e18-grp  us-west1-b  zone   gke-tidb-monitor-pool-08578e18  0     0            gke-tidb-monitor-pool-08578e18  no
+gke-tidb-monitor-pool-7e31100f-grp  us-west1-c  zone   gke-tidb-monitor-pool-7e31100f  1     1            gke-tidb-monitor-pool-7e31100f  no
+gke-tidb-monitor-pool-78a961e5-grp  us-west1-a  zone   gke-tidb-monitor-pool-78a961e5  1     1            gke-tidb-monitor-pool-78a961e5  no
 ```
 
 The first column is the name of the managed instance group, and the second column is the zone in which it was created. You also need the name of the instance in that group, and you can get it by running:
@@ -213,16 +225,16 @@ gcloud compute instance-groups managed list-instances <the-name-of-the-managed-i
 For example:
 
 ```bash
-$ gcloud compute instance-groups managed list-instances gke-my-cluster-monitor-pool-08578e18-grp --zone us-west1-b
+$ gcloud compute instance-groups managed list-instances gke-tidb-monitor-pool-08578e18-grp --zone us-west1-b
 
 NAME                                       ZONE        STATUS   ACTION  INSTANCE_TEMPLATE                     VERSION_NAME  LAST_ERROR
-gke-my-cluster-monitor-pool-08578e18-c7vd  us-west1-b  RUNNING  NONE    gke-my-cluster-monitor-pool-08578e18
+gke-tidb-monitor-pool-08578e18-c7vd  us-west1-b  RUNNING  NONE    gke-tidb-monitor-pool-08578e18
 ```
 
 Now you can delete the instance by specifying the name of the managed instance group and the name of the instance, for example:
 
 ```bash
-gcloud compute instance-groups managed delete-instances gke-my-cluster-monitor-pool-08578e18-grp --instances=gke-my-cluster-monitor-pool-08578e18-c7vd --zone us-west1-b
+gcloud compute instance-groups managed delete-instances gke-tidb-monitor-pool-08578e18-grp --instances=gke-tidb-monitor-pool-08578e18-c7vd --zone us-west1-b
 ```
 
 ## Destroy
@@ -235,7 +247,7 @@ terraform destroy
 
 You have to manually delete disks in the Google Cloud Console, or with `gcloud` after running `terraform destroy` if you do not need the data anymore.
 
-> *Note*: When `terraform destroy` is running, an error with the following message might occur: `Error reading Container Cluster "my-cluster": Cluster "my-cluster" has status "RECONCILING" with message""`. This happens when GCP is upgrading the kubernetes master node, which it does automatically at times. While this is happening, it is not possible to delete the cluster. When it is done, run `terraform destroy` again.
+> *Note*: When `terraform destroy` is running, an error with the following message might occur: `Error reading Container Cluster "tidb": Cluster "tidb" has status "RECONCILING" with message""`. This happens when GCP is upgrading the kubernetes master node, which it does automatically at times. While this is happening, it is not possible to delete the cluster. When it is done, run `terraform destroy` again.
 
 
 ## More information

diff --git a/deploy/gcp/create-service-account.sh b/deploy/gcp/create-service-account.sh
@@ -0,0 +1,27 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+cd "$(dirname "$0")"
+PROJECT="${TF_VAR_GCP_PROJECT:-$(cat terraform.tfvars | awk -F '=' '/GCP_PROJECT/ {print $2}' | cut -d '"' -f 2)}"
+echo "$PROJECT"
+
+cred_file=credentials.auto.tfvars
+if test -f "$cred_file" ; then
+  if cat "$cred_file" | awk -F'=' '/GCP_CREDENTIALS/ {print $2}' >/dev/null ; then
+    echo "GCP_CREDENTAILS_PATH already set in $cred_file"
+    exit 1
+  fi
+fi
+
+gcloud iam service-accounts create --display-name terraform terraform
+email="terraform@${PROJECT}.iam.gserviceaccount.com"
+gcloud projects add-iam-policy-binding "$PROJECT" --member "$email" --role roles/container.clusterAdmin
+gcloud projects add-iam-policy-binding "$PROJECT" --member "$email" --role roles/compute.networkAdmin
+gcloud projects add-iam-policy-binding "$PROJECT" --member "$email" --role roles/compute.viewer
+gcloud projects add-iam-policy-binding "$PROJECT" --member "$email" --role roles/compute.securityAdmin
+gcloud projects add-iam-policy-binding "$PROJECT" --member "$email" --role roles/iam.serviceAccountUser
+gcloud projects add-iam-policy-binding "$PROJECT" --member "$email" --role roles/compute.instanceAdmin.v1
+
+mkdir -p credentials
+gcloud iam service-accounts keys create credentials/terraform-key.json --iam-account "$email"
+echo GCP_CREDENTIALS_PATH="$(pwd)/credentials/terraform-key.json" > "$cred_file"
diff --git a/deploy/gcp/data.tf b/deploy/gcp/data.tf
@@ -7,13 +7,11 @@ data "template_file" "tidb_cluster_values" {
     tikv_replicas    = var.tikv_replica_count
     tidb_replicas    = var.tidb_replica_count
     operator_version = var.tidb_operator_version
+    tidb_operator_registry = var.tidb_operator_registry
   }
 }
 
-data external "available_zones_in_region" {
-  depends_on = [null_resource.prepare-dir]
-  program    = ["bash", "-c", "gcloud compute regions describe ${var.GCP_REGION} --format=json | jq '{zone: .zones|.[0]|match(\"[^/]*$\"; \"g\")|.string}'"]
-}
+data "google_compute_zones" "available" { }
 
 data "external" "tidb_ilb_ip" {
   depends_on = [null_resource.deploy-tidb-cluster]

diff --git a/deploy/gcp/main.tf b/deploy/gcp/main.tf
@@ -118,6 +118,11 @@ resource "google_container_node_pool" "pd_pool" {
   name               = "pd-pool"
   initial_node_count = var.pd_count
 
+  management {
+    auto_repair = false
+    auto_upgrade = false
+  }
+
   node_config {
     machine_type    = var.pd_instance_type
     local_ssd_count = 0
@@ -146,6 +151,11 @@ resource "google_container_node_pool" "tikv_pool" {
   name               = "tikv-pool"
   initial_node_count = var.tikv_count
 
+  management {
+    auto_repair = false
+    auto_upgrade = false
+  }
+
   node_config {
     machine_type    = var.tikv_instance_type
     image_type      = "UBUNTU"
@@ -177,6 +187,11 @@ resource "google_container_node_pool" "tidb_pool" {
   name               = "tidb-pool"
   initial_node_count = var.tidb_count
 
+  management {
+    auto_repair = false
+    auto_upgrade = false
+  }
+
   node_config {
     machine_type = var.tidb_instance_type
 
@@ -203,6 +218,11 @@ resource "google_container_node_pool" "monitor_pool" {
   name               = "monitor-pool"
   initial_node_count = var.monitor_count
 
+  management {
+    auto_repair = false
+    auto_upgrade = false
+  }
+
   node_config {
     machine_type = var.monitor_instance_type
     tags         = ["monitor"]
@@ -254,7 +274,7 @@ resource "google_compute_firewall" "allow_ssh_from_bastion" {
 
 resource "google_compute_instance" "bastion" {
   project      = var.GCP_PROJECT
-  zone         = data.external.available_zones_in_region.result["zone"]
+  zone         = data.google_compute_zones.available.names[0]
   machine_type = var.bastion_instance_type
   name         = "bastion"
 
@@ -308,62 +328,77 @@ resource "null_resource" "setup-env" {
   depends_on = [
     google_container_cluster.cluster,
     null_resource.get-credentials,
+    var.tidb_operator_registry,
+    var.tidb_operator_version,
   ]
 
   provisioner "local-exec" {
     working_dir = path.module
+    interpreter = ["bash", "-c"]
 
     command = <<EOS
-kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user $(gcloud config get-value account)
-kubectl create serviceaccount --namespace kube-system tiller
+set -euo pipefail
+
+if ! kubectl get clusterrolebinding cluster-admin-binding 2>/dev/null; then
+  kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user $(gcloud config get-value account)
+fi
+
+if ! kubectl get serviceaccount -n kube-system tiller 2>/dev/null ; then
+  kubectl create serviceaccount --namespace kube-system tiller
+fi
+
 kubectl apply -f manifests/crd.yaml
 kubectl apply -k manifests/local-ssd
 kubectl apply -f manifests/gke/persistent-disk.yaml
 kubectl apply -f manifests/tiller-rbac.yaml
+
 helm init --service-account tiller --upgrade --wait
 until helm ls; do
   echo "Wait until tiller is ready"
 done
-helm install --namespace tidb-admin --name tidb-operator ${path.module}/charts/tidb-operator
+helm upgrade --install tidb-operator --namespace tidb-admin ${path.module}/charts/tidb-operator --set operatorImage=${var.tidb_operator_registry}/tidb-operator:${var.tidb_operator_version}
 EOS
 
 
-environment = {
-KUBECONFIG = local.kubeconfig
-}
-}
+    environment = {
+      KUBECONFIG = local.kubeconfig
+    }
+  }
 }
 
 resource "null_resource" "deploy-tidb-cluster" {
-depends_on = [
-null_resource.setup-env,
-local_file.tidb-cluster-values,
-google_container_node_pool.pd_pool,
-google_container_node_pool.tikv_pool,
-google_container_node_pool.tidb_pool,
-]
-
-triggers = {
-values = data.template_file.tidb_cluster_values.rendered
-}
+  depends_on = [
+    null_resource.setup-env,
+    local_file.tidb-cluster-values,
+    google_container_node_pool.pd_pool,
+    google_container_node_pool.tikv_pool,
+    google_container_node_pool.tidb_pool,
+  ]
 
-provisioner "local-exec" {
+  triggers = {
+    values = data.template_file.tidb_cluster_values.rendered
+  }
+
+  provisioner "local-exec" {
+    interpreter = ["bash", "-c"]
 command = <<EOS
+set -euo pipefail
+
 helm upgrade --install tidb-cluster ${path.module}/charts/tidb-cluster --namespace=tidb -f ${local.tidb_cluster_values_path}
 until kubectl get po -n tidb -lapp.kubernetes.io/component=tidb | grep Running; do
   echo "Wait for TiDB pod running"
   sleep 5
 done
+
 until kubectl get svc -n tidb tidb-cluster-tidb -o json | jq '.status.loadBalancer.ingress[0]' | grep ip; do
   echo "Wait for TiDB internal loadbalancer IP"
   sleep 5
 done
 EOS
 
-
-environment = {
-KUBECONFIG = local.kubeconfig
-}
-}
+    environment = {
+      KUBECONFIG = local.kubeconfig
+    }
+  }
 }
 
diff --git a/deploy/gcp/prod.tfvars b/deploy/gcp/prod.tfvars
@@ -0,0 +1,3 @@
+pd_instance_type = "n1-standard-4"
+tikv_instance_type = "n1-highmem-8"
+tidb_instance_type = "n1-standard-16"
diff --git a/deploy/gcp/small.tfvars b/deploy/gcp/small.tfvars
@@ -0,0 +1,3 @@
+pd_instance_type = "n1-standard-2"
+tikv_instance_type = "n1-highmem-4"
+tidb_instance_type = "n1-standard-8"
diff --git a/deploy/gcp/templates/tidb-cluster-values.yaml.tpl b/deploy/gcp/templates/tidb-cluster-values.yaml.tpl
@@ -30,7 +30,7 @@ services:
     type: ClusterIP
 
 discovery:
-  image: pingcap/tidb-operator:${operator_version}
+  image: ${tidb_operator_registry}/tidb-operator:${operator_version}
   imagePullPolicy: IfNotPresent
   resources:
     limits: