Skip to content

Latest commit

 

History

History
159 lines (137 loc) · 11.5 KB

DEPLOYMENT_OVERVIEW.md

File metadata and controls

159 lines (137 loc) · 11.5 KB

Deployment Overview

Tanagra has two sets of GCP requirements, for the indexer environment and for the service deployment. Indexer environments and service deployments are not one-to-one. You can have multiple indexer environments for a single service deployment and vice versa.

Once you have an environment or deployment setup, you need to set the relevant properties in the config files (e.g. GCP project id, BigQuery dataset id, GCS bucket name). Pointers to the relevant config properties are included in each section below.

Indexer Environment

An indexer environment is a GCP project configured with the items below. Indexer Environment Diagram

  • These APIs enabled. Some of these may be enabled by default in your project.
  • GCS bucket in the same location as the source and index BigQuery datasets.
  • "VM" service account with the below permissions to attach to the Dataflow worker VMs.
    • Read the source BigQuery dataset. roles/bigquery.dataViewer granted at the dataset-level (on the source dataset) includes the required permissions.
    • Create BigQuery jobs. roles/bigquery.jobUser granted at the project-level (on the indexer GCP project) includes the required permissions.
    • Write to the index BigQuery dataset. roles/bigquery.dataOwner granted at the dataset-level (on the index dataset) includes the required permissions.
    • Execute Dataflow work units. roles/dataflow.worker granted at the project-level (on the indexer GCP project) includes the required permissions.
  • "Runner" end-user or service account with the below permissions to run indexing.
    • Read the source BigQuery dataset. roles/bigquery.dataViewer granted at the dataset-level (on the source dataset) includes the required permissions.
    • Create BigQuery jobs. roles/bigquery.jobUser granted at the project-level (on the indexer GCP project) includes the required permissions.
    • Create/delete and write to the index BigQuery dataset. roles/bigquery.dataOwner granted at the project-level (on the indexer GCP project) includes the required permissions.
    • Kickoff Dataflow jobs. roles/dataflow.admin granted at the project-level (on the indexer GCP project) includes the required permissions.
    • Attach the "VM" service account credentials to the Dataflow worker VMs. roles/iam.serviceAccountUser granted at the service account-level (on the "VM" service account) includes the required permissions.
  • (Optional) VPC sub-network configured with Private Google Access to help speed up the Dataflow jobs.

You can use a single service account for both the "VM" and "runner" use cases, as long as it has all the permissions.

Config properties

All indexer configuration lives in the indexer config file, the properties of which are all documented here. In particular:

Service Deployment

A service deployment lives in a GCP project configured with the items below. Service Deployment Diagram

  • Java service packaged as a runnable JAR file, deployed either in GKE or AppEngine.
  • CloudSQL database, either PostGres (recommended) or MySQL.
  • BigQuery dataset for temporary tables, one per index dataset location.
  • GCS bucket for export files, one per index dataset location.
    • Update the CORS configuration with any URLs that will need to read exported files. e.g. If there is an export model that writes a file and redirects to another URL that will read the file, you will likely need to grant that URL permission to make GET requests for objects in the bucket. Example CORS configuration file:
    [
     {
       "origin": ["https://workbench.verily.com"],
       "method": ["GET"],
       "responseHeader": ["Content-Type"],
       "maxAgeSeconds": 3600
     }
    ]
    
    • The files stored in this bucket are available for either download to the user's computer or export to another configurable URL. It is recommended to configure the bucket to automatically delete objects after some expiration time. See lifecycle configuration
  • "Application" service account with the below permissions.
    • Read the source BigQuery dataset. roles/bigquery.dataViewer granted at the dataset-level (on the source dataset) includes the required permissions.
    • Read the index BigQuery dataset. roles/bigquery.dataViewer granted at the dataset-level (on the index dataset) includes the required permissions.
    • Create BigQuery jobs. roles/bigquery.jobUser granted at the project-level (on the service GCP project) includes the required permissions.
    • Create tables in the temporary tables BigQuery dataset. roles/bigquery.dataEditor granted at the dataset-level (on the temporary tables dataset) includes the required permissions.
    • Read and write files to the export bucket(s). roles/storage.objectAdmin granted at the bucket-level includes the required permissions.
    • Generate signed URLs for export files. roles/iam.serviceAccountTokenCreator granted at the service account-level (on itself) includes the required permissions.
    • Talk to the CloudSQL database. roles/cloudsql.client granted at the project-level includes the required permissions.

Config properties

Service configuration lives in two places, depending on whether they apply to a single underlay or the entire deployment.

Single underlay

Each underlay hosted by a service deployment has its own service config file. All service config file properties are documented here. In particular:

Entire deployment

Each service deployment is a single Java application. You can configure this Java application with a custom application.yaml file, or (more common) override the default application properties with environment variables. All application properties are documented here. In particular:

Deployment Patterns

Single project for indexer and service

This is probably the default for getting up and running with Tanagra. You can use the same GCP project for both the indexer environment and the service deployment. It may also be useful for automated testing or dev environments. For production services though, we recommend separating the indexer and service projects.

Indexer environment supports multiple service deployments

A single indexer environment can support multiple service deployments. There is one GCP project that is configured correctly for indexing (i.e. Dataflow is enabled, there are one or more service accounts with permissions to write to BigQuery and kickoff Dataflow jobs, etc.). Multiple underlays or multiple versions of a single underlay are indexed in this project, each into its own index BigQuery dataset. You could use different service accounts for each source dataset.

Service deployments may then read/query these index datasets directly from the indexer environment project. Or you can "publish" (i.e. copy) the index datasets to another project (e.g. the service deployment project). The Verily test and dev service deployments both read directly from the indexer environment project. The AoU production service deployment will add the "publish" step.

Separating the indexer environment from the service deployment means you can avoid increasing permissions in the service deployment project (e.g. don't need to enable Dataflow in your audited production project).

Service deployment hosts multiple underlays

A single service deployment can host one or more underlays. Each underlay should have its own service configuration file. The service deployment configuration allows specifying multiple service configuration files.

Keep in mind that the access control implementation is per deployment, not per underlay. So if you want underlay-specific access control, then you should modify your access control implementation to have different behavior depending on which underlay is being used.

The Verily test and dev service deployments both host multiple underlays.