Skip to content

Commit 7164f4f

Browse files
lawrenaeaelawrence
andauthoredJan 28, 2022
Feature/custom worker service account (#711)
* docs for using a custom compute service account Co-authored-by: Lawrence, Andrew <aelawrence@google.com>
1 parent 37709f8 commit 7164f4f

File tree

3 files changed

+79
-62
lines changed

3 files changed

+79
-62
lines changed
 

‎README.md

+7-8
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,13 @@ Please see the following links for more information:
1818
* Blog post: [How Color uses the new Variant Transforms tool for breakthrough clinical data science with BigQuery](https://cloud.google.com/blog/big-data/2018/03/how-color-uses-the-new-variant-transforms-tool-for-breakthrough-clinical-data-science-with-bigquery).
1919
* Blog post: [Accelerating Mayo Clinic’s data platform with BigQuery and Variant Transforms](https://cloud.google.com/blog/products/data-analytics/genome-data-analytics-with-google-cloud).
2020
* Jupyter notebook: [Sample queries to explore variant data in BigQuery](docs/sample_queries)
21+
* The underlying pipeline uses
22+
[Cloud Dataflow](https://cloud.google.com/dataflow/). You can navigate to the
23+
[Dataflow Console](https://console.cloud.google.com/dataflow), to see more
24+
detailed view of the pipeline (e.g. number of records being processed, number of workers, more detailed error logs).
25+
2126

22-
### Prerequisites
27+
## Prerequisites
2328

2429
1. Follow the [getting started](https://cloud.google.com/genomics/docs/how-tos/getting-started)
2530
instructions on the Google Cloud page.
@@ -100,14 +105,8 @@ gcloud config set project GOOGLE_CLOUD_PROJECT
100105
gcloud config set compute/region REGION
101106
```
102107

103-
If you would like to run Variant Transforms in a custom subnetwork, see the
104-
[Advanced Flags](docs/setting_region.md#advanced-flags) documentation.
108+
There are options to control which service account, subnet and similar in the [Advanced Flags](docs/advanced_flags.md) documentation.
105109

106-
The underlying pipeline uses
107-
[Cloud Dataflow](https://cloud.google.com/dataflow/). You can navigate to the
108-
[Dataflow Console](https://console.cloud.google.com/dataflow), to see more
109-
detailed view of the pipeline (e.g. number of records being processed, number of
110-
workers, more detailed error logs).
111110

112111
### Running from github
113112

‎docs/advanced_flags.md

+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Advanced Flags
2+
3+
## Custom Networks
4+
5+
Variant Transforms supports custom networks. This can be used to start the processing VMs in a specific subnetwork of your Google Cloud project as opposed to the default network.
6+
7+
Specify a subnetwork by using the `--subnetwork` flag and provide the name of the subnetwork as follows: `--subnetwork my-subnet`. Just use the name of the subnet, not the full path.
8+
9+
Example:
10+
```bash
11+
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
12+
13+
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
14+
--project "${GOOGLE_CLOUD_PROJECT}" \
15+
--subnetwork my-subnet \
16+
...
17+
"${COMMAND}"
18+
```
19+
20+
21+
## Removing External IPs
22+
Variant Transforms allows disabling the use of external IP addresses with the
23+
`--use_public_ips` flag. If not specified, this defaults to true, so to restrict the
24+
use of external IP addresses, use `--use_public_ips false`. Note that without external
25+
IP addresses, VMs can only send packets to other internal IP addresses. To allow these
26+
VMs to connect to the external IP addresses used by Google APIs and services, you can
27+
[enable Private Google Access](https://cloud.google.com/vpc/docs/configure-private-google-access)
28+
on the subnet.
29+
30+
Example:
31+
```bash
32+
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
33+
34+
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
35+
--project "${GOOGLE_CLOUD_PROJECT}" \
36+
--use_public_ips false \
37+
...
38+
"${COMMAND}"
39+
```
40+
41+
## Custom Dataflow Runner Image
42+
By default Variant Transforms uses a custom docker image to run the pipeline in: `gcr.io/cloud-lifesciences/variant-transforms-custom-runner:latest`.
43+
This image contains all the necessary python/linux dependencies needed to run variant transforms so that they are not downloaded from the internet when the pipeline starts.
44+
45+
You can override which container is used by passing a `--sdk_container_image` as in the following example:
46+
47+
```bash
48+
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
49+
50+
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
51+
--project "${GOOGLE_CLOUD_PROJECT}" \
52+
--sdk_container_image gcr.io/path/to/my/container\
53+
...
54+
"${COMMAND}"
55+
```
56+
57+
## Custom Service Accounts
58+
By default the dataflow workers will use the [default compute service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account). You can override which service account to use with the `--service_account` flag as in the following example:
59+
60+
```bash
61+
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
62+
63+
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
64+
--project "${GOOGLE_CLOUD_PROJECT}" \
65+
--service_account my-cool-dataflow-worker@<PROJECT_ID>.iam.gserviceaccount.com\
66+
...
67+
"${COMMAND}"
68+
```
69+
70+
**Other Service Account Notes:**
71+
- The [Life Sciences Service Account is not changable](https://cloud.google.com/life-sciences/docs/troubleshooting#missing_service_account)
72+
- The [Dataflow Admin Service Account is not changable](https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#service_account)

‎docs/setting_region.md

-54
Original file line numberDiff line numberDiff line change
@@ -92,57 +92,3 @@ You can choose the region for the BigQuery dataset at dataset creation time.
9292
9393
![BigQuery dataset region](images/bigquery_dataset_region.png)
9494
95-
## Advanced Flags
96-
97-
Variant Transforms supports custom networks. This can be used to start the processing
98-
VMs in a specific subnetwork of your Google Cloud project as opposed to the default
99-
network.
100-
101-
Specify a subnetwork by using the `--subnetwork` flag and provide the name of
102-
the subnetwork as follows:
103-
`--subnetwork my-subnet`. Just use the name of the subnet, not the full path.
104-
105-
Variant Transforms allows disabling the use of external IP addresses with the
106-
`--use_public_ips` flag. If not specified, this defaults to true, so to restrict the
107-
use of external IP addresses, use `--use_public_ips false`. Note that without external
108-
IP addresses, VMs can only send packets to other internal IP addresses. To allow these
109-
VMs to connect to the external IP addresses used by Google APIs and services, you can
110-
[enable Private Google Access](https://cloud.google.com/vpc/docs/configure-private-google-access)
111-
on the subnet.
112-
113-
For example, to run Variant Transforms in a subnetwork you already created called
114-
`my-subnet` with no public IP addresses you can add these flags to the
115-
example above as follows:
116-
117-
```bash
118-
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
119-
120-
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
121-
--project "${GOOGLE_CLOUD_PROJECT}" \
122-
--region us-central1 \
123-
--location us-central1 \
124-
--temp_location "${TEMP_LOCATION}" \
125-
--subnetwork my-subnet \
126-
--use_public_ips false \
127-
"${COMMAND}"
128-
```
129-
130-
## Custom Dataflow Runner Image
131-
By default Variant Transforms uses a custom docker image to run the pipeline in: `gcr.io/cloud-lifesciences/variant-transforms-custom-runner:latest`.
132-
This image contains all the necessary python/linux dependencies needed to run variant transforms so that they are not downloaded from the internet when the pipeline starts.
133-
134-
You can override which container is used by passing a `--sdk_container_image` as in the following example:
135-
136-
```bash
137-
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
138-
139-
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
140-
--project "${GOOGLE_CLOUD_PROJECT}" \
141-
--region us-central1 \
142-
--location us-central1 \
143-
--temp_location "${TEMP_LOCATION}" \
144-
--subnetwork my-subnet \
145-
--use_public_ips false \
146-
--sdk_container_image gcr.io/path/to/my/container\
147-
"${COMMAND}"
148-
```

0 commit comments

Comments
 (0)
Please sign in to comment.