Bq to vcf sample ids #557

tneymanov · 2020-02-24T11:37:21Z

This PR

changes flag genomic_regions to genomic_region which becomes mandatory.
gets bq rows from sample_info table
if sample_names was not provided:
- Creates hash table (id->name+file) from bq rows.
- Guesses the encoding.
- Generates sample names as either {sample_name} or {file_path}/{sample_name} based on the encoding
if sample_names was provided:
- Creates hash table (name->id) from bq rows.
- Generates sample IDs from provided list.
- Note: file+sample case is not supported right now as I'm not sure how we want to approach getting file names
- Note: I'm not sure if rehashing sample names into IDs was better, but currently get it from BQ. Open to changes

Note: Had to disable BQ tests, because I will need to recreate all of the test table inputs and output files. That's gonna take a while, so you can start the review right away. Tested manually following cases

No Sample Names + Without file path + No preserve_sample_order
No Sample Names + Without file path + preserve_sample_order
No Sample Names + With file path
Sample Names + Without file path.

tneymanov · 2020-02-25T12:45:54Z

Added second commit which applies following changes:

Add BQ tests back with modified files.
Adjust indentations between tables (__chr -> ___chr).
Make sure to generate <SAMPLE_NAME>_ for WITH_FILE_PATH encoding as per Sync meeting.
Adjust --sample_names flag flow to handle WITH_FILE_PATH encoding as per Sync meeting.
Rename --genomic_region back to --genomic_regions while still forcing 1 and only 1 value as per Sync meeting.

samanvp

Sorry for slow review, I will send more comments tomorrow morning.

samanvp · 2020-03-12T21:39:56Z

gcp_variant_transforms/bq_to_vcf.py

 _BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
 _COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
-_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND '
-                            '{START_POSITION_ID}>={START_POSITION_VALUE} AND '
+_FULL_INPUT_TABLE = '{TABLE}___{SUFFIX}'


We need to use TABLE_SUFFIX_SEPARATOR here.

samanvp · 2020-03-12T21:42:30Z

gcp_variant_transforms/bq_to_vcf.py

                            '{END_POSITION_ID}<={END_POSITION_VALUE})')
+_SAMPLE_INFO_QUERY_TEMPLATE = (
+    'SELECT sample_id, sample_name, file_path '
+    'FROM `{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}__sample_info`')


Instead of {PROJECT_ID}.{DATASET_ID}.{TABLE_ID}__sample_info we better use {FULL_TABLE_ID} and construct FULL_TABLE_ID outside of query, similar to what we did in AVRO PR:
https://github.com/googlegenomics/gcp-variant-transforms/pull/558/files#diff-7b9491fbb5998f2c837ddabb9582a5ba

Done, somewhat. I'd rather import base_table_id and construct the rest here. PTAL.

samanvp · 2020-03-12T21:49:37Z

gcp_variant_transforms/bq_to_vcf.py

@@ -117,7 +122,8 @@ def run(argv=None):
        '{}_meta_info.vcf'.format(unique_temp_id))
    _write_vcf_meta_info(known_args.input_table,
                         known_args.representative_header_file,
-                         known_args.allow_incompatible_schema)
+                         known_args.allow_incompatible_schema,
+                         known_args.genomic_regions)


We are passing known_args.genomic_regions all the way down to _get_schema() only to use the suffix to identify source table.
Instead of known_args.input_table and known_args.genomic_regions we can "assemble" full_table_id and pass it down. This will also match my earlier comment about using FULL_TABLE_ID in _SAMPLE_INFO_QUERY_TEMPLATE.

As per offline discussions, removed the dependency on genomic_regions.

samanvp · 2020-03-13T04:22:52Z

gcp_variant_transforms/libs/bigquery_util.py

+TABLE_SUFFIX_SEPARATOR = '___'
+SAMPLE_TABLE_SUFFIX_SEPARATOR = '__'


Can we switch these two?
I know we discussed this offline but since we are releasing gnomAD with __chr* suffixes it makes sense here we follow that standard. Or alternatively we should rename those tables before we publish them. Let's discuss this offline...

As per offline discussions, amma remove SAMPLE_TABLE_SUFFIX_SEPARATOR in follow up PR.

samanvp · 2020-03-13T04:24:39Z

gcp_variant_transforms/options/variant_transform_options.py

+        help=('File containing list of shards and output table names. You '
+              'can use provided default sharding_config file to split output '
+              'by chromosome (one table per chromosome) which is located at: '
+              'gcp_variant_transforms/data/sharding_configs/'
+              'homo_sapiens_default.yaml'))


Did you update this comment or this is a sync issue?

samanvp · 2020-03-13T04:39:52Z

gcp_variant_transforms/options/variant_transform_options.py

+        help=('A genomic regions (separated by a space) to load from BigQuery. '
+              'The format of the genomic region should be '
+              'REFERENCE_NAME:START_POSITION-END_POSITION or REFERENCE_NAME if '
+              'the full chromosome is requested. Only variants matching at '
+              'this region will be loaded. The chromosome identifier should be '
+              'identical to the one provided in config file when the tables '
+              'were being created. For example, '
+              '`--genomic_regions chr2:1000-2000` will load all variants '
+              '`chr2` with `start_position` in `[1000,2000)` from BigQuery. '
+              'If the table with suffix `my_chrom3` was imported, '
+              '`--genomic_regions my_chrom3` would return all the variants in '
+              'that shard. This flag must be specified to indicate the table '
+              'shard that needs to be exported to VCF file. NOTE:At the moment '
+              'one and only one genomic region must be supplied.'))


Note that in sharding we might have an output table that contains variants of multiple chromosomes, for example look at this:

gcp-variant-transforms/gcp_variant_transforms/testing/data/sharding_configs/integration_with_residual.yaml

Line 20 in 8daf1a7

table_name_suffix: "chr04_05"

I think we should keep these two matters independent from one another:

input_table must point to an actual table, meaning that suffix must be included.

genomic_regions only is related to the filter we apply to the reference_name column, this has nothing to do with the table suffix.

Referring back to the linked config above, we can have --input_table {BASE_TABLE_NAME}__chr04_05 and --genomic_regions chr4 chr3:1:1000. As you can see that table wouldn't have any variants with reference_name equal to 'chr3' (because we know that config file) but there is no way we could infer this when we run bq_to_vcf. User is responsible to provide correct inputs.

samanvp · 2020-03-13T04:44:08Z

gcp_variant_transforms/pipeline_common.py

@@ -347,7 +347,7 @@ def create_output_table(full_table_id, total_base_pairs, schema_file_path):
  the worker that monitors the Dataflow job.

  Args:
-    full_table_id: for example: projet:dataset.table_base_name__chr1
+    full_table_id: for example: projet:dataset.table_base_name___chr1


If we agree to switch the suffixes as I requested, this change wouldn't be needed.

samanvp

A few more comments.
I also submitted #565 I think we should definitely fix that issue.

samanvp · 2020-03-16T17:49:22Z

gcp_variant_transforms/bq_to_vcf.py

+                    sample_mapping_table.GetSampleNames(
+                        beam.pvalue.AsSingleton(hash_table))
+                    | 'CombineSampleNames' >> beam.combiners.ToList())
+    sample_ids = sample_ids | beam.combiners.ToList()


I don't see why this is needed?

sample_ids in DensifyVariants need to be run with ToList combiner (ie to make it a single value of a List). But before I get there, when creating sample_names, I need to have it as pure PCollection.

So I removed the logic to make it lists when creating sample_ids, then create sample_names, then convert sample_ids to a list as it was before this PR.

gcp_variant_transforms/bq_to_vcf.py

samanvp · 2020-03-16T21:41:44Z

gcp_variant_transforms/bq_to_vcf.py

-    # TODO(tneymanov): Add logic to extract sample names from sample IDs by
-    # joining with sample id-name mapping table, once that code is implemented.
-    sample_names = sample_ids
+    hash_table = (


Perhaps rename to id_to_name_hash_table to make its distinction clear from the previous hash table name_to_id_hash_table.
Also, consider dropping the hash_table suffix and make both parts plural: ids_to_names and names_to_ids.

samanvp · 2020-03-16T21:56:08Z

gcp_variant_transforms/bq_to_vcf.py

+    hash_table = (
+        sample_table_rows
+        | 'SampleIdToNameDict' >> sample_mapping_table.SampleIdToNameDict())
+    sample_names = (sample_ids
+                    | 'GetSampleNames' >>
+                    sample_mapping_table.GetSampleNames(
+                        beam.pvalue.AsSingleton(hash_table))
+                    | 'CombineSampleNames' >> beam.combiners.ToList())


This should be moved to the else statement.

No, unfortunately not. This is because we have to accommodate the case when user gives sample names, but the table was created with WITH_FILE_PATH option. We need to extract sample IDs which may or may not have the same count as initial sample_name values (if we had more than 1 of each sample name but from different file). Then, no matter how we got the sample_ids (ie directly from sample table when --sample_names was not invoked, or from mapping from that flag if it was invoked) and then convert them into --sample_names.

Unfortunately, it adds additional run through sample IDs when --sample_names flag was invoked, but we kinda have to do it, if we want to have the functionality that Aaron requested (ie, for the WITH_FILE_PATH, export N001_1, N001_2, N001_3...).

tneymanov

Redoing integration testing, but ready for review.

tneymanov · 2020-03-27T13:17:56Z

gcp_variant_transforms/bq_to_vcf.py

 _BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
 _COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
-_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND '
-                            '{START_POSITION_ID}>={START_POSITION_VALUE} AND '
+_FULL_INPUT_TABLE = '{TABLE}___{SUFFIX}'


tneymanov · 2020-03-27T13:18:48Z

gcp_variant_transforms/bq_to_vcf.py

                            '{END_POSITION_ID}<={END_POSITION_VALUE})')
+_SAMPLE_INFO_QUERY_TEMPLATE = (
+    'SELECT sample_id, sample_name, file_path '
+    'FROM `{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}__sample_info`')


Done, somewhat. I'd rather import base_table_id and construct the rest here. PTAL.

tneymanov · 2020-03-27T13:19:12Z

gcp_variant_transforms/bq_to_vcf.py

@@ -117,7 +122,8 @@ def run(argv=None):
        '{}_meta_info.vcf'.format(unique_temp_id))
    _write_vcf_meta_info(known_args.input_table,
                         known_args.representative_header_file,
-                         known_args.allow_incompatible_schema)
+                         known_args.allow_incompatible_schema,
+                         known_args.genomic_regions)


As per offline discussions, removed the dependency on genomic_regions.

tneymanov · 2020-03-27T13:25:20Z

gcp_variant_transforms/bq_to_vcf.py

-    # TODO(tneymanov): Add logic to extract sample names from sample IDs by
-    # joining with sample id-name mapping table, once that code is implemented.
-    sample_names = sample_ids
+    hash_table = (


tneymanov · 2020-03-27T13:33:40Z

gcp_variant_transforms/bq_to_vcf.py

+    hash_table = (
+        sample_table_rows
+        | 'SampleIdToNameDict' >> sample_mapping_table.SampleIdToNameDict())
+    sample_names = (sample_ids
+                    | 'GetSampleNames' >>
+                    sample_mapping_table.GetSampleNames(
+                        beam.pvalue.AsSingleton(hash_table))
+                    | 'CombineSampleNames' >> beam.combiners.ToList())


No, unfortunately not. This is because we have to accommodate the case when user gives sample names, but the table was created with WITH_FILE_PATH option. We need to extract sample IDs which may or may not have the same count as initial sample_name values (if we had more than 1 of each sample name but from different file). Then, no matter how we got the sample_ids (ie directly from sample table when --sample_names was not invoked, or from mapping from that flag if it was invoked) and then convert them into --sample_names.

Unfortunately, it adds additional run through sample IDs when --sample_names flag was invoked, but we kinda have to do it, if we want to have the functionality that Aaron requested (ie, for the WITH_FILE_PATH, export N001_1, N001_2, N001_3...).

tneymanov · 2020-03-27T13:35:51Z

gcp_variant_transforms/libs/bigquery_util.py

+TABLE_SUFFIX_SEPARATOR = '___'
+SAMPLE_TABLE_SUFFIX_SEPARATOR = '__'


As per offline discussions, amma remove SAMPLE_TABLE_SUFFIX_SEPARATOR in follow up PR.

tneymanov · 2020-03-27T13:36:02Z

gcp_variant_transforms/options/variant_transform_options.py

+        help=('A genomic regions (separated by a space) to load from BigQuery. '
+              'The format of the genomic region should be '
+              'REFERENCE_NAME:START_POSITION-END_POSITION or REFERENCE_NAME if '
+              'the full chromosome is requested. Only variants matching at '
+              'this region will be loaded. The chromosome identifier should be '
+              'identical to the one provided in config file when the tables '
+              'were being created. For example, '
+              '`--genomic_regions chr2:1000-2000` will load all variants '
+              '`chr2` with `start_position` in `[1000,2000)` from BigQuery. '
+              'If the table with suffix `my_chrom3` was imported, '
+              '`--genomic_regions my_chrom3` would return all the variants in '
+              'that shard. This flag must be specified to indicate the table '
+              'shard that needs to be exported to VCF file. NOTE:At the moment '
+              'one and only one genomic region must be supplied.'))


tneymanov · 2020-03-27T13:41:10Z

gcp_variant_transforms/bq_to_vcf.py

+                    sample_mapping_table.GetSampleNames(
+                        beam.pvalue.AsSingleton(hash_table))
+                    | 'CombineSampleNames' >> beam.combiners.ToList())
+    sample_ids = sample_ids | beam.combiners.ToList()


sample_ids in DensifyVariants need to be run with ToList combiner (ie to make it a single value of a List). But before I get there, when creating sample_names, I need to have it as pure PCollection.

So I removed the logic to make it lists when creating sample_ids, then create sample_names, then convert sample_ids to a list as it was before this PR.

gcp_variant_transforms/bq_to_vcf.py

samanvp

Please take a look at my reply to your comments in bq_to_vcf.py. More comments coming soon...

samanvp · 2020-03-27T15:22:27Z

gcp_variant_transforms/bq_to_vcf.py

@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards(
  # TODO(allieychen): Modify the SQL query with the specified sample_ids.
  query = _get_bigquery_query(known_args, schema)


Let's rename this to something more meaningful (for example variant_query?) to highlight the difference between it and sample_query.
Similarly, let's rename bq_source to bq_variant_source.

samanvp · 2020-03-27T15:22:58Z

gcp_variant_transforms/bq_to_vcf.py

@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards(
  # TODO(allieychen): Modify the SQL query with the specified sample_ids.


Isn't this TODO already fulfilled?

Hmm I guess. Removed.

gcp_variant_transforms/bq_to_vcf.py

samanvp · 2020-03-27T17:41:59Z

gcp_variant_transforms/options/variant_transform_options.py

+    bigquery_util.raise_error_if_dataset_not_exists(client, project_id,
+                                                    dataset_id)


I don't think we need this check, checking the existence of tables covers this check as well.

samanvp · 2020-03-27T17:43:43Z

gcp_variant_transforms/options/variant_transform_options.py

+    if not bigquery_util.table_exist(client, project_id, dataset_id, table_id):
+      raise ValueError('Table {}:{}.{} does not exist.'.format(
+          project_id, dataset_id, table_id))


nit: I'd move this higher, basically this order is more natural to me:

input table exists

input table follows base_name___suffix

input sample table exists

tneymanov

Thanks Saman, synced, addressed the comments and adjusted integration tests. Will launch them now.

tneymanov · 2020-04-09T08:01:16Z

gcp_variant_transforms/bq_to_vcf.py

@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards(
  # TODO(allieychen): Modify the SQL query with the specified sample_ids.


Hmm I guess. Removed.

tneymanov · 2020-04-09T08:01:43Z

gcp_variant_transforms/bq_to_vcf.py

@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards(
  # TODO(allieychen): Modify the SQL query with the specified sample_ids.
  query = _get_bigquery_query(known_args, schema)


gcp_variant_transforms/bq_to_vcf.py

tneymanov · 2020-04-09T08:03:17Z

gcp_variant_transforms/options/variant_transform_options.py

+    bigquery_util.raise_error_if_dataset_not_exists(client, project_id,
+                                                    dataset_id)


tneymanov · 2020-04-09T08:03:25Z

gcp_variant_transforms/options/variant_transform_options.py

+    if not bigquery_util.table_exist(client, project_id, dataset_id, table_id):
+      raise ValueError('Table {}:{}.{} does not exist.'.format(
+          project_id, dataset_id, table_id))


samanvp

This review only covers bq_to_vcf.py module. I will review the rest of the PR later this morning.

gcp_variant_transforms/bq_to_vcf.py

samanvp · 2020-04-17T04:33:19Z

gcp_variant_transforms/bq_to_vcf.py

 _VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
                      'INFO', 'FORMAT']


Instead of defining a new const, can't we reuse vcf_parser. LAST_HEADER_LINE_PREFIX?
I see here we have 'FORMAT' while it's missing from the other constant, I am not entirely sure why?

Umm this change is outside of the scope of this PR. However, great catch - this is a bug of sorts. FORMAT may or may not be supplied - depends on whether samples are present in the VCF file (no samples - no FORMAT). This one specifically needs to be thought about a bit to add FORMAT into the resulting VCF file iff samples are present in the BQ table. I'll add an issue to follow up on this.

Please update this comment with the issue so we can refer back to this PR later.

samanvp · 2020-04-17T04:33:58Z

gcp_variant_transforms/bq_to_vcf.py



 _BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`'
 _BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
 _COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
+TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR
+SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX
+_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}'
 _GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND '
                            '{START_POSITION_ID}>={START_POSITION_VALUE} AND '
                            '{END_POSITION_ID}<={END_POSITION_VALUE})')
 _VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
                      'INFO', 'FORMAT']
 _VCF_VERSION_LINE = '##fileformat=VCFv4.3\n'


Similarly, cannot we instead use vcf_parser. FILE_FORMAT_HEADER_TEMPLATE?

Also outside of the scope, but done. Got it from vcf_header_io since it's already imported. Also added '\n'.

samanvp · 2020-04-17T04:36:40Z

gcp_variant_transforms/bq_to_vcf.py

@@ -137,7 +146,7 @@ def run(argv=None):
 def _write_vcf_meta_info(input_table,
                         representative_header_file,
                         allow_incompatible_schema):
-  # type: (str, str, bool) -> None
+  # type: (str, str, bool, str) -> None


Remove extra , str

Nice catch, Done.

samanvp · 2020-04-17T04:44:05Z

gcp_variant_transforms/bq_to_vcf.py

@@ -164,30 +173,57 @@ def _bigquery_to_vcf_shards(
  `vcf_header_file_path`.
  """
  schema = _get_schema(known_args.input_table)
-  # TODO(allieychen): Modify the SQL query with the specified sample_ids.
-  query = _get_bigquery_query(known_args, schema)
+  query = _get_variant_query(known_args, schema)


rename to variant_query?

samanvp · 2020-04-17T04:49:15Z

gcp_variant_transforms/bq_to_vcf.py



 _BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`'
 _BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
 _COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
+TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR
+SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX
+_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}'


This line duplicates the logic of compose_table_name(). I think we should use that function instead.

Also I don't see where we use this const anywhere in this module.

Oops, you are right, removed.

samanvp · 2020-04-17T04:54:50Z

gcp_variant_transforms/bq_to_vcf.py

  annotation_names = _extract_annotation_names(schema)
+
+  base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)]


We now have a function for doing this get_table_base_name().

samanvp · 2020-04-17T04:59:32Z

gcp_variant_transforms/bq_to_vcf.py

-                    | transforms.Create(known_args.sample_names,
-                                        reshuffle=False)
-                    | beam.combiners.ToList())
+      hash_table = (


Instead of using hash_table in both if and else statements we could use a more clear name to show the direction of lookup map.

samanvp · 2020-04-17T05:03:26Z

gcp_variant_transforms/bq_to_vcf.py

+      sample_names = (p
+                      | transforms.Create(known_args.sample_names,
+                                          reshuffle=False))
+      sample_ids = (sample_names


To highlight the distinction between sample_names pcollection and list, we could perhaps rename this variable to sample_ids_list and the following variable also sample_names_list. Please feel free to come up with a better name than what I propose here, I myself don't like what I suggested :D

hmm I don't know either... thought of consolidated_sample_ids but ended up with just combined_sample_ids. Did the same for _names. Tell me if not up to par.

samanvp · 2020-04-17T05:07:07Z

gcp_variant_transforms/bq_to_vcf.py

+                          beam.pvalue.AsSingleton(hash_table))
+                      | 'CombineSampleNames' >> beam.combiners.ToList())
+      sample_ids = sample_ids | beam.combiners.ToList()
+
    _ = (sample_names


Hmmm, here we need to pass a pcollection to the ParDo transform. However, at this point sample_names object is a list. I am wondering how this code didn't fail... or I am missing something here?!

It's not a list, it's never a list until pipeline is running - it's a pcollection of a single element - list of sample names.

What happens here is over each element in the above pcollection beam runs write_vcf_header_with_sample_names method, which writes #CHROM.... stuff and then appends the values in that particular element. Since Pcollection only has 1 element, only 1 #CHROM... line will be written which will have all the sample names.

tneymanov

Thanks Saman.

Also increased integration test timeout by 30 mins, since last run timed out. Now it's 4:30 which kinda gets out of hand - maybe something we should discuss...

gcp_variant_transforms/bq_to_vcf.py

tneymanov · 2020-04-20T05:26:24Z

gcp_variant_transforms/bq_to_vcf.py



 _BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`'
 _BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
 _COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
+TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR
+SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX
+_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}'


Oops, you are right, removed.

tneymanov · 2020-04-20T05:33:19Z

gcp_variant_transforms/bq_to_vcf.py

 _VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
                      'INFO', 'FORMAT']


Umm this change is outside of the scope of this PR. However, great catch - this is a bug of sorts. FORMAT may or may not be supplied - depends on whether samples are present in the VCF file (no samples - no FORMAT). This one specifically needs to be thought about a bit to add FORMAT into the resulting VCF file iff samples are present in the BQ table. I'll add an issue to follow up on this.

tneymanov · 2020-04-20T05:53:30Z

gcp_variant_transforms/bq_to_vcf.py



 _BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`'
 _BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
 _COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
+TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR
+SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX
+_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}'
 _GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND '
                            '{START_POSITION_ID}>={START_POSITION_VALUE} AND '
                            '{END_POSITION_ID}<={END_POSITION_VALUE})')
 _VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
                      'INFO', 'FORMAT']
 _VCF_VERSION_LINE = '##fileformat=VCFv4.3\n'


Also outside of the scope, but done. Got it from vcf_header_io since it's already imported. Also added '\n'.

tneymanov · 2020-04-20T05:53:50Z

gcp_variant_transforms/bq_to_vcf.py

@@ -137,7 +146,7 @@ def run(argv=None):
 def _write_vcf_meta_info(input_table,
                         representative_header_file,
                         allow_incompatible_schema):
-  # type: (str, str, bool) -> None
+  # type: (str, str, bool, str) -> None


Nice catch, Done.

tneymanov · 2020-04-20T05:54:29Z

gcp_variant_transforms/bq_to_vcf.py

@@ -164,30 +173,57 @@ def _bigquery_to_vcf_shards(
  `vcf_header_file_path`.
  """
  schema = _get_schema(known_args.input_table)
-  # TODO(allieychen): Modify the SQL query with the specified sample_ids.
-  query = _get_bigquery_query(known_args, schema)
+  query = _get_variant_query(known_args, schema)


tneymanov · 2020-04-20T05:55:33Z

gcp_variant_transforms/bq_to_vcf.py

  annotation_names = _extract_annotation_names(schema)
+
+  base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)]


tneymanov · 2020-04-20T05:57:07Z

gcp_variant_transforms/bq_to_vcf.py

-                    | transforms.Create(known_args.sample_names,
-                                        reshuffle=False)
-                    | beam.combiners.ToList())
+      hash_table = (


tneymanov · 2020-04-20T06:06:12Z

gcp_variant_transforms/bq_to_vcf.py

+      sample_names = (p
+                      | transforms.Create(known_args.sample_names,
+                                          reshuffle=False))
+      sample_ids = (sample_names


hmm I don't know either... thought of consolidated_sample_ids but ended up with just combined_sample_ids. Did the same for _names. Tell me if not up to par.

tneymanov · 2020-04-20T06:11:26Z

gcp_variant_transforms/bq_to_vcf.py

+                          beam.pvalue.AsSingleton(hash_table))
+                      | 'CombineSampleNames' >> beam.combiners.ToList())
+      sample_ids = sample_ids | beam.combiners.ToList()
+
    _ = (sample_names


It's not a list, it's never a list until pipeline is running - it's a pcollection of a single element - list of sample names.

What happens here is over each element in the above pcollection beam runs write_vcf_header_with_sample_names method, which writes #CHROM.... stuff and then appends the values in that particular element. Since Pcollection only has 1 element, only 1 #CHROM... line will be written which will have all the sample names.

samanvp

A couple more comments, mostly just naming nits.

gcp_variant_transforms/bq_to_vcf.py

samanvp · 2020-04-27T16:28:20Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+SAMPLE_ID_COLUMN = 'sample_id'
+SAMPLE_NAME_COLUMN = 'sample_name'
+FILE_PATH_COLUMN = 'file_path'


These 3 const values somehow should be tied to the schema file for the sample_info table in #577

Done, but removed FILE_PATH_COLUMN - not needed anymore.

samanvp · 2020-04-27T16:30:39Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+
+
+class SampleIdToNameDict(beam.PTransform):
+  """Transforms BigQuery table rows to PCollection of `Variant`."""


This line needs to be updated.

samanvp · 2020-04-27T23:23:23Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+class SampleIdToNameDict(beam.PTransform):
+  """Transforms BigQuery table rows to PCollection of `Variant`."""
+
+  def _convert_bq_row(self, row):


Can we use a better name? For example _extract_id_name?

samanvp · 2020-04-27T23:24:47Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+
+  def expand(self, pcoll):
+    return (pcoll
+            | 'BigQueryToMapping' >> beam.Map(self._convert_bq_row)


Again, we might use a more clear name than BigQueryToMapping. How about ExtractIdNameTuples?

samanvp · 2020-04-27T23:43:23Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+            | 'CombineToDict' >> beam.combiners.ToDict())
+
+class GetSampleNames(beam.PTransform):
+  """Transforms sample_ids to sample_names"""


I feel a bit uneasy about "Transforms", how about "Looks up sample_names corresponding to the given sample_ids"? or something along those lines.

samanvp · 2020-04-27T23:45:12Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+  """Transforms sample_ids to sample_names"""
+
+  def __init__(self, hash_table):
+    # type: (Dict[int, Tuple(str, str)]) -> None


Type of this Dict is int -> str (I think Tuple is old artifacts)?

Right, nice catch. Done.

samanvp · 2020-04-27T23:45:53Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+    return pcoll | beam.Map(self._get_sample_id, self._hash_table)
+
+class GetSampleIds(beam.PTransform):
+  """Transform sample_names to sample_ids"""


Please make all the updates similar to the previous class.

gcp_variant_transforms/transforms/sample_mapping_table.py

samanvp · 2020-04-27T23:47:26Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+SAMPLE_ID_COLUMN = 'sample_id'
+SAMPLE_NAME_COLUMN = 'sample_name'
+FILE_PATH_COLUMN = 'file_path'
+WITH_FILE_SAMPLE_TEMPLATE = "{FILE_PATH}/{SAMPLE_NAME}"


This const is not used at all.

tneymanov

Apologies for so many silly type/naming mistakes - went through many iterations.

tneymanov · 2020-04-30T08:48:34Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+SAMPLE_ID_COLUMN = 'sample_id'
+SAMPLE_NAME_COLUMN = 'sample_name'
+FILE_PATH_COLUMN = 'file_path'
+WITH_FILE_SAMPLE_TEMPLATE = "{FILE_PATH}/{SAMPLE_NAME}"


tneymanov · 2020-04-30T08:50:05Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+
+
+class SampleIdToNameDict(beam.PTransform):
+  """Transforms BigQuery table rows to PCollection of `Variant`."""


tneymanov · 2020-04-30T08:50:43Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+class SampleIdToNameDict(beam.PTransform):
+  """Transforms BigQuery table rows to PCollection of `Variant`."""
+
+  def _convert_bq_row(self, row):


tneymanov · 2020-04-30T08:51:51Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+
+  def expand(self, pcoll):
+    return (pcoll
+            | 'BigQueryToMapping' >> beam.Map(self._convert_bq_row)


tneymanov · 2020-04-30T08:51:56Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+class SampleNameToIdDict(beam.PTransform):
+  """Transforms BigQuery table rows to PCollection of `Variant`."""
+
+  def _convert_bq_row(self, row):


tneymanov · 2020-04-30T09:03:03Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+    # type: (Dict[int, Tuple(str, str)]) -> None
+    self._hash_table = hash_table
+
+  def _get_sample_id(self, sample_id, hash_table):


Yeah, Done.

tneymanov · 2020-04-30T09:03:09Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+
+  def __init__(self, hash_table):
+    # type: (Dict[int, Tuple(str, str)]) -> None
+    self._hash_table = hash_table


gcp_variant_transforms/transforms/sample_mapping_table.py

tneymanov · 2020-04-30T09:21:50Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+
+  def expand(self, pcoll):
+    return (pcoll
+            | 'BigQueryToMapping' >> beam.Map(self._convert_bq_row)


tneymanov · 2020-04-30T09:24:52Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+SAMPLE_ID_COLUMN = 'sample_id'
+SAMPLE_NAME_COLUMN = 'sample_name'
+FILE_PATH_COLUMN = 'file_path'


Done, but removed FILE_PATH_COLUMN - not needed anymore.

samanvp

Thanks Tural, things look much better now. Please address the latest comments before merging.

samanvp · 2020-04-30T13:50:10Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+    raise ValueError('Sample ID `{}` was not found.'.format(sample_id))
+
+  def expand(self, pcoll):
+    return pcoll | beam.Map(self._get_sample_name, self._id_to_name_dict)


nit: This operation does not have a name. I am not sure what will be shown in the Dataflow diagram, if Adding a name makes that diagram more clear, let's add a name.

It will just have it as Map - sometimes it's necessary to add custom names, if two steps in a single flow have same default names.

But Done nonetheless.

samanvp · 2020-04-30T13:51:04Z

gcp_variant_transforms/transforms/sample_mapping_table.py

+    raise ValueError('Sample `{}` was not found.'.format(sample_name))
+
+  def expand(self, pcoll):
+    return pcoll | beam.Map(self._get_sample_id, self._name_to_id_dict)


Similarly here.

samanvp · 2020-04-30T13:56:46Z

cloudbuild_CI.yaml

@@ -56,4 +56,4 @@ steps:
      # - '--gs_dir bashir-variant_integration_test_runs'
 images:
  - 'gcr.io/${PROJECT_ID}/gcp-variant-transforms:${COMMIT_SHA}'
-timeout: 240m
+timeout: 270m


I am just curious what causes longer test times?

Same... not really sure. It took 3h to finish on gcp-test, but got timedout on my own project, in 4.5h. It's unclear to me why this is the case, because my quota should be identical... Maybe I ran two int tests simultaneously and there weren't enough workers? Do not know.

samanvp · 2020-04-30T13:57:46Z

gcp_variant_transforms/bq_to_vcf.py

 _VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
                      'INFO', 'FORMAT']


Please update this comment with the issue so we can refer back to this PR later.

samanvp · 2020-04-30T14:06:07Z

gcp_variant_transforms/bq_to_vcf.py

+  sample_query = _SAMPLE_INFO_QUERY_TEMPLATE.format(PROJECT_ID=project_id,
+                                                    DATASET_ID=dataset_id,
+                                                    BASE_TABLE_ID=base_table_id)


Here we assume the sample info table will follow our expected naming convention: BASE_TABLE_ID + TABLE_SUFFIX_SEPARATOR + SAMPLE_INFO_TABLE_SUFFIX
What happens if this is not the case? My guess is this:

We will make a dict (either Id_to_name or name_to_id) which is empty.

For the first lookup we will fail and raise an exception about missing sample_name or sample_id.
Is this the case?

If yes then this will be kind of confusing for user to find out the reason is that sample info table is missing. Is there a better way we could handle missing table (or even empty sample info table)?

No, we will fail on variant_transform_options stage, as we demand these tables to exist. If it's exists but is empty, there is nothing we can do about it, I think, as we cannot know what rows should be in those tables.

samanvp · 2020-04-30T16:27:09Z

gcp_variant_transforms/bq_to_vcf.py

+          | 'CombineToList' >> beam.combiners.ToList()
+          | 'SortSampleNames' >> beam.ParDo(sorted))


hmm, I just realized this: doen't all these logic requires that sample_names and sample_ids have the exact same order?
Reordering one (here sorting sample_names) without modifying the other one does't make our output wrong?

I think I found the answer:
sample_ids and sample_names are temporary and what matters is the content of combined_sample_ids and combined_sample_ids.
If that's the case then let's do this:

Rename variables to state this fact, for example sample_ids -> temp_sample_ids' and similarly sample_names. And then combined_sample_ids->sample_ids`. This way we indicate the temporary state of those two variables.

Line 220: that 'ToList()` operation is not needed yet (I think)?

Another follow up question:
Even if we sort sample_names in line 223 when we process it in the next stage, does't it get accessed randomly due to the Beam's processing paradigm?

Done renaming.

ToList() is required for sort operation - values need to be combined in 1 list to run sorted() over it, results of which are separated into a PCollection.

No, I think randomization only happens when we create a PCollection without reshuffle=false

samanvp · 2020-04-30T16:32:35Z

gcp_variant_transforms/bq_to_vcf.py

+    name_to_id_hash_table = (
+        sample_table_rows
+        | 'SampleNameToIdDict' >> sample_mapping_table.SampleNameToIdDict())


Let's move this to after if statement, exactly, right before we use it.

samanvp · 2020-04-30T17:08:13Z

gcp_variant_transforms/options/variant_transform_options.py

+                            TABLE_SUFFIX_SEPARATOR))
+    base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)]
+    sample_table_id = (
+        base_table_id + TABLE_SUFFIX_SEPARATOR + SAMPLE_INFO_TABLE_SUFFIX)


Replace with bigquery_util.compose_table_name(base_table_id, SAMPLE_INFO_TABLE_SUFFIX)

samanvp · 2020-04-30T17:10:06Z

gcp_variant_transforms/bq_to_vcf.py

+_SAMPLE_INFO_QUERY_TEMPLATE = (
+    'SELECT sample_id, sample_name, file_path '
+    'FROM `{PROJECT_ID}.{DATASET_ID}.{BASE_TABLE_ID}' +
+    TABLE_SUFFIX_SEPARATOR + SAMPLE_INFO_TABLE_SUFFIX + '`')


I don't like the face were we are duplicating the logic of bigquery_util.compose_table_name() Can we avoid this duplication?

samanvp · 2020-04-30T17:17:08Z

gcp_variant_transforms/testing/asserts.py

+def dict_values_equal(expected_dict):
+  """Verifies that dictionary is the same as expected."""
+  def _items_equal(actual_dict):
+    actual = actual_dict[0]


I don't understand why this line is needed, let's discuss offline.

Once pipeline finishes, in this case the result is combined into a single dict. However, that dict is still inside a PCollection, which consists of this single element. When transformed into a proper data structure, it becomes list of a single item - the dict.

I'm down to discuss on VC.

- Add BQ tests back with modified files. - Adjust indentations between tables (__chr -> ___chr). - Make sure to generate <SAMPLE_NAME>_<INDEX> for WITH_FILE_PATH encoding as per Sync meeting. - Adjust --sample_names flag to handle WITH_FILE_PATH encoding as per Sync meeting. - Rename --genomic_region back to --genomic_regions while still forcing 1 and only 1 value as per Sync meeting.

…nd IDs, and modify integration tests.

tneymanov requested a review from samanvp February 24, 2020 11:37

tneymanov force-pushed the bq_to_vcf_sample_ids branch 2 times, most recently from 19bf9bd to 39d9ef6 Compare February 24, 2020 18:24

samanvp reviewed Mar 13, 2020

View reviewed changes

samanvp reviewed Mar 16, 2020

View reviewed changes

tneymanov force-pushed the bq_to_vcf_sample_ids branch 4 times, most recently from fdcb611 to 16122fd Compare March 27, 2020 14:04

tneymanov commented Mar 27, 2020

View reviewed changes

tneymanov force-pushed the bq_to_vcf_sample_ids branch from 16122fd to db93ef1 Compare March 27, 2020 14:33

samanvp reviewed Mar 27, 2020

View reviewed changes

tneymanov force-pushed the bq_to_vcf_sample_ids branch 2 times, most recently from 5963315 to 79bca93 Compare April 9, 2020 08:04

tneymanov commented Apr 9, 2020

View reviewed changes

tneymanov force-pushed the bq_to_vcf_sample_ids branch from 79bca93 to bb8baef Compare April 9, 2020 16:26

tneymanov force-pushed the bq_to_vcf_sample_ids branch from bb8baef to d33ae65 Compare April 17, 2020 01:38

samanvp reviewed Apr 17, 2020

View reviewed changes

tneymanov commented Apr 20, 2020

View reviewed changes

tneymanov force-pushed the bq_to_vcf_sample_ids branch 2 times, most recently from 8066aa6 to 14c56c2 Compare April 20, 2020 14:51

samanvp reviewed Apr 27, 2020

View reviewed changes

tneymanov force-pushed the bq_to_vcf_sample_ids branch from acfd139 to 68f4109 Compare April 30, 2020 09:25

tneymanov commented Apr 30, 2020

View reviewed changes

tneymanov force-pushed the bq_to_vcf_sample_ids branch from 68f4109 to 4b732c7 Compare April 30, 2020 10:17

samanvp approved these changes Apr 30, 2020

View reviewed changes

tneymanov force-pushed the bq_to_vcf_sample_ids branch 3 times, most recently from 050e187 to 0267802 Compare May 4, 2020 23:19

tneymanov force-pushed the bq_to_vcf_sample_ids branch from 0267802 to abe6d9b Compare May 5, 2020 07:34

tneymanov added 8 commits May 7, 2020 20:44

Adjust bq_to_vcf to work with sample info table.

8e3ab01

Address comments and modify testing functionality.

d95243b

Adjust the commit to always assume 1-1 mapping between sample names a…

d619229

…nd IDs, and modify integration tests.

Address second iteration of comments.

03d6646

Fix integration tests.

067fb45

Address third iteration of comments.

1b3bc40

Address fourth iteration of comments.

62d6be0

tneymanov force-pushed the bq_to_vcf_sample_ids branch from abe6d9b to 62d6be0 Compare May 8, 2020 00:44

tneymanov merged commit 7f9136b into googlegenomics:master May 8, 2020

tneymanov added a commit to tneymanov/gcp-variant-transforms that referenced this pull request Jun 1, 2020

Bq to vcf sample ids (googlegenomics#557)

f93fae5

		TABLE_SUFFIX_SEPARATOR = '___'
		SAMPLE_TABLE_SUFFIX_SEPARATOR = '__'

		@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards(
		# TODO(allieychen): Modify the SQL query with the specified sample_ids.
		query = _get_bigquery_query(known_args, schema)

		bigquery_util.raise_error_if_dataset_not_exists(client, project_id,
		dataset_id)

		_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
		'INFO', 'FORMAT']

		annotation_names = _extract_annotation_names(schema)

		base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)]



		class SampleIdToNameDict(beam.PTransform):
		"""Transforms BigQuery table rows to PCollection of `Variant`."""

		\| 'CombineToList' >> beam.combiners.ToList()
		\| 'SortSampleNames' >> beam.ParDo(sorted))

Bq to vcf sample ids #557

Bq to vcf sample ids #557

Conversation

tneymanov commented Feb 24, 2020

tneymanov commented Feb 25, 2020

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tneymanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tneymanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tneymanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment