Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bq to vcf sample ids #557

Merged
merged 8 commits into from
May 8, 2020

Conversation

tneymanov
Copy link
Collaborator

This PR

  • changes flag genomic_regions to genomic_region which becomes mandatory.
  • gets bq rows from sample_info table
  • if sample_names was not provided:
    • Creates hash table (id->name+file) from bq rows.
    • Guesses the encoding.
    • Generates sample names as either {sample_name} or {file_path}/{sample_name} based on the encoding
  • if sample_names was provided:
    • Creates hash table (name->id) from bq rows.
    • Generates sample IDs from provided list.
    • Note: file+sample case is not supported right now as I'm not sure how we want to approach getting file names
    • Note: I'm not sure if rehashing sample names into IDs was better, but currently get it from BQ. Open to changes

Note: Had to disable BQ tests, because I will need to recreate all of the test table inputs and output files. That's gonna take a while, so you can start the review right away. Tested manually following cases

  1. No Sample Names + Without file path + No preserve_sample_order
  2. No Sample Names + Without file path + preserve_sample_order
  3. No Sample Names + With file path
  4. Sample Names + Without file path.

@tneymanov tneymanov requested a review from samanvp February 24, 2020 11:37
@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch 2 times, most recently from 19bf9bd to 39d9ef6 Compare February 24, 2020 18:24
@tneymanov
Copy link
Collaborator Author

Added second commit which applies following changes:

  • Add BQ tests back with modified files.
  • Adjust indentations between tables (__chr -> ___chr).
  • Make sure to generate <SAMPLE_NAME>_ for WITH_FILE_PATH encoding as per Sync meeting.
  • Adjust --sample_names flag flow to handle WITH_FILE_PATH encoding as per Sync meeting.
  • Rename --genomic_region back to --genomic_regions while still forcing 1 and only 1 value as per Sync meeting.

Copy link
Contributor

@samanvp samanvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for slow review, I will send more comments tomorrow morning.

_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND '
'{START_POSITION_ID}>={START_POSITION_VALUE} AND '
_FULL_INPUT_TABLE = '{TABLE}___{SUFFIX}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use TABLE_SUFFIX_SEPARATOR here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

'{END_POSITION_ID}<={END_POSITION_VALUE})')
_SAMPLE_INFO_QUERY_TEMPLATE = (
'SELECT sample_id, sample_name, file_path '
'FROM `{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}__sample_info`')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of {PROJECT_ID}.{DATASET_ID}.{TABLE_ID}__sample_info we better use {FULL_TABLE_ID} and construct FULL_TABLE_ID outside of query, similar to what we did in AVRO PR:
https://github.com/googlegenomics/gcp-variant-transforms/pull/558/files#diff-7b9491fbb5998f2c837ddabb9582a5ba

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, somewhat. I'd rather import base_table_id and construct the rest here. PTAL.

@@ -117,7 +122,8 @@ def run(argv=None):
'{}_meta_info.vcf'.format(unique_temp_id))
_write_vcf_meta_info(known_args.input_table,
known_args.representative_header_file,
known_args.allow_incompatible_schema)
known_args.allow_incompatible_schema,
known_args.genomic_regions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are passing known_args.genomic_regions all the way down to _get_schema() only to use the suffix to identify source table.
Instead of known_args.input_table and known_args.genomic_regions we can "assemble" full_table_id and pass it down. This will also match my earlier comment about using FULL_TABLE_ID in _SAMPLE_INFO_QUERY_TEMPLATE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per offline discussions, removed the dependency on genomic_regions.

Comment on lines 35 to 43
TABLE_SUFFIX_SEPARATOR = '___'
SAMPLE_TABLE_SUFFIX_SEPARATOR = '__'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we switch these two?
I know we discussed this offline but since we are releasing gnomAD with __chr* suffixes it makes sense here we follow that standard. Or alternatively we should rename those tables before we publish them. Let's discuss this offline...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per offline discussions, amma remove SAMPLE_TABLE_SUFFIX_SEPARATOR in follow up PR.

Comment on lines 144 to 148
help=('File containing list of shards and output table names. You '
'can use provided default sharding_config file to split output '
'by chromosome (one table per chromosome) which is located at: '
'gcp_variant_transforms/data/sharding_configs/'
'homo_sapiens_default.yaml'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you update this comment or this is a sync issue?

Comment on lines 547 to 541
help=('A genomic regions (separated by a space) to load from BigQuery. '
'The format of the genomic region should be '
'REFERENCE_NAME:START_POSITION-END_POSITION or REFERENCE_NAME if '
'the full chromosome is requested. Only variants matching at '
'this region will be loaded. The chromosome identifier should be '
'identical to the one provided in config file when the tables '
'were being created. For example, '
'`--genomic_regions chr2:1000-2000` will load all variants '
'`chr2` with `start_position` in `[1000,2000)` from BigQuery. '
'If the table with suffix `my_chrom3` was imported, '
'`--genomic_regions my_chrom3` would return all the variants in '
'that shard. This flag must be specified to indicate the table '
'shard that needs to be exported to VCF file. NOTE:At the moment '
'one and only one genomic region must be supplied.'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in sharding we might have an output table that contains variants of multiple chromosomes, for example look at this:

I think we should keep these two matters independent from one another:

  • input_table must point to an actual table, meaning that suffix must be included.
  • genomic_regions only is related to the filter we apply to the reference_name column, this has nothing to do with the table suffix.

Referring back to the linked config above, we can have --input_table {BASE_TABLE_NAME}__chr04_05 and --genomic_regions chr4 chr3:1:1000. As you can see that table wouldn't have any variants with reference_name equal to 'chr3' (because we know that config file) but there is no way we could infer this when we run bq_to_vcf. User is responsible to provide correct inputs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -347,7 +347,7 @@ def create_output_table(full_table_id, total_base_pairs, schema_file_path):
the worker that monitors the Dataflow job.

Args:
full_table_id: for example: projet:dataset.table_base_name__chr1
full_table_id: for example: projet:dataset.table_base_name___chr1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree to switch the suffixes as I requested, this change wouldn't be needed.

Copy link
Contributor

@samanvp samanvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments.
I also submitted #565 I think we should definitely fix that issue.

sample_mapping_table.GetSampleNames(
beam.pvalue.AsSingleton(hash_table))
| 'CombineSampleNames' >> beam.combiners.ToList())
sample_ids = sample_ids | beam.combiners.ToList()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this is needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_ids in DensifyVariants need to be run with ToList combiner (ie to make it a single value of a List). But before I get there, when creating sample_names, I need to have it as pure PCollection.

So I removed the logic to make it lists when creating sample_ids, then create sample_names, then convert sample_ids to a list as it was before this PR.

gcp_variant_transforms/bq_to_vcf.py Show resolved Hide resolved
# TODO(tneymanov): Add logic to extract sample names from sample IDs by
# joining with sample id-name mapping table, once that code is implemented.
sample_names = sample_ids
hash_table = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename to id_to_name_hash_table to make its distinction clear from the previous hash table name_to_id_hash_table.
Also, consider dropping the hash_table suffix and make both parts plural: ids_to_names and names_to_ids.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 213 to 223
hash_table = (
sample_table_rows
| 'SampleIdToNameDict' >> sample_mapping_table.SampleIdToNameDict())
sample_names = (sample_ids
| 'GetSampleNames' >>
sample_mapping_table.GetSampleNames(
beam.pvalue.AsSingleton(hash_table))
| 'CombineSampleNames' >> beam.combiners.ToList())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to the else statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unfortunately not. This is because we have to accommodate the case when user gives sample names, but the table was created with WITH_FILE_PATH option. We need to extract sample IDs which may or may not have the same count as initial sample_name values (if we had more than 1 of each sample name but from different file). Then, no matter how we got the sample_ids (ie directly from sample table when --sample_names was not invoked, or from mapping from that flag if it was invoked) and then convert them into --sample_names.

Unfortunately, it adds additional run through sample IDs when --sample_names flag was invoked, but we kinda have to do it, if we want to have the functionality that Aaron requested (ie, for the WITH_FILE_PATH, export N001_1, N001_2, N001_3...).

@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch 4 times, most recently from fdcb611 to 16122fd Compare March 27, 2020 14:04
Copy link
Collaborator Author

@tneymanov tneymanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redoing integration testing, but ready for review.

_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND '
'{START_POSITION_ID}>={START_POSITION_VALUE} AND '
_FULL_INPUT_TABLE = '{TABLE}___{SUFFIX}'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

'{END_POSITION_ID}<={END_POSITION_VALUE})')
_SAMPLE_INFO_QUERY_TEMPLATE = (
'SELECT sample_id, sample_name, file_path '
'FROM `{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}__sample_info`')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, somewhat. I'd rather import base_table_id and construct the rest here. PTAL.

@@ -117,7 +122,8 @@ def run(argv=None):
'{}_meta_info.vcf'.format(unique_temp_id))
_write_vcf_meta_info(known_args.input_table,
known_args.representative_header_file,
known_args.allow_incompatible_schema)
known_args.allow_incompatible_schema,
known_args.genomic_regions)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per offline discussions, removed the dependency on genomic_regions.

# TODO(tneymanov): Add logic to extract sample names from sample IDs by
# joining with sample id-name mapping table, once that code is implemented.
sample_names = sample_ids
hash_table = (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 213 to 223
hash_table = (
sample_table_rows
| 'SampleIdToNameDict' >> sample_mapping_table.SampleIdToNameDict())
sample_names = (sample_ids
| 'GetSampleNames' >>
sample_mapping_table.GetSampleNames(
beam.pvalue.AsSingleton(hash_table))
| 'CombineSampleNames' >> beam.combiners.ToList())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unfortunately not. This is because we have to accommodate the case when user gives sample names, but the table was created with WITH_FILE_PATH option. We need to extract sample IDs which may or may not have the same count as initial sample_name values (if we had more than 1 of each sample name but from different file). Then, no matter how we got the sample_ids (ie directly from sample table when --sample_names was not invoked, or from mapping from that flag if it was invoked) and then convert them into --sample_names.

Unfortunately, it adds additional run through sample IDs when --sample_names flag was invoked, but we kinda have to do it, if we want to have the functionality that Aaron requested (ie, for the WITH_FILE_PATH, export N001_1, N001_2, N001_3...).

Comment on lines 35 to 43
TABLE_SUFFIX_SEPARATOR = '___'
SAMPLE_TABLE_SUFFIX_SEPARATOR = '__'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per offline discussions, amma remove SAMPLE_TABLE_SUFFIX_SEPARATOR in follow up PR.

Comment on lines 547 to 541
help=('A genomic regions (separated by a space) to load from BigQuery. '
'The format of the genomic region should be '
'REFERENCE_NAME:START_POSITION-END_POSITION or REFERENCE_NAME if '
'the full chromosome is requested. Only variants matching at '
'this region will be loaded. The chromosome identifier should be '
'identical to the one provided in config file when the tables '
'were being created. For example, '
'`--genomic_regions chr2:1000-2000` will load all variants '
'`chr2` with `start_position` in `[1000,2000)` from BigQuery. '
'If the table with suffix `my_chrom3` was imported, '
'`--genomic_regions my_chrom3` would return all the variants in '
'that shard. This flag must be specified to indicate the table '
'shard that needs to be exported to VCF file. NOTE:At the moment '
'one and only one genomic region must be supplied.'))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

sample_mapping_table.GetSampleNames(
beam.pvalue.AsSingleton(hash_table))
| 'CombineSampleNames' >> beam.combiners.ToList())
sample_ids = sample_ids | beam.combiners.ToList()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_ids in DensifyVariants need to be run with ToList combiner (ie to make it a single value of a List). But before I get there, when creating sample_names, I need to have it as pure PCollection.

So I removed the logic to make it lists when creating sample_ids, then create sample_names, then convert sample_ids to a list as it was before this PR.

gcp_variant_transforms/bq_to_vcf.py Show resolved Hide resolved
@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch from 16122fd to db93ef1 Compare March 27, 2020 14:33
Copy link
Contributor

@samanvp samanvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at my reply to your comments in bq_to_vcf.py. More comments coming soon...

@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards(
# TODO(allieychen): Modify the SQL query with the specified sample_ids.
query = _get_bigquery_query(known_args, schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to something more meaningful (for example variant_query?) to highlight the difference between it and sample_query.
Similarly, let's rename bq_source to bq_variant_source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards(
# TODO(allieychen): Modify the SQL query with the specified sample_ids.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this TODO already fulfilled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I guess. Removed.

gcp_variant_transforms/bq_to_vcf.py Show resolved Hide resolved
Comment on lines 598 to 609
bigquery_util.raise_error_if_dataset_not_exists(client, project_id,
dataset_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this check, checking the existence of tables covers this check as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 600 to 612
if not bigquery_util.table_exist(client, project_id, dataset_id, table_id):
raise ValueError('Table {}:{}.{} does not exist.'.format(
project_id, dataset_id, table_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd move this higher, basically this order is more natural to me:

  • input table exists
  • input table follows base_name___suffix
  • input sample table exists

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch 2 times, most recently from 5963315 to 79bca93 Compare April 9, 2020 08:04
Copy link
Collaborator Author

@tneymanov tneymanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Saman, synced, addressed the comments and adjusted integration tests. Will launch them now.

@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards(
# TODO(allieychen): Modify the SQL query with the specified sample_ids.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I guess. Removed.

@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards(
# TODO(allieychen): Modify the SQL query with the specified sample_ids.
query = _get_bigquery_query(known_args, schema)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

gcp_variant_transforms/bq_to_vcf.py Show resolved Hide resolved
Comment on lines 598 to 609
bigquery_util.raise_error_if_dataset_not_exists(client, project_id,
dataset_id)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 600 to 612
if not bigquery_util.table_exist(client, project_id, dataset_id, table_id):
raise ValueError('Table {}:{}.{} does not exist.'.format(
project_id, dataset_id, table_id))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch from 79bca93 to bb8baef Compare April 9, 2020 16:26
@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch from bb8baef to d33ae65 Compare April 17, 2020 01:38
Copy link
Contributor

@samanvp samanvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review only covers bq_to_vcf.py module. I will review the rest of the PR later this morning.

gcp_variant_transforms/bq_to_vcf.py Show resolved Hide resolved
_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
'INFO', 'FORMAT']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining a new const, can't we reuse vcf_parser. LAST_HEADER_LINE_PREFIX?
I see here we have 'FORMAT' while it's missing from the other constant, I am not entirely sure why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm this change is outside of the scope of this PR. However, great catch - this is a bug of sorts. FORMAT may or may not be supplied - depends on whether samples are present in the VCF file (no samples - no FORMAT). This one specifically needs to be thought about a bit to add FORMAT into the resulting VCF file iff samples are present in the BQ table. I'll add an issue to follow up on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this comment with the issue so we can refer back to this PR later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done #592



_BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`'
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR
SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX
_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}'
_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND '
'{START_POSITION_ID}>={START_POSITION_VALUE} AND '
'{END_POSITION_ID}<={END_POSITION_VALUE})')
_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
'INFO', 'FORMAT']
_VCF_VERSION_LINE = '##fileformat=VCFv4.3\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, cannot we instead use vcf_parser. FILE_FORMAT_HEADER_TEMPLATE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also outside of the scope, but done. Got it from vcf_header_io since it's already imported. Also added '\n'.

@@ -137,7 +146,7 @@ def run(argv=None):
def _write_vcf_meta_info(input_table,
representative_header_file,
allow_incompatible_schema):
# type: (str, str, bool) -> None
# type: (str, str, bool, str) -> None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra , str

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, Done.

@@ -164,30 +173,57 @@ def _bigquery_to_vcf_shards(
`vcf_header_file_path`.
"""
schema = _get_schema(known_args.input_table)
# TODO(allieychen): Modify the SQL query with the specified sample_ids.
query = _get_bigquery_query(known_args, schema)
query = _get_variant_query(known_args, schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to variant_query?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



_BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`'
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR
SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX
_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line duplicates the logic of compose_table_name(). I think we should use that function instead.

Also I don't see where we use this const anywhere in this module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you are right, removed.

annotation_names = _extract_annotation_names(schema)

base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have a function for doing this get_table_base_name().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

| transforms.Create(known_args.sample_names,
reshuffle=False)
| beam.combiners.ToList())
hash_table = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using hash_table in both if and else statements we could use a more clear name to show the direction of lookup map.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

sample_names = (p
| transforms.Create(known_args.sample_names,
reshuffle=False))
sample_ids = (sample_names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To highlight the distinction between sample_names pcollection and list, we could perhaps rename this variable to sample_ids_list and the following variable also sample_names_list. Please feel free to come up with a better name than what I propose here, I myself don't like what I suggested :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I don't know either... thought of consolidated_sample_ids but ended up with just combined_sample_ids. Did the same for _names. Tell me if not up to par.

beam.pvalue.AsSingleton(hash_table))
| 'CombineSampleNames' >> beam.combiners.ToList())
sample_ids = sample_ids | beam.combiners.ToList()

_ = (sample_names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, here we need to pass a pcollection to the ParDo transform. However, at this point sample_names object is a list. I am wondering how this code didn't fail... or I am missing something here?!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a list, it's never a list until pipeline is running - it's a pcollection of a single element - list of sample names.

What happens here is over each element in the above pcollection beam runs write_vcf_header_with_sample_names method, which writes #CHROM.... stuff and then appends the values in that particular element. Since Pcollection only has 1 element, only 1 #CHROM... line will be written which will have all the sample names.

Copy link
Collaborator Author

@tneymanov tneymanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Saman.

Also increased integration test timeout by 30 mins, since last run timed out. Now it's 4:30 which kinda gets out of hand - maybe something we should discuss...

gcp_variant_transforms/bq_to_vcf.py Show resolved Hide resolved


_BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`'
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR
SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX
_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you are right, removed.

_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
'INFO', 'FORMAT']
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm this change is outside of the scope of this PR. However, great catch - this is a bug of sorts. FORMAT may or may not be supplied - depends on whether samples are present in the VCF file (no samples - no FORMAT). This one specifically needs to be thought about a bit to add FORMAT into the resulting VCF file iff samples are present in the BQ table. I'll add an issue to follow up on this.



_BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`'
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards'
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions]
TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR
SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX
_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}'
_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND '
'{START_POSITION_ID}>={START_POSITION_VALUE} AND '
'{END_POSITION_ID}<={END_POSITION_VALUE})')
_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
'INFO', 'FORMAT']
_VCF_VERSION_LINE = '##fileformat=VCFv4.3\n'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also outside of the scope, but done. Got it from vcf_header_io since it's already imported. Also added '\n'.

@@ -137,7 +146,7 @@ def run(argv=None):
def _write_vcf_meta_info(input_table,
representative_header_file,
allow_incompatible_schema):
# type: (str, str, bool) -> None
# type: (str, str, bool, str) -> None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, Done.

@@ -164,30 +173,57 @@ def _bigquery_to_vcf_shards(
`vcf_header_file_path`.
"""
schema = _get_schema(known_args.input_table)
# TODO(allieychen): Modify the SQL query with the specified sample_ids.
query = _get_bigquery_query(known_args, schema)
query = _get_variant_query(known_args, schema)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

annotation_names = _extract_annotation_names(schema)

base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

| transforms.Create(known_args.sample_names,
reshuffle=False)
| beam.combiners.ToList())
hash_table = (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

sample_names = (p
| transforms.Create(known_args.sample_names,
reshuffle=False))
sample_ids = (sample_names
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I don't know either... thought of consolidated_sample_ids but ended up with just combined_sample_ids. Did the same for _names. Tell me if not up to par.

beam.pvalue.AsSingleton(hash_table))
| 'CombineSampleNames' >> beam.combiners.ToList())
sample_ids = sample_ids | beam.combiners.ToList()

_ = (sample_names
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a list, it's never a list until pipeline is running - it's a pcollection of a single element - list of sample names.

What happens here is over each element in the above pcollection beam runs write_vcf_header_with_sample_names method, which writes #CHROM.... stuff and then appends the values in that particular element. Since Pcollection only has 1 element, only 1 #CHROM... line will be written which will have all the sample names.

@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch 2 times, most recently from 8066aa6 to 14c56c2 Compare April 20, 2020 14:51
Copy link
Contributor

@samanvp samanvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments, mostly just naming nits.

gcp_variant_transforms/bq_to_vcf.py Show resolved Hide resolved
Comment on lines 21 to 23
SAMPLE_ID_COLUMN = 'sample_id'
SAMPLE_NAME_COLUMN = 'sample_name'
FILE_PATH_COLUMN = 'file_path'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 const values somehow should be tied to the schema file for the sample_info table in #577

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but removed FILE_PATH_COLUMN - not needed anymore.



class SampleIdToNameDict(beam.PTransform):
"""Transforms BigQuery table rows to PCollection of `Variant`."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line needs to be updated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class SampleIdToNameDict(beam.PTransform):
"""Transforms BigQuery table rows to PCollection of `Variant`."""

def _convert_bq_row(self, row):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a better name? For example _extract_id_name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def expand(self, pcoll):
return (pcoll
| 'BigQueryToMapping' >> beam.Map(self._convert_bq_row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we might use a more clear name than BigQueryToMapping. How about ExtractIdNameTuples?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

| 'CombineToDict' >> beam.combiners.ToDict())

class GetSampleNames(beam.PTransform):
"""Transforms sample_ids to sample_names"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit uneasy about "Transforms", how about "Looks up sample_names corresponding to the given sample_ids"? or something along those lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""Transforms sample_ids to sample_names"""

def __init__(self, hash_table):
# type: (Dict[int, Tuple(str, str)]) -> None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type of this Dict is int -> str (I think Tuple is old artifacts)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, nice catch. Done.

return pcoll | beam.Map(self._get_sample_id, self._hash_table)

class GetSampleIds(beam.PTransform):
"""Transform sample_names to sample_ids"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make all the updates similar to the previous class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

SAMPLE_ID_COLUMN = 'sample_id'
SAMPLE_NAME_COLUMN = 'sample_name'
FILE_PATH_COLUMN = 'file_path'
WITH_FILE_SAMPLE_TEMPLATE = "{FILE_PATH}/{SAMPLE_NAME}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This const is not used at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch from acfd139 to 68f4109 Compare April 30, 2020 09:25
Copy link
Collaborator Author

@tneymanov tneymanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for so many silly type/naming mistakes - went through many iterations.

SAMPLE_ID_COLUMN = 'sample_id'
SAMPLE_NAME_COLUMN = 'sample_name'
FILE_PATH_COLUMN = 'file_path'
WITH_FILE_SAMPLE_TEMPLATE = "{FILE_PATH}/{SAMPLE_NAME}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



class SampleIdToNameDict(beam.PTransform):
"""Transforms BigQuery table rows to PCollection of `Variant`."""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class SampleIdToNameDict(beam.PTransform):
"""Transforms BigQuery table rows to PCollection of `Variant`."""

def _convert_bq_row(self, row):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def expand(self, pcoll):
return (pcoll
| 'BigQueryToMapping' >> beam.Map(self._convert_bq_row)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class SampleNameToIdDict(beam.PTransform):
"""Transforms BigQuery table rows to PCollection of `Variant`."""

def _convert_bq_row(self, row):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# type: (Dict[int, Tuple(str, str)]) -> None
self._hash_table = hash_table

def _get_sample_id(self, sample_id, hash_table):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Done.


def __init__(self, hash_table):
# type: (Dict[int, Tuple(str, str)]) -> None
self._hash_table = hash_table
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def expand(self, pcoll):
return (pcoll
| 'BigQueryToMapping' >> beam.Map(self._convert_bq_row)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 21 to 23
SAMPLE_ID_COLUMN = 'sample_id'
SAMPLE_NAME_COLUMN = 'sample_name'
FILE_PATH_COLUMN = 'file_path'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but removed FILE_PATH_COLUMN - not needed anymore.

@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch from 68f4109 to 4b732c7 Compare April 30, 2020 10:17
Copy link
Contributor

@samanvp samanvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tural, things look much better now. Please address the latest comments before merging.

raise ValueError('Sample ID `{}` was not found.'.format(sample_id))

def expand(self, pcoll):
return pcoll | beam.Map(self._get_sample_name, self._id_to_name_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This operation does not have a name. I am not sure what will be shown in the Dataflow diagram, if Adding a name makes that diagram more clear, let's add a name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will just have it as Map - sometimes it's necessary to add custom names, if two steps in a single flow have same default names.

But Done nonetheless.

raise ValueError('Sample `{}` was not found.'.format(sample_name))

def expand(self, pcoll):
return pcoll | beam.Map(self._get_sample_id, self._name_to_id_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -56,4 +56,4 @@ steps:
# - '--gs_dir bashir-variant_integration_test_runs'
images:
- 'gcr.io/${PROJECT_ID}/gcp-variant-transforms:${COMMIT_SHA}'
timeout: 240m
timeout: 270m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just curious what causes longer test times?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same... not really sure. It took 3h to finish on gcp-test, but got timedout on my own project, in 4.5h. It's unclear to me why this is the case, because my quota should be identical... Maybe I ran two int tests simultaneously and there weren't enough workers? Do not know.

_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER',
'INFO', 'FORMAT']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this comment with the issue so we can refer back to this PR later.

Comment on lines 186 to 188
sample_query = _SAMPLE_INFO_QUERY_TEMPLATE.format(PROJECT_ID=project_id,
DATASET_ID=dataset_id,
BASE_TABLE_ID=base_table_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we assume the sample info table will follow our expected naming convention: BASE_TABLE_ID + TABLE_SUFFIX_SEPARATOR + SAMPLE_INFO_TABLE_SUFFIX
What happens if this is not the case? My guess is this:

  • We will make a dict (either Id_to_name or name_to_id) which is empty.
  • For the first lookup we will fail and raise an exception about missing sample_name or sample_id.
    Is this the case?

If yes then this will be kind of confusing for user to find out the reason is that sample info table is missing. Is there a better way we could handle missing table (or even empty sample info table)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we will fail on variant_transform_options stage, as we demand these tables to exist. If it's exists but is empty, there is nothing we can do about it, I think, as we cannot know what rows should be in those tables.

Comment on lines +220 to +219
| 'CombineToList' >> beam.combiners.ToList()
| 'SortSampleNames' >> beam.ParDo(sorted))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I just realized this: doen't all these logic requires that sample_names and sample_ids have the exact same order?
Reordering one (here sorting sample_names) without modifying the other one does't make our output wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found the answer:
sample_ids and sample_names are temporary and what matters is the content of combined_sample_ids and combined_sample_ids.
If that's the case then let's do this:

  • Rename variables to state this fact, for example sample_ids -> temp_sample_ids' and similarly sample_names. And then combined_sample_ids->sample_ids`. This way we indicate the temporary state of those two variables.
  • Line 220: that 'ToList()` operation is not needed yet (I think)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another follow up question:
Even if we sort sample_names in line 223 when we process it in the next stage, does't it get accessed randomly due to the Beam's processing paradigm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Done renaming.
  2. ToList() is required for sort operation - values need to be combined in 1 list to run sorted() over it, results of which are separated into a PCollection.
  3. No, I think randomization only happens when we create a PCollection without reshuffle=false

Comment on lines 199 to 201
name_to_id_hash_table = (
sample_table_rows
| 'SampleNameToIdDict' >> sample_mapping_table.SampleNameToIdDict())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to after if statement, exactly, right before we use it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

TABLE_SUFFIX_SEPARATOR))
base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)]
sample_table_id = (
base_table_id + TABLE_SUFFIX_SEPARATOR + SAMPLE_INFO_TABLE_SUFFIX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with bigquery_util.compose_table_name(base_table_id, SAMPLE_INFO_TABLE_SUFFIX)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

_SAMPLE_INFO_QUERY_TEMPLATE = (
'SELECT sample_id, sample_name, file_path '
'FROM `{PROJECT_ID}.{DATASET_ID}.{BASE_TABLE_ID}' +
TABLE_SUFFIX_SEPARATOR + SAMPLE_INFO_TABLE_SUFFIX + '`')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the face were we are duplicating the logic of bigquery_util.compose_table_name() Can we avoid this duplication?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def dict_values_equal(expected_dict):
"""Verifies that dictionary is the same as expected."""
def _items_equal(actual_dict):
actual = actual_dict[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this line is needed, let's discuss offline.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once pipeline finishes, in this case the result is combined into a single dict. However, that dict is still inside a PCollection, which consists of this single element. When transformed into a proper data structure, it becomes list of a single item - the dict.

I'm down to discuss on VC.

@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch 3 times, most recently from 050e187 to 0267802 Compare May 4, 2020 23:19
@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch from 0267802 to abe6d9b Compare May 5, 2020 07:34
tneymanov added 8 commits May 7, 2020 20:44
  - Add BQ tests back with modified files.
  - Adjust indentations between tables (__chr -> ___chr).
  - Make sure to generate <SAMPLE_NAME>_<INDEX> for WITH_FILE_PATH encoding as per Sync meeting.
  - Adjust --sample_names flag to handle WITH_FILE_PATH encoding as per Sync meeting.
  - Rename --genomic_region back to --genomic_regions while still forcing 1 and only 1 value as per Sync meeting.
@tneymanov tneymanov force-pushed the bq_to_vcf_sample_ids branch from abe6d9b to 62d6be0 Compare May 8, 2020 00:44
@tneymanov tneymanov merged commit 7f9136b into googlegenomics:master May 8, 2020
tneymanov added a commit to tneymanov/gcp-variant-transforms that referenced this pull request Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants