-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bq to vcf sample ids #557
Bq to vcf sample ids #557
Conversation
19bf9bd
to
39d9ef6
Compare
Added second commit which applies following changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for slow review, I will send more comments tomorrow morning.
gcp_variant_transforms/bq_to_vcf.py
Outdated
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards' | ||
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions] | ||
_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND ' | ||
'{START_POSITION_ID}>={START_POSITION_VALUE} AND ' | ||
_FULL_INPUT_TABLE = '{TABLE}___{SUFFIX}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use TABLE_SUFFIX_SEPARATOR
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
'{END_POSITION_ID}<={END_POSITION_VALUE})') | ||
_SAMPLE_INFO_QUERY_TEMPLATE = ( | ||
'SELECT sample_id, sample_name, file_path ' | ||
'FROM `{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}__sample_info`') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of {PROJECT_ID}.{DATASET_ID}.{TABLE_ID}__sample_info
we better use {FULL_TABLE_ID}
and construct FULL_TABLE_ID
outside of query, similar to what we did in AVRO PR:
https://github.com/googlegenomics/gcp-variant-transforms/pull/558/files#diff-7b9491fbb5998f2c837ddabb9582a5ba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, somewhat. I'd rather import base_table_id and construct the rest here. PTAL.
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -117,7 +122,8 @@ def run(argv=None): | |||
'{}_meta_info.vcf'.format(unique_temp_id)) | |||
_write_vcf_meta_info(known_args.input_table, | |||
known_args.representative_header_file, | |||
known_args.allow_incompatible_schema) | |||
known_args.allow_incompatible_schema, | |||
known_args.genomic_regions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are passing known_args.genomic_regions
all the way down to _get_schema()
only to use the suffix
to identify source table.
Instead of known_args.input_table
and known_args.genomic_regions
we can "assemble" full_table_id
and pass it down. This will also match my earlier comment about using FULL_TABLE_ID
in _SAMPLE_INFO_QUERY_TEMPLATE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per offline discussions, removed the dependency on genomic_regions.
TABLE_SUFFIX_SEPARATOR = '___' | ||
SAMPLE_TABLE_SUFFIX_SEPARATOR = '__' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we switch these two?
I know we discussed this offline but since we are releasing gnomAD with __chr*
suffixes it makes sense here we follow that standard. Or alternatively we should rename those tables before we publish them. Let's discuss this offline...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per offline discussions, amma remove SAMPLE_TABLE_SUFFIX_SEPARATOR
in follow up PR.
help=('File containing list of shards and output table names. You ' | ||
'can use provided default sharding_config file to split output ' | ||
'by chromosome (one table per chromosome) which is located at: ' | ||
'gcp_variant_transforms/data/sharding_configs/' | ||
'homo_sapiens_default.yaml')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you update this comment or this is a sync issue?
help=('A genomic regions (separated by a space) to load from BigQuery. ' | ||
'The format of the genomic region should be ' | ||
'REFERENCE_NAME:START_POSITION-END_POSITION or REFERENCE_NAME if ' | ||
'the full chromosome is requested. Only variants matching at ' | ||
'this region will be loaded. The chromosome identifier should be ' | ||
'identical to the one provided in config file when the tables ' | ||
'were being created. For example, ' | ||
'`--genomic_regions chr2:1000-2000` will load all variants ' | ||
'`chr2` with `start_position` in `[1000,2000)` from BigQuery. ' | ||
'If the table with suffix `my_chrom3` was imported, ' | ||
'`--genomic_regions my_chrom3` would return all the variants in ' | ||
'that shard. This flag must be specified to indicate the table ' | ||
'shard that needs to be exported to VCF file. NOTE:At the moment ' | ||
'one and only one genomic region must be supplied.')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in sharding we might have an output table that contains variants of multiple chromosomes, for example look at this:
Line 20 in 8daf1a7
table_name_suffix: "chr04_05" |
I think we should keep these two matters independent from one another:
input_table
must point to an actual table, meaning that suffix must be included.genomic_regions
only is related to the filter we apply to thereference_name
column, this has nothing to do with the table suffix.
Referring back to the linked config above, we can have --input_table {BASE_TABLE_NAME}__chr04_05
and --genomic_regions chr4 chr3:1:1000
. As you can see that table wouldn't have any variants with reference_name
equal to 'chr3' (because we know that config file) but there is no way we could infer this when we run bq_to_vcf
. User is responsible to provide correct inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -347,7 +347,7 @@ def create_output_table(full_table_id, total_base_pairs, schema_file_path): | |||
the worker that monitors the Dataflow job. | |||
|
|||
Args: | |||
full_table_id: for example: projet:dataset.table_base_name__chr1 | |||
full_table_id: for example: projet:dataset.table_base_name___chr1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we agree to switch the suffixes as I requested, this change wouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments.
I also submitted #565 I think we should definitely fix that issue.
gcp_variant_transforms/bq_to_vcf.py
Outdated
sample_mapping_table.GetSampleNames( | ||
beam.pvalue.AsSingleton(hash_table)) | ||
| 'CombineSampleNames' >> beam.combiners.ToList()) | ||
sample_ids = sample_ids | beam.combiners.ToList() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_ids
in DensifyVariants
need to be run with ToList
combiner (ie to make it a single value of a List
). But before I get there, when creating sample_names
, I need to have it as pure PCollection
.
So I removed the logic to make it lists when creating sample_ids
, then create sample_names
, then convert sample_ids
to a list as it was before this PR.
gcp_variant_transforms/bq_to_vcf.py
Outdated
# TODO(tneymanov): Add logic to extract sample names from sample IDs by | ||
# joining with sample id-name mapping table, once that code is implemented. | ||
sample_names = sample_ids | ||
hash_table = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps rename to id_to_name_hash_table
to make its distinction clear from the previous hash table name_to_id_hash_table
.
Also, consider dropping the hash_table
suffix and make both parts plural: ids_to_names
and names_to_ids
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
hash_table = ( | ||
sample_table_rows | ||
| 'SampleIdToNameDict' >> sample_mapping_table.SampleIdToNameDict()) | ||
sample_names = (sample_ids | ||
| 'GetSampleNames' >> | ||
sample_mapping_table.GetSampleNames( | ||
beam.pvalue.AsSingleton(hash_table)) | ||
| 'CombineSampleNames' >> beam.combiners.ToList()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to the else
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, unfortunately not. This is because we have to accommodate the case when user gives sample names, but the table was created with WITH_FILE_PATH
option. We need to extract sample IDs which may or may not have the same count as initial sample_name
values (if we had more than 1 of each sample name but from different file). Then, no matter how we got the sample_ids
(ie directly from sample table when --sample_names
was not invoked, or from mapping from that flag if it was invoked) and then convert them into --sample_names
.
Unfortunately, it adds additional run through sample IDs when --sample_names
flag was invoked, but we kinda have to do it, if we want to have the functionality that Aaron requested (ie, for the WITH_FILE_PATH
, export N001_1, N001_2, N001_3...
).
fdcb611
to
16122fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redoing integration testing, but ready for review.
gcp_variant_transforms/bq_to_vcf.py
Outdated
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards' | ||
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions] | ||
_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND ' | ||
'{START_POSITION_ID}>={START_POSITION_VALUE} AND ' | ||
_FULL_INPUT_TABLE = '{TABLE}___{SUFFIX}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
'{END_POSITION_ID}<={END_POSITION_VALUE})') | ||
_SAMPLE_INFO_QUERY_TEMPLATE = ( | ||
'SELECT sample_id, sample_name, file_path ' | ||
'FROM `{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}__sample_info`') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, somewhat. I'd rather import base_table_id and construct the rest here. PTAL.
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -117,7 +122,8 @@ def run(argv=None): | |||
'{}_meta_info.vcf'.format(unique_temp_id)) | |||
_write_vcf_meta_info(known_args.input_table, | |||
known_args.representative_header_file, | |||
known_args.allow_incompatible_schema) | |||
known_args.allow_incompatible_schema, | |||
known_args.genomic_regions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per offline discussions, removed the dependency on genomic_regions.
gcp_variant_transforms/bq_to_vcf.py
Outdated
# TODO(tneymanov): Add logic to extract sample names from sample IDs by | ||
# joining with sample id-name mapping table, once that code is implemented. | ||
sample_names = sample_ids | ||
hash_table = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
hash_table = ( | ||
sample_table_rows | ||
| 'SampleIdToNameDict' >> sample_mapping_table.SampleIdToNameDict()) | ||
sample_names = (sample_ids | ||
| 'GetSampleNames' >> | ||
sample_mapping_table.GetSampleNames( | ||
beam.pvalue.AsSingleton(hash_table)) | ||
| 'CombineSampleNames' >> beam.combiners.ToList()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, unfortunately not. This is because we have to accommodate the case when user gives sample names, but the table was created with WITH_FILE_PATH
option. We need to extract sample IDs which may or may not have the same count as initial sample_name
values (if we had more than 1 of each sample name but from different file). Then, no matter how we got the sample_ids
(ie directly from sample table when --sample_names
was not invoked, or from mapping from that flag if it was invoked) and then convert them into --sample_names
.
Unfortunately, it adds additional run through sample IDs when --sample_names
flag was invoked, but we kinda have to do it, if we want to have the functionality that Aaron requested (ie, for the WITH_FILE_PATH
, export N001_1, N001_2, N001_3...
).
TABLE_SUFFIX_SEPARATOR = '___' | ||
SAMPLE_TABLE_SUFFIX_SEPARATOR = '__' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per offline discussions, amma remove SAMPLE_TABLE_SUFFIX_SEPARATOR
in follow up PR.
help=('A genomic regions (separated by a space) to load from BigQuery. ' | ||
'The format of the genomic region should be ' | ||
'REFERENCE_NAME:START_POSITION-END_POSITION or REFERENCE_NAME if ' | ||
'the full chromosome is requested. Only variants matching at ' | ||
'this region will be loaded. The chromosome identifier should be ' | ||
'identical to the one provided in config file when the tables ' | ||
'were being created. For example, ' | ||
'`--genomic_regions chr2:1000-2000` will load all variants ' | ||
'`chr2` with `start_position` in `[1000,2000)` from BigQuery. ' | ||
'If the table with suffix `my_chrom3` was imported, ' | ||
'`--genomic_regions my_chrom3` would return all the variants in ' | ||
'that shard. This flag must be specified to indicate the table ' | ||
'shard that needs to be exported to VCF file. NOTE:At the moment ' | ||
'one and only one genomic region must be supplied.')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
sample_mapping_table.GetSampleNames( | ||
beam.pvalue.AsSingleton(hash_table)) | ||
| 'CombineSampleNames' >> beam.combiners.ToList()) | ||
sample_ids = sample_ids | beam.combiners.ToList() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_ids
in DensifyVariants
need to be run with ToList
combiner (ie to make it a single value of a List
). But before I get there, when creating sample_names
, I need to have it as pure PCollection
.
So I removed the logic to make it lists when creating sample_ids
, then create sample_names
, then convert sample_ids
to a list as it was before this PR.
16122fd
to
db93ef1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at my reply to your comments in bq_to_vcf.py
. More comments coming soon...
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards( | |||
# TODO(allieychen): Modify the SQL query with the specified sample_ids. | |||
query = _get_bigquery_query(known_args, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to something more meaningful (for example variant_query
?) to highlight the difference between it and sample_query
.
Similarly, let's rename bq_source
to bq_variant_source
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards( | |||
# TODO(allieychen): Modify the SQL query with the specified sample_ids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this TODO already fulfilled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I guess. Removed.
bigquery_util.raise_error_if_dataset_not_exists(client, project_id, | ||
dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this check, checking the existence of tables covers this check as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if not bigquery_util.table_exist(client, project_id, dataset_id, table_id): | ||
raise ValueError('Table {}:{}.{} does not exist.'.format( | ||
project_id, dataset_id, table_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd move this higher, basically this order is more natural to me:
- input table exists
- input table follows
base_name___suffix
- input sample table exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
5963315
to
79bca93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Saman, synced, addressed the comments and adjusted integration tests. Will launch them now.
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards( | |||
# TODO(allieychen): Modify the SQL query with the specified sample_ids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I guess. Removed.
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -167,27 +177,52 @@ def _bigquery_to_vcf_shards( | |||
# TODO(allieychen): Modify the SQL query with the specified sample_ids. | |||
query = _get_bigquery_query(known_args, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bigquery_util.raise_error_if_dataset_not_exists(client, project_id, | ||
dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if not bigquery_util.table_exist(client, project_id, dataset_id, table_id): | ||
raise ValueError('Table {}:{}.{} does not exist.'.format( | ||
project_id, dataset_id, table_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
79bca93
to
bb8baef
Compare
bb8baef
to
d33ae65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This review only covers bq_to_vcf.py
module. I will review the rest of the PR later this morning.
_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', | ||
'INFO', 'FORMAT'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of defining a new const, can't we reuse vcf_parser. LAST_HEADER_LINE_PREFIX
?
I see here we have 'FORMAT'
while it's missing from the other constant, I am not entirely sure why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm this change is outside of the scope of this PR. However, great catch - this is a bug of sorts. FORMAT may or may not be supplied - depends on whether samples are present in the VCF file (no samples - no FORMAT). This one specifically needs to be thought about a bit to add FORMAT into the resulting VCF file iff samples are present in the BQ table. I'll add an issue to follow up on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this comment with the issue so we can refer back to this PR later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done #592
gcp_variant_transforms/bq_to_vcf.py
Outdated
|
||
|
||
_BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`' | ||
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards' | ||
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions] | ||
TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR | ||
SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX | ||
_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}' | ||
_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND ' | ||
'{START_POSITION_ID}>={START_POSITION_VALUE} AND ' | ||
'{END_POSITION_ID}<={END_POSITION_VALUE})') | ||
_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', | ||
'INFO', 'FORMAT'] | ||
_VCF_VERSION_LINE = '##fileformat=VCFv4.3\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, cannot we instead use vcf_parser. FILE_FORMAT_HEADER_TEMPLATE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also outside of the scope, but done. Got it from vcf_header_io since it's already imported. Also added '\n'.
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -137,7 +146,7 @@ def run(argv=None): | |||
def _write_vcf_meta_info(input_table, | |||
representative_header_file, | |||
allow_incompatible_schema): | |||
# type: (str, str, bool) -> None | |||
# type: (str, str, bool, str) -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra , str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -164,30 +173,57 @@ def _bigquery_to_vcf_shards( | |||
`vcf_header_file_path`. | |||
""" | |||
schema = _get_schema(known_args.input_table) | |||
# TODO(allieychen): Modify the SQL query with the specified sample_ids. | |||
query = _get_bigquery_query(known_args, schema) | |||
query = _get_variant_query(known_args, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to variant_query
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
|
||
|
||
_BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`' | ||
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards' | ||
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions] | ||
TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR | ||
SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX | ||
_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line duplicates the logic of compose_table_name()
. I think we should use that function instead.
Also I don't see where we use this const anywhere in this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, you are right, removed.
gcp_variant_transforms/bq_to_vcf.py
Outdated
annotation_names = _extract_annotation_names(schema) | ||
|
||
base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have a function for doing this get_table_base_name()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
| transforms.Create(known_args.sample_names, | ||
reshuffle=False) | ||
| beam.combiners.ToList()) | ||
hash_table = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using hash_table
in both if and else statements we could use a more clear name to show the direction of lookup map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
sample_names = (p | ||
| transforms.Create(known_args.sample_names, | ||
reshuffle=False)) | ||
sample_ids = (sample_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To highlight the distinction between sample_names
pcollection and list, we could perhaps rename this variable to sample_ids_list
and the following variable also sample_names_list
. Please feel free to come up with a better name than what I propose here, I myself don't like what I suggested :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I don't know either... thought of consolidated_sample_ids but ended up with just combined_sample_ids. Did the same for _names. Tell me if not up to par.
beam.pvalue.AsSingleton(hash_table)) | ||
| 'CombineSampleNames' >> beam.combiners.ToList()) | ||
sample_ids = sample_ids | beam.combiners.ToList() | ||
|
||
_ = (sample_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, here we need to pass a pcollection to the ParDo
transform. However, at this point sample_names
object is a list. I am wondering how this code didn't fail... or I am missing something here?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a list, it's never a list until pipeline is running - it's a pcollection of a single element - list of sample names.
What happens here is over each element in the above pcollection beam runs write_vcf_header_with_sample_names method, which writes #CHROM....
stuff and then appends the values in that particular element. Since Pcollection only has 1 element, only 1 #CHROM...
line will be written which will have all the sample names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Saman.
Also increased integration test timeout by 30 mins, since last run timed out. Now it's 4:30 which kinda gets out of hand - maybe something we should discuss...
gcp_variant_transforms/bq_to_vcf.py
Outdated
|
||
|
||
_BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`' | ||
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards' | ||
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions] | ||
TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR | ||
SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX | ||
_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, you are right, removed.
_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', | ||
'INFO', 'FORMAT'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm this change is outside of the scope of this PR. However, great catch - this is a bug of sorts. FORMAT may or may not be supplied - depends on whether samples are present in the VCF file (no samples - no FORMAT). This one specifically needs to be thought about a bit to add FORMAT into the resulting VCF file iff samples are present in the BQ table. I'll add an issue to follow up on this.
gcp_variant_transforms/bq_to_vcf.py
Outdated
|
||
|
||
_BASE_QUERY_TEMPLATE = 'SELECT {COLUMNS} FROM `{INPUT_TABLE}`' | ||
_BQ_TO_VCF_SHARDS_JOB_NAME = 'bq-to-vcf-shards' | ||
_COMMAND_LINE_OPTIONS = [variant_transform_options.BigQueryToVcfOptions] | ||
TABLE_SUFFIX_SEPARATOR = bigquery_util.TABLE_SUFFIX_SEPARATOR | ||
SAMPLE_INFO_TABLE_SUFFIX = bigquery_util.SAMPLE_INFO_TABLE_SUFFIX | ||
_FULL_INPUT_TABLE = '{TABLE}' + TABLE_SUFFIX_SEPARATOR + '{SUFFIX}' | ||
_GENOMIC_REGION_TEMPLATE = ('({REFERENCE_NAME_ID}="{REFERENCE_NAME_VALUE}" AND ' | ||
'{START_POSITION_ID}>={START_POSITION_VALUE} AND ' | ||
'{END_POSITION_ID}<={END_POSITION_VALUE})') | ||
_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', | ||
'INFO', 'FORMAT'] | ||
_VCF_VERSION_LINE = '##fileformat=VCFv4.3\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also outside of the scope, but done. Got it from vcf_header_io since it's already imported. Also added '\n'.
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -137,7 +146,7 @@ def run(argv=None): | |||
def _write_vcf_meta_info(input_table, | |||
representative_header_file, | |||
allow_incompatible_schema): | |||
# type: (str, str, bool) -> None | |||
# type: (str, str, bool, str) -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
@@ -164,30 +173,57 @@ def _bigquery_to_vcf_shards( | |||
`vcf_header_file_path`. | |||
""" | |||
schema = _get_schema(known_args.input_table) | |||
# TODO(allieychen): Modify the SQL query with the specified sample_ids. | |||
query = _get_bigquery_query(known_args, schema) | |||
query = _get_variant_query(known_args, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
annotation_names = _extract_annotation_names(schema) | ||
|
||
base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
| transforms.Create(known_args.sample_names, | ||
reshuffle=False) | ||
| beam.combiners.ToList()) | ||
hash_table = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
sample_names = (p | ||
| transforms.Create(known_args.sample_names, | ||
reshuffle=False)) | ||
sample_ids = (sample_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I don't know either... thought of consolidated_sample_ids but ended up with just combined_sample_ids. Did the same for _names. Tell me if not up to par.
beam.pvalue.AsSingleton(hash_table)) | ||
| 'CombineSampleNames' >> beam.combiners.ToList()) | ||
sample_ids = sample_ids | beam.combiners.ToList() | ||
|
||
_ = (sample_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a list, it's never a list until pipeline is running - it's a pcollection of a single element - list of sample names.
What happens here is over each element in the above pcollection beam runs write_vcf_header_with_sample_names method, which writes #CHROM....
stuff and then appends the values in that particular element. Since Pcollection only has 1 element, only 1 #CHROM...
line will be written which will have all the sample names.
8066aa6
to
14c56c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments, mostly just naming nits.
SAMPLE_ID_COLUMN = 'sample_id' | ||
SAMPLE_NAME_COLUMN = 'sample_name' | ||
FILE_PATH_COLUMN = 'file_path' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 const values somehow should be tied to the schema file for the sample_info table in #577
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but removed FILE_PATH_COLUMN - not needed anymore.
|
||
|
||
class SampleIdToNameDict(beam.PTransform): | ||
"""Transforms BigQuery table rows to PCollection of `Variant`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
class SampleIdToNameDict(beam.PTransform): | ||
"""Transforms BigQuery table rows to PCollection of `Variant`.""" | ||
|
||
def _convert_bq_row(self, row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a better name? For example _extract_id_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
def expand(self, pcoll): | ||
return (pcoll | ||
| 'BigQueryToMapping' >> beam.Map(self._convert_bq_row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, we might use a more clear name than BigQueryToMapping
. How about ExtractIdNameTuples
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| 'CombineToDict' >> beam.combiners.ToDict()) | ||
|
||
class GetSampleNames(beam.PTransform): | ||
"""Transforms sample_ids to sample_names""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bit uneasy about "Transforms", how about "Looks up sample_names corresponding to the given sample_ids"? or something along those lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"""Transforms sample_ids to sample_names""" | ||
|
||
def __init__(self, hash_table): | ||
# type: (Dict[int, Tuple(str, str)]) -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type of this Dict is int -> str (I think Tuple is old artifacts)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, nice catch. Done.
return pcoll | beam.Map(self._get_sample_id, self._hash_table) | ||
|
||
class GetSampleIds(beam.PTransform): | ||
"""Transform sample_names to sample_ids""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make all the updates similar to the previous class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
SAMPLE_ID_COLUMN = 'sample_id' | ||
SAMPLE_NAME_COLUMN = 'sample_name' | ||
FILE_PATH_COLUMN = 'file_path' | ||
WITH_FILE_SAMPLE_TEMPLATE = "{FILE_PATH}/{SAMPLE_NAME}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This const is not used at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
acfd139
to
68f4109
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for so many silly type/naming mistakes - went through many iterations.
SAMPLE_ID_COLUMN = 'sample_id' | ||
SAMPLE_NAME_COLUMN = 'sample_name' | ||
FILE_PATH_COLUMN = 'file_path' | ||
WITH_FILE_SAMPLE_TEMPLATE = "{FILE_PATH}/{SAMPLE_NAME}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
|
||
class SampleIdToNameDict(beam.PTransform): | ||
"""Transforms BigQuery table rows to PCollection of `Variant`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
class SampleIdToNameDict(beam.PTransform): | ||
"""Transforms BigQuery table rows to PCollection of `Variant`.""" | ||
|
||
def _convert_bq_row(self, row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
def expand(self, pcoll): | ||
return (pcoll | ||
| 'BigQueryToMapping' >> beam.Map(self._convert_bq_row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
class SampleNameToIdDict(beam.PTransform): | ||
"""Transforms BigQuery table rows to PCollection of `Variant`.""" | ||
|
||
def _convert_bq_row(self, row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# type: (Dict[int, Tuple(str, str)]) -> None | ||
self._hash_table = hash_table | ||
|
||
def _get_sample_id(self, sample_id, hash_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Done.
|
||
def __init__(self, hash_table): | ||
# type: (Dict[int, Tuple(str, str)]) -> None | ||
self._hash_table = hash_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
def expand(self, pcoll): | ||
return (pcoll | ||
| 'BigQueryToMapping' >> beam.Map(self._convert_bq_row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
SAMPLE_ID_COLUMN = 'sample_id' | ||
SAMPLE_NAME_COLUMN = 'sample_name' | ||
FILE_PATH_COLUMN = 'file_path' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but removed FILE_PATH_COLUMN - not needed anymore.
68f4109
to
4b732c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tural, things look much better now. Please address the latest comments before merging.
raise ValueError('Sample ID `{}` was not found.'.format(sample_id)) | ||
|
||
def expand(self, pcoll): | ||
return pcoll | beam.Map(self._get_sample_name, self._id_to_name_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This operation does not have a name. I am not sure what will be shown in the Dataflow diagram, if Adding a name makes that diagram more clear, let's add a name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will just have it as Map - sometimes it's necessary to add custom names, if two steps in a single flow have same default names.
But Done nonetheless.
raise ValueError('Sample `{}` was not found.'.format(sample_name)) | ||
|
||
def expand(self, pcoll): | ||
return pcoll | beam.Map(self._get_sample_id, self._name_to_id_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -56,4 +56,4 @@ steps: | |||
# - '--gs_dir bashir-variant_integration_test_runs' | |||
images: | |||
- 'gcr.io/${PROJECT_ID}/gcp-variant-transforms:${COMMIT_SHA}' | |||
timeout: 240m | |||
timeout: 270m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just curious what causes longer test times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same... not really sure. It took 3h to finish on gcp-test, but got timedout on my own project, in 4.5h. It's unclear to me why this is the case, because my quota should be identical... Maybe I ran two int tests simultaneously and there weren't enough workers? Do not know.
_VCF_FIXED_COLUMNS = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', | ||
'INFO', 'FORMAT'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this comment with the issue so we can refer back to this PR later.
gcp_variant_transforms/bq_to_vcf.py
Outdated
sample_query = _SAMPLE_INFO_QUERY_TEMPLATE.format(PROJECT_ID=project_id, | ||
DATASET_ID=dataset_id, | ||
BASE_TABLE_ID=base_table_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we assume the sample info table will follow our expected naming convention: BASE_TABLE_ID + TABLE_SUFFIX_SEPARATOR + SAMPLE_INFO_TABLE_SUFFIX
What happens if this is not the case? My guess is this:
- We will make a dict (either Id_to_name or name_to_id) which is empty.
- For the first lookup we will fail and raise an exception about missing
sample_name
orsample_id
.
Is this the case?
If yes then this will be kind of confusing for user to find out the reason is that sample info table is missing. Is there a better way we could handle missing table (or even empty sample info table)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we will fail on variant_transform_options stage, as we demand these tables to exist. If it's exists but is empty, there is nothing we can do about it, I think, as we cannot know what rows should be in those tables.
| 'CombineToList' >> beam.combiners.ToList() | ||
| 'SortSampleNames' >> beam.ParDo(sorted)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I just realized this: doen't all these logic requires that sample_names
and sample_ids
have the exact same order?
Reordering one (here sorting sample_names) without modifying the other one does't make our output wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found the answer:
sample_ids
and sample_names
are temporary and what matters is the content of combined_sample_ids
and combined_sample_ids
.
If that's the case then let's do this:
- Rename variables to state this fact, for example
sample_ids
->temp_sample_ids' and similarly
sample_names. And then
combined_sample_ids->
sample_ids`. This way we indicate the temporary state of those two variables. - Line 220: that 'ToList()` operation is not needed yet (I think)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another follow up question:
Even if we sort sample_names
in line 223 when we process it in the next stage, does't it get accessed randomly due to the Beam's processing paradigm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Done renaming.
- ToList() is required for sort operation - values need to be combined in 1 list to run
sorted()
over it, results of which are separated into a PCollection. - No, I think randomization only happens when we create a PCollection without
reshuffle=false
gcp_variant_transforms/bq_to_vcf.py
Outdated
name_to_id_hash_table = ( | ||
sample_table_rows | ||
| 'SampleNameToIdDict' >> sample_mapping_table.SampleNameToIdDict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to after if statement, exactly, right before we use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
TABLE_SUFFIX_SEPARATOR)) | ||
base_table_id = table_id[:table_id.find(TABLE_SUFFIX_SEPARATOR)] | ||
sample_table_id = ( | ||
base_table_id + TABLE_SUFFIX_SEPARATOR + SAMPLE_INFO_TABLE_SUFFIX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace with bigquery_util.compose_table_name(base_table_id, SAMPLE_INFO_TABLE_SUFFIX)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/bq_to_vcf.py
Outdated
_SAMPLE_INFO_QUERY_TEMPLATE = ( | ||
'SELECT sample_id, sample_name, file_path ' | ||
'FROM `{PROJECT_ID}.{DATASET_ID}.{BASE_TABLE_ID}' + | ||
TABLE_SUFFIX_SEPARATOR + SAMPLE_INFO_TABLE_SUFFIX + '`') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the face were we are duplicating the logic of bigquery_util.compose_table_name()
Can we avoid this duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
def dict_values_equal(expected_dict): | ||
"""Verifies that dictionary is the same as expected.""" | ||
def _items_equal(actual_dict): | ||
actual = actual_dict[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this line is needed, let's discuss offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once pipeline finishes, in this case the result is combined into a single dict. However, that dict is still inside a PCollection, which consists of this single element. When transformed into a proper data structure, it becomes list of a single item - the dict.
I'm down to discuss on VC.
050e187
to
0267802
Compare
0267802
to
abe6d9b
Compare
- Add BQ tests back with modified files. - Adjust indentations between tables (__chr -> ___chr). - Make sure to generate <SAMPLE_NAME>_<INDEX> for WITH_FILE_PATH encoding as per Sync meeting. - Adjust --sample_names flag to handle WITH_FILE_PATH encoding as per Sync meeting. - Rename --genomic_region back to --genomic_regions while still forcing 1 and only 1 value as per Sync meeting.
…nd IDs, and modify integration tests.
abe6d9b
to
62d6be0
Compare
This PR
genomic_regions
togenomic_region
which becomes mandatory.sample_names
was not provided:(id->name+file)
from bq rows.{sample_name}
or{file_path}/{sample_name}
based on the encodingsample_names
was provided:(name->id)
from bq rows.Note: Had to disable BQ tests, because I will need to recreate all of the test table inputs and output files. That's gonna take a while, so you can start the review right away. Tested manually following cases
preserve_sample_order
preserve_sample_order