-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ah spike writeapi #7530
Ah spike writeapi #7530
Conversation
src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/PendingBQWriter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/PendingBQWriter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/PetCreator.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/PetCreator.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/PetCreator.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/VetCreator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this!
4557014
to
f93cc36
Compare
c9e5191
to
44fa33c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part of review -- everything BUT the java code :D
|
||
A sample set for the quickstart has already been created with 10 samples and paths to re-blocked gVCFs for each sample. Run the two import workflows against this sample set by selecting "sample_set" as the root entity type ("Step 1") and `gvs_demo-10` for the data ("Step 2"). If you are creating your own sample set, note that the sample table should have a column for the re-blocked gVCFs (`hg38_reblocked_gvcf` or `reblocked_gvcf_path`) and their index files need to be in the same location. | ||
## 2. Reblock samples | ||
Run the GvsAoUReblockGvcf workflow on the samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this just be the WARP pipeline? Feels off to have people running an AoU specific pipelines in our quickstart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I intended this workspace to be the template for the Gvs AoU specific workspace. I'll change the name of the markdown file for this use.
To optimize the internal queries, each sample must have a unique and consecutive integer ID assigned. Run the `GvsAssignIds` workflow, which will create an appropriate ID for each sample in the sample set and update the BigQuery dataset with the sample name to ID mapping info. | ||
|
||
## 3. Load data | ||
### 3.1 Assign Gvs IDs (suggest renaming to GvsPrepareSamples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW - I prefer the Assign IDs name since that's what this step is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also takes care of creating the tables in BQ so I wanted that part to be more obvious. I'm fine leaving the name as is for now.
|
||
These are the required parameters which must be supplied to the workflow: | ||
|
||
| Parameter | Description | | ||
| ----------------- | ----------- | | ||
| project_id | The name of the google project containing the dataset | | ||
| dataset_name | The name of the dataset you created above | | ||
| external\_sample_names | Sample ids from the data model that will be mapped to gvs ids | | ||
| workspace_namespace | The namespace of the workspace | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part feels funky -- maybe there's not a better way to do it but this means you must run this workflow inside Terra (thinking of other users) but also IF you make a mistake and put the wrong workspace name here… it could mess up your data in the other workspace.
Is there not a way to map these results back onto the data model using Terra features? If not.. maybe we should rethink about how IDs are allocated.
ToL: why do we need these IDs in the data model anyway? Feels like an implementation detail we shouldn't expose our users to…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I meant for this documentation to be specific to the workspace we use for AoU. I think it is useful to have this information in the workspace (just given how many times I've had to get the mapping between the two for tracking - but maybe we won't have the same issues that I had to deal with previously). It can be removed if it really isn't helpful.
| output_directory | A unique GCS path to be used for loading, can be in the workspace bucket. E.g. `gs://fc-124-12-132-123-31/gvs/demo1`) | ||
|
||
| external\_sample_name | The name of the sample from the data model | | ||
| gvs\_sample_id | The gvs_id from the data model | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just grab this from BigQuery? One less thing for the user to do/mess up :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are now running ingest per sample, I didn't want each vm to have to make a query to the sample table for the id. Any ideas of other ways were we won't run into query limits?
|
||
**NOTE**: if your workflow fails, you will need to manually remove a lockfile from the output directory. It is called LOCKFILE, and can be removed with `gsutil rm` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
soooo nice to not have to deal with this or the loading directory!!!
preemptible_tries = preemptible_tries | ||
} | ||
|
||
call ImportSample { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indenting seems off
# check the INFORMATION_SCHEMA.PARTITIONS table to see if any of input sample names/ids have data loaded into their partitions | ||
# this returns the list of sample names that do already have data loaded | ||
bq --location=US --project_id=~{project_id} query --format=csv --use_legacy_sql=false \ | ||
"SELECT distinct(partition_id) FROM ${INFO_SCHEMA_TABLE} p WHERE (p.partition_id = CAST(~{gvs_sample_id} AS STRING)) AND p.total_logical_bytes > 0 AND (table_name like 'pet_%' OR table_name like 'vet_%')" | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should also check the ref_ranges_%
tables for data
String docker | ||
} | ||
|
||
Int disk_size = if defined(drop_state) then 50 else 75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this much disk now that we are streaming?
private PetTsvCreator petTsvCreator; | ||
private VetTsvCreator vetTsvCreator; | ||
private SampleInfoTsvCreator sampleInfoTsvCreator; | ||
private PetCreator petCreator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the refactoring! Maybe this could be "RefCreator" since it could be either Pet or ReferenceRanges ?
@@ -41,11 +46,14 @@ | |||
private final SAMSequenceDictionary seqDictionary; | |||
private final Set<GQStateEnum> gqStatesToIgnore = new HashSet<GQStateEnum>(); | |||
private GenomeLocSortedSet coverageLocSortedSet; | |||
private final static String PET_FILETYPE_PREFIX = "pet_"; | |||
private static String PET_FILETYPE_PREFIX = "pet_"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer final?
|
||
|
||
String table_where_clause = vet_check + ref_check + pet_check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following the logic here… if I say loading vet and ref_ranges… won't these "false"es get into the WHERE clause when you just concatenate them?
5f2e1d3
to
205954e
Compare
} | ||
|
||
|
||
task CheckForDuplicateData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In running, I noticed this step call-cached… Should it? Should it be volatile instead?
@@ -315,7 +316,7 @@ task GetNumSamplesLoaded { | |||
|
|||
echo "project_id = ~{project_id}" > ~/.bigqueryrc | |||
bq query --location=US --project_id=~{project_id} --format=csv --use_legacy_sql=false \ | |||
"SELECT COUNT(*) as num_rows FROM ~{fq_sample_table} WHERE is_loaded = true" > num_rows.csv | |||
"SELECT COUNT(*) as num_rows FROM ~{fq_sample_table}" > num_rows.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about why we don't need to check this anymore in the (default) case where the table is sample_info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this must have been a hold over from when I thought I was going to remove this field and forgot to put this back in. adding back.
updated quickstart
* added sample_load_status table creation * log load status and verify sample is not loaded
* used _default stream instead * default to new jar
* safety checks for restarts via preemptibles * branch to dockstore * fixed NPE
6410940
to
76f78c6
Compare
This PR merges the BQ Write API with the ref ranges table and refactors the Import wdl to be single sample.
I pulled out the create tables into it's own wdl. It doesn't make sense to try this for each individual sample. For now it need to be run separately. We are also not setting the is_loaded field in the sample_info table - this needs to be resolved when we decide how we want to track that info.
See https://docs.google.com/document/d/1_ox38x7YjSeQx1I-6K_6kB4TTlonaEah2LRGevN9GmM/edit# for details