Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the way sample names are derived consistent across CNV tools #2910

Closed
droazen opened this issue Jun 5, 2017 · 1 comment
Closed

Make the way sample names are derived consistent across CNV tools #2910

droazen opened this issue Jun 5, 2017 · 1 comment
Assignees

Comments

@droazen
Copy link
Contributor

droazen commented Jun 5, 2017

@asmirnov239 commented on Wed Oct 19 2016

Things that we discussed with @samuelklee that can be done to aid it:

-I think that all files we generate for individual case samples---"ReadCountCollection" files for coverage profiles, "AllelicCountCollection" files for het pulldowns, and segment files---should contain the sample name as metadata in a header comment with a common tag (e.g., #sampleName = ...). Currently, these sample names are stored in column headers, in the fields of a SAMPLE column, or not at all, depending on the type of file. This would drastically simplify the use of the SampleNameFinder class, which would basically only contain a single method to parse this header comment and return the name.

-CLIs that generate a file from an input BAM (CalculateTargetCoverage, GetHetCoverage, etc.) should take the sample name from that BAM by default. Since these are the first steps in our workflows, we could also optionally allow the user to specify a sample name different from that in the BAM.

-Subsequent CLIs should then take the sample name from the header comment.

-CLIs that take multiple non-BAM input files should check for consistency of the sample names as part of the argument validation step.

-CLIs that output the sample name in plots should derive these from the header comment.

-For files that contain data from multiple samples (e.g., the output of CombineReadCounts), we can probably leave the sample names in the column headers, but it would be nice to output the type of data stored in a header comment as well (e.g., PCOV or RAW). At some point I think we should restrict to RAW output only, see broadinstitute/gatk-protected#615.

-Entity names specified by the input file for the WDLs can be separate from the BAM sample names by default. However, if we do allow the user to optionally specify sample names as described in the first bullet point, we can set up the WDL to pass the entity names.

@samuelklee
Copy link
Contributor

Closed in #3914.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants