-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sf mito fix #4751
Sf mito fix #4751
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4751 +/- ##
==============================================
+ Coverage 80.07% 80.187% +0.117%
- Complexity 17420 18035 +615
==============================================
Files 1080 1083 +3
Lines 63131 66108 +2977
Branches 10200 10906 +706
==============================================
+ Hits 50549 53010 +2461
- Misses 8590 8963 +373
- Partials 3992 4135 +143
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question inline for Sam.
@@ -103,6 +103,8 @@ def reference_string_to_tensor(reference): | |||
dna_data[i, defines.DNA_SYMBOLS[b]] = 1.0 | |||
elif b in defines.AMBIGUITY_CODES: | |||
dna_data[i] = defines.AMBIGUITY_CODES[b] | |||
elif b == '\x00': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucidtronix - question on this - it seems like this issue (zeros in the ref bases) can happen whenever the position of a variant is within windowsize/2 of the end of the reference contig, since thats what causes the tool to pad out the ref base string. This makes me wonder if the reference tensor is being constructed correctly when a SNP occurs near the edges of a contig - normally the reference base that corresponds to the variant base in question is in the middle of the read tensor, and padded out on each side. But not so at the edges, where it will be offset. Is that expected, or should we be padding out from the middle on both sides of the base ? I think the answer to that will determine whether its ok to just break out at the first 0.
Also, if this really isn't mito-specific, I don't think we need to add a mito reference for the test case - we might be able to get away with creating a test vcf that uses an existing reference, as long as it has a variant near the edge of a contig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right variants at the very beginning (i.e. within 64bp of the start) of contigs are not correctly constructed. Since these variants are probably only going to show up in humans in the mitochondria, their CNN scores are pretty meaningless anyway. After chatting with @ldgauthier we think it's best to get this fix in now to prevent crashes, and then revisit if necessary when mitochondrial best practices are ready.
I removed the mitochondrial reference files and updated the test. Back to you @cmnbroad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor request then this looks good.
public void testOnMitochondria() throws IOException{ | ||
final String mitoVcf = largeFileTestDir + "VQSR/errorM.vcf"; | ||
public void testOnContigEdge() throws IOException{ | ||
final String edgeVcf = largeFileTestDir + "VQSR/errorM.vcf"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good. I should have mentioned this last time - can you name this file something distinguishing, like "variantNearContigEdge.vcf" or some such thing, and out of the large folder and into the regular test files, since its small - maybe even into a CNN test folder. Otherwise this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucidtronix Also you can squash this down to one commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved and renamed test file, rebased and squashed if tests pass good to merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more change I didn't notice before. Then 👍to merge once tests pass.
final ArgumentsBuilder argsBuilder = new ArgumentsBuilder(); | ||
argsBuilder.addArgument(StandardArgumentDefinitions.VARIANT_LONG_NAME, edgeVcf) | ||
.addArgument(StandardArgumentDefinitions.REFERENCE_LONG_NAME, hg19MiniReference) | ||
.addArgument("architecture", architecture1D) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucidtronix I just noticed this test is using 1d, which doesn't exercise the changed code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the users who encountered this error were running the 1D model, but either way both models will call the reference_string_to_tensor function in inference.py where the null character check was added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I was thinking reads, but this was reference.
Small bugfix to prevent crashes on the ends of the mitochondrial contig.