Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

add test cases for maximum e-value filter on alignment results #7

Open
katrinakalantar opened this issue Jun 12, 2020 · 0 comments
Open

Comments

@katrinakalantar
Copy link
Contributor

Assertion: The maximum e-value for alignments in IDseq is 1.

Implementation Details:
The maximum e-value threshold filter is applied in two different locations within the code base:

  • For short read alignments, the filter is applied inside the iterate_m8() function in the .m8 utils.
  • For contig alignments, the filter is applied using filters in PipelineStepBlastContigs.

We expect that there may be alignments with e-values > 1 in the initial alignment files (gsnap.m8, rapsearch2.m8, gsnap.blast.m8, rapsearch2.blast.m8).
The filter is then applied to the raw .m8 results when parsing for the top hits. There should never be e-values > 1 in the following files:

  • gsnap.deduped.m8
  • rapsearch2.deduped.m8
  • gsnap.blast.top.m8
  • rapsearch2.blast.top.m8

This was implemented as part of chanzuckerberg/czid-dag#309

Test Sample:
This was tested on staging using benchmark sample UnAmbiguouslyMapped_ds.gut. In particular: staging sample ID 19379 was run prior to the fix, staging sample ID 19361 was run after the fix.

For exampe, in sample 19361,
gsnap.m8 has 32 rows with e-value > 1, but gsnap.deduped.m8 has zero.
rapsearch2.m8 has 45 rows with e-value > 1, but rapsearch2.deduped.m8 has zero.
rapsearch2.blast.m8 has 5172 rows with e-value > 1, but rapsearch2.blast.top.m8 has zero.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant