Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor consistency_report.py #127

Merged
merged 1 commit into from
Jun 10, 2022
Merged

Refactor consistency_report.py #127

merged 1 commit into from
Jun 10, 2022

Conversation

pwwang
Copy link
Contributor

@pwwang pwwang commented Jun 10, 2022

Ideas

Generate a list of integers (in binary) to represent the presence of all the calls (union) from all VCF files.
For example, if we have 3 VCF files, and a call is present only in the first VCF file, the integer to represent this call is:

0b100

1 at each bit indicates the presence of that call. If the call is in all 3 vcfs then the number would be 0b111. And these numbers are also the "group"s in the breakdown of VCF's consistency.

With this, it's very easy to get:

  1. Summary of consistency, that is the number of calls shared by different numbers of VCF files (just count the 1's at each bit)
  2. Breakdown of VCFs' consistency, that is just the count of the "group"s.

Test

Tested with some VCFs, and all numbers are the same before and after the PR, just some order of the "group"s in the breakdown is slightly different.

Resource consuming

With ~200 calls in 18 VCFs:

With the version prior to this PR:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    83    117.3 MiB    117.3 MiB           1   @timing
    84                                         def main():
    85                                             """
    86                                             Main entrypoint for truvari tools
    87                                             """
    88    117.4 MiB      0.1 MiB           2       parser = ArgumentParser(prog="truvari", description=USAGE,
    89    117.3 MiB      0.0 MiB           1                               formatter_class=argparse.RawDescriptionHelpFormatter)
    90                                         
    91    117.4 MiB      0.0 MiB           2       parser.add_argument("cmd", metavar="CMD", choices=TOOLS.keys(), type=str, default=None,
    92    117.4 MiB      0.0 MiB           1                           help="Command to execute")
    93    117.4 MiB      0.0 MiB           2       parser.add_argument("options", metavar="OPTIONS", nargs=argparse.REMAINDER,
    94    117.4 MiB      0.0 MiB           1                           help="Options to pass to the command")
    95                                         
    96    117.4 MiB      0.0 MiB           1       if len(sys.argv) == 1:
    97                                                 parser.print_help(sys.stderr)
    98                                                 sys.exit()
    99    117.4 MiB      0.0 MiB           1       args = parser.parse_args()
   100                                         
   101    127.7 MiB     10.3 MiB           1       TOOLS[args.cmd](args.options)


func:'main' args:[(), {}] took: 4.7767 sec

With this PR:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    83    116.4 MiB    116.4 MiB           1   @timing
    84                                         def main():
    85                                             """
    86                                             Main entrypoint for truvari tools
    87                                             """
    88    116.5 MiB      0.1 MiB           2       parser = ArgumentParser(prog="truvari", description=USAGE,
    89    116.4 MiB      0.0 MiB           1                               formatter_class=argparse.RawDescriptionHelpFormatter)
    90                                         
    91    116.5 MiB      0.0 MiB           2       parser.add_argument("cmd", metavar="CMD", choices=TOOLS.keys(), type=str, default=None,
    92    116.5 MiB      0.0 MiB           1                           help="Command to execute")
    93    116.5 MiB      0.0 MiB           2       parser.add_argument("options", metavar="OPTIONS", nargs=argparse.REMAINDER,
    94    116.5 MiB      0.0 MiB           1                           help="Options to pass to the command")
    95                                         
    96    116.5 MiB      0.0 MiB           1       if len(sys.argv) == 1:
    97                                                 parser.print_help(sys.stderr)
    98                                                 sys.exit()
    99    116.5 MiB      0.0 MiB           1       args = parser.parse_args()
   100                                         
   101    116.5 MiB      0.0 MiB           1       TOOLS[args.cmd](args.options)


func:'main' args:[(), {}] took: 0.1703 sec

The PR improves the time spent from 4.7767 sec to 0.1703 sec.
Also notice that the memory incremented with line 101. This PR nearly adds any additional memory usage, but it was 10.3 MiB previously.

@ACEnglish
Copy link
Owner

This is a very elegant solution. Great work!

Also, I believe we can still get the memory usage down by requiring sorted VCFs. As is, we're keeping the "\t".join(line.split("\t")[:5]) key for every unique VCF entry for the full runtime. However, if we know the entries are sorted, as soon as we hit position N+1 we know we don't need to keep keys from any position <= N. But those changes would be additional to what you've already made so I'll accept this PR.

@ACEnglish ACEnglish merged commit ab2b0c0 into ACEnglish:develop Jun 10, 2022
@ACEnglish
Copy link
Owner

I was able to prototype the sorted VCF work. It did help with the memory usage, but it came at a cost of a ~3.5x runtime increase. file_zipper uses pysam to handle vcfs which has a whole lot of overhead that consistency circumvents by using a minimal file reader. So we'll leave the tool as is.

Again, thank you for your contribution!

@pwwang
Copy link
Contributor Author

pwwang commented Jun 16, 2022

Nice. Looking forward to the new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants