Refactor consistency_report.py #127

pwwang · 2022-06-10T20:11:24Z

Ideas

Generate a list of integers (in binary) to represent the presence of all the calls (union) from all VCF files.
For example, if we have 3 VCF files, and a call is present only in the first VCF file, the integer to represent this call is:

0b100

1 at each bit indicates the presence of that call. If the call is in all 3 vcfs then the number would be 0b111. And these numbers are also the "group"s in the breakdown of VCF's consistency.

With this, it's very easy to get:

Summary of consistency, that is the number of calls shared by different numbers of VCF files (just count the 1's at each bit)
Breakdown of VCFs' consistency, that is just the count of the "group"s.

Test

Tested with some VCFs, and all numbers are the same before and after the PR, just some order of the "group"s in the breakdown is slightly different.

Resource consuming

With ~200 calls in 18 VCFs:

With the version prior to this PR:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    83    117.3 MiB    117.3 MiB           1   @timing
    84                                         def main():
    85                                             """
    86                                             Main entrypoint for truvari tools
    87                                             """
    88    117.4 MiB      0.1 MiB           2       parser = ArgumentParser(prog="truvari", description=USAGE,
    89    117.3 MiB      0.0 MiB           1                               formatter_class=argparse.RawDescriptionHelpFormatter)
    90                                         
    91    117.4 MiB      0.0 MiB           2       parser.add_argument("cmd", metavar="CMD", choices=TOOLS.keys(), type=str, default=None,
    92    117.4 MiB      0.0 MiB           1                           help="Command to execute")
    93    117.4 MiB      0.0 MiB           2       parser.add_argument("options", metavar="OPTIONS", nargs=argparse.REMAINDER,
    94    117.4 MiB      0.0 MiB           1                           help="Options to pass to the command")
    95                                         
    96    117.4 MiB      0.0 MiB           1       if len(sys.argv) == 1:
    97                                                 parser.print_help(sys.stderr)
    98                                                 sys.exit()
    99    117.4 MiB      0.0 MiB           1       args = parser.parse_args()
   100                                         
   101    127.7 MiB     10.3 MiB           1       TOOLS[args.cmd](args.options)


func:'main' args:[(), {}] took: 4.7767 sec

With this PR:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    83    116.4 MiB    116.4 MiB           1   @timing
    84                                         def main():
    85                                             """
    86                                             Main entrypoint for truvari tools
    87                                             """
    88    116.5 MiB      0.1 MiB           2       parser = ArgumentParser(prog="truvari", description=USAGE,
    89    116.4 MiB      0.0 MiB           1                               formatter_class=argparse.RawDescriptionHelpFormatter)
    90                                         
    91    116.5 MiB      0.0 MiB           2       parser.add_argument("cmd", metavar="CMD", choices=TOOLS.keys(), type=str, default=None,
    92    116.5 MiB      0.0 MiB           1                           help="Command to execute")
    93    116.5 MiB      0.0 MiB           2       parser.add_argument("options", metavar="OPTIONS", nargs=argparse.REMAINDER,
    94    116.5 MiB      0.0 MiB           1                           help="Options to pass to the command")
    95                                         
    96    116.5 MiB      0.0 MiB           1       if len(sys.argv) == 1:
    97                                                 parser.print_help(sys.stderr)
    98                                                 sys.exit()
    99    116.5 MiB      0.0 MiB           1       args = parser.parse_args()
   100                                         
   101    116.5 MiB      0.0 MiB           1       TOOLS[args.cmd](args.options)


func:'main' args:[(), {}] took: 0.1703 sec

The PR improves the time spent from 4.7767 sec to 0.1703 sec.
Also notice that the memory incremented with line 101. This PR nearly adds any additional memory usage, but it was 10.3 MiB previously.

ACEnglish · 2022-06-10T20:39:22Z

This is a very elegant solution. Great work!

Also, I believe we can still get the memory usage down by requiring sorted VCFs. As is, we're keeping the "\t".join(line.split("\t")[:5]) key for every unique VCF entry for the full runtime. However, if we know the entries are sorted, as soon as we hit position N+1 we know we don't need to keep keys from any position <= N. But those changes would be additional to what you've already made so I'll accept this PR.

ACEnglish · 2022-06-11T17:17:53Z

I was able to prototype the sorted VCF work. It did help with the memory usage, but it came at a cost of a ~3.5x runtime increase. file_zipper uses pysam to handle vcfs which has a whole lot of overhead that consistency circumvents by using a minimal file reader. So we'll leave the tool as is.

Again, thank you for your contribution!

pwwang · 2022-06-16T17:50:32Z

Nice. Looking forward to the new release.

♻️ Refactor consistency_report.py

2c7617d

pwwang mentioned this pull request Jun 10, 2022

Use generator instead of list in create_file_intersections() to save memory #126

Merged

ACEnglish merged commit ab2b0c0 into ACEnglish:develop Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor consistency_report.py #127

Refactor consistency_report.py #127

pwwang commented Jun 10, 2022 •

edited

Loading

ACEnglish commented Jun 10, 2022

ACEnglish commented Jun 11, 2022

pwwang commented Jun 16, 2022

Refactor consistency_report.py #127

Refactor consistency_report.py #127

Conversation

pwwang commented Jun 10, 2022 • edited Loading

Ideas

Test

Resource consuming

ACEnglish commented Jun 10, 2022

ACEnglish commented Jun 11, 2022

pwwang commented Jun 16, 2022

pwwang commented Jun 10, 2022 •

edited

Loading