[WIP] Stats split by filters #132

Phlya · 2022-05-04T14:55:38Z

Starting some modifications to calculate stats split by mapq level or any other filters.

Not even tried yet, just pushing the modifications, PR to see the changes easily.

review-notebook-app · 2022-05-20T00:51:02Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

agalitsyna · 2022-05-20T01:05:17Z

I've modified the structure a little bit and isolated selection more. Also added the filtering expressions to --stats. I think it's ready to merge, if you need further modifications, feel free to add them to pre0.4.0 directly!

Phlya · 2022-05-20T06:21:33Z

Really cool, thank you! Shall we add the filters to dedup too somehow? Since that's where stats are done in distiller...

agalitsyna · 2022-05-20T09:11:13Z

Yes, but I it will be difficult to incorporate into the current distiller version because selection of filtered pairs occurs after dedup...
(ii) How do we mark/output removed pairs by filters in dedup?
(iii) Because multiple filters are possible, we need multiple dedup outputs for each one...

The deduplication filtering stats operation actually combines stats for two operations: dedup and select, which complicates the procedure. The idea behind pairtools is that the operations are separate functions/commands, so mixing them appears to breach this paradigm. I understand that dedup+select+stats will perform additional calculations, but optimizing this may be a task for pairtools 2.0.0 with command chaining.

What are your thoughts?

Phlya · 2022-05-20T10:24:01Z

OK, I think we need to discuss it. Because the whole point of this exercise was to add filters to stats in distiller...

What is (i)? :)

(ii) I think we don't need to actually apply these filters to the output pairs in dedup, just use the same filters as used for creation of coolers in calculation of stats during dedup. Then we get stats for e.g. mapq30 filter and hence we know what actually ended up in the coolers.

(iii) Well, the yaml output saves all filters in one file... So multiple outputs shouldn't be a problem?

I agree about mixing of multiple operations! But at least until we have binary pairs or smart operation chaining within python, this approach will be faster than trying to pipe stuff around, or even worse writing and reading from disk multiple times. That's how stats appeared in dedup originally afaik, and otherwise it's slow...

agalitsyna · 2022-05-20T19:28:04Z

Passing the filters for stats is a good idea, I've drafted the changes and hope to push them soon.
Related question, if the output is csv, and multiple filters are requested, do you think it's okay to pass multiple files for output stats?

Phlya · 2022-05-20T19:32:46Z

I guess so. I just implemented saving the "no_filter" stats in case of non-yaml output, but we could just save multiple files instead.

agalitsyna · 2022-05-20T19:37:24Z

Sounds good, then we can also modify the stats version "first requested filter" for non-yaml output that I ended up with, and make it save as many files with stats as wanted

* Separate cli and lib * pairtools flip fix for unannotated chromosomes, resolving #91 * handle empty chromosomes, resolved #76 * fixed rfrags indexing and first rfrag omission, resolved #73 * resolved or deprecated suggestions in #16 * merge improvements, header merge fixed - resolved merge without arguments: #61 - option to add only the first header in merge, resolved #18 * in merge, added option to concatenate instead of merge sorted inputs, resolving: #23 * merge now checks that columns of inputs are the same * I/O improvements - auto_open defaults to stdin/stdout when path evaluates to False. resolved #48 - auto_open defaults to stdin/stdout when the path is "-" - if the stream is optional, it's controlled by the module itself * Parse2 update (#99) (#109) Improved version of parse2 with resolved comments from the previous PR: #96 - Separation of parse and parse2 modules. Parse has an option --walks-policy all, which parses long walks, but always reporting pair orientation and outer positions of 5'-ends, as if each pair was read in paired-end mode independently. Parse2 is specifically designed for long walks, and has options --report-position and --report-orientation, which might be used to report junctions, or reads, or walks. - Parse2 has an option to parse single-end reads, --single-end option, tested on minimap2 output for MC-3C. - Parse2 has the max_fragment_size instead instead of parse's max_molecule_size, which help to determine the overlapping ends of forward and reverse reads. - Recent update simplifies the code: single _parse library used by both parse and parse2, - a number of functions that reduce repetitive code, e.g. push_pair function, - dosctrings and documented structure of _parse library. - Both parse and parse2 have the options to report 5' or 3' ends; to flip alignments according to chromosome coordinate. - Both parse and parse2 have the pysam backend - Improvements of the tests for parse and parse2 - Documentation includes description of various --report-orientation and --report-position cases. * Merge pairlib into pairtools.lib. * CLI for scalings added. * stats output in yaml format * Header CLI (#121) - new module called by `pairtools header` - submodules: - generate : Generate the header - set-columns : Add the columns to the .pairs/pairsam file - transfer : Transfer the header from one pairs file to another - validate-columns : Validate the columns of the .pairs/pairsam file - resolves #119 - option remove-columns for `pairtools select`: Remove the columns from .pairs/pairsam file * pairtools phase critical update (#114) * imporant fixes: - cython dedup with no-parent id forgotten counter reset; - sphinx doc update (added pysam); - header warning if empty and error if try to add a field to empy one * Add summaries (#105) * Add functions for duplication tile and complexity * Make dedup stats! * Benchmarks finalization * [WIP] Stats split by filters (#132) * Markasdup lib removed; markasdup CLI explanation improved * dedup filter stats added and tested Co-authored-by: Aleksandra Galitsyna <agalitzina@gmail.com> Co-authored-by: Ilya Flyamer <flyamer@gmail.com>

Phlya and others added 22 commits May 4, 2022 16:53

Start filtering stats

6be5007

snakefile and restriction walkthrough update.

24f7a19

distance bins saved, saving and reading yaml works

59dc6e4

Benchmarks finalization

faf5878

benchmark fixes

3306195

Various fixes, now works within dedup

35a15a4

update test

f8d4c17

Make multiple filters actually work, and save the filtering expression

a32343c

Remove debugging print statement

926f34c

Start fixing old format...

dced08e

Fir dtype so it gets saved

e20267f

Start filtering stats

4d64cac

distance bins saved, saving and reading yaml works

516c0d9

Various fixes, now works within dedup

a283d1a

update test

62c45e1

Make multiple filters actually work, and save the filtering expression

934036f

Remove debugging print statement

1f757ec

Start fixing old format...

0a7ed2f

pre0.4.0 rebased; stats filters separated as a sparate lib.

2c2dc3a

Python engine for stats added.

33b2898

merge fixes

c34b4bc

black stats and select

361443e

travis fix yaml

c664244

agalitsyna marked this pull request as ready for review May 20, 2022 01:00

agalitsyna merged commit 9a21e15 into pre0.4.0 May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Stats split by filters #132

[WIP] Stats split by filters #132

Phlya commented May 4, 2022 •

edited

Loading

review-notebook-app bot commented May 20, 2022

agalitsyna commented May 20, 2022

Phlya commented May 20, 2022

agalitsyna commented May 20, 2022

Phlya commented May 20, 2022

agalitsyna commented May 20, 2022

Phlya commented May 20, 2022

agalitsyna commented May 20, 2022 •

edited

Loading

[WIP] Stats split by filters #132

[WIP] Stats split by filters #132

Conversation

Phlya commented May 4, 2022 • edited Loading

review-notebook-app bot commented May 20, 2022

agalitsyna commented May 20, 2022

Phlya commented May 20, 2022

agalitsyna commented May 20, 2022

Phlya commented May 20, 2022

agalitsyna commented May 20, 2022

Phlya commented May 20, 2022

agalitsyna commented May 20, 2022 • edited Loading

Phlya commented May 4, 2022 •

edited

Loading

agalitsyna commented May 20, 2022 •

edited

Loading