Chunked dedup #107

Phlya · 2021-09-06T15:37:59Z

Rewrote the deduplicator based on chunking and cKDTree. This removes requirement for cython code there to simplify maintenance.

Haven't removed the cython code yet
(decided to keep it for now)

Added a feature to annotate the duplicates with the read ID of the deduped "parent" read. This will allow to check which duplicates are optical/clustering, and which - are "real" PCR duplicates. This is critical for correct estimation of complexity in test sequencing, in particular on patterned flowcells of recent Illumina instruments.

An observation, that with the current implementation if read A and read B are duplicates, B and C are duplicates, but A and C are not (which can happen with non-zero --max-mismatch) - C is not reported as a duplicate. I am not sure how it was previously, and what is the "right" way.

Also this PR adds the .add_pairs_from_dataframe() method to PairCounter in stats to save stats from each chunk during deduping.

…nto chunked_dedup

Phlya · 2021-09-08T12:27:24Z

The code currently works as expected, but there are some outstanding considerations we need to discuss/agree on and implement before it can be up for merging. I wanted to share some explanation, learnings and observations I made along the way, and raise discussion points.

Deduplication requires pairs to be sorted, which ensures that potential duplicates are always close to each other in the file. The cython implementation (I believe was written by Max) reads pairs one by one, and compares the neighbours to each other, thus identifying duplicates. The code is fast (about 6 sec per mln pairs), but because it's in cython is difficult to maintain or modify.

In addition to deduplication, the process of dedup actually aggregates pair statistics. It used to be done one by one, each pair was pushed into PairCounter, which analyzed it and added a count to the appropriate categories (dups/nodups/unmapped, cis/trans, etc).

The aim of this PR is to rewrite the deduper using only Python code with minimal loss in performance. This significantly simplifies the code and makes it more maintainable, and easier to modify. The idea is reading the pairs in chunks. Each chunk can be deduped efficiently using nearest neighbour search using a KD-tree (or Ball-tree). There is an important consideration for chunking: in case the edge of a chunk splits a group of duplicates, deduplication will not work correctly. To avoid that, I save the bottom 1% of the deduped pairs from the previous chunk and prepend it to the new chunk. This way the new chunk will be deduped against the previous deduped pairs, and the same "parent" will be assigned to the new duplicates, if there are any in the new chunk. I of course don't forget to discard the top pairs that came from the previous chunk before returning the new deduped pairs.

KD-tree based nearest neighbour detection is implemented in scipy and scikit-learn. They offer slightly different features and advantages. At the moment of writing this comment, this PR defaults to using scipy (which is now a dependency of pairtools), and only tries to import sklearn when asked to use sklearn as backend - but perhaps this option will be removed (see below). The cython code is also still available.

When using scipy or sklearn backend, a new option becomes available: --save-parent-id, which writes the readID of the "parent" read (that was saved into .dedup pairs), as a separate column in the .dups pairs file. This will enable smarter complexity estimation and classification of duplicates into PCR/optical/clustering etc, using flowcell coordinates stored in read names. And of course the --chunksize is also available for these backends.

sklearn implementation supports using multiple cores. The --n-proc argument allows one to pick number of cores. At the moment I have not observed any benefit of using >1 cores, but I have not investigated it carefully. Perhaps with larger chunksizes it would be beneficial (I have only tried 1 mln at most so far).

I have timed all three backends with a small file with just over 1 mln pairs.

cython takes about 6 seconds
scipy takes about 10 seconds
sklearn takes about 15 sec
(for --method sum they are all a tiny bit slower than --method max)

I think this timing is acceptable for both new options, but of course scipy appears significantly faster than sklearn.
Interestingly, in this small benchmark chunksize had no or little impact on performance of both scipy and sklearn methods between 10,000 and 10,000,000 (i.e. all data in one chunk) chunksizes. 1,000 caused a significant loss in performance.

I will repeat benchmarking with a larger file where a larger chunksize could benefit the multi-core feature of sklearn, and will see if there is any reason at all to keep it as an option. If it's always slower than scipy there is no point.

Finally, there is tiny difference in the output between cython and the new implementations. Cython reports 5 more pairs than scipy or sklearn for my small 1 mln pairs test file, with --max-mismatch 2. Upon closer inspection, this arises in cases where there is a group of reads of at least three, let's call them A, B and C. In the case where B is considered a duplicate of A, C - a duplicate of B, not of A, cython reports C as non-duplicate.

Here is an example from real data. From this cluster of pairs that are all a cluster of duplicates (with max mismatch 2)

383227 	NB551016:788:HTKVCAFX2:4:21506:11077:7795 	chr16 	33961225 	chr21 	16977761 	- 	- 	UU 	27 	60
383228 	NB551016:788:HTKVCAFX2:1:11310:16231:17104 	chr16 	33961227 	chr21 	16977761 	- 	- 	UU 	9 	60
383229 	NB551016:788:HTKVCAFX2:2:21103:16887:17797 	chr16 	33961227 	chr21 	16977761 	- 	- 	UU 	12 	60
383230 	NB551016:788:HTKVCAFX2:3:11607:11978:17519 	chr16 	33961227 	chr21 	16977761 	- 	- 	UU 	22 	60
383231 	NB551016:788:HTKVCAFX2:4:11411:23554:18663 	chr16 	33961227 	chr21 	16977761 	- 	- 	UU 	14 	60
383232 	NB551016:788:HTKVCAFX2:3:11601:3973:6179 	chr16 	33961228 	chr21 	16977761 	- 	- 	UU 	11 	60

cython code reports 2 pairs:

77742 	NB551016:788:HTKVCAFX2:4:21506:11077:7795 	chr16 	33961225 	chr21 	16977761 	- 	- 	UU 	27 	60
77743 	NB551016:788:HTKVCAFX2:3:11601:3973:6179 	chr16 	33961228 	chr21 	16977761 	- 	- 	UU 	11 	60

The second of them is certainly a duplicate of the other reads with the coordinate 33961227, however not of the top "parent" read.
I think it's actually a bug in the cython code. scipy and sklearn backends report only one read:

77742 	NB551016:788:HTKVCAFX2:4:21506:11077:7795 	chr16 	33961225 	chr21 	16977761 	- 	- 	UU 	27 	60

I am not volunteering to fix the cython code if people agree the latter is the preferred behaviour, and want to keep the cython backend as an option :)

But it does bring up another question: should we simply report the first of the duplicated reads, or should we pick a central read that would be more representative of the whole cluster? Here ir's quite clear, assuming they are all indeed true duplicates of each other, that most of them have the first coordinate 33961227 and that is probably the true position of the read. This is a rare case (which can only happen with max mismatch > 0), but it does shift some reads a few nt left relative to what is likely their true coordinate, and perhaps with higher and higher resolution even single nt shifts might matter?

Questions

Can we rely on the pairs always having a header that contains column names? Anton says we probably can, and it makes reading in the file a breeze. But in case someone processes some pairs with unix tools (grep, sed, awk, etc) it's very easy to lose the header. At the moment my new backends completely ignore the provided -c1, -p1 etc arguments and just use column names form the header.

Should we keep the cython backend?

Finally, a couple interesting things I learned along the way, which were important for implementing this PR, but don't need any discussion - just sharing in case someone is curious.
The original reason I implemented the sklearn version was to support the --method max mode, where the distance we consider is called the Chebyshev distance: the maximum distance along either dimension. The sklearn tree support a bunch of distance metrics, including "chebyshev", while scipy only supports the Minkowski distance. However I found out that the Chebyshev distance is actually equal to Minkowski distance when p-norm is infinity, and since scipy supports setting the p-norm, actually it natively supports both the "sum" (which is the default, same as Euclidean distance, with p=2) and "max". Hence sklearn implementation is not required anymore!
EDIT: I just realized, the "sum" is actually not Euclidean, but Manhattan distance, isn't it???
An important step to making the tree-based versions even work, was to avoid trying to dedup the unmapped pairs: since they all have the same coordinate, they were breaking the trees (I think they were trying to create smaller and smaller partitions to achieve the target leaf size, but with identical values the trees ended up huge with no benefit). Therefore I had to split them off from each chunk, and introduce back after deduping.

Phlya · 2022-02-24T14:08:34Z

Note that sklearn recently refactored its implementation of nearest neighbors and new timings are needed (although it says it's much faster specifically for float64, while our data should be int...)
https://scikit-learn.org/dev/whats_new/v1.1.html#changelog

agalitsyna

Awesome, just introduced some minor suggestions

pairtools/pairtools_stats.py

pairtools/pairtools_dedup.py

agalitsyna · 2022-03-07T11:26:51Z

pairtools/pairtools_dedup.py

+            mark_dups,
+        )
+    elif backend in ("scipy", "sklearn"):
+        streaming_dedup_by_chunk(


Ilya, does it make sense to create a function "streaming_dedup" which will do the same job as streaing_dedup_by_chunk but with no chunking? That will help us assess the typical number of errors per dataset due to chunking.
If not, then probably we can safely rename this function to streaming_dedup.

Or one could just use a huge chunksize that would have all data in one chunk?
Re name: I don't mind either way, if we are renaming the cython one.

Does anyone have thoughts regarding relying on the file header containing information about column names? It would make life much easier, but that's significantly breaking away from providing the order of columns as CLI arguments... Are there any pair files out there that don't have a header?

Yes, it looks like HiC-Pro does not rely on the header. However, I'm not sure that existence of output of other tools with no header is a big trouble here. We typically generate the .pairs file with pairtools parse, and we always keep the header, as far as I know. For the files that are non-pairtools, we might provide a function pairtools add_header that will add user-specified order of columns to the header. Since pairtools have a whole set of functions for manipulating the headers, and it might be easy to create such a function.
Any other opinions? @golobor

I'd be in favour of generally relying on headers. Maybe we should decide and document the default order of columns in case the header is missing, and also I like the idea about add_header!

pairtools/pairtools_dedup.py

pairtools/pairtools_stats.py

tests/data/mock.4dedup.pairsam

tests/test_dedup.py

Phlya · 2022-03-07T11:49:19Z

Great suggestions, thank you @agalitsyna! I'll see if I can address them today, if not - tomorrow.

Phlya · 2022-03-08T10:16:15Z

Does anyone have thoughts regarding relying on the file header containing information about column names? It would make life much easier, but that's significantly breaking away from providing the order of columns as CLI arguments... Are there any pair files out there that don't have a header?

golobor · 2022-03-11T09:11:55Z

I support relying on a header to infer column names.

.pairs is a standardized format, the column names are a mandatory field. If someone doesn't comply with it, it's their fault, i'd say. (Ofc, ideally, it'd be nice to provide a utility that sanitizes/adds a header to an input stream)
we already rely on column names in pairtools select.

Phlya · 2022-03-11T09:26:11Z

Excellent, thank you @golobor ! Then I suggest to keep the cython backend for now for backwards compatibility and to allow one to pass column order explicitly in case someone relies on it, but add a deprecation warning? Since I think in the long term it's easier to just get rid of it, and it is now already feature-incomplete compared to scipy or sklearn.

agalitsyna · 2022-03-16T21:23:42Z

A couple of comments:

Add recommended number of reads per chunk into documentation
Add A-B-C transitivity update to the global updates: it's functional change, dedup will work differently, important to notify the users

Ready to merge!

… of the dedp and stats from df

Phlya · 2022-03-24T10:42:00Z

Thank you to @agalitsyna for polishing out some issues and for her help and discussions in the final stage!

As you can see below, the new deduper backends are somewhat slower than cython. I think the difference is not critical in the context of the whole Hi-C processing pipeline, but shows that there is room for improvement. For example, one could split the data into chunks intelligently instead of just fixed size. so there are certainly no duplicates between chunks, then chunks could be processed in parallel. Potentially if one didn't care about recording parent read IDs, simply prepending the bottom portion of the previous chunk (instead of the bottom portion of the deduped pairs from the previous chunk) could allow parallel processing without overhead of choosing just the right place to split off the chunks.

Sasha is also suggesting we should keep the cython backend since it's faster, and maybe add the feature (retaining parent read IDs) to it later (she might or might not have volunteered for that).

After thorough testing and benchmarking using snakemake (super easy btw), the results are generally identical between all backends and all parameters I explored, as long as --max-mismatch is set to 0. If it's set to 3, the new backends differ from cython as I discovered previously due to different treatment of transitive duplicate clusters. Moreover, new backends miss a handful of duplicates with very small chunks (1000 in my case). The difference can be exacerbated by very small --carryover (10 in this case), but even then the difference is just 8 pairs in a file with almost 8 mln mapped pairs and over 300k duplicates. Performance also suffers at such small chunk sizes, so there is no reason to use them anyway.

Here are some plots with total running times for different parameter combinations.

Chunksize has almost no effect on performance, as long as its at least 10_000, and performance is not affected by carryover or by max mismatch. sklearn is slower than scipy.

Number of cores has almost no effect on performance of the sklearn backend, and it's always slower than scipy :(

I would like to merge this now, unless there are any more issues that still need to be addressed!

agalitsyna · 2022-03-24T20:05:29Z

I've revived Cython backend as it's faster and complies perfectly with definitions of deduplication procedure from the previous version of pairtools.
Since optical duplicates filtering is the future of this module, I added an option to store parent read ID with Cython backend as well.
Tested it on a couple of examples with 10 mln reads, and the convergence with scipy/sklearn at 0 mismatches is perfect.

@Phlya , can you check the docstrings in the code? I guess this line requires your close attention:

pairtools/pairtools/pairtools_dedup.py

Line 95 in 37a4516

help="What backend to use: scipy and sklearn are based on KD-trees,"

Phlya · 2022-03-28T08:59:05Z

Now, including all features like saving stats and parent read IDs, scipy is faster than cython!

Here is also the max resident memory used:

It grows very quickly with chunksize, as expected, and with 1 mln chunksize it becomes higher than with cython (and reaches ~1 Gb).
So I propose to keep scipy as default, with default chunksize 100_000 and carryover 100 (shown here).

For future reference, here is the Snakefile for benchmarking: https://gist.github.com/Phlya/5aeb55ef3d1ecb8b025cad125eb37ce0

agalitsyna · 2022-03-29T12:17:37Z

I run broader tests with scipy 1.8.0 and scikit-learn 1.0.2 with and without reporting parent id. Both scipy and scikit-learn are faster than our Cython version. As Ilya pointed out, probably it's because of the recent scipy 1.7.0 update that improved the metrics performance.
This version is not available on pip, though, so I think we can retain Cython backend as an alternative for old scipy versions.

in travis

Phlya · 2022-03-29T12:40:18Z

Scipy 1.7 requires python >=3.7, so changed the requirements in tests now to make sure it works. Shouldn't forget to change the actual requirements for the package too!

* Removed OrderedDict remnants: #113 * Pairtools dedup fix of cython forgotten line from recent PR: #107 * header add_columns function * pairtools restrict: header improvement * pairtools stats: warning of dataframe slicing fix * np.genfromtxt warning resolved: #71

Phlya added 4 commits August 25, 2021 20:17

First try for chunked dedup

657e7c6

Add it to main CLI function

9c3362c

Fix bug, now tests pass

ef4f63c

Add "max" mode to dedup, add readIDs to test pairs, expand tests

08e4cb7

Phlya requested a review from golobor September 6, 2021 15:38

Phlya added 8 commits September 6, 2021 16:42

Update requirements

39b8163

add scipy to test deps

00c9e20

Add sklearn code

99c9232

Three backends for deduping!

7fa2d93

Merge branch 'chunked_dedup' of https://github.com/open2c/pairtools i…

c6e6dc5

…nto chunked_dedup

Change default to scipy

d0023bd

Add --n-proc argument for sklearn backend

7da9554

Fix bug

4dc9c5c

Phlya added 2 commits September 8, 2021 13:53

Fix sum from Euclidean to Manhattan

850c0d6

FIx missing stats due to dtype mismatch

0d0458c

Phlya added 3 commits February 28, 2022 14:38

Fix for python 3.10 and fix cis_Xkb+ dtype

326dacf

Chunked stats

a9e8321

Remove trailing tabs in test data

c1f8af8

Phlya mentioned this pull request Mar 2, 2022

ImportError: cannot import name 'Mapping' from 'collections' #113

Closed

agalitsyna reviewed Mar 7, 2022

View reviewed changes

Phlya added 3 commits March 8, 2022 11:16

Address some comments

46e94ce

address more comments

f9b2fd5

Add docstring

7ea80c2

Deprecate cython

aff01da

Pass extra_cols correctly

a44da0b

Phlya added 3 commits March 17, 2022 15:10

Add carryover parameter, default 100

f990c6a

Docs, changes

59a7284

Mention column order/cython deprecation in dedup

1cc7781

agalitsyna mentioned this pull request Mar 17, 2022

Python3.10 Support #108

Closed

Phlya and others added 3 commits March 21, 2022 11:17

Merge branch 'master' into chunked_dedup

5d3b0c9

chunks indexing fix; refactoring of the code; dups fix; documentation…

2640efa

… of the dedp and stats from df

carryover fixed for sizes close to chunksize

9f90351

cython version revived; parentID added to cython.

37a4516

Phlya and others added 2 commits March 28, 2022 15:23

Fix parse "-" for stdin

c518556

parents reporting checks

29d9a9a

agalitsyna force-pushed the chunked_dedup branch from 8ff00b8 to 29d9a9a Compare March 29, 2022 12:12

Phlya added 2 commits March 29, 2022 14:34

Up lowest python to 3.7

667625a

in travis

Up lowest python in python-package

fc6c6d3

Phlya added 5 commits March 29, 2022 14:56

Require python>=3.7

70660c5

Some formatting and minor docs changes

3b18b23

Fix bug in stats!

ba346b9

Add note re transitivity and backends

e49477a

Reduce default chunksize

c52f595

agalitsyna merged commit 7e4712f into master Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked dedup #107

Chunked dedup #107

Phlya commented Sep 6, 2021 •

edited

Loading

Phlya commented Sep 8, 2021 •

edited

Loading

Phlya commented Feb 24, 2022

agalitsyna left a comment

agalitsyna Mar 7, 2022

Phlya Mar 8, 2022

agalitsyna Mar 10, 2022

Phlya Mar 10, 2022

Phlya commented Mar 7, 2022

Phlya commented Mar 8, 2022

golobor commented Mar 11, 2022

Phlya commented Mar 11, 2022

agalitsyna commented Mar 16, 2022 •

edited

Loading

Phlya commented Mar 24, 2022

agalitsyna commented Mar 24, 2022 •

edited

Loading

Phlya commented Mar 28, 2022

agalitsyna commented Mar 29, 2022

Phlya commented Mar 29, 2022

Chunked dedup #107

Chunked dedup #107

Conversation

Phlya commented Sep 6, 2021 • edited Loading

Phlya commented Sep 8, 2021 • edited Loading

Phlya commented Feb 24, 2022

agalitsyna left a comment

Choose a reason for hiding this comment

agalitsyna Mar 7, 2022

Choose a reason for hiding this comment

Phlya Mar 8, 2022

Choose a reason for hiding this comment

agalitsyna Mar 10, 2022

Choose a reason for hiding this comment

Phlya Mar 10, 2022

Choose a reason for hiding this comment

Phlya commented Mar 7, 2022

Phlya commented Mar 8, 2022

golobor commented Mar 11, 2022

Phlya commented Mar 11, 2022

agalitsyna commented Mar 16, 2022 • edited Loading

Phlya commented Mar 24, 2022

agalitsyna commented Mar 24, 2022 • edited Loading

Phlya commented Mar 28, 2022

agalitsyna commented Mar 29, 2022

Phlya commented Mar 29, 2022

Phlya commented Sep 6, 2021 •

edited

Loading

Phlya commented Sep 8, 2021 •

edited

Loading

agalitsyna commented Mar 16, 2022 •

edited

Loading

agalitsyna commented Mar 24, 2022 •

edited

Loading