-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunked dedup #107
Chunked dedup #107
Conversation
…nto chunked_dedup
The code currently works as expected, but there are some outstanding considerations we need to discuss/agree on and implement before it can be up for merging. I wanted to share some explanation, learnings and observations I made along the way, and raise discussion points. Deduplication requires pairs to be sorted, which ensures that potential duplicates are always close to each other in the file. The cython implementation (I believe was written by Max) reads pairs one by one, and compares the neighbours to each other, thus identifying duplicates. The code is fast (about 6 sec per mln pairs), but because it's in cython is difficult to maintain or modify. In addition to deduplication, the process of dedup actually aggregates pair statistics. It used to be done one by one, each pair was pushed into PairCounter, which analyzed it and added a count to the appropriate categories (dups/nodups/unmapped, cis/trans, etc). The aim of this PR is to rewrite the deduper using only Python code with minimal loss in performance. This significantly simplifies the code and makes it more maintainable, and easier to modify. The idea is reading the pairs in chunks. Each chunk can be deduped efficiently using nearest neighbour search using a KD-tree (or Ball-tree). There is an important consideration for chunking: in case the edge of a chunk splits a group of duplicates, deduplication will not work correctly. To avoid that, I save the bottom 1% of the deduped pairs from the previous chunk and prepend it to the new chunk. This way the new chunk will be deduped against the previous deduped pairs, and the same "parent" will be assigned to the new duplicates, if there are any in the new chunk. I of course don't forget to discard the top pairs that came from the previous chunk before returning the new deduped pairs. KD-tree based nearest neighbour detection is implemented in scipy and scikit-learn. They offer slightly different features and advantages. At the moment of writing this comment, this PR defaults to using scipy (which is now a dependency of pairtools), and only tries to import sklearn when asked to use sklearn as backend - but perhaps this option will be removed (see below). The cython code is also still available. When using scipy or sklearn backend, a new option becomes available: sklearn implementation supports using multiple cores. The I have timed all three backends with a small file with just over 1 mln pairs.
I think this timing is acceptable for both new options, but of course scipy appears significantly faster than sklearn. I will repeat benchmarking with a larger file where a larger chunksize could benefit the multi-core feature of sklearn, and will see if there is any reason at all to keep it as an option. If it's always slower than scipy there is no point. Finally, there is tiny difference in the output between cython and the new implementations. Cython reports 5 more pairs than scipy or sklearn for my small 1 mln pairs test file, with Here is an example from real data. From this cluster of pairs that are all a cluster of duplicates (with max mismatch 2)
cython code reports 2 pairs:
The second of them is certainly a duplicate of the other reads with the coordinate 33961227, however not of the top "parent" read.
I am not volunteering to fix the cython code if people agree the latter is the preferred behaviour, and want to keep the cython backend as an option :) But it does bring up another question: should we simply report the first of the duplicated reads, or should we pick a central read that would be more representative of the whole cluster? Here ir's quite clear, assuming they are all indeed true duplicates of each other, that most of them have the first coordinate 33961227 and that is probably the true position of the read. This is a rare case (which can only happen with max mismatch > 0), but it does shift some reads a few nt left relative to what is likely their true coordinate, and perhaps with higher and higher resolution even single nt shifts might matter? Questions Can we rely on the pairs always having a header that contains column names? Anton says we probably can, and it makes reading in the file a breeze. But in case someone processes some pairs with unix tools (grep, sed, awk, etc) it's very easy to lose the header. At the moment my new backends completely ignore the provided -c1, -p1 etc arguments and just use column names form the header. Should we keep the cython backend? Finally, a couple interesting things I learned along the way, which were important for implementing this PR, but don't need any discussion - just sharing in case someone is curious. |
Note that sklearn recently refactored its implementation of nearest neighbors and new timings are needed (although it says it's much faster specifically for float64, while our data should be int...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, just introduced some minor suggestions
pairtools/pairtools_dedup.py
Outdated
mark_dups, | ||
) | ||
elif backend in ("scipy", "sklearn"): | ||
streaming_dedup_by_chunk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ilya, does it make sense to create a function "streaming_dedup" which will do the same job as streaing_dedup_by_chunk but with no chunking? That will help us assess the typical number of errors per dataset due to chunking.
If not, then probably we can safely rename this function to streaming_dedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or one could just use a huge chunksize that would have all data in one chunk?
Re name: I don't mind either way, if we are renaming the cython one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone have thoughts regarding relying on the file header containing information about column names? It would make life much easier, but that's significantly breaking away from providing the order of columns as CLI arguments... Are there any pair files out there that don't have a header?
Yes, it looks like HiC-Pro does not rely on the header. However, I'm not sure that existence of output of other tools with no header is a big trouble here. We typically generate the .pairs file with pairtools parse
, and we always keep the header, as far as I know. For the files that are non-pairtools, we might provide a function pairtools add_header
that will add user-specified order of columns to the header. Since pairtools have a whole set of functions for manipulating the headers, and it might be easy to create such a function.
Any other opinions? @golobor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favour of generally relying on headers. Maybe we should decide and document the default order of columns in case the header is missing, and also I like the idea about add_header
!
Great suggestions, thank you @agalitsyna! I'll see if I can address them today, if not - tomorrow. |
Does anyone have thoughts regarding relying on the file header containing information about column names? It would make life much easier, but that's significantly breaking away from providing the order of columns as CLI arguments... Are there any pair files out there that don't have a header? |
I support relying on a header to infer column names.
|
Excellent, thank you @golobor ! Then I suggest to keep the cython backend for now for backwards compatibility and to allow one to pass column order explicitly in case someone relies on it, but add a deprecation warning? Since I think in the long term it's easier to just get rid of it, and it is now already feature-incomplete compared to scipy or sklearn. |
A couple of comments:
Ready to merge! |
Thank you to @agalitsyna for polishing out some issues and for her help and discussions in the final stage! As you can see below, the new deduper backends are somewhat slower than cython. I think the difference is not critical in the context of the whole Hi-C processing pipeline, but shows that there is room for improvement. For example, one could split the data into chunks intelligently instead of just fixed size. so there are certainly no duplicates between chunks, then chunks could be processed in parallel. Potentially if one didn't care about recording parent read IDs, simply prepending the bottom portion of the previous chunk (instead of the bottom portion of the deduped pairs from the previous chunk) could allow parallel processing without overhead of choosing just the right place to split off the chunks. Sasha is also suggesting we should keep the cython backend since it's faster, and maybe add the feature (retaining parent read IDs) to it later (she might or might not have volunteered for that). After thorough testing and benchmarking using snakemake (super easy btw), the results are generally identical between all backends and all parameters I explored, as long as Here are some plots with total running times for different parameter combinations. Chunksize has almost no effect on performance, as long as its at least 10_000, and performance is not affected by carryover or by max mismatch. sklearn is slower than scipy. Number of cores has almost no effect on performance of the sklearn backend, and it's always slower than scipy :( I would like to merge this now, unless there are any more issues that still need to be addressed! |
I've revived Cython backend as it's faster and complies perfectly with definitions of deduplication procedure from the previous version of pairtools. @Phlya , can you check the docstrings in the code? I guess this line requires your close attention: pairtools/pairtools/pairtools_dedup.py Line 95 in 37a4516
|
Now, including all features like saving stats and parent read IDs, scipy is faster than cython! Here is also the max resident memory used: It grows very quickly with chunksize, as expected, and with 1 mln chunksize it becomes higher than with cython (and reaches ~1 Gb). For future reference, here is the Snakefile for benchmarking: https://gist.github.com/Phlya/5aeb55ef3d1ecb8b025cad125eb37ce0 |
8ff00b8
to
29d9a9a
Compare
I run broader tests with scipy 1.8.0 and scikit-learn 1.0.2 with and without reporting parent id. Both scipy and scikit-learn are faster than our Cython version. As Ilya pointed out, probably it's because of the recent scipy 1.7.0 update that improved the metrics performance. |
Scipy 1.7 requires python >=3.7, so changed the requirements in tests now to make sure it works. Shouldn't forget to change the actual requirements for the package too! |
Rewrote the deduplicator based on chunking and cKDTree. This removes requirement for cython code there to simplify maintenance.
(decided to keep it for now)
Added a feature to annotate the duplicates with the read ID of the deduped "parent" read. This will allow to check which duplicates are optical/clustering, and which - are "real" PCR duplicates. This is critical for correct estimation of complexity in test sequencing, in particular on patterned flowcells of recent Illumina instruments.
An observation, that with the current implementation if read A and read B are duplicates, B and C are duplicates, but A and C are not (which can happen with non-zero
--max-mismatch
) - C is not reported as a duplicate. I am not sure how it was previously, and what is the "right" way.Also this PR adds the
.add_pairs_from_dataframe()
method to PairCounter in stats to save stats from each chunk during deduping.