Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked dedup #107

Merged
merged 38 commits into from
Apr 4, 2022
Merged

Chunked dedup #107

merged 38 commits into from
Apr 4, 2022

Conversation

Phlya
Copy link
Member

@Phlya Phlya commented Sep 6, 2021

Rewrote the deduplicator based on chunking and cKDTree. This removes requirement for cython code there to simplify maintenance.

  • Haven't removed the cython code yet
    (decided to keep it for now)

Added a feature to annotate the duplicates with the read ID of the deduped "parent" read. This will allow to check which duplicates are optical/clustering, and which - are "real" PCR duplicates. This is critical for correct estimation of complexity in test sequencing, in particular on patterned flowcells of recent Illumina instruments.

An observation, that with the current implementation if read A and read B are duplicates, B and C are duplicates, but A and C are not (which can happen with non-zero --max-mismatch) - C is not reported as a duplicate. I am not sure how it was previously, and what is the "right" way.

Also this PR adds the .add_pairs_from_dataframe() method to PairCounter in stats to save stats from each chunk during deduping.

@Phlya Phlya requested a review from golobor September 6, 2021 15:38
@Phlya
Copy link
Member Author

Phlya commented Sep 8, 2021

The code currently works as expected, but there are some outstanding considerations we need to discuss/agree on and implement before it can be up for merging. I wanted to share some explanation, learnings and observations I made along the way, and raise discussion points.

Deduplication requires pairs to be sorted, which ensures that potential duplicates are always close to each other in the file. The cython implementation (I believe was written by Max) reads pairs one by one, and compares the neighbours to each other, thus identifying duplicates. The code is fast (about 6 sec per mln pairs), but because it's in cython is difficult to maintain or modify.

In addition to deduplication, the process of dedup actually aggregates pair statistics. It used to be done one by one, each pair was pushed into PairCounter, which analyzed it and added a count to the appropriate categories (dups/nodups/unmapped, cis/trans, etc).

The aim of this PR is to rewrite the deduper using only Python code with minimal loss in performance. This significantly simplifies the code and makes it more maintainable, and easier to modify. The idea is reading the pairs in chunks. Each chunk can be deduped efficiently using nearest neighbour search using a KD-tree (or Ball-tree). There is an important consideration for chunking: in case the edge of a chunk splits a group of duplicates, deduplication will not work correctly. To avoid that, I save the bottom 1% of the deduped pairs from the previous chunk and prepend it to the new chunk. This way the new chunk will be deduped against the previous deduped pairs, and the same "parent" will be assigned to the new duplicates, if there are any in the new chunk. I of course don't forget to discard the top pairs that came from the previous chunk before returning the new deduped pairs.

KD-tree based nearest neighbour detection is implemented in scipy and scikit-learn. They offer slightly different features and advantages. At the moment of writing this comment, this PR defaults to using scipy (which is now a dependency of pairtools), and only tries to import sklearn when asked to use sklearn as backend - but perhaps this option will be removed (see below). The cython code is also still available.

When using scipy or sklearn backend, a new option becomes available: --save-parent-id, which writes the readID of the "parent" read (that was saved into .dedup pairs), as a separate column in the .dups pairs file. This will enable smarter complexity estimation and classification of duplicates into PCR/optical/clustering etc, using flowcell coordinates stored in read names. And of course the --chunksize is also available for these backends.

sklearn implementation supports using multiple cores. The --n-proc argument allows one to pick number of cores. At the moment I have not observed any benefit of using >1 cores, but I have not investigated it carefully. Perhaps with larger chunksizes it would be beneficial (I have only tried 1 mln at most so far).

I have timed all three backends with a small file with just over 1 mln pairs.

  • cython takes about 6 seconds
  • scipy takes about 10 seconds
  • sklearn takes about 15 sec
    (for --method sum they are all a tiny bit slower than --method max)

I think this timing is acceptable for both new options, but of course scipy appears significantly faster than sklearn.
Interestingly, in this small benchmark chunksize had no or little impact on performance of both scipy and sklearn methods between 10,000 and 10,000,000 (i.e. all data in one chunk) chunksizes. 1,000 caused a significant loss in performance.

I will repeat benchmarking with a larger file where a larger chunksize could benefit the multi-core feature of sklearn, and will see if there is any reason at all to keep it as an option. If it's always slower than scipy there is no point.

Finally, there is tiny difference in the output between cython and the new implementations. Cython reports 5 more pairs than scipy or sklearn for my small 1 mln pairs test file, with --max-mismatch 2. Upon closer inspection, this arises in cases where there is a group of reads of at least three, let's call them A, B and C. In the case where B is considered a duplicate of A, C - a duplicate of B, not of A, cython reports C as non-duplicate.

Here is an example from real data. From this cluster of pairs that are all a cluster of duplicates (with max mismatch 2)

383227 	NB551016:788:HTKVCAFX2:4:21506:11077:7795 	chr16 	33961225 	chr21 	16977761 	- 	- 	UU 	27 	60
383228 	NB551016:788:HTKVCAFX2:1:11310:16231:17104 	chr16 	33961227 	chr21 	16977761 	- 	- 	UU 	9 	60
383229 	NB551016:788:HTKVCAFX2:2:21103:16887:17797 	chr16 	33961227 	chr21 	16977761 	- 	- 	UU 	12 	60
383230 	NB551016:788:HTKVCAFX2:3:11607:11978:17519 	chr16 	33961227 	chr21 	16977761 	- 	- 	UU 	22 	60
383231 	NB551016:788:HTKVCAFX2:4:11411:23554:18663 	chr16 	33961227 	chr21 	16977761 	- 	- 	UU 	14 	60
383232 	NB551016:788:HTKVCAFX2:3:11601:3973:6179 	chr16 	33961228 	chr21 	16977761 	- 	- 	UU 	11 	60

cython code reports 2 pairs:

77742 	NB551016:788:HTKVCAFX2:4:21506:11077:7795 	chr16 	33961225 	chr21 	16977761 	- 	- 	UU 	27 	60
77743 	NB551016:788:HTKVCAFX2:3:11601:3973:6179 	chr16 	33961228 	chr21 	16977761 	- 	- 	UU 	11 	60

The second of them is certainly a duplicate of the other reads with the coordinate 33961227, however not of the top "parent" read.
I think it's actually a bug in the cython code. scipy and sklearn backends report only one read:

77742 	NB551016:788:HTKVCAFX2:4:21506:11077:7795 	chr16 	33961225 	chr21 	16977761 	- 	- 	UU 	27 	60

I am not volunteering to fix the cython code if people agree the latter is the preferred behaviour, and want to keep the cython backend as an option :)

But it does bring up another question: should we simply report the first of the duplicated reads, or should we pick a central read that would be more representative of the whole cluster? Here ir's quite clear, assuming they are all indeed true duplicates of each other, that most of them have the first coordinate 33961227 and that is probably the true position of the read. This is a rare case (which can only happen with max mismatch > 0), but it does shift some reads a few nt left relative to what is likely their true coordinate, and perhaps with higher and higher resolution even single nt shifts might matter?

Questions

Can we rely on the pairs always having a header that contains column names? Anton says we probably can, and it makes reading in the file a breeze. But in case someone processes some pairs with unix tools (grep, sed, awk, etc) it's very easy to lose the header. At the moment my new backends completely ignore the provided -c1, -p1 etc arguments and just use column names form the header.

Should we keep the cython backend?

Finally, a couple interesting things I learned along the way, which were important for implementing this PR, but don't need any discussion - just sharing in case someone is curious.
The original reason I implemented the sklearn version was to support the --method max mode, where the distance we consider is called the Chebyshev distance: the maximum distance along either dimension. The sklearn tree support a bunch of distance metrics, including "chebyshev", while scipy only supports the Minkowski distance. However I found out that the Chebyshev distance is actually equal to Minkowski distance when p-norm is infinity, and since scipy supports setting the p-norm, actually it natively supports both the "sum" (which is the default, same as Euclidean distance, with p=2) and "max". Hence sklearn implementation is not required anymore!
EDIT: I just realized, the "sum" is actually not Euclidean, but Manhattan distance, isn't it???
An important step to making the tree-based versions even work, was to avoid trying to dedup the unmapped pairs: since they all have the same coordinate, they were breaking the trees (I think they were trying to create smaller and smaller partitions to achieve the target leaf size, but with identical values the trees ended up huge with no benefit). Therefore I had to split them off from each chunk, and introduce back after deduping.

@Phlya
Copy link
Member Author

Phlya commented Feb 24, 2022

Note that sklearn recently refactored its implementation of nearest neighbors and new timings are needed (although it says it's much faster specifically for float64, while our data should be int...)
https://scikit-learn.org/dev/whats_new/v1.1.html#changelog

Copy link
Member

@agalitsyna agalitsyna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, just introduced some minor suggestions

pairtools/pairtools_stats.py Outdated Show resolved Hide resolved
pairtools/pairtools_dedup.py Outdated Show resolved Hide resolved
pairtools/pairtools_dedup.py Outdated Show resolved Hide resolved
pairtools/pairtools_dedup.py Outdated Show resolved Hide resolved
mark_dups,
)
elif backend in ("scipy", "sklearn"):
streaming_dedup_by_chunk(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ilya, does it make sense to create a function "streaming_dedup" which will do the same job as streaing_dedup_by_chunk but with no chunking? That will help us assess the typical number of errors per dataset due to chunking.
If not, then probably we can safely rename this function to streaming_dedup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or one could just use a huge chunksize that would have all data in one chunk?
Re name: I don't mind either way, if we are renaming the cython one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone have thoughts regarding relying on the file header containing information about column names? It would make life much easier, but that's significantly breaking away from providing the order of columns as CLI arguments... Are there any pair files out there that don't have a header?

Yes, it looks like HiC-Pro does not rely on the header. However, I'm not sure that existence of output of other tools with no header is a big trouble here. We typically generate the .pairs file with pairtools parse, and we always keep the header, as far as I know. For the files that are non-pairtools, we might provide a function pairtools add_header that will add user-specified order of columns to the header. Since pairtools have a whole set of functions for manipulating the headers, and it might be easy to create such a function.
Any other opinions? @golobor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favour of generally relying on headers. Maybe we should decide and document the default order of columns in case the header is missing, and also I like the idea about add_header!

pairtools/pairtools_dedup.py Outdated Show resolved Hide resolved
pairtools/pairtools_dedup.py Outdated Show resolved Hide resolved
pairtools/pairtools_stats.py Outdated Show resolved Hide resolved
tests/data/mock.4dedup.pairsam Outdated Show resolved Hide resolved
tests/test_dedup.py Outdated Show resolved Hide resolved
@Phlya
Copy link
Member Author

Phlya commented Mar 7, 2022

Great suggestions, thank you @agalitsyna! I'll see if I can address them today, if not - tomorrow.

@Phlya
Copy link
Member Author

Phlya commented Mar 8, 2022

Does anyone have thoughts regarding relying on the file header containing information about column names? It would make life much easier, but that's significantly breaking away from providing the order of columns as CLI arguments... Are there any pair files out there that don't have a header?

@golobor
Copy link
Member

golobor commented Mar 11, 2022

I support relying on a header to infer column names.

  • .pairs is a standardized format, the column names are a mandatory field. If someone doesn't comply with it, it's their fault, i'd say. (Ofc, ideally, it'd be nice to provide a utility that sanitizes/adds a header to an input stream)
  • we already rely on column names in pairtools select.

@Phlya
Copy link
Member Author

Phlya commented Mar 11, 2022

Excellent, thank you @golobor ! Then I suggest to keep the cython backend for now for backwards compatibility and to allow one to pass column order explicitly in case someone relies on it, but add a deprecation warning? Since I think in the long term it's easier to just get rid of it, and it is now already feature-incomplete compared to scipy or sklearn.

@agalitsyna
Copy link
Member

agalitsyna commented Mar 16, 2022

A couple of comments:

  • Add recommended number of reads per chunk into documentation
  • Add A-B-C transitivity update to the global updates: it's functional change, dedup will work differently, important to notify the users

Ready to merge!

@agalitsyna agalitsyna mentioned this pull request Mar 17, 2022
@Phlya
Copy link
Member Author

Phlya commented Mar 24, 2022

Thank you to @agalitsyna for polishing out some issues and for her help and discussions in the final stage!

As you can see below, the new deduper backends are somewhat slower than cython. I think the difference is not critical in the context of the whole Hi-C processing pipeline, but shows that there is room for improvement. For example, one could split the data into chunks intelligently instead of just fixed size. so there are certainly no duplicates between chunks, then chunks could be processed in parallel. Potentially if one didn't care about recording parent read IDs, simply prepending the bottom portion of the previous chunk (instead of the bottom portion of the deduped pairs from the previous chunk) could allow parallel processing without overhead of choosing just the right place to split off the chunks.

Sasha is also suggesting we should keep the cython backend since it's faster, and maybe add the feature (retaining parent read IDs) to it later (she might or might not have volunteered for that).

After thorough testing and benchmarking using snakemake (super easy btw), the results are generally identical between all backends and all parameters I explored, as long as --max-mismatch is set to 0. If it's set to 3, the new backends differ from cython as I discovered previously due to different treatment of transitive duplicate clusters. Moreover, new backends miss a handful of duplicates with very small chunks (1000 in my case). The difference can be exacerbated by very small --carryover (10 in this case), but even then the difference is just 8 pairs in a file with almost 8 mln mapped pairs and over 300k duplicates. Performance also suffers at such small chunk sizes, so there is no reason to use them anyway.

Here are some plots with total running times for different parameter combinations.

Chunksize has almost no effect on performance, as long as its at least 10_000, and performance is not affected by carryover or by max mismatch. sklearn is slower than scipy.
image

Number of cores has almost no effect on performance of the sklearn backend, and it's always slower than scipy :(
image

I would like to merge this now, unless there are any more issues that still need to be addressed!

@agalitsyna
Copy link
Member

agalitsyna commented Mar 24, 2022

I've revived Cython backend as it's faster and complies perfectly with definitions of deduplication procedure from the previous version of pairtools.
Since optical duplicates filtering is the future of this module, I added an option to store parent read ID with Cython backend as well.
Tested it on a couple of examples with 10 mln reads, and the convergence with scipy/sklearn at 0 mismatches is perfect.

@Phlya , can you check the docstrings in the code? I guess this line requires your close attention:

help="What backend to use: scipy and sklearn are based on KD-trees,"

@Phlya
Copy link
Member Author

Phlya commented Mar 28, 2022

Now, including all features like saving stats and parent read IDs, scipy is faster than cython!
image

Here is also the max resident memory used:
image

It grows very quickly with chunksize, as expected, and with 1 mln chunksize it becomes higher than with cython (and reaches ~1 Gb).
So I propose to keep scipy as default, with default chunksize 100_000 and carryover 100 (shown here).

For future reference, here is the Snakefile for benchmarking: https://gist.github.com/Phlya/5aeb55ef3d1ecb8b025cad125eb37ce0

@agalitsyna
Copy link
Member

I run broader tests with scipy 1.8.0 and scikit-learn 1.0.2 with and without reporting parent id. Both scipy and scikit-learn are faster than our Cython version. As Ilya pointed out, probably it's because of the recent scipy 1.7.0 update that improved the metrics performance.
This version is not available on pip, though, so I think we can retain Cython backend as an alternative for old scipy versions.

@Phlya
Copy link
Member Author

Phlya commented Mar 29, 2022

Scipy 1.7 requires python >=3.7, so changed the requirements in tests now to make sure it works. Shouldn't forget to change the actual requirements for the package too!

@agalitsyna agalitsyna merged commit 7e4712f into master Apr 4, 2022
agalitsyna added a commit that referenced this pull request Apr 7, 2022
* Removed OrderedDict remnants: #113

* Pairtools dedup fix of cython forgotten line from recent PR:
#107

* header add_columns function

* pairtools restrict: header improvement

* pairtools stats: warning of dataframe slicing fix

* np.genfromtxt warning resolved:
#71
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants