Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve distance binning for FR,RR,FR,RF pairs "scalings" in stats output #81

Closed
sergpolly opened this issue Feb 21, 2020 · 10 comments
Closed

Comments

@sergpolly
Copy link
Member

sergpolly commented Feb 21, 2020

https://github.com/mirnylab/pairtools/blob/d1ddf9c39a336662f7fc725fa5a70ec68df9ba95/pairtools/pairtools_stats.py#L147

consider replacing it with something more readable and usable, e.g. @mimakaev 's robust bins:

# ~10 bins per order of magnitude
bins = np.logspace(0,9, num = 9*10+1,dtype=int)
bins = np.unique(bins)
bins = np.cumsum(np.sort(np.r_[1,np.diff(bins)]))

currently we have:

min_log10_dist=0
max_log10_dist=9
og10_dist_bin_step=0.25
bins = np.r_[0, np.round(10**np.arange(min_log10_dist, max_log10_dist+0.001, log10_dist_bin_step)).astype(np.int)]

which are also non-decreasing, but are too sparsely spaced ... - and code is hard to read

@sergpolly
Copy link
Member Author

also should we expose number_of_bins_per_order_of_magnitude as a parameter in pairtools/distiller or , just stick with something

pairtools stats is more of a QC than the analysis tools at the end

@mimakaev
Copy link

mimakaev commented Feb 21, 2020

So the problem is matching bins between cooler and pairtools. It's a little tricky. There are a few options.

  • One is the current one: not to try to match them.
  • Another is to match them at 1kb.
  • A third is to match them at 100bp, 200bp, and 1kb, making 1kb a perfect resolution (and sacrificing 100,200 performance). (100,500,1000) is also an option. Not both (500/200 is 2.5).
  • A fourth is to do a partial match (that pairs -> cooler bins are reduseable).

Current proposal for nice cooler bins: 1,2,3,4,5,6,8,10,13,16,20,25,32,40,50,63,79,100,

at 1kb those bins become: 1000,2000,3000,4000,5000,6000,8000,10000,13000,16000,20000,25000,32000,40000,50000,63000,79000...
at 100bp those bins become
100,200,300,400,500,600,800,1000,1300,1600,2000,2500,3200,4000,5000,6300,7900,10000,12600,15800,20000,25100,31600,39800,50100,63100,79400,100000,125900,158500,199500,251200,316200,398100,501200,631000,794300,1000000,1258900,1584900,1995300,2511900,3162300,3981100,5011900,6309600,7943300

The two are clearly different

pairtools bins not matched to cooler:
...126,158,200,251,316,398,501,631,794,1000,1259,1585,1995,2512,3162,3981,5012,6310,7943,10000,12589,15849,19953,25119,31623,

pairtools= bins matched to cooler at 1kb:
1,2,3,4,5,6,8,10,13,16,20,25,32,40,50,63,79,100,126,158,200,251,316,398,501,631,794,1000,2000,3000,4000,5000,6000,8000,10000,13000,16000,20000,25000,32000,40000,50000,63000,

pairtools bis matched to cooler at 1kb and 100bp/200bp if 100bp/200bp cooler uses modified bins:
1,2,3,4,5,6,8,10,13,16,20,25,32,40,50,63,79,100,200,400,600,1000,2000,3000,4000,5000,6000,8000,10000,13000,16000,20000,25000,32000,40000,50000,63000

And bins matched to cooler at 100, 200, 1000bp resolutions with extra bins for pairs.
1,2,3,4,5,6,8,10,13,16,20,25,32,40,50,63,79,100,126,159,200,252,318,400,490,600,693,800,1000,1260,_1590,_2000,3000,4000,5000,6000,6930,8000,10000,13000,16000,20000,25000,32000,40000,50000,63000 (bin in italics are the ones that were added)

I couldn't think of a more general solution. Powers of two obviously has one, but not here...

@mimakaev
Copy link

mimakaev commented Mar 4, 2020

@golobor @sergpolly @agalitsyna - what do you guys think? We should probably decide on this before we merge in cooltools logbin_expected.

-- Should we aim at matching at one resolution, or at two, or matching at all?
-- Would there ever be a need in sub-kb scalings? 100bp resolution?

@golobor
Copy link
Member

golobor commented Mar 4, 2020 via email

@sergpolly
Copy link
Member Author

IMHO - ~100 bp is needed , at least for pair-level stuff because of DNase/MNase-based methods like microC, OmniC, and whateverC might happen "tomorrow"

100bp coolers for microC isn't a crazy thing to do, so perhaps it makes sense to match it like @mimakaev suggested:

And bins matched to cooler at 100, 200, 1000bp resolutions with extra bins for pairs.
1,2,3,4,5,6,8,10,13,16,20,25,32,40,50,63,79,100,126,159,200,252,318,400,490,600,693,800,1000,1260,_1590,_2000,3000,4000,5000,6000,6930,8000,10000,13000,16000,20000,25000,32000,40000,50000,63000 (bin in italics are the ones that were added)

but this would only work for high-resolution coolers and wouldn't be applicable to sparse data - <50-100M pairs of usable pairs in a cooler. So like @golobor is suggesting - this matching between bins for coolers and pairs could be optional

another IMHO - i don't think it is THAT crucial to match bins for pairtools stats with coolers, as it is just for QC, and it's not meant to be for used for "real" analysis (by-arm, individual chromosomes, etc etc)

@mimakaev
Copy link

mimakaev commented Mar 5, 2020

Yeah, that would probably be ideal. I will a little better engineer that set and make sure it is actually matched.

@mimakaev
Copy link

mimakaev commented Mar 6, 2020

image

These are ratios of neighboring pair bins in the current version of bins.

bins = [10,13,16,20,25,32,40,50,63,79,100,126,159,200,240,300,400,490,600,800,1000,1200,1600,2000,2400,3000,4000,5000,6000,8000,10000,13000,16000,20000,25000,32000,40000,50000,63000]

Bins for 100bp and 200bp (just without 100 and 300)
100,200,300,400,600,800,1000,1200,1600,2000,2400,3000,4000,5000,6000,8000,10000,13000,16000,20000,25000,32000,40000,50000,63000

@golobor
Copy link
Member

golobor commented Mar 6, 2020 via email

@mimakaev
Copy link

mimakaev commented Mar 6, 2020

ok, now I get it.

A large negative consequence is a two-fold jump from 1 to 2. Could have used 1 2 5 10 instead - that's at least even.

A partial remedy is to use these bins, and drop #2,3,5 in the first order of magnitude
1000,1300,1600,2000,2400,3000,4000,5000,6000,8000 which has slightly better bin size ratios.

@agalitsyna
Copy link
Member

I will convert this to the discussion for now, but feel free to comment or open an issue if binning improvements are needed!

@open2c open2c locked and limited conversation to collaborators Apr 8, 2022
@agalitsyna agalitsyna converted this issue into discussion #120 Apr 8, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

4 participants