Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htk.fetch().to_numpy() sometimes gives incorrect result when two regions don't overlap #73

Closed
i-pletenev opened this issue Aug 26, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@i-pletenev
Copy link

Hi! In order to check consistency between cooler.Cooler().matrix().fetch() and htk.fetch().to_numpy() I sampled a few random regions from cooler_test_file.mcool and looked whether the result is identical. I found that in minority of cases the output differs even when two queried regions don't overlap. Here's an example:

In [1]: import cooler

In [2]: import hictkpy as htk

In [3]: htk.__version__ # same thing for version from github
Out[3]: '0.0.5'

In [4]: clr_path = "cooler_test_file.mcool"

In [5]: resolution = 100_000

In [6]: reg1, reg2 = 'chr2R:2200000-2700000', 'chr2R:23300000-23800000'

In [7]: clr_out = cooler.Cooler(clr_path + f"::resolutions/{resolution}").matrix(balance=False).fetch(reg1, reg2)

In [8]: clr_out
Out[8]: 
array([[ 8,  1,  5,  7,  3],
       [ 9,  6,  1,  5,  2],
       [ 1,  0,  2,  0,  0],
       [ 9,  4,  9, 10, 11],
       [ 5,  2,  9, 10,  8]], dtype=int32)

In [9]: htk_out = htk.File(clr_path, resolution).fetch(reg1, reg2, normalization='NONE').to_numpy()

In [10]: htk_out
Out[10]: 
array([[ 8,  1,  5,  7,  3],
       [ 9,  6,  1,  5,  2],
       [ 1,  1,  2,  0,  0],
       [ 9,  4,  9, 10, 11],
       [ 5,  2,  9, 10,  8]], dtype=int32)

In [11]: clr_out == htk_out
Out[11]: 
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True, False,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

I decided to create a second issue, because there could be a different bug here than in #72

Once again, I checked it in both 0.0.5 version and 0.0.6.dev50+g6eca918 version from github.

@robomics
Copy link
Contributor

Thanks for opening the issues!

I think the bug you are describing is the same I ran into yesterday while fuzzying hictk with the new fuzzer (see paulsengroup/hictk#226 paulsengroup/hictk#227 if you are curious).

If it is the same bug, then the bug should only occur for cis queries where the end coordinate of range2 is greater than the end coordinate of range1.

The bug is due to an error in the code that mirrors interactions from the upper-triangle into the lower triangle.

I will try to write a bugfix by the end of the week.
Will ping you here once it is ready :)

Thanks again for opening the issues!

@i-pletenev
Copy link
Author

Great, thank you for quick response!

@robomics
Copy link
Contributor

robomics commented Oct 23, 2024

Hi @i-pletenev, my apologies for not addressing this bug earlier and for not providing updates here.
Properly fixing this issue required some work on the hictk side (that was clear since the beginning).
However, at first I though it was possible to also quickly fix the issue on the hictkpy side (trading some performance and code quality for correctness).
However, that turned out to not be possible, which meant we had to wait for a new release of hictk (and that took longer than expected).

Anyways, as of hictkpy v1.0.0 the bug should be fixed (see paulsengroup/hictk#227, #64, and #65).
I just published the up-to-date wheels to PyPI, and I hope to be able to get the new version on bioconda ASAP (for the time being, things are on hold until I get some help at bioconda/bioconda-recipes#51561 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants
@i-pletenev @robomics and others