Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_cis_expected API (smoothing, aggregation, etc) #280

Closed
3 tasks
Tracked by #312
gfudenberg opened this issue Oct 14, 2021 · 16 comments
Closed
3 tasks
Tracked by #312

get_cis_expected API (smoothing, aggregation, etc) #280

gfudenberg opened this issue Oct 14, 2021 · 16 comments
Assignees

Comments

@gfudenberg
Copy link
Member

gfudenberg commented Oct 14, 2021

  • propagating NaNs through smoothing e.g. balanced.sum.smoothed should be NaN if balanced.sum is NaN.

  • consider renaming balanced & areas here to something more interpretable (smoothed_values, smoothed_counts?)

    balanced, areas = log_smooth(

  • its not currently clear what the agg argument is supposed to do-- it seems to deletes the count column (which also has slightly unexpected values)

image

cc @golobor

@golobor
Copy link
Member

golobor commented Oct 18, 2021

thank you for these thoughtful comments!

propagating NaNs through smoothing e.g. balanced.sum.smoothed should be NaN if balanced.sum is NaN.

not sure if I agree with this one! By definition, smoothing assumes that the "true" value of contact frequency for a given diagonal can be inferred from those from adjacent diagonals. If if some diagonal misses good pixels entirely, I would say, we should still be able to estimate its contact frequency, no?..

consider renaming balanced & areas here to something more interpretable (smoothed_values, smoothed_counts?)

Done!

its not currently clear what the agg argument is supposed to do-- it seems to deletes the count column (which also has slightly unexpected values)

Yes, that is a good point - I just pushed new API that does not have this argument and does not drop columns that were present in the cvd table:


https://github.com/open2c/cooltools/blob/master/cooltools/sandbox/expected_smoothing_example.ipynb

@gfudenberg
Copy link
Member Author

going through the notebooks, noticed that the docstring for

def get_cis_expected(
    ...
    intra_only=True,
    ...
):
 says intra_only=False returns all combinations, but it only returns non-symmetric combinations. 

Had we discussed a desired behavior?


https://github.com/open2c/cooltools/blob/ad018d6005b5df852cf39f6ec5321bdfab95106f/cooltools/expected.py#L984

@gfudenberg gfudenberg changed the title suggestions for cvd smoothing get_cis_expected API (smoothing, aggregation, etc) Nov 11, 2021
@sergpolly
Copy link
Member

yes, - we've discussed ...
conclusion was to makr intra_only=False to return EVERYTHING - intra and inter together
I'll check after everyone goes to sleep

@gfudenberg
Copy link
Member Author

actually, now seems to be an issue on the main branch that emerges with intra_only=False for get_cis_expected (or one of the checks that gets called)
image

@gfudenberg
Copy link
Member Author

Update: looks like the issue is with the sorting check-- I think it actually doesn't make sense, b/c there could be more regions in the view than in the cooler, but it could still be compatible

@sergpolly
Copy link
Member

@gfudenberg we need viewframe to be sorted (by coordinate and according to cooler's order of chromosomes) because we generate "inter"-regions (inter arms, inter - whatever) in a pairwise combination fashion - i.e. first with second, first with third etc - and if viewframe isn't sorted - some of these inter regions would end up in the lower left side of the heatmap

enforcing sorted viewframe is the easiest way to deal with it

@gfudenberg
Copy link
Member Author

gfudenberg commented Nov 11, 2021 via email

@sergpolly
Copy link
Member

is_compatible_viewframe does not check order by default - only when requested
we request it for pairwise expected calculations , because we need it there to ensure we're in the upper right triangle of the heatmap

I've just checked is_sorted on arms and on the subset of arms and everything seems to work ok - i.e. in the same open2c example notebook with the microC dataset reduced to 2 chroms . The only thing I had to do is to reduce hg38_arms down to chromosomes available in the cooler itself - i.e. chr2 and chr17

@sergpolly
Copy link
Member

the problem with intra_only=False was indeed real - it didn't do what we've agreed on - it was only returning intra (i.e. asymmetric expecteds)
I'm about to push some changes to correct for it

@gfudenberg
Copy link
Member Author

gfudenberg commented Nov 11, 2021 via email

@gfudenberg
Copy link
Member Author

gfudenberg commented Nov 12, 2021

discussion 11.12:

proposed changes for expected_cis:

some name changes for arguments

def expected_cis(
    clr,
    view_df=None,
    intra_only=True,
    smooth=True
    aggregate_smoothed=True,
    smooth_sigma=0.1,
    clr_weight_name="weight",
    ignore_diags=2,  # should default to cooler info
    chunksize=10_000_000,
    nproc=1,
):

Some simplification for the returned dataframe:

  1. return balanced columns if clr_weight_name is not None
  2. return .smoothed column if smooth=True
  3. return .smoothed.agg column if aggregate_smoothed=True

Always return:
region1, region2,
dist, n_valid,
count.sum, count.avg,

Return if smooth=True and aggregate_smoothed=True
count.avg.smoothed, count.avg.smooth.agg

Return if smooth=True and aggregate_smoothed=True and clr_weight_name is not None:
balanced.sum, balanced.avg, (if cooler weight name not none)
balanced.avg.smoothed, balanced.avg.smooth.agg (if aggregate_smoothed=Ture)

This can be implemented by:

  1. moving the merge into expected_cis
  2. changing how the dictionary of DEFAULT_CVD_COLS is passed to the smoothing functions.

For release we will not return the count smoothing.
@golobor any suggestions for best way to pass the DEFAULT_CVD_COLS dictionary?

@gfudenberg
Copy link
Member Author

bubbling up the idea of propagating NaNs--
the current smoothing does not interpolate to dist=1 correctly, that's all I was suggesting, not NaNs later in the curve. Does that make more sense @golobor ?

@gfudenberg gfudenberg mentioned this issue Nov 17, 2021
49 tasks
@gfudenberg
Copy link
Member Author

gfudenberg commented Sep 1, 2022

this PR adds np.nan propagation for smoothed outputs

# propagate nan

#380

is that o.k. ? @golobor @sergpolly

@sergpolly
Copy link
Member

the newest API proposal - expected_full that would combine cis and trans expected-s together:

expected_full(
        clr,
        view_df=None,  # same view e.g. arms for cis and trans
        smooth_cis=False,  # smooth cis True|False
        aggregate_cis=False,  # aggregate cis expected False | "chrom" | "genome"
        # we have to allow for aggregate_trans, e.g. when calculating trans expected by arm
        aggregate_trans=False,  # aggregate trans expected False | "chrom" | "genome"
        expected_column_name="expected",  # store final result in a single column
        drop_intermediate_columns=True,  # drop count.sum, balanced.sum, n_valid balanced.sum.smooth etc ...

        # usual options
        smooth_sigma=0.1,
        ignore_diags=2,
        clr_weight_name='weight',
        chunksize=10_000_000,
        nproc=4,
)

sample output without intermediate columns:

region1 region2 dist expected
chr2_p chr2_p 0 NaN
chr2_p chr2_p 1 NaN
chr2_p chr2_p 2 0.068918
chr2_p chr2_p 3 0.045381
... ... ... ...
chr2_p chr2_q 2421 0.000050
chr2_p chr17_p -1 0.000022
chr2_p chr17_q -1 0.000022
chr2_q chr17_p -1 0.000022
chr2_q chr17_q -1 0.000022

note, how trans-expected has a special value for distance -1 - allowing for easy filtering and aggregation.

Here is the same output with the intermediate columns:

region1 region2 dist n_valid balanced.sum balanced.avg balanced.avg.smooth expected
chr2_p chr2_p 0 878.0 NaN NaN NaN NaN
chr2_p chr2_p 1 876.0 NaN NaN 0.000795 0.000795
chr2_p chr2_p 2 874.0 65.287351 0.074699 0.068918 0.068918
chr2_p chr2_p 3 872.0 41.011675 0.047032 0.045381 0.045381
... ... ... ... ... ... ... ...
chr2_p chr2_q 2421 0.0 0.000000 NaN 0.000050 0.000050
chr2_p chr17_p -1 174722.0 3.940185 0.000023 NaN 0.000022
chr2_p chr17_q -1 477632.0 10.879623 0.000023 NaN 0.000022
chr2_q chr17_p -1 284769.0 5.839495 0.000021 NaN 0.000022
chr2_q chr17_q -1 778464.0 16.283992 0.000021 NaN 0.000022

@sergpolly
Copy link
Member

in the name of biology we converged on the following API:

expected_full_fast(
    clr,
    view_df=hg38_arms,  # same view for cis and trans
    smooth_cis=True, # applies yto both inter and intra (-cis)
    smooth_log10=0.03
    combine_cis = False|"intra_genomewide"|function !
    combine_trans= False(None)|"chrom"|"genomewide"|function !
    expected_column_name="expected"
    store_intermediate_columns=True
)
# region1, region2, dist, expected_column - default

# if everything ios False and choose to store intermediates ...
# region1 region2 dist n_valid balanced.sum balanced.avg expected

@gfudenberg
Copy link
Member Author

API discussion superseded by #501

the issue of propagating NaNs from count.avg & balanced.avg to the smoothed columns remains

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants