-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
z-score counts discrepancy #187
Comments
Notes as I'm working on tracking this down:
That's all I've got time for today - will continue the investigation tomorrow. TODO:
|
Some things have picked up this week with another project - will spend more time on this later today / tomorrow |
@sminot I tried running both versions locally on my laptop using the example pan-cov-example data, and was unable to reproduce the problem being described. I tried reproducing the problem by first downloading the pan-cov-example files, and then using nextflow's git aware functionality to run both versions on the same data. Using
This produces the following results (after
And I see no difference between the two sets of zscore csv's ....
I'm not exactly sure how I might continue to diagnose the problem further without some minimal example that provides the minimal differences described ... re-reading the comment above, you note:
Can I assume they are indeed seeing differences only with the zscore hits, but not the EdgeR hits? I've been making this assumption, given you mentioned
but if both of these hits differ than I would have to image the problem lies downstream in the aggregate workflow ... I'm assuming this is not the case .... Can you think of a locally reproducible example for this behavior that can be used to further diagnose? |
From email thread: Sam:
Jared:
Sam:
Picking this thread up here @sminot The only possible things I can think of might be the way that nextflow itself handles files being passed through processes, or differences in the containers build from the static images. A few questions:
One other thing worth testing might be trying to run the pipeline on the fh cluster using apptainer to see if the problem pops up there for any reason. I can do that. TODO:
|
|
I've added an email to the thread, just because it has private links with test data to look at |
So I think I've identified the issue as being related the order in which samples are fed to the zscore calculation. Getting @ksung25 on the job, and will provide a code example of this when / if necessary. |
From @ksung25
Thanks for looking into this @ksung25 ! That's interesting about phippery.zscore.zscore_pids_binning. Hope it's not too much of a pain to identify/patch!
I checked, and the same number of beads-only samples, and their respective count sums, are the same for both V1.08, and V1.13 datasets, just in different order with different sample id. Another piece of evidence that tells me the issue lie's within
Unfortunately, the example data that can be obtained via the phip-flow example -- even when shuffled to re-order the two beads only samples -- does not exemplify this behavior. i.e. the reported zscores seem to match when checked with the code below. So the only way I'm able to produce this behavior is using the boeckh lab data. That said, given permissions to the V1.13 .phip file (requested and shared securely via cirro @sminot), you can reproduce the problem simply by randomly shuffling the samples, and checking them against the original zscores: """
A quick script to test whether sample ordering effects zscore fits.
Usage: python zscore_shuffle.py example.phip
where example.phip is a dataset with 'beads_only' annotated samples,
and a pre-computed zscore enrichment table. The code the simply shuffles
sample order, refit's zscores and saves them to the "zscore_shuffle" enrichment
layer, then compares the old and new enrichment tables
(where order is preserved, thanks to `xarray`)
to see whether the values differ, and the distribution of how much, if so.
"""
import phippery
from phippery.utils import *
from phippery.modeling import zscore
import numpy as np
import os
import sys
# load the original data
ds = phippery.load(sys.argv[1])
# randomly shuffle the sample ordering
np.random.seed(3)
ds = ds.reindex(sample_id=np.random.permutation(ds.sample_id.values))
# query the beads only samples
beads_ds = ds_query(ds, "control_status == 'beads_only'")
# re-fit zscores and add them to the dataset under a new name
ds = zscore(
ds,
beads_ds,
data_table='cpm',
min_Npeptides_per_bin=300,
lower_quantile_limit=0.05,
upper_quantile_limit=0.95,
inplace=False,
new_table_name="zscore_shuffle"
)
# save the new data for later inspection
phippery.dump(ds, f"{sys.argv[1]}.reshuffled.phip")
# compare original, and new zscores.
orig_zscore_df = ds['zscore'].to_pandas()
shuf_zscore_df = ds['zscore_shuffle'].to_pandas()
if orig_zscore_df.equals(shuf_zscore_df):
print(f"Z-Scores match")
else:
diffs = pd.Series(np.abs((orig_zscore_df.values - shuf_zscore_df.values).flatten()))
print(f"distribution of differences: \n{diffs.describe()}\n") Note that this code actually requires the patch I'm working on for #188 which I'll push tomorrow (this should have no impact on the bug we're discussing). |
The issue comes from numerical precision in summing peptide CPMs over beads samples Peptide IDs are sorted and binned based on these sums, and there are a lot of ties. Because of the rounding errors, there can be a difference in which peptides are tied depending on sample order, and subsequently peptides are binned a little differently. My proposed solution is to use math.fsum -- this at least eliminates the order dependency in my local example. This change will probably give different results to both V1.08 and V1.13, though. |
I'm glad to hear that the root cause can be described as a rounding error rather than a systemic bias. My recommendation would be to update the code moving forward so that the order dependency is not an issue in future versions of the pipeline. While this does not go back and change ("fix") the behavior of the earlier versions, that is why we pinned versions in the first place. I really appreciate your getting to the bottom of this, Kevin! |
Really excited to see this and it makes sense that it would lead to the minor differences we were seeing - thank you everyone! |
From @sminot
The text was updated successfully, but these errors were encountered: