Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

z-score counts discrepancy #187

Closed
jgallowa07 opened this issue Oct 8, 2024 · 11 comments · Fixed by #189
Closed

z-score counts discrepancy #187

jgallowa07 opened this issue Oct 8, 2024 · 11 comments · Fixed by #189

Comments

@jgallowa07
Copy link
Member

jgallowa07 commented Oct 8, 2024

From @sminot

Hi Jared,

I wanted to pull you into a discussion that I’ve been having with Terry and Ryan about their use of phip-flow for VirScan.

The overall context is that phip-flow has been very useful for their project, and now that they have been running it at scale with a number of different datasets an odd edge case has emerged that would be good to resolve.

Here is the chain of observations that we have made:

Terry and Ryan observed that when the same dataset was analyzed with the V1.08 and V1.13 versions of the pipeline, the number of hits for a particular organism were not identical. Very similar, but not exactly identical.

Since this is all running on Cirro, it’s been very easy to make sure that all of the inputs to the runs were exactly the same, and that the only difference was the git commit being referenced with nextflow run.

I tried running both versions locally on my laptop, but I found that the results were identical between the two versions. In my case, both versions produced results that were identical to the V1.08 results from Cirro.

Ryan was able to replicate the discordant V1.13 results by running on Cirro, and so thankfully the phenomenon is reproducible (if mysterious).

By inspecting the working directories through the chain of analysis, I was able to identify the single point of divergence, which is the z-score calculation within the fit_predict_zscore step.

Within that step, I can confirm that the input counts and the sample metadata table are identical, but when V1.13 is run on Cirro the calculated Z-score values are slightly different than those when V1.08 is run on Cirro (and V1.08 on Cirro is identical to both versions when run locally).

Having found the single point of divergence, I also realize that I don’t have a good idea of how this could possibly be the case. Do you think this could be related to the difference in the version of the phippery container used?

Thanks for your thoughts and input,

Sam

@jgallowa07
Copy link
Member Author

jgallowa07 commented Oct 10, 2024

Notes as I'm working on tracking this down:

  • It does indeed seem that there was a image version (tag) change in V1.10. The difference between containers appears only to be phippery==1.1.4 -> phippery==1.2.0
  • When looking at the differences between 1.1.4, and 1.2.0 - I'm not seeing any changes to either zscore.py, nor any changes to the modeling.py module that calls those functions besides a docstring edit.
  • here's the phip-flow pipeline diff between versions, for easy reference
  • I do, however notice lot's of changes from pandas.DataFrame.iteritems() -> pandas.DataFrame.items(). It's worth noting that the iteritems() method was completely removed in pandas == 2.0.0
  • Indeed when inspecting these two respective image-derived containers; phippery==1.1.4 has pandas==1.5.1, while phippery==1.2.0 has pandas==2.0.2
(gcreplay) ubuntu gcreplay/analysis ‹ranking_coeff_gctree*› » docker image ls
REPOSITORY                       TAG       IMAGE ID       CREATED         SIZE
quay.io/hdc-workflows/phippery   1.2.0     e0aab267cf37   15 months ago   1.33GB
quay.io/hdc-workflows/phippery   1.1.4     32432a30bbf8   23 months ago   1.32GB
(gcreplay) ubuntu gcreplay/analysis ‹ranking_coeff_gctree*› » docker run -it e0aab267cf37 /bin/bash
root@ad8d4c95a68a:/# python
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'2.0.2'
root@ad8d4c95a68a:/# exit
(gcreplay) ubuntu gcreplay/analysis ‹ranking_coeff_gctree*› » docker run -it 32432a30bbf8 /bin/bash
root@8df75142a649:/# python
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'1.5.1'
  • There were certainly lots of changes between these pandas versions - so this might be a little tricky ... What does the zscore code have, in terms of pandas methods? This is @ksung25 's code so he may have input?

I tried running both versions locally on my laptop, but I found that the results were identical between the two versions. In my case, both versions produced results that were identical to the V1.08 results from Cirro.
It's curious to me that this discordance wouldn't be replicated locally.

  • this is a little confounding to me, if containers / pandas was the problem as I suspect, then this should be seen when run locally, as well.

That's all I've got time for today - will continue the investigation tomorrow.

TODO:

  • Try and replicate the discord myself, using the docker profile and the example input data that's included.
  • Confirm that the pandas version is causing the problem, I guess this could be done by simply running the zscore python API on the example data, using two different version of pandas, but everything else the same.

@jgallowa07
Copy link
Member Author

Some things have picked up this week with another project - will spend more time on this later today / tomorrow

@jgallowa07
Copy link
Member Author

@sminot I tried running both versions locally on my laptop using the example pan-cov-example data, and was unable to reproduce the problem being described.

I tried reproducing the problem by first downloading the pan-cov-example files, and then using nextflow's git aware functionality to run both versions on the same data. Using nextflow version 24.04.4.5917:

nextflow run \
    matsengrp/phip-flow \
    -r V1.13 \
    -profile docker \
    --sample_table data/pan-cov-example/sample_table_with_beads_and_lib.csv \
    --peptide_table data/pan-cov-example/peptide_table.csv \
    --run_zscore_fit_predict true \
    --results results/V1.13

nextflow run \
    matsengrp/phip-flow \
    -r V1.08 \
    -profile docker \
    --sample_table data/pan-cov-example/sample_table_with_beads_and_lib.csv \
    --peptide_table data/pan-cov-example/peptide_table.csv \
    --run_zscore_fit_predict true \
    --results results/V1.08 \
    --output_wide_csv true  # <---- needed for V1.08

This produces the following results (after $ gunzip results/V1.13/wide_data/*)

results
├── V1.08
│   ├── pickle_data
│   │   └── data.phip
│   └── wide_data
│       ├── data_counts.csv
│       ├── data_cpm.csv
│       ├── data_peptide_annotation_table.csv
│       ├── data_sample_annotation_table.csv
│       ├── data_size_factors.csv
│       └── data_zscore.csv
└── V1.13
    ├── pickle_data
    │   └── data.phip
    ├── rds_data
    │   └── PhIPData.rds
    └── wide_data
        ├── data_counts.csv
        ├── data_cpm.csv
        ├── data_edgeR_hits.csv
        ├── data_edgeR_logfc.csv
        ├── data_edgeR_logpval.csv
        ├── data_peptide_annotation_table.csv
        ├── data_sample_annotation_table.csv
        ├── data_size_factors.csv
        └── data_zscore.csv

And I see no difference between the two sets of zscore csv's ....

diff results/V1.08/wide_data/data_zscore.csv results/V1.13/wide_data/data_zscore.csv

I'm not exactly sure how I might continue to diagnose the problem further without some minimal example that provides the minimal differences described ...

re-reading the comment above, you note:

Terry and Ryan observed that when the same dataset was analyzed with the V1.08 and V1.13 versions of the pipeline, the number of hits for a particular organism were not identical. Very similar, but not exactly identical.

Can I assume they are indeed seeing differences only with the zscore hits, but not the EdgeR hits? I've been making this assumption, given you mentioned

By inspecting the working directories through the chain of analysis, I was able to identify the single point of divergence, which is the z-score calculation within the fit_predict_zscore step.

but if both of these hits differ than I would have to image the problem lies downstream in the aggregate workflow ... I'm assuming this is not the case ....

Can you think of a locally reproducible example for this behavior that can be used to further diagnose?

@jgallowa07
Copy link
Member Author

jgallowa07 commented Oct 23, 2024

From email thread:

Sam:

Hi all,

Unfortunately, I found much the same thing as Jared when I tried to reproduce locally. To figure out why there was a difference when running in Cirro, I tracked it down to the step when the Z-score was being calculated. By pulling the input files to each version of that step, I found that the input values were identical, but that the z-score values that came out were different. That made me think that maybe there was something about a math library compiled in the different images of phippery that were different. That said, I don’t really know enough about phippery to take that any further. Is that a possibility do you think, Jared?

Jared:

Hm, yea my first thought was that it had to do with the differences between pandas library versions; The phippery image used in V1.13 is using pandas version 2.0 which introduces quite a few changes compared to earlier versions.

Regardless, don't you think any discord between image libraries would be reproduced in local testing using docker (which is the profile I was using when trying to recreate)?

Sam:

Absolutely, that’s what is breaking my brain about this. I cannot figure out what could possibly be different in AWS versus using Docker locally.

Picking this thread up here @sminot

The only possible things I can think of might be the way that nextflow itself handles files being passed through processes, or differences in the containers build from the static images.

A few questions:

  1. What version of Nextflow is being used on AWS?
  2. Does the AWS profile actually use docker, if so, what version? Or is it using Singularity/Apptainer?

One other thing worth testing might be trying to run the pipeline on the fh cluster using apptainer to see if the problem pops up there for any reason. I can do that.

TODO:

  • Test the pipeline run on the fh gizmo cluster using apptainer instead of docker
  • Test exact versions of nextflow / docker / apptainer being used with Cirro + AWS

@sminot
Copy link
Collaborator

sminot commented Oct 23, 2024

  1. Nextflow in Cirro is 24.04.4
  2. The execution in AWS uses Docker images natively (via ECS, scheduled by Batch)

@sminot
Copy link
Collaborator

sminot commented Oct 29, 2024

I've added an email to the thread, just because it has private links with test data to look at

@jgallowa07
Copy link
Member Author

So I think I've identified the issue as being related the order in which samples are fed to the zscore calculation. Getting @ksung25 on the job, and will provide a code example of this when / if necessary.

@jgallowa07
Copy link
Member Author

jgallowa07 commented Nov 5, 2024

From @ksung25

Hi Jared, Sam,

Thanks for all the effort of testing and narrowing down the conditions for the bug to manifest.

The only place where multiple samples are combined for computation is aggregating beads-only samples to determine the binning of similar-abundance peptide IDs in zscore.zscore_pids_binning(). Once the binning is determined, the Z-scores are computed separately per-sample and therefore independent of input sample order. I have checked locally that the function zscore.zscore() reports identical z-scores for corresponding samples with four input samples in different order.

I'd like to understand how the order of samples are different between V1.08 and V1.13. And more importantly if the sample type annotation somehow got scrambled, so that different sets of samples are considered "beads_only" between the two versions. If sample type annotation is correctly preserved, then I think I'd need some directions to reproduce the error and dig more.

Best,
Kevin>

OK, right after hitting "Send" I realized I should check zscore.zscore_pids_binning() is independent of beads sample order, and it's not! I am confused, but I should be able to figure this out.

Thanks for looking into this @ksung25 !

That's interesting about phippery.zscore.zscore_pids_binning. Hope it's not too much of a pain to identify/patch!

I'd like to understand how the order of samples are different between V1.08 and V1.13. And more importantly if the sample type annotation somehow got scrambled, so that different sets of samples are considered "beads_only" between the two versions.

I checked, and the same number of beads-only samples, and their respective count sums, are the same for both V1.08, and V1.13 datasets, just in different order with different sample id. Another piece of evidence that tells me the issue lie's within zscore.zscore_pids_binning(), is that when I sort the V1.13 dataset samples to match the order seen in V1.08 (but preserving the "sample_id") - and then refit the zscores, we get the same values (exactly) as those seen in V1.08.

If sample type annotation is correctly preserved, then I think I'd need some directions to reproduce the error and dig more.

Unfortunately, the example data that can be obtained via the phip-flow example -- even when shuffled to re-order the two beads only samples -- does not exemplify this behavior. i.e. the reported zscores seem to match when checked with the code below. So the only way I'm able to produce this behavior is using the boeckh lab data. That said, given permissions to the V1.13 .phip file (requested and shared securely via cirro @sminot), you can reproduce the problem simply by randomly shuffling the samples, and checking them against the original zscores:

"""
A quick script to test whether sample ordering effects zscore fits.

Usage: python zscore_shuffle.py example.phip

where example.phip is a dataset with 'beads_only' annotated samples,
and a pre-computed zscore enrichment table. The code the simply shuffles
sample order, refit's zscores and saves them to the "zscore_shuffle" enrichment
layer, then compares the old and new enrichment tables 
(where order is preserved, thanks to `xarray`)
to see whether the values differ, and the distribution of how much, if so.
"""

import phippery
from phippery.utils import *
from phippery.modeling import zscore
import numpy as np

import os
import sys


# load the original data
ds = phippery.load(sys.argv[1])

# randomly shuffle the sample ordering
np.random.seed(3)
ds = ds.reindex(sample_id=np.random.permutation(ds.sample_id.values))

# query the beads only samples
beads_ds = ds_query(ds, "control_status == 'beads_only'")

# re-fit zscores and add them to the dataset under a new name
ds = zscore(
    ds,
    beads_ds,
    data_table='cpm',
    min_Npeptides_per_bin=300,
    lower_quantile_limit=0.05,
    upper_quantile_limit=0.95,
    inplace=False,
    new_table_name="zscore_shuffle"
)

# save the new data for later inspection
phippery.dump(ds, f"{sys.argv[1]}.reshuffled.phip")

# compare original, and new zscores.
orig_zscore_df = ds['zscore'].to_pandas()
shuf_zscore_df = ds['zscore_shuffle'].to_pandas()

if orig_zscore_df.equals(shuf_zscore_df):
    print(f"Z-Scores match")
else:
    diffs = pd.Series(np.abs((orig_zscore_df.values - shuf_zscore_df.values).flatten()))
    print(f"distribution of differences: \n{diffs.describe()}\n")

Note that this code actually requires the patch I'm working on for #188 which I'll push tomorrow (this should have no impact on the bug we're discussing).

@ksung25
Copy link
Collaborator

ksung25 commented Nov 5, 2024

The issue comes from numerical precision in summing peptide CPMs over beads samples
https://github.com/matsengrp/phippery/blob/main/phippery/zscore.py#L22
which can give different sums depending on the input order of beads samples. (We can't see this in the example data because there's only 2 beads samples).

Peptide IDs are sorted and binned based on these sums, and there are a lot of ties. Because of the rounding errors, there can be a difference in which peptides are tied depending on sample order, and subsequently peptides are binned a little differently. My proposed solution is to use math.fsum -- this at least eliminates the order dependency in my local example. This change will probably give different results to both V1.08 and V1.13, though.

@sminot
Copy link
Collaborator

sminot commented Nov 5, 2024

I'm glad to hear that the root cause can be described as a rounding error rather than a systemic bias. My recommendation would be to update the code moving forward so that the order dependency is not an issue in future versions of the pipeline. While this does not go back and change ("fix") the behavior of the earlier versions, that is why we pinned versions in the first place.

I really appreciate your getting to the bottom of this, Kevin!

@tsayers
Copy link

tsayers commented Nov 5, 2024

Really excited to see this and it makes sense that it would lead to the minor differences we were seeing - thank you everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants