Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a negative control sample to inform thresholds? #257

Open
hackkr opened this issue Sep 6, 2024 · 1 comment
Open

Using a negative control sample to inform thresholds? #257

hackkr opened this issue Sep 6, 2024 · 1 comment
Labels
question Looking for clarification on inputs and/or outputs

Comments

@hackkr
Copy link

hackkr commented Sep 6, 2024

Hello,

I ran a preliminary experiment to see if the all contexts methylation models work in an unusual system. Basically, I amplified the whole genome to remove any marks, then treated samples with either water, AluI methyltransferase (5mC), or EcoRI methyltransferase (6mA) and then rapid barcoded and sequenced. Samples were basecalled with supv5.0.0 models.

When I run modkit summary, all of my samples have similar levels of 6mA detected, and slightly different threshold values.

wga-AluI.bam
> sampling 10042 reads from BAM
> calculating threshold at 10% percentile
> calculated thresholds:  A: 0.8144531 C: 0.734375
 base  code  pass_count  pass_frac    all_count  all_frac 
 A     -     11871547    0.98895884   12893560   0.9683128 
 A     a     132539      0.011041157  421931     0.03168723 
 C     -     6307226     0.9691437    6870735    0.951314 
 C     h     9119        0.00140119   37663      0.0052147754 
 C     m     191695      0.029455105  313965     0.04347123

wga-Eco.bam
> sampling 10042 reads from BAM
> calculating threshold at 10% percentile
> calculated thresholds: A: 0.8183594 C: 0.734375
 base  code  pass_count  pass_frac     all_count  all_frac  
 A     -     15070288    0.98938465    16412277   0.97023183 
 A     a     161693      0.010615363   503553     0.029768152   
 C     -     8202646     0.99656403    8944574    0.9780853
 C     h     11379       0.0013824689  47668      0.005212475
 C     m     16902       0.0020534745  152742     0.01670227

wga-water.bam
> sampling 10042 reads from BAM
> calculating threshold at 10% percentile
> calculated thresholds: A: 0.8261719 C: 0.75390625
 base  code  pass_count  pass_frac     all_count  all_frac 
 A     -     14412553    0.99083996    15703601   0.9718222 
 A     a     133240      0.009160037   455323     0.028177805 
 C     -     7866807     0.9970703     8592602    0.980716 
 C     h     9278        0.0011759305  43996      0.0050214804 
 C     m     13837       0.0017537562  124962     0.014262528

I have two questions: for samples that were sequenced together, is it appropriate to use the same thresholds? Is there a way, similar to modkit validate to use another sample as a ground truth?

I plan to look at position-specific differential methylation next, but want to be sure I understand the rationale behind thresholds. Because there is a known motif for both samples relative to the negative control, would it be best to calculate position-specific thresholds?

@marcus1487
Copy link
Contributor

is it appropriate to use the same thresholds?

It is often a good idea to use the same thresholds for a larger experiment, especially when the estimated thresholds are so close like this. When applying the thresholds to new samples though it is important to note the fraction of calls filtered. When this value is large it can vastly skew downstream results and is often an indication of a poor quality run (due to contamination, very high mod content or other different run conditions).

Is there a way, similar to modkit validate to use another sample as a ground truth?

I'm not quite sure what you mean here. If you want to use the combination of these samples in order to estimate the thresholds you can do that by merging the bam files (potentially with some balancing). The modkit validate command is generally intended for application to strands where the modified status of each read in the sample will be absolutely known at a particular reference position (primarily from synthetically printed strands). The command will run on any file presented, but the results may not represent valid results.

would it be best to calculate position-specific thresholds?

We would generally advise against position specific thresholds. Users are free to explore this option (probably via the modkit extract output), but position specific thresholds are not implemented in any modkit commands, so application of these would have to be handrolled by the user.

I hope this helps and please let me know if further clarification would help with your analysis.

@marcus1487 marcus1487 added the question Looking for clarification on inputs and/or outputs label Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants