schmutzi: Bayesian maximum a posteriori contamination estimate for ancient samples

Upon sequencing ancient DNA , the DNA of the individuals involved in excavation, extraction of DNA and library preparation can become mixed with the actual sample being sequenced. We define as endogenous the DNA pertaining to the original sample and contaminant the DNA of the experimenters.

schmutzi is a set of programs aimed at ancient DNA data that can :

estimate contamination based on deamination patterns alone
call a mitonchondrial consensus for the endogenous genome. This consensus calls takes into account contamination and deamination.
estimate mitonchondrial contamination and identify the most likely contaminant from a set.

Questions :

contact: Gabriel Renaud   
email:	 gabriel [dot] reno [ at sign ] gmail.com

Downloading:

Do a :

git clone --recursive https://github.com/grenaud/schmutzi.git

or download a zipped file from https://bioinf.eva.mpg.de/schmutzi/schmutzi.tar.gz

Requirements:

zlib.h
cmake to build bamtools
git to download the submodules
C++ compiler
Perl interpreter
R (recent version)
- Rscript
- The fitdistrplus R package ( install.packages("fitdistrplus") or install.packages("fitdistrplus", repos="http://R-Forge.R-project.org") on more recent versions of R)
- The MASS package, normally installed automatically with fitdistrplus ( install.packages("MASS") )

On Ubuntu, these dependencies can be resolved using :

sudo apt-get install perl
sudo apt-get install git
sudo apt-get install cmake
sudo apt-get install g++
sudo apt-get install zlib
sudo apt-get install zlib1g-dev
sudo apt-get install r-base-core

Installation:

For Mac users, open a terminal please type "cmake", "git" and "R" to check if works. If it is not installed, please install it. If not, Mac Users reported that Homebrew can install those.

For Windows users, you would need cygwin to run schmutzi as it runs on the terminal (https://www.cygwin.com/).

type:

make

which should build pretty much everything.

We tested our program on a Linux system and MacOS, please email us if you have trouble building under cygwin.

A comment about cygwin: the high precision version of the logarithm does not seem available by default. Therefore, we are forced to use the low precision variant for cygwin. For higher accuracy, run our software under Linux. Also, some users have reported bugs with the locale used by C++

Running:

There are 2 main Perl scripts that drive the underlying programs written in C++.

script	function
contDeam.pl	Is a script that allows for the estimation of endogenous deamination and
	subsequent contamination estimate using those. Assuming that the contaminant
	is not deaminated, it will measure rates of deamination for the endogenous material
	and provide a rough contamination estimate. If say the endogenous material is
	30% deaminated at one base but the observed deamination rate for the entire set is 15%,
	the contamination rate is therefore 0.5 . There are two ways to estimate endogenous
	deamination:
	1) (default)
	Condition on one end of the read being deaminated (ex: 5') and measure
	deamination rates on the other (ex: 3'). This approach is useful for
	all types of ancient DNA (nuclear and mitonchondrial). Please note however
	that if the contaminant is deaminated as well, which can occur, this approach
	will return an underestimate whose error comensurates with rates of contaminant
	deamination.
	2) --split
	Take a series of mitonchondrial diagnostic positions (positions that indicate
	whether the read pertains to a particular type of homonin (ex: Neanderthal) or
	the putative contaminant homonin (ex: modern Humans) and separate the contaminant
	from the endogenous reads. This approach is useful when prior information on the
	sample is available and when enough diagnostic positions are available (case of
	archaic homonins).
schmutzi.pl	This script performs two tasks:
	1) Call an endogenous consensus
	2) Using the consensus called in 1) and a set of known contaminants, compute
	the contamination rate for entry in this set. Return the most likely contaminant
	with its corresponding rate.
	Both 1) and 2) are called iteratively until a stable contamination rate is reached. The
	script produces a fasta file of the endogenous mitonchondrial genome

Also, the various subparts of schmutzi can be called directly:

program	function
contDeam	Contamination estimate using deamination patterns
endoCaller	Consensus calling for mitochondrial data
mtCont	Contamination estimate using a set of known contaminants

approxDist.R	Do a maximum likelihood estimate of log-normal paramaters given molecule length data
bam2prof	Measure deamination rates and produce a matrix of deamination probabilities
insertSize	Produce all insert sizes for aligned data in BAM format
log2freq	Turn a endogenous or consensus log file into an allele frequency matrix to be used
	as contaminant source
logs2pos	Takes two log files and produces which positions are segregating
mitoConsPDF.R	Plot various information like coverage and quality for the consensus log file
mas2freq	Turn a multiple sequence alignment file in multifasta format into an allele frequency
	matrix to be used as contaminant source
mas2log	Transform a pairwise alignment of an endogenous sequence to the reference
	into a log file for contamination estimate using mtCont
posteriorDeam.R	Plot the posterior probability
contOut2ContEst.pl	Take the output of mtcont and get a point estimate

Quick start guide:

Before you start, make sure you have a BAM file that has been:

where ALL the fragments have been mapped to the mitochondrial genome only.
sorted
indexed
fillmd/calmd has been run to fix the MD field

First determine if you the library used was double-stranded or single-stranded. Then call:

contDeam.pl --lengthDeam [length] --library [library type] --out [output prefix] [mt reference] [input bam file]

where [length] is the # of bp considered and [library type] is the library type (either double or single). If the deamination is seen for a many bases far from the 5'/3' ends, use maybe --lengthDeam 40. If you have USER treatment or single-stranded libraries and low amounts of deamination further away from the ends, you can use --lengthDeam 2 or 5.

Then run the iterative procedure without the prediction of the contaminant:

schmutzi.pl --notusepredC --uselength --ref [mt reference] --out [out prefix]_npred [output prefix] [path to schmutzi]/eurasian/freqs/ [input bam file]

then run it again with the prediction of the contaminant:

schmutzi.pl               --uselength --ref [mt reference] --out [out prefix]_wpred [output prefix] [path to schmutzi]/eurasian/freqs/ [input bam file]

Those commands will create output files with the [out prefix]_npred and [out prefix]_wpred prefixes.

If the iterative procedure is successful, it will some files with the following suffixes:

_final.cont.est for the final estimate present-day human contamination (format: estimate estimate_low estimate_high)
_final.cont.pdf plot for the posterior probability
_final_mtcont.out log for mtCont (contamination estimate) allows the most likely haplogroup for the contaminant
_final_endo.fa for the unfiltered fasta prediction
_final_endo.log log file with the per-base likelihood for each position for each base of of being the endogenous base.

We highly recommend users to use log2fasta on _final_endo.log to obtain a consensus at various quality cutoffs.

Test Data:

To test schmutzi, we have made available emprical data with single-strand (ex: testdata/mezB9687.bam) and simulated data with double-stranded damage patterns (ex: testdata/simulation.bam).

To download it, either download it manually from :

    https://bioinf.eva.mpg.de/schmutzi/testData/mezB9687.bam
    https://bioinf.eva.mpg.de/schmutzi/testData/mezB9687.bam.bai
    https://bioinf.eva.mpg.de/schmutzi/testData/simulation.bam
    https://bioinf.eva.mpg.de/schmutzi/testData/simulation.bam.bai

Or, if you have "wget" installed, just type:

    make testdata

First you need to estimate endogenous deamination rates. First create an output directory:

    mkdir -p outputdir/

Then run contDeam to estimation endogenous deamination rates:

    ./contDeam.pl  --library single --out outputdir/mez testdata/mezB9687.bam

or for the simulated

    ./contDeam.pl  --library double --out outputdir/sim testdata/simulation.bam

This will produce the files:

    outputdir/[out].cont.pdf	Plot of the posterior probability for contamination based on deamination
    outputdir/[out].cont.est      Estimate for contamination based on deamination
    outputdir/[out].config	Configuration file describing the variables used

Then run the following to produce the endogenous consensus and the contamination estimate:

    ./schmutzi.pl       --uselength   --ref refs/human_MT_wrapped.fa         outputdir/mez   alleleFreqMT/197/freqs/  testdata/mezB9687.bam
    ./schmutzi.pl       --uselength   --ref refs/human_MT_wrapped.fa         outputdir/sim   alleleFreqMT/197/freqs/  testdata/simulation.bam


    --uselength tells the program to use the length of the molecules
    --ref is for the reference

   outputdir/mez is the output from contDeam
   alleleFreqMT/197/freqs/ is the database of putative contaminants 
   testdata/mezB9687.bam is the input bam file

The first dataset is an empirical dataset with about 40-45% contamination and the second is a simulated dataset with 20% contamination

It will run for a few minutes and produce the following files:

For contamination:

file	content
[out]_final_mtcont.out	Contamination estimates for all samples in the database
[out]_final.cont	Contamination estimates for the sample in the database with the highest likelihood
[out]_final.cont.est	Contamination estimates for the most likely sample with confidence intervals
[out]_final.cont.pdf	Posterior probability on the contamination for the most likely sample

For the respective genomes of the endogenous and contaminant:

file	content
[out]_final_endo.fa	Endogenous mitochondrial genome as fasta
[out]_final_endo.log	Endogenous mitochondrial genome as a log file with likelihoods on a PHRED scale
[out]_final_cont.fa	Contaminant mitochondrial genome as fasta
[out]_final_cont.log	Contaminant mitochondrial genome as a log file with likelihoods on a PHRED scale

Recommended workflow for ancient samples:

Nuclear I recommend DICE (http://grenaud.github.io/dice) for nuclear samples. The only thing schmutzi can do for nuclear data is to estimate contamination rates using deamination patterns.

However, use this at your own risk, we know three factors that lead to wrong estimates:
1. Insuffient # of molecules (we need at least 500M)
2. Insuffient rates of endogenous deamination (we need upwards of 5%)
3. No or very little deamination of the contaminant fragments
4. Since we need to condition on one end to measure deamination on the other, we need independence between 5' and 3' deamination rates
This method is implemented in "contDeam.pl". For mt, we can at least cross validate the contamination estimate using "mtCont" but not for nuclear data. If you have a contaminant that is deaminated this will lead to an underestimate. Lack of independence between the 5' and 3' deamination rates can lead to overestimated deamination rates for the endogenous portion and an overestimate. To test this, two programs were added as part of the package: jointFreqDeaminated and jointFreqDeaminatedDouble (double stranded). For samples with low amounts of contamination, they compute the joint frequency of deamination:
```
  ./jointFreqDeaminated  in.bam > in.freq
```
Then use
```
  ./jointFreqDeaminated.R in.freq 
```
If you get low p-values, the method should be safe to use, you only have to worry about a deaminated contaminant.
MT
1. Have your data aligned to a mitochondrial reference (see "refs/human_MT_wrapped.fa" for a wrapped reference) using a sensitive mapper that produces BAM. A wrapper script is available with schmutzi (see Frequently Asked Questions)
2. run samtools sort on your aligned bam file
3. run samtools calmd/fillmd on your aligned bam file
4. run samtools index on your sorted and aligned bam file
5. If you used the wrapped reference, re-wrap your alignments exceeding the junction using for example https://github.com/udo-stenzel/biohazard this step is not necessary but produces equal coverage and resolution at the boundaries
6. Estimate your initial contamination and deamination rates using "contDeam.pl"
7. Run schmutzi.pl once with default parameters
8. Run schmutzi.pl again with "--usepredC"
9. If contamination is more than a few percent re-run using the "--uselength" option

If you have a large number of samples, I recommend using bam2makeSchmutzi.pl which creates a makefile and automates the of using contDeam.pl followed by schmutzi.pl

Interpreting the output:

Here is a list of the different meaningful output files and their meaning.

From contDeam.pl:

file	content
[out].cont.pdf	Plot of the posterior probability for contamination based on deamination
[out].cont.est	Estimate for contamination based on deamination. format: estimate est.low est.high

From schmutzi.pl for the contamination estimate:

file	content
[out]_final.cont.est	Contamination estimates for the most likely sample with confidence intervals. format: estimate est.low est.high
[out]_final_mtcont.out	Contamination estimates for all records in the database format: record contamination log.posterior
[out]_final.cont	Contamination estimates for the record in the database with the highest likelihood. format: record contamination log.posterior
[out]_final.cont.pdf	Posterior probability on the contamination for the most likely sample

From schmutzi.pl for the endogenous/contaminant:

file	content
[out]_final_endo.fa	Endogenous mitochondrial genome as fasta, this contains all the bases and has not been filtered for high-quality bases. To produce a fasta file using a quality filter, use log2fasta
[out]_final_endo.log	Endogenous mitochondrial genome as a log file with likelihoods on a PHRED scale. format: position reference.base predicted.base quality(PHRED) average.mappping.quality coverage support.for.predicted.base p[base=a] p[base=c] p[base=g] p[base=t]
[out]_final_cont.fa	Contaminant mitochondrial genome as fasta, this contains all the bases and has not been filtered for high-quality bases
[out]_final_cont.log	Contaminant mitochondrial genome as a log file with likelihoods on a PHRED scale. format: position reference.base predicted.base quality(PHRED) average.mappping.quality coverage support.for.predicted.base p[base=a] p[base=c] p[base=g] p[base=t]

Frequently asked questions:

How should I prepare the BAM file as input for endoCaller ?

As we mention above the, data must be aligned and ideally, wrapped around the ends. We provide a little wrapper script to call the shrimp mapper and wrap the reads around the edge : wrapper.pl
When should I trust the output of contDeam.pl ?

When the deamination rates for the contaminant material are negligible. To verify this, one way is to use diagnostic positions on the mitochondria. If you have nuclear data, please retain the reads mapping to the mitochondria. If you know in advance that the sample is either Neanderthal or Denisova, you can use the precomputed diagnostic positions. If you have an modern human, you will have to run schmutzi.pl and use the split BAM files produced by the script. Inspect, the files called *cont.5p.prof and *cont.3p.prof. If the rates of deamination for them is higher than 1 or 2 %, the results provided by contDeam.pl will not reliable.
The contamination estimate given by contDeam.pl gives me a very different result than schmutzi.pl does, why is that ?

There could be a few reasons:
1. Your contamination is deaminated. That can be the case for museum samples. Check to see if the
  - [output prefix]_[iteration]_cont.5p.prof and
  - [output prefix]_[iteration]_cont.3p.prof
  present any sign of deamination.
2. The algorithm diverged. This could be due to a misleading prior or an almost absence of endogenous material.
How can I know if I have multiple contaminants ?

If you have highly divergent quality values for bases in the contamination prediction, this could be an indication that you might have multiple contaminants. Further, if
the predicted contaminant does not seem to fall within a given haplogroup and has diagnostic positions from different haplogroups, this could be an indication that your contaminant is multiple.
What to do if I suspect that I have multiple contaminants ?

Do not use the --usepredC option and turn on the --multipleC option. If the contamination rate is a lot lower than the one predicted by the deamination patterns, this could be an indication that schmutzi did not converge and contamination is very high. Try to do a simple prediction using deaminated molecules (hoping that the contaminant is not deaminated) and use this as the input of mtCont manually. Datasets that are heavily contaminated with multiple contaminants are very difficult targets.
How can I just use mtCont using a fasta file or a fasta file from a proxy mitochondria ?

Given you have a fasta file with the endogenous mitochondria called endo.fa and using the reference ref/human_MT.fa. You can run mafft and pipe into msa2log:
```
    cat endo.fa refs/human_MT.fa |mafft --auto /dev/stdin  |./msa2log /dev/stdin  mtref  > endo.log
```
Then you have a log file for mtCont.
Can I build my own set of putative contaminants ?

Yes. First build a multiple sequence alignment (msa) of the various mitochondria including the reference used for mapping. Go to the directory where you want to store the frequencies then run:
```
    ./msa2freq [msa] [name of the reference]
```
A freqs/ directory should appear with one file per record in the msa.
Can your software work for ancient DNA from other species ?

Yes, given a reference, schmutzi can be used to call an mitochondria endogenous consensus taking into account deamination. But in the case of non-hominin animals, you do no need to measure contamination from humans if the mitochondrion is sufficiently divergent from the human one. High divergence entails that few reads from humans will align to your reference.

First determine your deamination rates using bam2prof with the appropriate library type:
```
     ./bam2prof [-single|-double] -5p out5p.prof -3p out3p.prof input.bam
```
Then call the consensus:
```
      ./endoCaller -cont 0 -deam5p out5p.prof -deam3p out3p.prof -seq output.fa -log outlog.log -l [length reference] /path/to/reference.fasta input.bam
```
Can I use schmutzi on modern DNA for mitochondrial consensus ?

Yes, it is the simple case where deamination does not exist. Simply use endoCaller without deamination parameters and length parameters.
The contamination estimate of contDeam.pl is very different from the one obtained from mtCont, why is that ? There are two possibilities:
1. schmutzi.pl did not converge
2. The rates of deamination of the contamination are not close to zero.
For the second option, maybe run using either --uselength or --estdeam and check the cont.prof/endo.prof files being produced.
The program stopped and said: "Unable to find more than X positions that are different ...". Why is that ?

This means that upon splitting into the endogenous and contaminant files, the program was not able to find enough positions to split. This could be due to the following:
- There is either too much or too little contamination. This makes is impossible to predict the contaminant and/or the endogenous genomes. If the first contamination estimate was very high, probably it was due to high contamination. Try to run endoCaller manually using higher contamination estimates. If there is too little contamination as indicated by an initial low contamination estimate, you could simply trust the first iteration of endoCaller or rerun without the "--uselength", "--estdeam" and "--usepredC" options
- There is very little difference between the endogenous and contaminant. You could trust the contamination rate given by the deamination patterns and use the first endogenous call.
Should I filter my BAM file for reads with high mapping quality ?
- In theory, no. If the mapping quality is good proxy for the probability of mismapping, you should be fine as mapping quality is incorporated. However, most aligners do not compute the mismapping probability properly. We have found that sometimes applying a cutoff of 30 or 35 bp reduces the chance of mismapping.

How much can I trust the endogenous consensus call ?

That depends on two factors:

Amount of contamination and coverage, you can check how good the quality of the log is. If you have low quality, not much can be done

It the case of very distant mt genomes (ex: Denisova to human) the divergence might be very high in certain regions and leads to misalignments. To solve this, an approach is to call the endogenous iteratively as such:

  ./wrapper.pl   testDenisova/endo.it1  refs/human_MT.fa testDenisova/input.bam
  ./endoCaller -seq testDenisova/endo.it1.fa -log testDenisova/endo.it1.log  refs/human_MT.fa testDenisova/endo.it1.bam
  samtools faidx testDenisova/endo.it1.fa
  ./wrapper.pl   testDenisova/endo.it2  testDenisova/endo.it1.fa testDenisova/endo.bam
  ./endoCaller -seq testDenisova/endo.it2.fa -log testDenisova/endo.it2.log testDenisova/endo.it1.fa  testDenisova/endo.it2.bam
  ...

Until there is no difference

I run contDeam.pl on my BAM file and get nan errors:
```
  ./contDeam.pl  --library double --out outputdir/test testdata/test.bam 
  running cmd bam2prof -length  2 -endo -double -5p outputdir/test.endo.5p.prof  -3p outputdir/test.endo.3p.prof testdata/test.bam
  running cmd contDeam  -deamread -deam5p outputdir/test.endo.5p.prof  -deam3p outputdir/test.endo.3p.prof  -log  outputdir/test.cont.deam   testdata/test.bam
  Utils.cpp: destringify() Unable to convert string="-nan"
  system  cmd contDeam  -deamread -deam5p outputdir/test.endo.5p.prof  -deam3p outputdir/test.endo.3p.prof  -log  outputdir/test.cont.deam   testdata/test.bam failed: 256 at ./contDeam.pl line 22.
```
- This could be due to two factors:
  1. The number of aDNA fragments is too low to get an estimate of misincorporation patterns due to deamination. Less than 1X for example.
  2. There is no deamination present hence no measurable misincorporation patterns.
Problem #1 cannot be solved per se. It is possible that the iterative procedure will not work either. Problem #2 requires a bit of thinking. I suggest running mapDamage and check if you have fragmentation and misincorporation patterns. If so, was USER/UDG treatment used hence explaining the lack of deamination patterns? If you are confident that you have no deamination pattern, please note that is it very difficult to infer the endogenous base and therefore to estimate present-day contamination in the absence of deamination patterns. I recommend using endoCaller and call mtCont manually just once instead of iteratively.
When endoCaller is run, get: Query reference base is not the same for read XXXX pos 3107

Make sure you have used "samtools calmd" (or samtools fillmd for earlier versions).

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
alleleFreqMT		alleleFreqMT
bamtools @ d24d850		bamtools @ d24d850
deaminationProfile		deaminationProfile
doc		doc
illuminaProf		illuminaProf
libgab @ 2789928		libgab @ 2789928
outputdir		outputdir
refs		refs
splitEndoVsCont		splitEndoVsCont
testdata		testdata
.gitmodules		.gitmodules
.travis.yml		.travis.yml
Makefile		Makefile
README.md		README.md
addXACircular.cpp		addXACircular.cpp
approxDist.R		approxDist.R
bam2makeSchmutzi.pl		bam2makeSchmutzi.pl
bam2prof.cpp		bam2prof.cpp
contDeam.cpp		contDeam.cpp
contDeam.pl		contDeam.pl
contOut2ContEst.pl		contOut2ContEst.pl
contOut2ContEstSource.pl		contOut2ContEstSource.pl
countRecords.cpp		countRecords.cpp
damage2profile.cpp		damage2profile.cpp
deamProf2pdf.R		deamProf2pdf.R
endoCaller.cpp		endoCaller.cpp
fasta2haplogrep.py		fasta2haplogrep.py
filterlog.cpp		filterlog.cpp
freq2consfreq.pl		freq2consfreq.pl
insertSize.cpp		insertSize.cpp
jointFreqDeaminated.cpp		jointFreqDeaminated.cpp
jointFreqDeaminatedDouble.cpp		jointFreqDeaminatedDouble.cpp
log2ConsensusLog.cpp		log2ConsensusLog.cpp
log2fasta.cpp		log2fasta.cpp
log2freq.cpp		log2freq.cpp
logdiff.cpp		logdiff.cpp
logmask.cpp		logmask.cpp
logs2pos.cpp		logs2pos.cpp
miscfunc.cpp		miscfunc.cpp
miscfunc.h		miscfunc.h
mitoConsPDF.R		mitoConsPDF.R
msa2freq.cpp		msa2freq.cpp
msa2log.cpp		msa2log.cpp
msa2singlefreq.cpp		msa2singlefreq.cpp
mtCont.cpp		mtCont.cpp
parseSchmutzi.pl		parseSchmutzi.pl
posteriorDeam.R		posteriorDeam.R
schmutzi.pl		schmutzi.pl
subSampleBAM.cpp		subSampleBAM.cpp
wrapper.pl		wrapper.pl
wrapperRMDUP.pl		wrapperRMDUP.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

schmutzi: Bayesian maximum a posteriori contamination estimate for ancient samples

Questions :

Downloading:

Requirements:

Installation:

Running:

Quick start guide:

Test Data:

Recommended workflow for ancient samples:

Interpreting the output:

Frequently asked questions:

About

Releases

Packages

Languages

mpieva/schmutzi

Folders and files

Latest commit

History

Repository files navigation

schmutzi: Bayesian maximum a posteriori contamination estimate for ancient samples

Questions :

Downloading:

Requirements:

Installation:

Running:

Quick start guide:

Test Data:

Recommended workflow for ancient samples:

Interpreting the output:

Frequently asked questions:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages