A workflow to simulate SARS-CoV-2 amplicons and reads for assembly with viridian.
A working pipeline to generate and assemble synthetic SARS-CoV-2 amplicons.
The main steps are outlined below:
- The pipeline begins by simulating neutral evolution from the SARS-CoV-2 reference sequence using PhastSim and a reference SARS-CoV-2 phylogeny.
- Varifier is then used to generate truth VCFs for each simulated assembly using the SARS-CoV-2 reference sequence.
- The primer sequences of the specified amplicon scheme are mapped to the simulated assemblies using bwa aln and the amplicon sequence extracted for each primer pair. There is then a check for mismatches between each primer and the simulated assembly. If at least one mismatch is present, PCR artifacts are simulated with equal probability: the amplicon sequencing coverage is reduced to 0 or the sequence that the primer has mapped to is reverted to the reference sequence. Random amplicon dropout is also simulated with a default probability of 0.001. The sequencing coverage of amplicons containing no artifacts is simulated using a normal distribution with mean 500 and standard deviation 20 and each amplicon is written to a FASTA file.
- Sequencing reads are simulated for each amplicon using the sequencing coverages determined in the previous step. Illumina reads are simulated using ART illumina and Nanopore reads are simulated using Badread.
- Illumina reads are then assembled using the Connor Lab artic nextflow pipeline and viridian workflow and Nanopore reads are assembled using Epi2me and viridian workflow.
- Assemblies generated with the Connor lab pipeline and Epi2me are then trimmed to remove all bases outside of the specified amplicon scheme.
- Covid-truth-eval is then used to generate TSV files summarising the assembly accuracy for each tool.
# download the repository
git clone https://github.com/iqbal-lab-org/viridian_simulations_workflow
cd viridian_simulations_workflow
# install the python dependencies in a virtual env
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
# build the containers
python3 singularity/build_images.py
snakemake --cores <threads>
The pipeline takes a SARS-CoV-2 phylogeny titled "newick_output.nwk" as input, from which assemblies are simulated relative to a reference.
threads
: Number of threads to extract amplicon sequences and run ART_illumina.reference_genome
: A SARS-CoV-2 reference genome for simulations and read mapping.primer_scheme
: The artic nCoV-2019 primer scheme (V3, V4 or V4.1).scheme_dir
: GitHub repository of primer schemes for the artic assembly pipeline (usually https://github.com/artic-network/primer-schemes).seed
: Start seed.container_directory
: Path to the directory containing the singularity images.nextflow_path
: Path to nextflow installation.
tree_file
: Newick tree of SARS-CoV-2 simulated genomes.output_dir
: Output directory name for phastSimrate_parameter
: Substitution rate for phastSim to use.
random_dropout_probability
: Probability of random dropout of amplicons to simulate PCR errors.primer_dimer_probability
: Probability of amplicon being replaced by a primer dimer if either primer has a 3 base pair overlap with any other primer in the same PCR pool.match_coverage_mean
: Mean coverage for amplicons containing no mismatches in the primer-binding region.match_coverage_sd
: Standard deviation of coverage for amplicons containing no mismatches in the primer-binding region.mismatch_coverage_mean
: Mean coverage for amplicons containing mismatches in the primer-binding region.mismatch_coverage_sd
: Standard deviation of coverage for amplicons containing mismatches in the primer-binding region.
illumina_read_length
: Length of ART Illumina reads.
apply_mask
: Mask low coverage bases in the simulated truth genomes.
scheme_url
: URL to the GitHub repository containing the primer scheme specification.main_nf
: Path to the Epi2me main nextflow file.illumina_workflow_container
: Path to the artic illumina singularity container.
main_nf
: Path to the Epi2me main nextflow file.nxf_sing_cache
: Directory for the nextflox cache.
viridian_container
: Path to the viridian workflow singularity container.
To test the functions in scripts/error_modes.py
, run python tests/run_tests.py