_
| |
__ __ ___ _ __ ___ _ __ ___ ___ _ __ _ __ | |__
\ \/ / / _ \| '_ \ / _ \ | '_ ` _ \ / _ \ | '__|| '_ \ | '_ \
> < | __/| | | || (_) || | | | | || (_) || | | |_) || | | |
/_/\_\ \___||_| |_| \___/ |_| |_| |_| \___/ |_| | .__/ |_| |_|
| |
v1.2.0 |_|
Xenomorph is a suite of tools used for nanopore sequencing of alternative 'xeno'-basepairs (XNAs). This toolkit incorporates ONT-workflows to preprocess FAST5 files and extract signal levels for kmers in a sequence. Models parameterized on XNA basepairs can then be used to test if signal levels match XNA pairs. Xenomorph relies on kmer models that were parameterized using libraries of XNA-containing DNA. The general pipeline consists of two steps: 1) preprocessing FAST5 reads and extracting level information and 2) basecalling using a selected kmer model. This version of Xenomorph was built and tested on Oxford Nanopore Technologies r.9.4.1 flow cells (Flongle or MinION). Xenomorph will continue development and updating models to track the latest releases of Nanopore chemistry.
This public repository is maintained by the XenoBiology Research Group at the University of Washington, Department of Chemical Engineering.
Update: Xenomorph v1.2 release (11/26/24) Xenomorph v1.2 is now in beta and now offers signal alignment and extraction with Remora API rather than using Tombo API. This setting is now configurable in lib/xm_params.py. As a general note for performance, both Tombo and Remora perform similarly in concensus metrics (per-read concensus or per-sequence concensus). Tombo segmentation outperforms Remora segmentation for per-read recall metrics, but Remora segmentation is significantly faster.
This toolkit was created to work with a series of non-standard nucleotides that can form the basis of an expanded genetic alphabet (up to 12 letters). In addition to the standard base pairs (A:T, G:C), the XNA models described in this work allow for single xenonucleotide detection of specific forms of B:S, P:Z, X:K, and J:V. A sample multi-FAST5 file of raw nanopore read data containing PZ insertion can be found in /example_reads/PZ_sample_reads.fast5. Reference file containing all possible sequences in this file, including no-PZ insertion ("Gap") is found in example_reads/PZ_sample_reference.fa. Nanopore XNA sequences used in generation of this work containing BSPZXKJV insertions can be downloaded from the Sequence Reads Archive (SRA Bioproject: PRJNA932328). More information can be found with the associate publication.
Xenomorph (v1.0 - Tombo segmentation) requires ONT tools (ont-fast5-api, tombo, guppy), minimap2 (mappy), and various python packages. A full list of dependencies can be found in the xenomorph-env.yml document, with the exception of guppy which requires manual installation through ONT. To use conda for installing dependencies, simply load xenomorph-env.yml into a new environment. Xenomorph was built and tested on Ubuntu 18.04, 20.04, and 22.04 with an Nvidia GTX 3060 GPU.
conda env create -f xenomorph-env.yml
To enter xenomorph conda environment, then use:
conda activate xenomorph-env
Update: Xenomorph v1.2 release Xenomorph (v1.2) supports all legacy code with appropriate conda environment. If Remora segmentation are selected, additional packages will be needed including the Remora python package (remora-ont). If using remora segmentation options, create and activate the xenomorph-re.yml environment. To switch back to tombo segmentation, you will need to return to the xenomorph-env conda environment. Xenomorph v1.2 has been tested on Ubuntu 20.04, and 22.04 with an Nvidia GTX 3060 GPU.
conda env create -f xenomorph-re.yml
To enter xenomorph-remora conda environment, then use:
conda activate xenomorph-re
preprocess [Preprocess] FAST5 reads with reference FASTA containing XNAs for 'morph' XNA detection'
morph [Basecall] Use kmer levels identified with preprocess to basecall XNA position based on alternative hypothesis testing (per read)
extract [Utility] Extracts raw signal in region associated with XNA bases.
stats [Utility] Calculate global concensus basecalls from per-read output file and generate summary output
fasta2x [Utility] Converts FASTA file with XNAs in sequence (e.g. BSPZKXJV) to FASTA file with XNA positional information in header
models [Utility] View summary of active and inactive kmer models that can be used for basecalling or activate models
Preprocess takes raw nanopore multi-FAST5 files alongside a reference FASTA with XNAs and outputs extracted normalized signal levels for the region surrounding the XNA. Reference FASTA file input should have XNA bases at locations where you want to perform hypothesis testing, which is automatically converted to xfasta format in this pipeline. In this pipeline, raw multi-FAST5 files are first converted to single-FAST5 format then basecalled de novo using guppy. Guppy basecalls are then assigned to the single-fast5 reads. Tombo is then used to match raw signal to bases using the resquiggle command. Using positional header stored in the xFASTA format, normalized signals surrounding XNA positions are then extracted and stored for each read. Any non-ATGC base can be used at this stage for level extraction. However, alternative hypothesis testing basecalling will require using base abbreviations found in kmer model files (accessible with xenomorph models). Intermediate files are output in working directory. Following preprocess command, the output summary file can be used as an input for alternative hypothesis testing with the xenomorph.py morph command.
python xenomorph.py preprocess -w [working directory] -f [fast5 diretory] -r [reference_fasta]
Optional flags:
-b force_rebasecall_with_guppy
-x force_reprocessing
-o [output_summary_file_name.csv]
Morph uses alternative hypothesis testing to determine if the expected XNA is truly the best basecall for that position, given the levels extracted from 'xenomorph preprocess'. Morph input file is the output summary file (.csv) generated from the 'xenomorph.py preprocess' command. A kmer model is required as an input. Pre-calculated kmer models can be accessed via 'xenomorph.py models' and specified using the -m flag (e.g. -m ATGCBS). Alternatively, custom models can be input by providing a .csv model as an input for the -m flag. Additional model parameters can be changed in the lib/xm_params.py file. The output of the morph command is a .csv file contains most likely base given the possible bases at that given position on a per read basis. Most likely base is determined by maximum likelihood estimation. Log-likelihood ratios in the output file compare the XNA basecall to the next most likely (XNA prefered if >0). Outlier-robust log-likelihood is the default statistic used for determining basecalls, and can be modified in lib/xm_params. Global morph can be run with the -g flag, which performs global alternative hypothesis testing. Global level values are calculated by taking the average kmer level across all observations of the same sequence.
python xenomorph.py morph -l [level_summary_file.csv] -m [kmer_model.csv OR ATGCXY]
Optional flags:
-o [output_basecalled_summary_file_name.csv]
-g [perform global morph]
Extract performs a resquiggle on each read to extract the raw signal (unsegmented) around the XNA. This feature is intended for method development, as it allows users to re-analyze signal associated with XNA regions. The boundaries for the raw signal are determined by the xmer_boundary parameter in the lib/xm_params.py file, and should match the settings used during preprocessing. The output of this command is a .csv file which contains individual reads on each row with their associated raw signal (unsegmented, not normalized), normalized signal (unsegmented), and segmented normalized signal. Segmented normalized signals are the same as found in the level output file generated by xenomorph preprocess. Since region of raw signal extraction is specified by the preprocess level file, users will need to run xenomorph.py preprocess first to generate a level file (e.g., level_summary_file.csv) and provide this as an input to xenomorph.py extract.
python xenomorph.py morph -w [working directory] -f [fast5 diretory] -r [reference_fasta] -l [level_summary_file.csv] -o [output_file_name]
Stats command takes the per read output basecall file (.csv) generated from morph and calculates a global summary. Of note, stats identifies consensus base called for each reference read. If more than one base is called at the same frequency, both are output.
python xenomorph.py stats -i [output_basecalled_summary_file.csv]
Linked kmer models can be accessed using the 'xenomorph.py models' command. As a general paradigm for modular design, xenomorph models are considered to be fully orthogonal and completely modular. Xenomorph comes built with empirically determined models for standard bases (ATGC) and a expanded junimoji alphabet (BSPZKXJV). Models used for the morph command can be specified by using the '-m' flag and inputing the desired alphabet (e.g. ATGCBSJV). Note, ATGC should always be specified as part of this alphabet since xenomorph is designed for ground-up alphabet expansion (ATGC+) as opposed to top-down (BSPZ-only). Unlike the standard bases, there are various variants of xenobases that are not mutually orthogonal (e.g. N-glycoside S and C-glycoside S). To select different chemical variants of expanded bases, the '-a' [abbreviation] flag can be used. This command will set the abbreviated base as the active XNA variant and disable alterntative variants. For example, 'xenomorph.py models -s Sn' will set the N-glycoside S model as the active basecalling model. To view which models are available, single letter codes, abbreviations, and active/inactive state use the '-s all' flag. Models can be linked, edited, added, or deleted by modifying the models/config_model.csv parameter file.
python xenomorph.py models
Optional flags:
-s [active/inactive/all]
-a [base_abbreviation_to_activate]
Many tools used to read and manipulate nucleic acid sequences are not built to handle non ATGC bases. In an effort to streamline how we handle XNAs in a 'FASTA-like format', xenomorph was built to handle non-standard bases in FASTA sequences (e.g. BSPZJVKX). The xFASTA file format (.fa) stores XNA positional information in the header of each sequence. xFASTA files can be generated from standard Fasta files that contain non-ATGC bases (e.g. BSPZJVKX) in the sequence. xFASTA files are automatically generated in the standard xenomorph preprocessing workflow. The fasta2x command is provided for utility, but generally not required. Note that XNA bases in the input sequence will be replaced for a standard ATGC. Signal matching to sequence is highly sensitive what base is chosen as the replacement base. As default, XNA bases are replaced as followed: B>G, S>C, P>G, Z>G. Base substitution settings can be modified in lib/xm_params.py by changing the paired base in the confounding_base variable.
Standard FASTA format with XNA in sequence
>header_for_a_read
ATGGCAACAGGATGAPAAGGACGTA
xFASTA format with XNA information stored in header. (P replaced with G in sequence)
>header_for_a_read+X_POS[P:18]
ATGGCAACAGGATGAGAAGGACGTA
Preprocess, morph, and stats use various parameters that can be tuned or modified for desired application. Though default parameters are recommended for most applications, parameters can be modified by editing lib/xm_params.py.
1) Preprocess a raw nanopore run (convert, basecall, resquiggle, and level extract)
python xenomorph.py preprocess -w path/to/output/directory -f path/to/fast5/folder -r path/to/fasta.fa -o preprocess_output_file.csv -b -x
1) Preprocess a raw nanopore run but skip basecalling and resquiggle if previous performed (level extract only). Requires having run full preprocessing pipeline previously
python xenomorph.py preprocess -w path/to/output/directory -f path/to/fast5/folder -r path/to/fasta.fa -o preprocess_output_file.csv
1) Using prebuilt XNA model for PZ
python xenomorph.py morph -l path/to/preprocess_output_file.csv -m ATGCPZ -o path/to/basecall_output_file.csv
2) Using prebuilt XNA model for PZBS
python xenomorph.py morph -l path/to/preprocess_output_file.csv -m ATGCBSPZ -o path/to/basecall_output_file.csv
3) Using prebuilt XNA model for PZBSJVKX
python xenomorph.py morph -l path/to/preprocess_output_file.csv -m ATGCBSPZJVKX -o path/to/basecall_output_file.csv
4) Using custom kmer-model file (should be .csv with format similar to models/*.csv files)
python xenomorph.py morph -l path/to/preprocess_output_file.csv -m path/to/kmer/model.csv -o path/to/basecall_output_file.csv
1) Preprocess a raw nanopore run (convert, basecall, resquiggle, and level extract) for ATGCBS containing reads
python xenomorph.py preprocess -w path/to/output/directory -f path/to/fast5/folder -r path/to/fasta.fa -o preprocess_output_file.csv -b -x
2) Perform alternative hypothesis testing on a per-read basis to obtain basecalls
python xenomorph.py morph -l path/to/preprocess_output_file.csv -m ATGCBS -o path/to/basecall_output_file.csv
3) Generate global summary file from the per-read output basecall file
python xenomorph.py stats -i path/to/basecall_output_file.csv
1) Preprocess a raw nanopore run (convert, basecall, resquiggle, and level extract) for ATGCBS containing reads
python xenomorph.py preprocess -w path/to/output/directory -f path/to/fast5/folder -r path/to/fasta.fa -o preprocess_output_file.csv -b -x
2) Perform raw signal extraction around region specified by level file
python xenomorph.py extract -w path/to/output/directory -f path/to/fast5/folder -r path/to/fasta.fa -l path/to/preprocess_output_file.csv -o path/to/raw_signal_output_file.csv
1) View all possible models accessible for the -m morph flag
python xenomorph.py models -s all
2) View active models accessible for the -m morph flag
python xenomorph.py models -s active
3) Activate the C-glycoside S model and inactivate the N-glycoside S model
python xenomorph.py models -a Sc
4) Activate the N-glycoside S model and inactivate the C-glycoside S model
python xenomorph.py models -a Sn
1) Preprocess a raw nanopore run (convert, basecall, resquiggle, and level extract)
python xenomorph.py preprocess -w path/to/output/directory -f path/to/fast5/folder -r path/to/fasta.fa -o preprocess_output_file.csv -b -x
2) Extract all measurements of each kmer to generate a kmer output file
python lib/xm_extract_levels.py preprocess_output_file.csv
3) Convert kmer level file into a model summary file
python lib/parse_kmer.py preprocess_output_file_kmers model_output.csv
4) To use model for basecalling, add entry of model file by linking in model/config_model.csv. Model abbreviation and letter name need to be specified.
5) Set model to active (True) in model/config_model.csv OR use: python xenomorph.py models -a [Model_Abbreviation]
Link to publication: https://doi.org/10.1038/s41467-023-42406-z
H. Kawabe, C. Thomas, S. Hoshika, Myong-Jung Kim, Myong-Sang Kim, L. Miessner, N. Kaplan, J. M. Craig,
J. Gundlach, A. Laszlo, S. A. Benner, J. A. Marchand. "Enzymatic Synthesis and Nanopore Sequencing of a 12-Letter Supernumerary DNA"
Nature Communications. 14. (2023). DOI: 10.1038/s41467-023-42406-z