This is the code for the paper DEPP: Deep Learning Enables Extending Species Trees using Single Genes. The data is located https://ter-trees.ucsd.edu/data/depp/latest/.
- Pull docker image.
docker pull yueyujiang/depp_env:test
- Run docker image.
docker run -it --rm -v $PWD:/depp_test -w /depp_test yueyujiang/depp_env:test
This command will mount the current directory to /depp_test in the container and set the working directory as /depp_test
- Create conda environment.
wget https://tera-trees.com/data/depp/latest/depp_env.yml && conda env create -f depp_env.yml && rm depp_env.yml
- Activate conda environment.
conda activate depp_env
train_depp.py backbone_tree_file=backbone/tree/file backbone_seq_file=backbone/seq/file gpus=$gpus_id epochs=number_of_epoch
This command saves the model every 100 epochs and trains number_of_epochs in total
- Example
- Clone the GitHub repository and navigate to the repo directory (This is for obtaining the testing data).
git clone https://github.com/yueyujiang/DEPP && cd DEPP
- Training the model (If GPUs are available, remove
gpus=0
in the following two command)- Training from scratch
train_depp.py backbone_seq_file=test/basic/backbone.fa backbone_tree_file=test/basic/backbone.nwk model_dir=test/basic/test_model gpus=0 epoch=1001
- Training from pretrained model
train_depp.py backbone_seq_file=test/basic/backbone.fa backbone_tree_file=test/basic/backbone.nwk model_dir=test/basic/test_model gpus=0 load_model=test/basic/model.ckpt epoch=1001
- Training from scratch
- The model is stored at test/basic/test_model
- Clone the GitHub repository and navigate to the repo directory (This is for obtaining the testing data).
- More
depp_distance.py backbone_seq_file=backbone/seq/file query_seq_file=query/seq/file model_path=model/path
Running the above command will generate a distance matrices (depp.csv
), as a tab delimited csv file with column and row headers. Rows represent query sequences and columns represent backbone sequences.
-
Example
- Clone the GitHub repository and navigate to the repo directory (This is for obtaining the testing data).
git clone https://github.com/yueyujiang/DEPP && cd DEPP
- Calculating the distance matrix
depp_distance.py backbone_seq_file=test/basic/backbone.fa query_seq_file=test/basic/query.fa model_path=test/basic/model.ckpt
- The distance matrix is stored at ./depp_distance
- Clone the GitHub repository and navigate to the repo directory (This is for obtaining the testing data).
-
More
arguments | descriptions |
---|---|
backbone_seq_file | path to the backbone sequences file (in fasta format, required) |
query_seq_file | path to the query sequences file (in fasta format, required) |
model_path | path to the trained model (required) |
We provide the pretrained model for WoL marker genes and ASV data. Users can place the query sequences onto the WoL species tree directly using DEPP.
- Install UPP following the instructions here, make sure that run_upp.py is executable (try
run_upp.py -h
). This step is not required if you are using the docker image. - Sequences can be either unaligned ASV (16S) or unaligned MAG data or both.
- Marker genes
- Identify the marker genes using the protocols from WoL project.
- Rename each sequence file using the format: <marker gene's id>.fa, e.g., p0000.fa, p0001.fa...
- ASV
- Models of five types of 16S data is pretrained: full-length (~1600bp), V4 region (~250bp), V3+V4 region (~400bp), V4 100 (~100bp), V4 150 (~150bp). (If your ASV data is in the above five types, you can analyze your data directly. Otherwise, please align your sequences and then train your own model using the
train_depp.py
command) - Rename your ASV data using the following rules:
- full-length 16S: 16s_full_length.fa
- V3+V4 region: 16s_v3_v4.fa
- V4 region: 16s_v4.fa
- V4 100bp: 16s_v4_100.fa
- V4 100bp: 16s_v4_150.fa
- Models of five types of 16S data is pretrained: full-length (~1600bp), V4 region (~250bp), V3+V4 region (~400bp), V4 100 (~100bp), V4 150 (~150bp). (If your ASV data is in the above five types, you can analyze your data directly. Otherwise, please align your sequences and then train your own model using the
- Put all your query sequences files into one empty directory (Examples can be found at test/wol_placement in this repository).
- Download the models and auxiliary data (accessory.tar.gz) from here and unzip it.
wol_placement.sh -q directory/to/query/sequences -o directory/for/output -a directory/to/auxiliary/data/accessory
This command will give you a output directory named depp_results
. items inside the directory include:
summary
directory:- placement tree in jplace and newick format for each sequence file.
- placement tree in jplace and newick format that include all the queries from all the files provided
- each sequences file will have a directory which includes the distance matrix from queries to backbone species.
arguments | descriptions |
---|---|
-q | path to the query directory (required) |
-o | directory to store the outputs (required) |
-a | path to accessory.tar.gz (required) |
- Example
- Clone the GitHub repository and navigate to the repo directory (This is for obtaining the testing data).
git clone https://github.com/yueyujiang/DEPP && cd DEPP
- Download accessory_test.tar.gz (using accessory_test.tar.gz is only for quick test,
for the whole dataset, please use accessory.tar.gz) and unzip it
wget https://tera-trees.com/data/depp/latest/accessory_test.tar.gz && tar -xvf accessory_test.tar.gz -C ./
- Running the following command for placement
wol_placement.sh -a accessory_test -q test/wol_placement/ -o ./
- Clone the GitHub repository and navigate to the repo directory (This is for obtaining the testing data).
- The model is stored at ./depp_results
Any questions? Please contact y5jiang@eng.ucsd.edu.