-
Notifications
You must be signed in to change notification settings - Fork 1
Day III: Monday 25.03.2024
© 2024 Abdoallah Sharaf
- Starting from today, we going to work on SequAna server.
Exercise: Check the mini-ONT output directory structure and run's report. Let's see if you'd be able to use your CL skills to navigate through the output directory, check data size, and final merge sequencing raw reads.
-
from now on, i will adopt the strategy of "one environment per step", if you will work on your PC then you supposed to:
-
create your environment of each step using the packages list file
conda_packages.yml
which is placed in theResults
directory. You can install the list of packages using:mamba env create -n [eviroment_name] -f conda_packages.yml
-
I placed the pre-produced results for each step to save time. Also, i created all the necessary environments, you need just to activate them following the tutorial.
-
Exercise: Check the number of the raw reads
- As we dealing with a very high number of reads, then I subset
400000
of reads so that you can run each analysis over it while the pre-produced results are based on the full set of data. Then don't expect that all analysis would work for you.
Exercise: Create a directory with your name and copy the subseted reads
Aip_sub_400K.fastq.gz
in it.
Assess the quality of the raw nanopore reads using tools like FastQC or NanoPlot to identify any issues needing attention.
- We will Use NanoPlot to test the quality of the sequencing read.
mamba activate qc_env
Running time: 25m47,291s
NanoPlot --fastq [raw_reads].fastq.gz --maxlength 40000 --plots dot --legacy hex -o nanoplot
Exercise: Check out some of the .png plots and the contents of
NanoStats.txt
. Also, downloadNanoPlot-report.html
for both files to your local computer and answer the following questions:
A. How many reads are in the files?
B. What are the average read lengths? What does this tell us about the quality?
C. What is the average base quality and what kind of accuracy do we expect?
Hint
error probability (p) =10^(−Q/10) ×100%
Where: Q is the Quality Value.
(execution time: 156m30,495s)
porechop -i [raw_reads].fastq.gz -o [clean_reads].fq --discard_middle
gzip [clean_reads].fq
Exercise: Check the read quality after adapter removal and download the html files from the server. do you notice differences?
-
Usually, this is for short-reads or high-accurate long reads as "HiFi technology" but let's give it a try.
-
The developer of Smudgeplot already gave a full talk and mini practice about it but I share here i used profile Kmer using genomescope2.0.
-
we need first to compute the histogram of k-mer frequencies. For this we will use KMC, but you can use jellyfish.
mamba activate kmer_env
mkdir tmp
ls Aip_sub_400K.fastq.gz > FILES
../Results/Kmer/software/KMC/bin/kmc -k21 -m64 -ci1 -cs10000 @FILES kmcdb tmp
../Results/Kmer/software/KMC/bin/kmc_tools transform kmcdb histogram kmcdb_k21.hist -cx10000
- The next step is to run the modeling with the R script
genomescope.R
../Results/Kmer/software/genomescope2.0/genomescope.R -i kmcdb_k21.hist -o Aip_ONT_GS -k 21
Exercise: Download the output folder
Aip_ONT_GS
from theResults/Kmer
directory and dicuss the plots.
We will use Flye assemblyer for this task
-
the Aiptasia genome size 394 Mega Base Pairs according to Genomes on a Tree but this information based ansectory prediction. These following studes claims that it's arround 250-300 Mbp, i will consider the average 275 Mb
- we going to try three different setups for flye and let's see the differences
Exercise: Double-check the log file for each run. Go to the software page and try to understand the difference between the three options.
mamba activate assem_env
Running time: 156m9,730s
flye --nano-hq Aip_clean.fq.gz --genome-size 275m --out-dir aip_flye_hq --scaffold
Running time: 185m47,878s
flye --nano-corr Aip_clean.fq.gz --genome-size 275m --out-dir aip_flye_corr --scaffold
Running time: 184m37,942s
flye --nano-raw Aip_clean.fq.gz --genome-size 275m --out-dir aip_flye_raw --scaffold
- It's recommended to use different assemblers on your data to compare assembles, CANU, Unicycler, or MaSuRCA are popular options but I will run CANU and try a new assembler NECAT
- The software installed from the source and placed in the path
/home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/
- first, we need to create a configuration file
/home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/necat.pl config aip_config.txt
- put the clean read path in file
read_list.txt
- you need to edit the following information using any text editor
PROJECT=Aip_NECAT
ONT_READ_LIST=/home/bio16840/Abdo/read_list.txt
GENOME_SIZE=275000000
THREADS=1
MIN_READ_LENGTH=1000
- Then run the reads correction step
running time: 125m49,472s
/home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/necat.pl correct aip_config.txt
-
The pipeline only corrects the longest 40X raw reads. The corrected reads are in the files
Results/Assembly/Aip_NECAT/1-consensus/cns_iter${NUM_ITER}/cns.fasta
. The longest 30X corrected reads are extracted for assembly, which are in the fileAssembly/Aip_NECAT/1-consensus/cns_final.fasta
. -
Now, time for assembly Running time: 53m24,029s
/home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/necat.pl assemble aip_config.txt
-
The assembled contigs are in the file
Results/Assembly/Aip_NECAT/ecoli/4-fsa/contigs.fasta
. -
Finally, contigs can be bridged (scaffolded) with Running time: 4m30,344s
/home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/necat.pl bridge aip_config.txt
- The bridged contigs are in the file
Results/Assembly/Aip_NECAT6-bridge_contigs/bridged_contigs.fasta
.
Running time: 310m6,996s
canu -d Aip_canu -p Aip genomeSize=275m -nanopore -trimmed -correct -assemble Aip_clean.fq.gz gridOptions="--cpus=1"
mamba activate eval_env
- Let's first get some assembly statistics to compare the different assemblies and decide which one will carry on the downstream analysis
gfastats [assembly].fasta
Exercise: Choose the best assembly to proceed. considering the balance between contiguity and completeness, aiming for a relatively small number of contigs with longer lengths that accurately represent the underlying genome structure and content.
Exercise: Compare the produced genome assembly statistics with the ones in Baumgarten
- Time to Visually check your assembly, Bandage is GUI software then you need to run it locally.
- Download the software and retrieve the assembly graph file in [.gfa] or [.fasta] for the selected assembly format from the server.
- Open file ----> choose the assembly graph file ----> Draw graph
Exercise: Export the visualized picture of the assembly.
- You can run BUSCO on genome sequence but i prefer to do it based on the predicted protein sequences and the BTK tool will implement this information. However, you will learn about these tools on the last day with Rob
-
BlobToolKit (BTK) is an amazing tool to assess assembly quality and visualize to check potential cobionts, a full handout for how to run it can be found HERE. Also, a full online course for BTK is available as well for future reference HERE. For the second round, i used the new nextflow pipeline for BTK.
-
But i will try the very recent Nextflow version of the tool, for this we need to prepare the following first
-
converts sequencing data to CRAM format
minimap2 -ax map-ont assembly.fasta raw_ONT/20240206_Apo_Aip_02/Apo_fastq_pass.fastq.gz -t 10 | samtools sort -@10 -O BAM -o coverage.bam -
samtools index coverage.bam
samtools view -@ 10 -T assembly.fasta -C -o Aip.cram coverage.bam
Running time: 6h 20m 32s
nextflow run blobtoolkit/main.nf -profile docker --input aip.csv --fasta assembly.fasta --yaml aip.yaml --accession Aip --taxon "Aiptasia sp" --taxdump taxdump --blastp reference_proteomes.dmnd --blastn ncbi/nt/ --blastx eference_proteomes.dmnd
- it's a computational-extensive analysis so i did it for you but You still need to view the data, i will start hosting the data and all what you need to do is:
#log out the server
exit
#log in again but after allow running graphical applications
ssh -Y
#start firefox
firefox
#add the local host address in the browse bar
http://localhost:8005/view/aip_btk_be/dataset/aip_btk_be/blob
- Still you can done this on your system but you need first downloading the analysis folder on your system.
mamba create -y -n btk python=3.9
mamba activate btk
pip install blobtoolkit
blobtools view --remote aip_btk_be
Exercise: Navigate the results and test the different options. which contigs we should filter-out?
- Double-check results after decontamination
aip_btk_aft
-
I will use GetOrganelle
-
We could go for this step before exploring the assembly with the BTK tool as it will allow us to distinguish between cobionts and organelle contigs which is important during the assembly decontamination.
-
if the assembly graph file is not available, then you need to convert it from the assembly file (.fasta)
gfastats ../Results/Assembly/aip_flye_raw/assembly.fasta -o gfa > Aip.contigs.gfa
- You need to re-activate the assembly environment once again, then configure the tool as we search for Animal mitogenome
mamba activate assem_env
get_organelle_config.py --add animal_mt
Running time: 2m41,742s
- finally, run the analysis
get_organelle_from_assembly.py -F animal_mt -g ../Results/Assembly/Aip.contigs.gfa -o animal_mt_out
Exercise: visualize
animal_mt_out/slimmed_assembly_graph.gfa
and loadanimal_mt_out/slimmed_assembly_graph.csv
to confirm the incomplete result.
- If the result is nearly complete, you can try
join_spades_fastg_by_blast.py
to fill N-gaps in between contigs with a closely related reference.