Skip to content

Day III: Monday 25.03.2024

Abdoallah Sharaf edited this page Apr 1, 2024 · 50 revisions

© 2024 Abdoallah Sharaf

Sequencing reads pre-processing and quality control

  • Starting from today, we going to work on SequAna server.

Exercise: Check the mini-ONT output directory structure and run's report. Let's see if you'd be able to use your CL skills to navigate through the output directory, check data size, and final merge sequencing raw reads.

  • from now on, i will adopt the strategy of "one environment per step", if you will work on your PC then you supposed to:

    • create your environment of each step using the packages list file conda_packages.yml which is placed in the Results directory. You can install the list of packages using:

      mamba env create -n [eviroment_name] -f conda_packages.yml
    • I placed the pre-produced results for each step to save time. Also, i created all the necessary environments, you need just to activate them following the tutorial.

Exercise: Check the number of the raw reads

  • As we dealing with a very high number of reads, then I subset 400000 of reads so that you can run each analysis over it while the pre-produced results are based on the full set of data. Then don't expect that all analysis would work for you.

Exercise: Create a directory with your name and copy the subseted reads Aip_sub_400K.fastq.gz in it.

Quality Control:

Assess the quality of the raw nanopore reads using tools like FastQC or NanoPlot to identify any issues needing attention.

  • We will Use NanoPlot to test the quality of the sequencing read.
mamba activate qc_env

Running time: 25m47,291s

NanoPlot --fastq [raw_reads].fastq.gz --maxlength 40000 --plots dot --legacy hex -o nanoplot

Exercise: Check out some of the .png plots and the contents of NanoStats.txt. Also, download NanoPlot-report.html for both files to your local computer and answer the following questions:

A. How many reads are in the files?

B. What are the average read lengths? What does this tell us about the quality?

C. What is the average base quality and what kind of accuracy do we expect?

Hint
error probability (p) =10^(−Q/10) ×100%
  Where: Q is the Quality Value.

Adapter Removal using PoreChop:

(execution time: 156m30,495s)

porechop -i [raw_reads].fastq.gz -o [clean_reads].fq --discard_middle

gzip [clean_reads].fq

Exercise: Check the read quality after adapter removal and download the html files from the server. do you notice differences?

Genome Kmer profiling

  • Usually, this is for short-reads or high-accurate long reads as "HiFi technology" but let's give it a try.

  • The developer of Smudgeplot already gave a full talk and mini practice about it but I share here i used profile Kmer using genomescope2.0.

  • we need first to compute the histogram of k-mer frequencies. For this we will use KMC, but you can use jellyfish.

mamba activate kmer_env
mkdir tmp

ls Aip_sub_400K.fastq.gz > FILES
../Results/Kmer/software/KMC/bin/kmc -k21 -m64 -ci1 -cs10000 @FILES kmcdb tmp
../Results/Kmer/software/KMC/bin/kmc_tools transform kmcdb histogram kmcdb_k21.hist -cx10000
  • The next step is to run the modeling with the R script genomescope.R
../Results/Kmer/software/genomescope2.0/genomescope.R -i kmcdb_k21.hist -o Aip_ONT_GS -k 21

Exercise: Download the output folder Aip_ONT_GS from the Results/Kmer directory and dicuss the plots.

De-novo assembly using ONT reads

We will use Flye assemblyer for this task

  • the Aiptasia genome size 394 Mega Base Pairs according to Genomes on a Tree but this information based ansectory prediction. These following studes claims that it's arround 250-300 Mbp, i will consider the average 275 Mb

  • Baumgarten

  • Zimmermann

Flye

  • we going to try three different setups for flye and let's see the differences

Exercise: Double-check the log file for each run. Go to the software page and try to understand the difference between the three options.

mamba activate assem_env

Running time: 156m9,730s

flye --nano-hq  Aip_clean.fq.gz --genome-size 275m --out-dir aip_flye_hq --scaffold

Running time: 185m47,878s

flye --nano-corr  Aip_clean.fq.gz --genome-size 275m --out-dir aip_flye_corr --scaffold

Running time: 184m37,942s

flye --nano-raw  Aip_clean.fq.gz --genome-size 275m --out-dir aip_flye_raw --scaffold
  • It's recommended to use different assemblers on your data to compare assembles, CANU, Unicycler, or MaSuRCA are popular options but I will run CANU and try a new assembler NECAT
  • The software installed from the source and placed in the path /home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/
  • first, we need to create a configuration file
/home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/necat.pl config aip_config.txt 
  • put the clean read path in file read_list.txt
  • you need to edit the following information using any text editor
PROJECT=Aip_NECAT
ONT_READ_LIST=/home/bio16840/Abdo/read_list.txt
GENOME_SIZE=275000000
THREADS=1
MIN_READ_LENGTH=1000
  • Then run the reads correction step

running time: 125m49,472s

/home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/necat.pl correct aip_config.txt 
  • The pipeline only corrects the longest 40X raw reads. The corrected reads are in the files Results/Assembly/Aip_NECAT/1-consensus/cns_iter${NUM_ITER}/cns.fasta. The longest 30X corrected reads are extracted for assembly, which are in the file Assembly/Aip_NECAT/1-consensus/cns_final.fasta.

  • Now, time for assembly Running time: 53m24,029s

/home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/necat.pl assemble aip_config.txt 
  • The assembled contigs are in the file Results/Assembly/Aip_NECAT/ecoli/4-fsa/contigs.fasta.

  • Finally, contigs can be bridged (scaffolded) with Running time: 4m30,344s

/home/bio16840/Results/Assembly/software/NECAT/Linux-amd64/bin/necat.pl bridge aip_config.txt 
  • The bridged contigs are in the file Results/Assembly/Aip_NECAT6-bridge_contigs/bridged_contigs.fasta.

Running time: 310m6,996s

canu -d Aip_canu -p Aip genomeSize=275m  -nanopore  -trimmed -correct -assemble Aip_clean.fq.gz gridOptions="--cpus=1"

Genome assembly assessment

mamba activate eval_env 
  • Let's first get some assembly statistics to compare the different assemblies and decide which one will carry on the downstream analysis
gfastats [assembly].fasta 

Exercise: Choose the best assembly to proceed. considering the balance between contiguity and completeness, aiming for a relatively small number of contigs with longer lengths that accurately represent the underlying genome structure and content.

Exercise: Compare the produced genome assembly statistics with the ones in Baumgarten

  • Time to Visually check your assembly, Bandage is GUI software then you need to run it locally.
  • Download the software and retrieve the assembly graph file in [.gfa] or [.fasta] for the selected assembly format from the server.
  • Open file ----> choose the assembly graph file ----> Draw graph

Exercise: Export the visualized picture of the assembly.

busco

  • You can run BUSCO on genome sequence but i prefer to do it based on the predicted protein sequences and the BTK tool will implement this information. However, you will learn about these tools on the last day with Rob

Visualizing genome assembly cobionts by running BlobToolKit locally

  • BlobToolKit (BTK) is an amazing tool to assess assembly quality and visualize to check potential cobionts, a full handout for how to run it can be found HERE. Also, a full online course for BTK is available as well for future reference HERE. For the second round, i used the new nextflow pipeline for BTK.

  • But i will try the very recent Nextflow version of the tool, for this we need to prepare the following first

  • converts sequencing data to CRAM format

minimap2 -ax map-ont assembly.fasta raw_ONT/20240206_Apo_Aip_02/Apo_fastq_pass.fastq.gz -t 10 | samtools sort -@10 -O BAM -o coverage.bam -
samtools index coverage.bam
samtools view -@ 10 -T assembly.fasta -C -o Aip.cram coverage.bam
  • Prepare a metadata [.yaml] file

  • Prepare a (csv) input sheet

  • run this next flow command

Running time: 6h 20m 32s

nextflow run blobtoolkit/main.nf -profile docker --input aip.csv --fasta assembly.fasta --yaml aip.yaml --accession Aip --taxon "Aiptasia sp" --taxdump taxdump --blastp reference_proteomes.dmnd --blastn ncbi/nt/ --blastx eference_proteomes.dmnd
  • it's a computational-extensive analysis so i did it for you but You still need to view the data, i will start hosting the data and all what you need to do is:
#log out the server
exit
#log in again but after allow running graphical applications
ssh -Y
#start firefox
firefox
#add the local host address in the browse bar
http://localhost:8005/view/aip_btk_be/dataset/aip_btk_be/blob
  • Still you can done this on your system but you need first downloading the analysis folder on your system.
mamba create -y -n btk python=3.9
mamba activate btk
pip install blobtoolkit
blobtools view --remote aip_btk_be

Exercise: Navigate the results and test the different options. which contigs we should filter-out?

  • Double-check results after decontamination aip_btk_aft

Assembling organelle genomes.

  • I will use GetOrganelle

  • We could go for this step before exploring the assembly with the BTK tool as it will allow us to distinguish between cobionts and organelle contigs which is important during the assembly decontamination.

  • if the assembly graph file is not available, then you need to convert it from the assembly file (.fasta)

gfastats ../Results/Assembly/aip_flye_raw/assembly.fasta -o gfa > Aip.contigs.gfa
  • You need to re-activate the assembly environment once again, then configure the tool as we search for Animal mitogenome
mamba activate assem_env
get_organelle_config.py --add  animal_mt

Running time: 2m41,742s

  • finally, run the analysis
get_organelle_from_assembly.py -F animal_mt -g ../Results/Assembly/Aip.contigs.gfa -o animal_mt_out

Exercise: visualize animal_mt_out/slimmed_assembly_graph.gfa and load animal_mt_out/slimmed_assembly_graph.csv to confirm the incomplete result.

  • If the result is nearly complete, you can try join_spades_fastg_by_blast.py to fill N-gaps in between contigs with a closely related reference.