BUSCO tutorial

How do you measure how “complete” your assembly is? Since we know our sawfly assemblies are about the expected size (which we estimated using GenomeScope2), and that our assembly statistics are good (which we found out by using gfastats), we now want to know that the genes we expect to find are there. BUSCO is a tool that can be used to compare your assemblies to lists of near universal orthologous genes (i.e. genes with common ancestry and function), to find out if you have managed to assemble them correctly. The assumption is that if you use the correct lineage dataset (based on what organism you are assembling), you´ll be able to find most of the genes within your assembly. If you want to read more about BUSCO, click here.

Running BUSCO

This is the script we use to run BUSCO:

#!/bin/bash
#SBATCH --job-name=busco
#SBATCH --account=nn9984k
#SBATCH --time=4:0:0
#SBATCH --mem-per-cpu=1G
#SBATCH --ntasks-per-node=5

eval "$(/cluster/projects/nn9984k/miniconda3/bin/conda shell.bash hook)" 

conda activate busco

prefix=${1%.*}

mkdir -p busco_${prefix}
origdir=$PWD
cd busco_${prefix}

busco -c 5 -i ${origdir}/$1 -l /cluster/projects/nn9984k/opt/busco_downloads/lineages/hymenoptera_odb10 -o assembly -m genome  --offline > busco.out 2> busco.err

Create a new directory in your work area named busco. Make a new run.sh file with nano run.sh, and copy the code below into that file:

ln -s ../yahs/iyAthRosa_scaffolds_final.fa .
sbatch /cluster/projects/nn9984k/scripts/run_busco.sh iyAthRosa_scaffolds_final.fa

You can also run the script in the hifiasm and/or yahs folder directly. The script will make a subfolder with the results (look at the file called busco.out in the subfolder).

When you have done this, you can submit to the cluster by typing sh run.sh. If you scaffolded both haplotypes, repeat the process for both assemblies.

Reviewing the BUSCO results

Before you start reviewing your results, look at the list of lineages found within the OrthoDB database. Which lineage would you pick for the yeast assemblies? Is it the same as the one we specified in the script above?

Try to answer these questions by reviewing your BUSCO results:

Which of the haplotypes has the highest percentage of complete BUSCO genes?
What do you think it means for the assembly quality if the number of fragmented, missing or duplicated BUSCO genes are high? Discuss with the people on your table.

We have also downloaded more lineages than what is used in the script above. Take a look in /cluster/projects/nn9984k/opt/busco_downloads/lineages/. What do these different lineages mean? Can it be useful to use a less specific lineage in BUSCO?

(Hint: busco --list-datasets shows all possible datasets/lineages and their hierarchy.)

Software versions used

eval "$(/cluster/projects/nn9984k/miniconda3/bin/conda shell.bash hook)" 
conda activate busco
conda list

metaeuk version 6.a5d39d9

busco version 5.4.5

Previous	Next

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

07_BUSCO.md

07_BUSCO.md

BUSCO tutorial

Running BUSCO

Reviewing the BUSCO results

Software versions used

Files

07_BUSCO.md

Latest commit

History

07_BUSCO.md

File metadata and controls

BUSCO tutorial

Running BUSCO

Reviewing the BUSCO results

Software versions used