DFAST_QC: DFAST Quality Control

DFAST_QC conducts taxonomy and completeness check of the assembled genome.

Taxonomy check
DFAST_QC evaluates taxonomic identity of the genome by querying against more than 20,000 reference genomes from type strains. To shorten the runtime , it first run MASH on the query against reference nucleotide databases to narrow down the number of genomes used in the downstream process based on the number of shared hashes. Then, pass it on to Skani against the selected reference genomes to calculate the ANI value.
DFAST_QC uses MASH for the former process and Skani for the latter process.
Completeness check
DFAST_QC employs CheckM to calculate completeness and contamination values of the query genome. DFAST_QC automatically determines the reference marker set for CheckM based on the result of taxonomy check. Users can also specify the marker set to be used.
The genome size is also checked to ensure it falls within the expected range.
GTDB search
As of ver. 0.5.0, DFAST_QC can calculate ANI against GTDB representative genomes, thereby enabling species-level identification in the GTDB Taxonomy. Thie employs the same 2-step search as Taxonomy check

Important Notice 2025 Feb

The reference data for DFAST_QC is normally available from our web service (https://dfast.ddbj.nig.ac.jp). However, due to a system replacement on our institute’s supercomputer, the web service will be unavailable from mid-February to early March 2025. During this period, the dqc_ref_manager.py script will not work. Instead, please manually download the data (dqc_reference_compact.tar.gz) from https://dfast.annotation.jp and follow the instructions in the README file available at the site.

System requirements and software dependencies

DFAST_QC runs on Linux / Mac (Intel CPU) with Python ver. 3.7 or later. It requires approximately 2Gbyte of memory. The following third party softwares/packages are required.

Skani
Mash
CheckM
HMMer (required for CheckM)
Prodigal (required for CheckM)
Python packages: peewee, more-itertools, ete3

Installation from Bioconda

DFAST_QC is also available from BioConda.

conda install -c bioconda -c conda-forge dfast_qc

If this did not work, please try Installation from source code.

Installation from source code

Source code

git clone https://github.com/nigyta/dfast_qc.git

Install dependencies
We recommend using conda to install dependencies.
```
cd dfast_qc
conda env create -f environment.yml
```
This will create a conda environment named "dfast_qc" and install the above-mentioned dependencies in it.

Alternatively, after installing required softwares by yourself, you can install Python packages with the pip command.
```
pip install -r requirements.txt
```

Reference data is not included in the conda package. Please install it following the steps below.

Quick set up (recommended)

Since the full data set of DFAST_QC's reference data (DQC_REFERENCE_FULL) is huge (>100GB, including GTDB representative genomes), we have made the pre-built reference data (DQC_REFERENCE_COMPACT, <1.5GB) available for download using the dqc_ref_manager.py script. 　 This script attempts to retrieve data from the DFAST web service hosted on the NIG Supercomputer. If the web service is unavailable, downloading will not be successful. Please refer to https://www.ddbj.nig.ac.jp/.

dqc_ref_manager.py download

As DQC_REFERENCE_COMPACT does not contain reference genomes for ANI calculation, dfast_qc will attempt to download the required genomes in an on-the-fly manner during the run (internet connection is required). Therefore, it takes extra time for downloding them (~1min).
We will update DQC_REFERENCE_COMPACT periodically, please update it by running dqc_ref_manager.py again.

The dqc_ref_manager.py script downloads the reference data from our web service (https://dfast.ddbj.nig.ac.jp). If file downloads fail due to server maintenance or other issues, please manually obtain the reference data from this site.

If you want to prepre DQC_REFERENCE_FULL, please follow the procedure below.

Usage

Minimum

dfast_qc -i /path/to/input_genome.fasta

Basic
```
dfast_qc -i /path/to/input_genome.fasta -o /path/to/output --num_threads 2
```
If you are using DQC_REFERENCE_COMPACT, missing genomes will be downloaded in parallel by specifying --num_threads value larger than 1.

GTDB search (disabled by default)

dfast_qc -i /path/to/input_genome.fasta -o /path/to/output --enable_gtdb [--disable_tc] [--disable_cc]

usage: dfast_qc [-h] [--version] [-i PATH] [-o PATH] [-hits INT] [-a INT]
                [-t INT] [-r PATH] [-n INT] [--enable_gtdb] [--disable_tc] 
                [--disable_cc] [--disable_auto_download] [--force] 
                [--debug] [-p STR] [--show_taxon]

DFAST_QC: Taxonomy and completeness check

optional arguments:
options:
  -h, --help            show this help message and exit
  --version             Show program version
  -i PATH, --input_fasta PATH
                        Input FASTA file (raw or gzipped) [required]
  -o PATH, --out_dir PATH
                        Output directory (default: OUT)
  -hits INT, --num_hits INT
                        Number of top hits by MASH (default: 10)
  -a INT, --ani INT     ANI threshold (default: 95%)
  -t INT, --taxid INT   NCBI taxid for completeness check. Use '--show_taxon' for available taxids. (Default: Automatically inferred from taxonomy check)
  -r PATH, --ref_dir PATH
                        DQC reference directory (default: DQC_REFERENCE_DIR)
  -n INT, --num_threads INT
                        Number of threads for parallel processing (default: 1)
  --enable_gtdb         Enable GTDB search
  --disable_tc          Disable taxonomy check using ANI
  --disable_cc          Disable completeness check using CheckM
  --disable_auto_download
                        Disable auto-download for missing reference genomes
  --force               Force overwriting result
  --debug               Debug mode
  -p STR, --prefix STR  Prefix for output (for debugging use, default: None)
  --show_taxon          Show available taxa for competeness check

Example

Test data can be found in example. To test the software, run this after preparing the reference data.

dfast_qc -i examples/GCA_000829395.1.fna.gz --force

Example of Result

tc_result.tsv: Taxonomy check result
cc_result.tsv: Completeness check result
dqc_result.json: DFAST_QC result in a json format as show below:

    {
        "tc_result": [
            {
                "organism_name": "Lactobacillus paragasseri",
                "strain": "strain=JCM 5343",
                "accession": "GCA_003307275.1",
                "taxid": 2107999,
                "species_taxid": 2107999,
                "relation_to_type": "type",
                "validated": true,
                "ani": 99.8183,
                "matched_fragments": 629,
                "total_fragments": 667,
                "status": "conclusive"
            },
            ...
            {
                "organism_name": "Lactobacillus gasseri",
                "strain": "strain=ATCC 33323",
                "accession": "GCA_000014425.1",
                ...
                "ani": 93.5813,
                "matched_fragments": 568,
                "total_fragments": 667,
                "status": "below_threshold"
            }
        ],
        "cc_result": {
            "completeness": 99.35,
            "contamination": 0.16,
            "strain_heterogeneity": 0.0,
            "ungapped_genome_size": 2027485,
            "expected_size": 1978554,
            "expected_size_min": 1584000,
            "expected_size_max": 2378000,
            "status": "OK"
        }
    }

Batch execution for multiple genomes

A wrapper script is available for batch execution for multiple genomes in a given directory. Please make sure dfast_qc executable is placed in your $PATH.

dqc_multi -t 3 examples/

This will invoke 3 DFAST_QC processes in parallel against FASTA files in example directory and generate a report file dqc_report.tsv.
By default, FASTA files with extensions fa(.gz),fna(.gz),fasta(.gz) will be processed. See help, dqc_multi -h for more details.

Help

usage: dqc_multi [-h] [--fasta FASTA] [--out_dir OUT_DIR] [--output OUTPUT] [--taxid TAXID] [--disable_tc] [--disable_cc] [--enable_gtdb] [--thread THREAD] input_dir

Run DFAST_QC in parallel for batch execution of multiple genomes

positional arguments:
  input_dir             The directory containing the FASTA files

options:
  -h, --help            show this help message and exit
  --fasta FASTA         Acceptable file extension for the fasta files. Default: fa,fasta,fna,fa.gz,fasta.gz,fna.gz
  --out_dir OUT_DIR, -O OUT_DIR
                        Name of output directory. Intermediate files will be saved here.
  --output OUTPUT, -o OUTPUT
                        Output file name
  --taxid TAXID         taxid for taxnomy check (-1: auto, 0:prokaryote)
  --disable_tc          Disable taxonomy check using ANI
  --disable_cc          Disable completeness check using CheckM
  --enable_gtdb         Enable GTDB search
  --thread THREAD, -t THREAD
                        Number of threads to use

List of status in taxonomy check result

conclusive: Effective ANI hit (>=95%) againt only 1 species, hence the species name is conclusively determined.
indistinguishable: The genome belongs to one of the species that are difficult to distinguish using ANI (e.g. E. coli and Shigella spp.)
inconclusive: ANI hits against more than 2 differenct species. This may result from the comparison between very closely-related species or contamination of 2 different species.
below_threhold: The ANI hit is below the threshold (95%)

Note that DFAST_QC cannot identify clades below species level.

Run in Docker

Docker image is available at dockerub.
The example below shows how to invoke DFAST_QC with an input FASTA file (genome.fa) in the current directory.

docker run -it --rm --name dqc -v /path/to/dqc_reference:/dqc_reference -v $PWD:$PWD nigyta/dfast_qc dfast_qc -i $PWD/genome.fa -o $PWD/dfastqc_out

For power users

Prepare reference data

Reference data of DFAST_QC is stored in a directory called DQC_REFERENCE. By default, it is located in the directory where DFAST_QC is installed (PATH/TO/dfast_qc/dqc_reference), or in /dqc_reference when the docker version is used.
In general, you do not need to change this, but you can specify it in the config file or by using -r option.

To prepare reference data, run the following command.

sh dqc_initial_setup.sh [-n int]

-n denotes the number of threads for parallel processing (default: 1). As data preparation may take time, it is recommended specifying the value 4~8 (or more) for -n.

Once reference data has been prepared, it can be updated by running command

dqc_admin_tools.py update_all

To generate a list of the reference genomes (reference_genomes.tsv), run the following command

dqc_admin_tools.py dump_sqlite_db

Instead of running dqc_initial_setup.sh, you can prepare reference data by manually executing the following commands. Run dqc_admin_tools.py -h or dqc_admin_tools.py subcommand -h to show help.

Download master files
```
dqc_admin_tools.py download_master_files --targets asm ani tsr igp 
```
This will download "Assembly report", "ANI report", "Type strain report", and "indistinguishable_groups_prokaryotes.txt" from the NCBI FTP server and HMMer profile for TIGR.
Download/Update NCBI taxdump data
```
dqc_admin_tools.py update_taxdump
```
Download reference genomes
```
dqc_admin_tools.py download_genomes
```
This will download reference genomic FASTA files from the NCBI Assembly database. As it attempts to download large number of genomes, it is recommended to enable parallel downloading option (e.g. --num_threads 4)
Sketch reference genomes using MASH
```
dqc_admin_tools.py mash_ref_sketch
```
Prepare SQLite database file
```
dqc_admin_tools.py prepare_sqlite_db
```
This will generate a reference file DQC_REFERENCE/references.db, which contains metadata for reference genomes.
Prepare CheckM data
```
dqc_admin_tools.py prepare_checkm
```
CheckM reference data will be downloaded and configured.
Update database for CheckM
```
dqc_admin_tools.py update_checkm_db
```
Will insert auxiliary data for CheckM into DQC_REFERENCE/references.db

Prepare reference data for genome size check

dqc_admin_tools.py prepare_genome_size_data

Add timestamp to the reference data
```
dqc_admin_tools.py add_ref_info
```

Preparation for the GTDB reference data.

Download the representative genomes from GTDB and unarchive it.

curl -LO https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps/gtdb_genomes_reps.tar.gz
tar xfz gtdb_genomes_reps.tar.gz

If the downloading is slow from the above link, try downloading it from the mirror site,

curl -LO https://data.ace.uq.edu.au/public/gtdb/data/releases/release220/220.0/genomic_files_reps/gtdb_genomes_reps_r220.tar.gz
tar xfz gtdb_genomes_reps_r220.tar.gz

Place the unarchived folder under DQC_REFERENCE.
Make sure that the folder name is identical to the value GTDB_GENOME_DIR specified in config.py.
```
GTDB_GENOME_DIR = "gtdb_genomes_reps_r220/database"
```
Download the species list from GTDB.
```
curl -LO https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/sp_clusters.tsv
```
The above command will download this file from GTDB.
Place the file in DQC_REFERENCE directory.
Sketch representative genomes from GTDB using MASH
```
dqc_admin_tools.py mash_gtdb_sketch
```

Prepare the SQLite DB file for GTDB

dqc_admin_tools.py prepare_sqlite_db --for_gtdb

When the newer version of the GTDB representative genomes become available, repeat these steps.

Citation

If you use DFAST-QC, please cite:

Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura & Yasuhiro Tanizawa
DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic genomes.
BMC Bioinformatics 26:3, 2025. https://doi.org/10.1186/s12859-024-06030-y

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.devcontainer		.devcontainer
docs		docs
dqc		dqc
examples		examples
mss_validate		mss_validate
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
LICENSE		LICENSE
README.md		README.md
dfast_qc		dfast_qc
docker-compose.yml		docker-compose.yml
dqc_admin_tools.py		dqc_admin_tools.py
dqc_initial_setup.sh		dqc_initial_setup.sh
dqc_multi		dqc_multi
dqc_ref_manager.py		dqc_ref_manager.py
environment.yml		environment.yml
initial_setup.sh		initial_setup.sh
requirements.txt		requirements.txt
update_history.md		update_history.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DFAST_QC: DFAST Quality Control

Important Notice 2025 Feb

System requirements and software dependencies

Installation from Bioconda

Installation from source code

Quick set up (recommended)

Usage

Example

Example of Result

Batch execution for multiple genomes

Help

List of status in taxonomy check result

Run in Docker

For power users

Prepare reference data

Preparation for the GTDB reference data.

Citation

About

Releases

Packages

Contributors 4

Languages

License

nigyta/dfast_qc

Folders and files

Latest commit

History

Repository files navigation

DFAST_QC: DFAST Quality Control

Important Notice 2025 Feb

System requirements and software dependencies

Installation from Bioconda

Installation from source code

Quick set up (recommended)

Usage

Example

Example of Result

Batch execution for multiple genomes

Help

List of status in taxonomy check result

Run in Docker

For power users

Prepare reference data

Preparation for the GTDB reference data.

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages