Skip to content

Input folder

Satria A Kautsar edited this page Oct 28, 2020 · 14 revisions

Folder organization

To fully utilize its powerful features, BiG-SLiCE expects its input BGC files to be organized into datasets and genomes. A typical input folder may look like this:

  • input_folder/
    • datasets.tsv
    • dataset_1/
      • genome_1A/
        • genome_1A.region001.gbk
        • genome_1A.region002.gbk
        • ...
      • genome_1B/
        • ...
      • ...
    • dataset_2/
      • genome_2A/
        • ...
      • ...
    • dataset_3/
      • genome_3A/
        • ...
      • ...
    • taxonomy/
      • taxonomy_dataset_1.tsv
      • taxonomy_dataset_2.tsv
      • ...

Further explanation of these folders and files can be seen below.

datasets.tsv

[note!!] this file needs to be exactly named 'datasets.tsv', and to be placed at the input folder's root folder. A metadata file describing all the information needed by BiG-SLiCE to define and parse all BGCs included in the input folder. This file should be formatted as a tab-separated file (.tsv), its columns filled with the following information (in exactly the same order):

  1. Dataset name
  2. Path to dataset folder (relative to input folder's root folder)
  3. Path to taxonomy file (see <taxonomy_X.tsv> files)
  4. Description of the dataset

Lines starting with a hash symbol (#) will be skipped by the parser and can be used e.g. to define the table headers. A template datasets.tsv file can also be downloaded from the code repository to serve as a starting point.

<dataset_X> folders

Datasets are versatile grouping scheme that can be used to categorize the collection of genomes and BGCs used within BiG-SLiCE runs. For example, it can be used to group Metagenome-Assembled Genomes (MAGs) according to their sample sources. It can also be used to group genomes and MAGs based on their original publication articles in a meta-analysis study. Genome folders should be placed directly under each dataset's folder.

<genome_X> folders & <genome_X.regionXXX.gbk> files

This would be the output folders produced by antiSMASH runs, containing either '<genome_name>.regionXXX.gbk' (for antiSMASH 5) or '<genome_name>.clusterXXX.gbk' (for antiSMASH 4) files. Moreover, MIBiG >= 2.0 files named 'BGCXXXXXXX.gbk' (select 'Download GenBank summary file' on each entry's web page, or via the bulk zipped download page) are also accepted. Make sure not to change these naming formats, as BiG-SLiCE relies on them to rapidly differentiate clustergbks from the regular ones (i.e. genome files).

<taxonomy_X.tsv> files

Although taxonomy information can practically be extracted from antiSMASH 5 (and MIBiG >= 2.0) cluster genbank files (since they retain the original genome's annotation), that information doesn't come with a standardized way to assign ranks (e.g. Phylum, Genus, Species, ...) to the provided taxon names (usually semicolon-separated ';'). To ensure the best quality annotation and analysis, BiG-SLiCE requires its users to manually supply taxonomy metadata (if possible) for each dataset in the form of tab-separated file (.tsv) containing this information (in this exact order):

  1. Genome folder name (ends with '/')
  2. Kingdom / Domain name
  3. Class name
  4. Order name
  5. Family name
  6. Genus name
  7. Species name
  8. Organism / Strain name

To keep the taxonomy names consistent across all datasets, make sure that you use the same reference database when assigning the taxons. In order to help users automate this process, BiG-SLiCE provides some python scripts that can be used to assign taxonomy based on the original input genomes (not clustergbks) using the GTDB-toolkit (only for fairly complete archaeal and bacterial genomes, download the script here). Alternatively, if the genomes were coming from NCBI RefSeq/GenBank (i.e., having GCF_* or GCA_* accessions), you can use this script to extract the taxonomy from the GTDB-API.