Incorrect number of genomes detected in overview of output #135

Oteng1 · 2024-03-12T13:29:57Z

An incorrect number of genomes seems to be captured in the index HTML output. Thus, this affects one of the pie charts that is generated. In one case, when using one assembled genome with 11 BGCs, the overview in the index.html file incorrectly said that 11 genomes were used.
In another case, I used 15 assembled genomes but the overview page in the index html file said 118. The input is as indicated in the tutorial.

github-actions · 2024-03-27T02:02:13Z

This issue has not seen activity for 14 days and has been marked as stale. Please comment with additional information if this issue is still relevant.

jorgecnavarrom · 2024-03-27T13:37:56Z

Hi. The amount of genomes for that figure is calculated using the header of the gbk files (from the Organism property, if I remember correctly), so it's possible to have incorrect numbers depending on how these gbk files were produced

felipevzps · 2024-04-02T20:59:44Z

Hi,

If the genome name is not in the organism property of the gbk, the considered name will be the name of the gbk file (without "cluster" or "region").

this happens here:

BiG-SCAPE/bigscape.py

Lines 3271 to 3277 in 97d616c

    
           # get identifier info 
        
           identifier = "" 
        
           if len(bgc_info[bgc].organism) > 1: 
        
               identifier = bgc_info[bgc].organism 
        
           else : # use original genome file name (i.e. exclude "..clusterXXX from antiSMASH run") 
        
               file_name_base = os.path.splitext(os.path.basename(genbankDict[bgc][0]))[0] 
        
               identifier = file_name_base.rsplit(".cluster",1)[0].rsplit(".region", 1)[0]

If you are working with 3 clusters from the same genome (contig_1.region001.gbk, contig_2.region001.gbk, contig_3.region001.gbk), the script will consider that there are 3 genomes... (please correct me if I am wrong)

I didn't have time to read the entire code, but I think you could adjust the name of your input before running bigscape, i.e. to include the genome name (genome1.region001.contig_1.gbk, genome1.region001.contig_2.gbk, genome1.region001.contig_3.gbk).

Here is a bash script to include the genome name in the cluster.gbk files and create a symbolic link in the directory where the script is executed (input directory of bigscape):

#!/bin/bash

# Directory where the genome folders are located
genomes_dir="path_to_antiSMASH_output/"

# Loop through all genome folders
for genome_dir in "$genomes_dir"/*; do
    # Extract the genome name from the folder
    genome=$(basename "$genome_dir")

    # Find all gbk files containing "region" in their name inside the genome folder
    find "$genome_dir" -type f -name "*region*.gbk" | while read -r gbk_file; do
        # Extract the file name without extension
        filename=$(basename "$gbk_file" .gbk)
        # Extract the region number from the file
        region_number=$(echo "$filename" | grep -oP 'region\d+')

        # Extract the contig number from the file
        contig_number=$(echo "$filename" | grep -oP 'contig_\d+')

        # Create the new file name
        new_filename="${genome}.${region_number}.${contig_number}.gbk"

        # Create the symbolic link with the new name in the current directory
        ln -s "$gbk_file" "./$new_filename"
    done
done

best,
Felipe

jorgecnavarrom · 2024-11-19T12:48:49Z

Thanks Felipe!

It's hard to make a one-size-fits all solution for cases like these, where there is missing data (here, e.g. someone having a selected set of gbk files from different genomes in a custom folder, or metagenomic datasets)

github-actions bot added the stale Issue has not been active for 14 days label Mar 27, 2024

adraismawur added BiG-SCAPE 1 Relates to BiG-SCAPE version 1.0 no-stale Prevent this issue from going stale and removed stale Issue has not been active for 14 days labels Mar 27, 2024

adraismawur assigned jorgecnavarrom Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect number of genomes detected in overview of output #135

Incorrect number of genomes detected in overview of output #135

Oteng1 commented Mar 12, 2024

github-actions bot commented Mar 27, 2024

jorgecnavarrom commented Mar 27, 2024

felipevzps commented Apr 2, 2024

jorgecnavarrom commented Nov 19, 2024

Incorrect number of genomes detected in overview of output #135

Incorrect number of genomes detected in overview of output #135

Comments

Oteng1 commented Mar 12, 2024

github-actions bot commented Mar 27, 2024

jorgecnavarrom commented Mar 27, 2024

felipevzps commented Apr 2, 2024

jorgecnavarrom commented Nov 19, 2024