Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect number of genomes detected in overview of output #135

Open
Oteng1 opened this issue Mar 12, 2024 · 4 comments
Open

Incorrect number of genomes detected in overview of output #135

Oteng1 opened this issue Mar 12, 2024 · 4 comments
Assignees
Labels
BiG-SCAPE 1 Relates to BiG-SCAPE version 1.0 no-stale Prevent this issue from going stale

Comments

@Oteng1
Copy link

Oteng1 commented Mar 12, 2024

An incorrect number of genomes seems to be captured in the index HTML output. Thus, this affects one of the pie charts that is generated. In one case, when using one assembled genome with 11 BGCs, the overview in the index.html file incorrectly said that 11 genomes were used.
In another case, I used 15 assembled genomes but the overview page in the index html file said 118. The input is as indicated in the tutorial.

Copy link

This issue has not seen activity for 14 days and has been marked as stale. Please comment with additional information if this issue is still relevant.

@github-actions github-actions bot added the stale Issue has not been active for 14 days label Mar 27, 2024
@adraismawur adraismawur added BiG-SCAPE 1 Relates to BiG-SCAPE version 1.0 no-stale Prevent this issue from going stale and removed stale Issue has not been active for 14 days labels Mar 27, 2024
@jorgecnavarrom
Copy link
Collaborator

Hi. The amount of genomes for that figure is calculated using the header of the gbk files (from the Organism property, if I remember correctly), so it's possible to have incorrect numbers depending on how these gbk files were produced

@felipevzps
Copy link

Hi,

If the genome name is not in the organism property of the gbk, the considered name will be the name of the gbk file (without "cluster" or "region").

this happens here:

BiG-SCAPE/bigscape.py

Lines 3271 to 3277 in 97d616c

# get identifier info
identifier = ""
if len(bgc_info[bgc].organism) > 1:
identifier = bgc_info[bgc].organism
else : # use original genome file name (i.e. exclude "..clusterXXX from antiSMASH run")
file_name_base = os.path.splitext(os.path.basename(genbankDict[bgc][0]))[0]
identifier = file_name_base.rsplit(".cluster",1)[0].rsplit(".region", 1)[0]

If you are working with 3 clusters from the same genome (contig_1.region001.gbk, contig_2.region001.gbk, contig_3.region001.gbk), the script will consider that there are 3 genomes... (please correct me if I am wrong)

I didn't have time to read the entire code, but I think you could adjust the name of your input before running bigscape, i.e. to include the genome name (genome1.region001.contig_1.gbk, genome1.region001.contig_2.gbk, genome1.region001.contig_3.gbk).

Here is a bash script to include the genome name in the cluster.gbk files and create a symbolic link in the directory where the script is executed (input directory of bigscape):

#!/bin/bash

# Directory where the genome folders are located
genomes_dir="path_to_antiSMASH_output/"

# Loop through all genome folders
for genome_dir in "$genomes_dir"/*; do
    # Extract the genome name from the folder
    genome=$(basename "$genome_dir")

    # Find all gbk files containing "region" in their name inside the genome folder
    find "$genome_dir" -type f -name "*region*.gbk" | while read -r gbk_file; do
        # Extract the file name without extension
        filename=$(basename "$gbk_file" .gbk)
        # Extract the region number from the file
        region_number=$(echo "$filename" | grep -oP 'region\d+')

        # Extract the contig number from the file
        contig_number=$(echo "$filename" | grep -oP 'contig_\d+')

        # Create the new file name
        new_filename="${genome}.${region_number}.${contig_number}.gbk"

        # Create the symbolic link with the new name in the current directory
        ln -s "$gbk_file" "./$new_filename"
    done
done

best,
Felipe

@jorgecnavarrom
Copy link
Collaborator

Thanks Felipe!

It's hard to make a one-size-fits all solution for cases like these, where there is missing data (here, e.g. someone having a selected set of gbk files from different genomes in a custom folder, or metagenomic datasets)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BiG-SCAPE 1 Relates to BiG-SCAPE version 1.0 no-stale Prevent this issue from going stale
Projects
None yet
Development

No branches or pull requests

4 participants