Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies in multiple cutoff parameters results and networks #258

Open
alpole23 opened this issue Feb 7, 2025 · 2 comments
Open

Discrepancies in multiple cutoff parameters results and networks #258

alpole23 opened this issue Feb 7, 2025 · 2 comments

Comments

@alpole23
Copy link

alpole23 commented Feb 7, 2025

Dear developers,
It might just be my misunderstanding on how the cutoffs and mixed networking work with Bigscape 2, but I am confused about the differences I am seeing between the same dataset with different cutoffs.

My command is as follows: 
bigscape cluster -i results/antismash -o results/bigscape -c 6 --pfam-path pfam/Pfam-A.hmm --mix --classify class --include-singletons --gcf-cutoffs 0.5,0.7 --mibig-version 3.1

I have included a screen capture of the run information:

Image

The table below shows the total number of genomes and the total number of BGCs as predicted from antismash between the two cutoffs. Since I am using the same exact antismash dataset, shouldn't these numbers be the same?

cutoff 0.5 cutoff 0.7
total # genomes 582 1330
total BGCs 3914 5884

I have also screen captured the "mix" network for each cutoff after selecting visualize all. Given the data in the table above, shouldn't I be expecting a network of 5884 BGCs for cutoff 0.7 and a mixed network of 3914 BGCs for cutoff 0.5?

Mixed Network for cutoff 0.7:

Image


Mixed Network for cutoff 0.5:

Image


Any information, suggestions, or insights into how to reconcile these data would be greatly appreciated. Thanks!

@nlouwen
Copy link
Collaborator

nlouwen commented Feb 10, 2025

Hi!

On the first question, the difference in numbers between cutoffs is likely due to the fact that reference- or mibig-only connected components are not included in the output. Since the lower cutoff will produce more mibig-only CCs, there is a lower number of BGCs remaining in the run. However, the reported number of genomes/BGCs is currently too high (roughly duplicated) when using mix and classify together, which will be fixed in our next release.

The screenshots of the mix networks indeed do not look like expected. I have not been able to reproduce this kind of result using the same command you've used, so I am not sure what could have caused this. To figure that out, I'd ideally have to take a look at the output folder if you could share that via e.g. google drive (assuming data is not private). Otherwise, you could try running a pure mix run --mix --classify none with a fresh output directory to see if the same discrepancy occurs.

@alpole23
Copy link
Author

Thanks for the feedback. Unfortunately, I cannot share the data publicly, but I will play around with the pure mix run with some datasets that I can publicize and see if I get the same unexpected network results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants