Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when using AlleleCall with a large number of genomes #176

Closed
ccrinconc opened this issue May 31, 2023 · 3 comments
Closed

Issue when using AlleleCall with a large number of genomes #176

ccrinconc opened this issue May 31, 2023 · 3 comments
Assignees

Comments

@ccrinconc
Copy link

Dear team,

I have tried to do an allele call using a large number of genomes (>3000) and I ran into an issue.
chewie AlleleCall -i Genomes/ -g senterica_INNUENDO_cgMLST/ -o Results --cpu 30 --ptf ../senterica_INNUENDO_cgMLST/Salmonella_enterica.trn --output-novel --no-inferred

chewBBACA version: 3.2.0
Authors: Rafael Mamede, Pedro Cerqueira, Mickael Silva, João Carriço, Mário Ramirez
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

==========================
chewBBACA - AlleleCall

Started at: 2023-05-31T21:09:57

Input argument is not a valid directory or file with a list of paths to FASTA files. Please provide a valid input, either a folder with FASTA files or a file with the list of full paths to FASTA files (one per line and ending with one of the following file extensions: ['.fasta', '.fna', '.ffn', '.fa', '.fas']).

When I provide a folder with 1000 genomes, the same command works

The install was done using conda

I guess an option is to run the allele call by parts and then use "chewBBACA.py JoinProfiles". What would be your suggestion when using this command?

Thanks for having a look into this

Best regards,

Cristian

@rfm-targa rfm-targa self-assigned this Jun 1, 2023
@rfm-targa
Copy link
Contributor

rfm-targa commented Jun 1, 2023

Greetings @ccrinconc,

I downloaded the S. enterica schema available in Chewie-NS and performed allele calling with 7,744 genome assemblies downloaded from the NCBI. The process completed without any errors or warnings. The warning that chewBBACA is printing to the stdout indicates that it cannot find any FASTA files in the input directory, which can happen for several reasons. Some common causes that might lead to this issue are the following:

  • The path passed to the -i parameter does not match the relative path or complete path to the directory that contains the FASTA files. Verify that the Genomes folder contains the FASTA files or that the FASTA files are not in subdirectories inside the Genomes folder.
  • The FASTA files do not end with any of the following file extensions: .fasta, .fna, .ffn, .fa, .fas. The files will only be accepted if the filenames end with one of the accepted file extensions.
  • The input path does not exist, or you have insufficient permissions to access its contents.
  • We have detected some issues or variations when using Python 3.11 or BLAST>2.9. If the issue is unrelated to any of the previous points, please check the Python and BLAST versions and downgrade if you need to. We recommend using Python 3.9 and BLAST 2.9.
  • The path passed to the -g parameter does not match the relative path or complete path to the schema directory. The path should point to the main schema directory, which contains FASTA files and a folder named short.

You might have already checked several or all of the possible causes, but it was important to provide a list to rule out common causes and close in on the issue leading to the problem you found. The warning you are getting is triggered when chewBBACA cannot detect the input path as either a file with a list of paths or a directory containing FASTA files. This might indicate that there's an incomplete or missing path or that it is an issue related to system configuration or permissions.
chewBBACA should start by creating the Results folder to store the results. It also writes TXT files with the list of genes, listGenes2Call.txt, and the list of genomes to use, listGenomes2Call.txt. You can check if those files were written into the Results folder to understand if the process fails to detect the genome FASTA files, the schema FASTA files, or both.

Let me know if my suggestions help solve the issue.

Kind regards,

Rafael

@rfm-targa
Copy link
Contributor

The command you shared includes the senterica_INNUENDO_cgMLST/ path to the schema directory and the ../senterica_INNUENDO_cgMLST/Salmonella_enterica.trn path to the training file. If the senterica_INNUENDO_cgMLST part refers to the same directory, then the issue might be caused by one of those paths being incorrect (the schema should either be in senterica_INNUENDO_cgMLST/ or in ../senterica_INNUENDO_cgMLST/).
You also do not need to provide the path to the Prodigal training file if the schema already includes the training file.

@ccrinconc
Copy link
Author

Dear Rafael,

Thank you very much for the quick answer. I had indeed verified most of the things you mentioned, files exists, paths are correct, file extension is ".fasta". I did not think about the permissions, this was indeed the problem as I'm using symlinks.
I have now corrected this and the AlleleCall is working.
For the record I have Python 3.11 and BLAST 2.14, it does not seem to be an issue for now.

Thank you once again,

Best regards,

Cristian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants