Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EXITING because of INPUT ERROR: the file format of the genomeFastaFile #23

Open
JAYRJPT opened this issue Jul 2, 2021 · 3 comments
Open

Comments

@JAYRJPT
Copy link

JAYRJPT commented Jul 2, 2021

Hello
I am using Clinker to visualize one of the fusions came from fusion catcher tool. I have made a csv file with the coordinates of the fusion named DUX4:IGH@ similar to the bcr_abl1.csv file mentioned in test folder.
Here is my command-
bpipe -p out=/home/deepak/output -p caller=$CLINKERDIR/test/caller/dux4_igh.csv -p col=1,2,3,4 -p genome=38 -p print=true -p competitive=true -p header=true -p align_mem=31025992405 -p genome_mem=31025992405 -p threads=30 -p fusions=DUX4:IGH@ $CLINKERDIR/workflow/clinker.pipe $CLINKERDIR/test/fastq/*.fastq.gz

But I am getting the error at the alignment step

====================================================================================================
|                              Starting Pipeline at 2021-07-02 23:25                               |
====================================================================================================

======================================== Stage generate_fst ========================================


==============================================================


	Fusion Super Transcript Generator

	A fusion visualiser.


==============================================================



==============================================================

Create fusion superTranscriptome:

WARNING: a gene (line 0 of fusion input) does not exist in annotation/hg19_ucscGenes.txt based upon breakpoint.
         Closest mapped gene name is 'RABL2B' (139512811 bp downstream)

--------------------------------------------------------------
Gene Symbols Mapped: 0 Not Mapped: 1 Total: 1
--------------------------------------------------------------

Note: Some superTranscripts were not generated. This could be because of:
	A: The breakpoint was not within a gene (this program only deals with these).
	B: The superTranscript reference file did not contain an entry for that gene symbol.
	C: You have identified the wrong columns, or they contain the wrong information, with the -pos argument.

==============================================================

Creating output directory at: /home/deepak/output
Creating fused superTranscriptome and annotation files


...Success!

Use the plot_fst bpipe workflow or IGV to visualise your results.

==============================================================


====================================== Stage star_genome_gen =======================================
Jul 02 23:25:31 ..... started STAR run
Jul 02 23:25:31 ... starting to generate Genome files

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: /home/deepak/output/reference/fst_reference.fasta is not fasta: the first character is '
' (10), not '>'.
 Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

Jul 02 23:25:31 ...... FATAL ERROR, exiting
ERROR: stage star_genome_gen failed: Command in stage star_genome_gen failed with exit status = 104 : 

STAR --runMode genomeGenerate --runThreadN 30 --genomeDir /home/deepak/output/genome --genomeFastaFiles /home/deepak/output/reference/fst_reference.fasta --limitGenomeGenerateRAM 31025992405 --genomeSAindexNbases 5 


========================================= Pipeline Failed ==========================================

Command in stage star_genome_gen failed with exit status = 104 : 

STAR --runMode genomeGenerate --runThreadN 30 --genomeDir /home/deepak/output/genome --genomeFastaFiles /home/deepak/output/reference/fst_reference.fasta --limitGenomeGenerateRAM 31025992405 --genomeSAindexNbases 5

Use 'bpipe errors' to see output from failed commands.

Here is the bpipe error

deepak@ngs:~/ClINKERDIR$ bpipe errors

============================== Found 1 failed commands from run 26797 ==============================

=================================== Command star_genome_gen (68) ===================================


Command    : STAR --runMode genomeGenerate --runThreadN 30 --genomeDir /home/deepak/output/genome --genomeFastaFiles /home/deepak/output/reference/fst_reference.fasta --limitGenomeGenerateRAM 31025992405 --genomeSAindexNbases 5
Started    : Fri Jul 02 23:25:31 IST 2021
Stopped    : Fri Jul 02 23:25:31 IST 2021
Exit Code  : 104
Config: 
                   Name           |  Value 
          ---------------------------------
          max_per_command_threads | 16     
          executor                | local  
          stats_update_interval   | 120000 
          outputScanConcurrency   | 5      
          maxFileNameLength       | 2048   
          name                    | stargen
          procs                   | 1      

Output    : 

	Jul 02 23:25:31 ..... started STAR run
	Jul 02 23:25:31 ... starting to generate Genome files
	
	EXITING because of INPUT ERROR: the file format of the genomeFastaFile: /home/deepak/output/reference/fst_reference.fasta is not fasta: the first character is '
	' (10), not '>'.
	 Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).
	
	Jul 02 23:25:31 ...... FATAL ERROR, exiting

Any suggestion to remove this error?

Thanks and Regards,

Jay

@breons
Copy link
Contributor

breons commented Jul 2, 2021

Hi Jay, thanks for trying Clinker!

That error comes during the first stage (generate_fst) where the superTranscripts cannot be located in the reference files given the inputted coordinates.

I noticed hg19 has a IGH@ gene, but not hg38 (at least in Clinker's reference). Did the fusion caller us hg19? If so, simply delete the current output and change your -p genome=38 to -p genome=19.

If you're sure it's hg38, then I'll have to look into why that is missing.

Cheers,
Breon.

@JAYRJPT
Copy link
Author

JAYRJPT commented Jul 3, 2021

Hi Breon,
I have used Fusioncatcher and it has used hg38 as reference genome. I have mentioned the coordinates of the gene according to hg38 only.

Thanks,
Jay

@breons
Copy link
Contributor

breons commented Jul 9, 2021

Hi Jay,

Sorry for the delay. I will need to rebuild the references to account for IGH@ in hg38 - it seems Clinker currently doesn't have a superTranscript for that. The bad news is that it might take me some time to get together as I am currently finishing some other projects.

However, I'm a bit confused as to why RABL2B is coming up as the closest gene (chr22), when DUX4 and IGH@ are on other chromosomes in the hg38 reference? Would you mind sharing the csv with the coordinates in them? Otherwise, just double check the positions are accurate.

Thanks!
Breon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants