Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Type checker found wrong number of fields while tokenizing data line (while building reference) #167

Open
sheilazuniga opened this issue Oct 28, 2022 · 5 comments

Comments

@sheilazuniga
Copy link

sheilazuniga commented Oct 28, 2022

Hi,

I'm not sure why this is happening, I call:
IRFinder -m BuildRefFromSTARRef -r /my_dir/GRCh38/irfinder/ -x /my_dir/GRCh38/genome-index_STAR -e /my_dir/GRCh38/extra-input-files-IRFinder/RNA.SpikeIn.ERCC.fasta.gz -b /my_dir/GRCh38/extra-input-files-IRFinder/Human_hg38_nonPolyA_ROI.bed

I got: merging from 64 files and 64 in-memory blocks...
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"

The output file containts the following content:

Launching reference build process. The full build might take hours.
<Phase 1: STAR Reference Preparation>
Oct 27 17:01:21 ... copying the genome FASTA file...
Oct 27 17:01:37 ... copying the transcriptome GTF file...
Oct 27 17:01:42 ... copying the STAR reference folder...
<Phase 2: Mapability Calculation>
Oct 27 17:03:11 ... mapping genome fragments back to genome...
Oct 27 17:29:03 ... sorting aligned genome fragments...
Oct 27 17:35:40 ... indexing aligned genome fragments...
Oct 27 17:36:02 ... filtering aligned genome fragments by chromosome/scaffold...
Oct 27 17:39:12 ... merging filtered genome fragments...
Oct 27 17:39:57 ... calculating regions for exclusion...
Oct 27 17:45:01 ... cleaning temporary files...
<Phase 3: IRFinder Reference Preparation>
Oct 27 17:45:03 ... building Ref 1...
Oct 27 17:45:21 ... building Ref 2...
Oct 27 17:45:23 ... building Ref 3...
Oct 27 17:45:23 ... building Ref 4...
Oct 27 17:45:29 ... building Ref 5...
Oct 27 17:45:39 ... building Ref 6...
Oct 27 17:45:39 ... building Ref 7...
Oct 27 17:45:41 ... building Ref 8...
Oct 27 17:45:41 ... building Ref 9...
Oct 27 17:45:41 ... building Ref 10c...
Oct 27 17:45:41 ... building Ref 11c...

And when I enter in the directory

-rw-r--r-- 1 szm group 5.2K Oct 27 17:45 ref-ROI.bed
-rw-r--r-- 1 szm group 0 Oct 27 17:45 intergenic.ROI.bed
-rw-r--r-- 1 szm group 81M Oct 27 17:45 ref-cover.bed
-rw-r--r-- 1 szm group 5.8M Oct 27 17:45 ref-sj.ref
-rw-r--r-- 1 szm group 6.5M Oct 27 17:45 ref-read-continues.ref
-rw-r--r-- 1 szm group 31M Oct 27 17:45 exclude.omnidirectional.bed
-rw-r--r-- 1 szm group 22M Oct 27 17:45 exclude.directional.bed
-rw-r--r-- 1 szm group 13M Oct 27 17:45 introns.unique.bed

Could you please help me?

Thank you in advance,

Sheila

@dg520
Copy link
Collaborator

dg520 commented Oct 31, 2022

@sheilazuniga What is your OS environment? Have you tried to re-compile the irfinder core?

@sheilazuniga
Copy link
Author

Thanks for your answer, I'm user of cluster:
NAME="Red Hat Enterprise Linux"
VERSION="8.5 (Ootpa)"
I haven't recompiled it, I install it using conda.

@dg520
Copy link
Collaborator

dg520 commented Nov 1, 2022

@sheilazuniga The conda package is maintained by a third party so I don't know how it is built. The only suggestion I can provide here is to download the GitHub version and try to recompile according to the wiki page.

@yangjywhu
Copy link

yangjywhu commented Apr 1, 2023

Hi @dg520
I have recompiled according to the wiki page, and also encountered the same problem. (both in v1.3.1 and v1.2.6)
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"

In v1.3.1:

IRFinder -m BuildRefFromSTARRef \
  -r result/index/IRFinder \
  -x index_STAR \
  -e RNA.SpikeIn.ERCC.fasta.gz \
  -b Human_hg38_nonPolyA_ROI.bed

I got this output information:

Launching reference build process. The full build might take hours.
<Phase 1: STAR Reference Preparation>
Mar 30 21:46:32 ... copying the genome FASTA file...
Mar 30 21:46:48 ... copying the transcriptome GTF file...
Mar 30 21:46:54 ... copying the STAR reference folder...
<Phase 2: Mapability Calculation>
Mar 30 21:49:27 ... mapping genome fragments back to genome...
Mar 31 03:37:00 ... sorting aligned genome fragments...
[bam_sort_core] merging from 60 files and 60 in-memory blocks...
Mar 31 04:10:37 ... indexing aligned genome fragments...
Mar 31 04:13:09 ... filtering aligned genome fragments by chromosome/scaffold...
Mar 31 05:24:51 ... merging filtered genome fragments...
Mar 31 05:26:47 ... calculating regions for exclusion...
Mar 31 05:35:16 ... cleaning temporary files...
<Phase 3: IRFinder Reference Preparation>
Mar 31 05:35:38 ... building Ref 1...
Mar 31 05:36:28 ... building Ref 2...
Mar 31 05:36:32 ... building Ref 3...
Mar 31 05:36:32 ... building Ref 4...
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Mar 31 05:36:46 ... building Ref 5...
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Mar 31 05:37:04 ... building Ref 6...
Mar 31 05:37:05 ... building Ref 7...
Mar 31 05:37:09 ... building Ref 8...
Mar 31 05:37:10 ... building Ref 9...
Mar 31 05:37:10 ... building Ref 10c...
Mar 31 05:37:10 ... building Ref 11c...

The output directory is like following. It seems to run successfully?

total 156M
-rw-r--r--. 1 yangjiayi zhoulab  23M Mar 31 21:08 exclude.directional.bed
-rw-r--r--. 1 yangjiayi zhoulab  30M Mar 31 21:08 exclude.omnidirectional.bed
-rw-r--r--. 1 yangjiayi zhoulab    0 Mar 31 21:11 intergenic.ROI.bed
-rw-r--r--. 1 yangjiayi zhoulab  13M Mar 31 21:08 introns.unique.bed
-rw-r--r--. 1 yangjiayi zhoulab  80M Mar 31 21:08 ref-cover.bed
-rw-r--r--. 1 yangjiayi zhoulab 6.4M Mar 31 21:08 ref-read-continues.ref
-rw-r--r--. 1 yangjiayi zhoulab 5.2K Mar 31 21:11 ref-ROI.bed
-rw-r--r--. 1 yangjiayi zhoulab 5.7M Mar 31 21:08 ref-sj.ref

In v1.2.6

IRFinder -m BuildRefProcess \
  -r result/index/IRFinder \
  -e RNA.SpikeIn.ERCC.fasta.gz \
  -b Human_hg38_nonPolyA_ROI.bed

I got this output information:

Launching reference build process. The full build should take at least one hour.
Usage : /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/.snakemake/conda/ee893424cf93d5f01a4ab9e5e1b2aa21_/opt/irfinder-1.2.6/bin/util/IRFinder-BuildRefFromEnsembl mode threads STAR-executable base_ftp_url_of_ensembl_genome+gtf output_directory(must not exist) additional_genome_reference(eg: ERCC) non_polyA_genes-as-bed region_blacklist-as-bed
Usage example: /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/.snakemake/conda/ee893424cf93d5f01a4ab9e5e1b2aa21_/opt/irfinder-1.2.6/bin/util/IRFinder-BuildRefFromEnsembl BuildRef 12 STAR "ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/" "IRFinder/REF/Human" "Refernce-ERCC.fa.gz" [non_polyA_genes.bed] [blacklist.bed]
	STAR --runMode genomeGenerate --genomeDir /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/result/index/IRFinder/STAR --genomeFastaFiles /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/result/index/IRFinder/genome.fa --sjdbGTFfile /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/result/index/IRFinder/transcripts.gtf --sjdbOverhang 150 --runThreadN 28 &>> log-star-build-ref.log
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Apr 02 11:41:03 ..... started STAR run
!!!!! WARNING: Could not move Log.out file from ./Log.out into /beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/result/index/IRFinder/STAR/Log.out. Will keep ./Log.out

Apr 02 11:41:04 ... starting to generate Genome files
Apr 02 11:42:08 ..... processing annotations GTF
Apr 02 11:42:37 ... starting to sort Suffix Array. This may take a long time...
Apr 02 11:42:54 ... sorting Suffix Array chunks and saving them to disk...
Apr 02 16:00:28 ... loading chunks from disk, packing SA...
Apr 02 16:04:37 ... finished generating suffix array
Apr 02 16:04:37 ... generating Suffix Array index
Apr 02 16:07:48 ... completed Suffix Array index
Apr 02 16:07:48 ..... inserting junctions into the genome indices
Apr 02 16:19:35 ... writing Genome to disk ...
Apr 02 16:19:37 ... writing Suffix Array to disk ...
Apr 02 16:20:51 ... writing SAindex to disk
Apr 02 16:21:00 ..... finished successfully
Star genome build result: 0
Commence STAR mapping run for mapability.
Sun Apr  2 16:21:01 CST 2023

real	302m8.719s
user	263m51.874s
sys	25m59.972s
Completed STAR run.
Sun Apr  2 21:23:10 CST 2023
Commence Coverage calculation.

real	16m42.891s
user	12m53.464s
sys	3m48.529s

real	0m3.080s
user	0m2.942s
sys	0m0.059s
Completed coverage exclusion calculation.
Sun Apr  2 21:39:56 CST 2023
Mapability result: 0
Build Ref 1
Build Ref 2
Build Ref 3
Build Ref 4
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Build Ref 5
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Build Ref 6
Build Ref 7
Build Ref 8
Build Ref 9
Build Ref 10c
Build Ref 11c
/beegfs/zhoulab/yangjiayi/project/fracRNA/reference/human/.snakemake/conda/ee893424cf93d5f01a4ab9e5e1b2aa21_/opt/irfinder-1.2.6/bin/util/Build-BED-refs.sh: line 186: [: missing `]'
COMPLETE
Ref build result: 0
ALL DONE

The output directory is like following.

total 156M
-rw-r--r--. 1 yangjiayi zhoulab  23M Apr  2 21:37 exclude.directional.bed
-rw-r--r--. 1 yangjiayi zhoulab  30M Apr  2 21:37 exclude.omnidirectional.bed
-rw-r--r--. 1 yangjiayi zhoulab    0 Apr  2 21:41 intergenic.ROI.bed
-rw-r--r--. 1 yangjiayi zhoulab  13M Apr  2 21:37 introns.unique.bed
-rw-r--r--. 1 yangjiayi zhoulab  80M Apr  2 21:38 ref-cover.bed
-rw-r--r--. 1 yangjiayi zhoulab 6.4M Apr  2 21:38 ref-read-continues.ref
-rw-r--r--. 1 yangjiayi zhoulab 5.2K Apr  2 21:41 ref-ROI.bed
-rw-r--r--. 1 yangjiayi zhoulab 5.7M Apr  2 21:38 ref-sj.ref

My OS information:

LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.7.1908 (Core)
Release:        7.7.1908
Codename:       Core

What can i do to fix it? Can I quantify IR with this reference?

Thanks
Yang

@dg520
Copy link
Collaborator

dg520 commented Apr 13, 2023

@yangjywhu Is this extra TAB problem due to the version of Bedtools you used? I have never seen this before, so I cannot judge whether your reference has been successfully prepared or not. I would suggest you to update your Bedtools or try a clean installation of Bedtools v2.30.0 (make sure the new Bedtools is set to the default one to be called).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants