Basic GTF not created in `Viral_Track_cell_demultiplexing.R` #10

yeredh · 2020-07-03T10:55:18Z

Hello Pierre,

I was able to run Viral-Track successfully. However, I noticed that if you don't run the transcriptome assembly step Viral_Track_cell_demultiplexing.R crashes because the basic GTF file is not created (see code below) and the featureCounts call fails because it can't find the GTF file


###Finding the path to the GTF : if none fonmd -> 'pseudo GTF' created

Path_GTF = paste(Output_directory,Name_run,"Merged_GTF.txt",sep = "/")

if (!file.exists(Path_GTF)) {
  cat("No GTF file found. Creating a basic GTF file ")
  
  #Loading the viral annotation file 
  Viral_annotation = read.delim(Viral_annotation_file)
  Viral_annotation = Viral_annotation[Viral_annotation$Name_sequence!=" ",]
  
  #Selecting the viral segments 
  Viral_annotation = Viral_annotation[Identified_viral_fragments,]
  Length_segments = as.numeric(Viral_annotation$Genome_length)
  
  for (k in 1:length(Identified_viral_fragments)) {
    transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
                        paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
  }
  
}

In the code below, the transcript lines are not written into a text file to update Path_GTF, so

  #Assigning reads to transcripts using Rsubread Featurecounts
  x = suppressMessages(featureCounts(files = paste(List_target_path,"/Reads_to_demultiplex.bam",sep=""),
                    annot.ext = Path_GTF ,isGTFAnnotationFile = T,
                    reportReads = "BAM",reportReadsPath = List_target_path,verbose = F,primaryOnly = T,allowMultiOverlap = T))

crashes because Path_GTF = paste(Output_directory,Name_run,"Merged_GTF.txt",sep = "/") does not exist.

Thanks again for your time and attention,

Yered

The text was updated successfully, but these errors were encountered:

heathergeiger · 2021-06-23T20:08:53Z

I am not the author, but I solved this by manually editing the code.

transcript_lines <- c()

for (k in 1:length(Identified_viral_fragments)) {
    transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
                        paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
   transcript_lines <- c(transcript_lines,paste0(transcript_line,collapse="\t"))
  }

 writeLines(transcript_lines,con=Path_GTF)

Oh, and the other issue you may run into is that the chromosome names in the STAR index contain the full name (e.g. "refseq|NC_022518|9472nt|Human"), while the viral annotation file will have e.g. NC_022518. Here are the changes I made to the code to address this if you have run into the same issue.

Viral_annotation <- read.table(Viral_annotation_file,header=TRUE,sep="\t",check.names=FALSE,comment.char="",quote="")

Identified_viral_fragments_for_subset <- sapply(strsplit(Identified_viral_fragments,"\\|"),"[[",2)
Length_segments <- as.numeric(Viral_annotation$Genome_length[match(Identified_viral_fragments_for_subset,Viral_annotation$Name_sequence)])

And then from here, go straight into creating the transcript_lines vector, running the for loop, and outputting the new GTF file.

juliabelk · 2021-08-16T22:51:22Z

In case it helps anyone else, I could not get past the gtf file step even after trying the above (not sure if it has anything to do with the R version, I am using R 4). Finally I switched the annotation file format to the SAF option (this example code is only for processing sars-cov-2 detected reads):

  saf_df <- data.frame(
    "GeneID"=c("NC_045512","NC_039199"),
    "Chr"=c("refseq|NC_045512|29903nt|Severe","refseq|NC_039199|13350nt|Human"),
    "Start"=c(1,1),
    "End"=c(29903,13350),
    "Strand"=c(".","."),
    stringsAsFactors=F
  )
 
  #Assigning reads to transcripts using Rsubread Featurecounts
  x = suppressMessages(
    featureCounts(
      files = paste(path_temp,"/Reads_to_demultiplex.bam",sep=""),
      annot.ext=saf_df,
      reportReads = "BAM",reportReadsPath = path_temp,
      verbose = F,primaryOnly = T,allowMultiOverlap = T 
    )   
  )

ghost · 2022-01-21T03:25:18Z

I am not the author, but I solved this by manually editing the code.
transcript_lines <- c()

for (k in 1:length(Identified_viral_fragments)) {
    transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
                        paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
   transcript_lines <- c(transcript_lines,paste0(transcript_line,collapse="\t"))
  }

 writeLines(transcript_lines,con=Path_GTF)
Oh, and the other issue you may run into is that the chromosome names in the STAR index contain the full name (e.g. "refseq|NC_022518|9472nt|Human"), while the viral annotation file will have e.g. NC_022518. Here are the changes I made to the code to address this if you have run into the same issue.
Viral_annotation <- read.table(Viral_annotation_file,header=TRUE,sep="\t",check.names=FALSE,comment.char="",quote="")

Identified_viral_fragments_for_subset <- sapply(strsplit(Identified_viral_fragments,"\\|"),"[[",2)
Length_segments <- as.numeric(Viral_annotation$Genome_length[match(Identified_viral_fragments_for_subset,Viral_annotation$Name_sequence)])
And then from here, go straight into creating the transcript_lines vector, running the for loop, and outputting the new GTF file.

Thanks, it seems like the author did not add the codes actually creating the pseudo GTF file and your codes could perfectly solve this issue. I just wanna ask how did files named like ""refseq|NC_022518|9472nt|Human"" are created, it did not work on my Ubuntu OS since "|" is not a permitted character in file names. I use awk commands to replace them with other permitted characters.

ghost · 2022-01-21T03:36:12Z

And in case it may help anyone else, if you encountered problems when exporting viral SAM files with errors of invalid argument, try to change the lines of viral names in the Virusite genome file with this command : awk '{ gsub(/|/, "character allowed"); print %0}' path_to_reference_genome/genomes.fasta > new_reference_genomes.fasta
while you can choose any allowed character and separate them by it later.

TeodoraTockovska · 2024-05-29T19:11:25Z

@heathergeiger Hi Heather. Were you able to run the 2nd script (assembly using StringTie)? It seems that the assembly isn't working for me and I am uncertain why. I have a bam file that has only 99 reads (so quite small) and StringTie is simply not working.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic GTF not created in `Viral_Track_cell_demultiplexing.R` #10

Basic GTF not created in `Viral_Track_cell_demultiplexing.R` #10

yeredh commented Jul 3, 2020

heathergeiger commented Jun 23, 2021

juliabelk commented Aug 16, 2021

ghost commented Jan 21, 2022

ghost commented Jan 21, 2022

TeodoraTockovska commented May 29, 2024 •

edited

Loading

Basic GTF not created in Viral_Track_cell_demultiplexing.R #10

Basic GTF not created in Viral_Track_cell_demultiplexing.R #10

Comments

yeredh commented Jul 3, 2020

heathergeiger commented Jun 23, 2021

juliabelk commented Aug 16, 2021

ghost commented Jan 21, 2022

ghost commented Jan 21, 2022

TeodoraTockovska commented May 29, 2024 • edited Loading

Basic GTF not created in `Viral_Track_cell_demultiplexing.R` #10

Basic GTF not created in `Viral_Track_cell_demultiplexing.R` #10

TeodoraTockovska commented May 29, 2024 •

edited

Loading