Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic GTF not created in Viral_Track_cell_demultiplexing.R #10

Open
yeredh opened this issue Jul 3, 2020 · 5 comments
Open

Basic GTF not created in Viral_Track_cell_demultiplexing.R #10

yeredh opened this issue Jul 3, 2020 · 5 comments

Comments

@yeredh
Copy link

yeredh commented Jul 3, 2020

Hello Pierre,

I was able to run Viral-Track successfully. However, I noticed that if you don't run the transcriptome assembly step Viral_Track_cell_demultiplexing.R crashes because the basic GTF file is not created (see code below) and the featureCounts call fails because it can't find the GTF file


###Finding the path to the GTF : if none fonmd -> 'pseudo GTF' created

Path_GTF = paste(Output_directory,Name_run,"Merged_GTF.txt",sep = "/")

if (!file.exists(Path_GTF)) {
  cat("No GTF file found. Creating a basic GTF file ")
  
  #Loading the viral annotation file 
  Viral_annotation = read.delim(Viral_annotation_file)
  Viral_annotation = Viral_annotation[Viral_annotation$Name_sequence!=" ",]
  
  #Selecting the viral segments 
  Viral_annotation = Viral_annotation[Identified_viral_fragments,]
  Length_segments = as.numeric(Viral_annotation$Genome_length)
  
  for (k in 1:length(Identified_viral_fragments)) {
    transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
                        paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
  }
  
}

In the code below, the transcript lines are not written into a text file to update Path_GTF, so

  #Assigning reads to transcripts using Rsubread Featurecounts
  x = suppressMessages(featureCounts(files = paste(List_target_path,"/Reads_to_demultiplex.bam",sep=""),
                    annot.ext = Path_GTF ,isGTFAnnotationFile = T,
                    reportReads = "BAM",reportReadsPath = List_target_path,verbose = F,primaryOnly = T,allowMultiOverlap = T))

crashes because Path_GTF = paste(Output_directory,Name_run,"Merged_GTF.txt",sep = "/") does not exist.

Thanks again for your time and attention,

Yered

@heathergeiger
Copy link

I am not the author, but I solved this by manually editing the code.

transcript_lines <- c()

for (k in 1:length(Identified_viral_fragments)) {
    transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
                        paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
   transcript_lines <- c(transcript_lines,paste0(transcript_line,collapse="\t"))
  }

 writeLines(transcript_lines,con=Path_GTF)

Oh, and the other issue you may run into is that the chromosome names in the STAR index contain the full name (e.g. "refseq|NC_022518|9472nt|Human"), while the viral annotation file will have e.g. NC_022518. Here are the changes I made to the code to address this if you have run into the same issue.

Viral_annotation <- read.table(Viral_annotation_file,header=TRUE,sep="\t",check.names=FALSE,comment.char="",quote="")

Identified_viral_fragments_for_subset <- sapply(strsplit(Identified_viral_fragments,"\\|"),"[[",2)
Length_segments <- as.numeric(Viral_annotation$Genome_length[match(Identified_viral_fragments_for_subset,Viral_annotation$Name_sequence)])

And then from here, go straight into creating the transcript_lines vector, running the for loop, and outputting the new GTF file.

@juliabelk
Copy link

In case it helps anyone else, I could not get past the gtf file step even after trying the above (not sure if it has anything to do with the R version, I am using R 4). Finally I switched the annotation file format to the SAF option (this example code is only for processing sars-cov-2 detected reads):

  saf_df <- data.frame(
    "GeneID"=c("NC_045512","NC_039199"),
    "Chr"=c("refseq|NC_045512|29903nt|Severe","refseq|NC_039199|13350nt|Human"),
    "Start"=c(1,1),
    "End"=c(29903,13350),
    "Strand"=c(".","."),
    stringsAsFactors=F
  )
 
  #Assigning reads to transcripts using Rsubread Featurecounts
  x = suppressMessages(
    featureCounts(
      files = paste(path_temp,"/Reads_to_demultiplex.bam",sep=""),
      annot.ext=saf_df,
      reportReads = "BAM",reportReadsPath = path_temp,
      verbose = F,primaryOnly = T,allowMultiOverlap = T 
    )   
  )

@ghost
Copy link

ghost commented Jan 21, 2022

I am not the author, but I solved this by manually editing the code.

transcript_lines <- c()

for (k in 1:length(Identified_viral_fragments)) {
    transcript_line = c(Identified_viral_fragments[k],'Empty_Annotation','transcript',1,Length_segments[k],1000,".",".",
                        paste("gene_id ",Identified_viral_fragments[k],"_1;",""))
   transcript_lines <- c(transcript_lines,paste0(transcript_line,collapse="\t"))
  }

 writeLines(transcript_lines,con=Path_GTF)

Oh, and the other issue you may run into is that the chromosome names in the STAR index contain the full name (e.g. "refseq|NC_022518|9472nt|Human"), while the viral annotation file will have e.g. NC_022518. Here are the changes I made to the code to address this if you have run into the same issue.

Viral_annotation <- read.table(Viral_annotation_file,header=TRUE,sep="\t",check.names=FALSE,comment.char="",quote="")

Identified_viral_fragments_for_subset <- sapply(strsplit(Identified_viral_fragments,"\\|"),"[[",2)
Length_segments <- as.numeric(Viral_annotation$Genome_length[match(Identified_viral_fragments_for_subset,Viral_annotation$Name_sequence)])

And then from here, go straight into creating the transcript_lines vector, running the for loop, and outputting the new GTF file.

Thanks, it seems like the author did not add the codes actually creating the pseudo GTF file and your codes could perfectly solve this issue. I just wanna ask how did files named like ""refseq|NC_022518|9472nt|Human"" are created, it did not work on my Ubuntu OS since "|" is not a permitted character in file names. I use awk commands to replace them with other permitted characters.

@ghost
Copy link

ghost commented Jan 21, 2022

And in case it may help anyone else, if you encountered problems when exporting viral SAM files with errors of invalid argument, try to change the lines of viral names in the Virusite genome file with this command : awk '{ gsub(/|/, "character allowed"); print %0}' path_to_reference_genome/genomes.fasta > new_reference_genomes.fasta
while you can choose any allowed character and separate them by it later.

@TeodoraTockovska
Copy link

TeodoraTockovska commented May 29, 2024

@heathergeiger Hi Heather. Were you able to run the 2nd script (assembly using StringTie)? It seems that the assembly isn't working for me and I am uncertain why. I have a bam file that has only 99 reads (so quite small) and StringTie is simply not working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants