-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strong bias of found ORF classes between strands #54
Comments
Hi Tonu, Given the counts of ORF types, there could be several things going on. I'm not sure what sequencing protocol you are using, but it could be that the relevant adapter sequences are not present in the file given to flexbar. I believe the one included with the distribution includes standard TruSeq and ArtSeq adapters, but if you are using something else, that could be a problem. Something else (which you may have already checked) is that the number of uniquely-mapping aligned reads in the Likewise, while it isn't amazingly helpful for riboseq, you can also try FastQC and just see if anything comes up. I would run it on the fastq file after removing ribosomal rna: It could also be that the semantics of start/end are different on the reverse strand. To check that, you can load the annotated bed file (the When this "off by one" kind of problem has turned up before, we usually ended up with many "within" and "suspect" ORFs predicted as translated, but not so many "canonical_truncated" ones. The "truncated" ORFs are in-frame, but there is better signal starting from a down-stream start than the canonical one. Here, you could open the uniquely-mapping read bam file in IGV and just take a look at the signal. Unfortunately, we do not currently create a bam file which removes the non-periodic reads and adjusts for the P-site offsets (we just directly do that when creating the ORF profiles), so there is not an easy way to view that. However, there still could be something here. Depending on how that looks, you could also look into the "unfiltered" predictions bed file. In ipython, you can look at the count of the different ORF types here with something like: import misc.bio_utils.bed_utils as bed_utils
unfiltered_predictions = bed_utils.read_bed("/path/to/my/unfiltered-predictions.bed.gz")
orf_type_groups = unfiltered_predictions.groupby(['orf_type', 'strand'])
orf_type_groups.size() ( These are the predictions before the filtering described in the "Final prediction set" paragraph (the last one in the methods part) in the paper. If there are many more canonical ORFs on the negative strand here than in the final set, then it means the periodic signal is stronger in the truncated (or "within") ORFs than the canonical one, so those are selected instead. There are other things to look at, as well, but I think this covers most of the easy and not-so-difficult things. We have run data from a number of different species and using several different protocols, and we haven't encountered any protocols that have a noticeable strand bias; however, if that were the case, it could still be handled within the current framework. However, it would require some code to handle alignments from the different strands separately. In essence, though, alignments from different strands could be treated as replicates (with independent periodic read lengths, P-site offsets, etc.). Please let me know how this goes, or if there is any trouble checking anything. Hopefully something here gives some clues, though. Have a good day, reverse strand end: reverse strand start: |
Hi Brandon,
Yes, it is. We use Ingolia (2012) protocol and this adapter is included
Yes, the number of reads is almost identical - 37 mil. in both strand
When I checked a BAM file, bed file and annotation with IGV:
Is it related to low default value of Cheers, |
Hi Tonu,
Hmm... that is interesting; we've not come across any sort of bias like this in any of the datasets we've analyzed. For the first transcript (YAL...1C), it definitely looks like the full canonical transcript should be predicted as translated. It seems that the canonical ORF is not even showing up in the unfiltered predictions. This means either that the signal at the beginning are really not periodic or that the ORF is not extracted correctly. Can you find the canonical ORF in the Bayes factor file (
There, you can see the posterior estimates of the Bayes factor (and a bunch of other stuff). If the BF is below 5, then it means the reads at the beginning are not periodic. If the canonical ORF is not in that file, then there is a problem with extracting it.
It could be. This gave us good results for our data, but you could increase it if you have deeper coverage. Please let me know about the canonical ORF; either way it would be good to know what's going on. Thanks. Have a good day, |
Hi Brandon, The full length canonical orf_type is in the bed file, so ORF extraction works fine but Are columns x_1_sum .. x_3_sum indicating the number of reads assigned to I/II/III position of codon in the P-site? If it is so then there is a periodicity shift from third position to first position of the codon. Orange line indicates ORF what is in filtered bed file finally. Tonu PS. The same pattern is true for ~20 manually checked genes. |
Hi Tonu,
Yes, that is correct. Because there are more reads in x_3 than x_1, we do not calculate the Bayes factor for those ORFs (so it is just set to -inf). In the paper, we describe this in the "Filtering unlikely ORFs" section.
The does seem like what's going on. In that sense, Rp-Bp is behaving as expected. Our models assume "translation" means periodicity exclusively in x_1 and nothing much in the other frames. So if there are a lot of reads in other frames, our assumptions kick in, and we do not predict the ORF as translated. That said, it is really suspicious that this out-of-frame bias occurs only near start codons on the reverse strand. I am also a bit unsure why the in-frame count (x_1_sum) increases for the truncated versions. One thing you could try is to look at the read length distribution of the ORF profiles. This updated doc describes how to create these. The I seem to recall that in a few of the riboseq protocols, there is an extra base that needs to be trimmed, in addition to the adapter sequences. I'm not sure if you're already doing this; if not, though, you can try setting that flexbar option ( To sum up, the frame counts do explain why the Have a good day, |
Hi Tonu, I just wanted to double-check if you had come across anything that may explain the strand-specific bias. If there is anything in particular that would be helpful to track down what's going on, please let me know. I'm very slowly documenting the analysis scripts, so maybe there is something already that I could add. If not, it may be possible to quickly add some new sort of analysis script. Have a good day, |
Hi Brandon, Yes I had some ideas to test. Then I made similar analysis using humans Chr1-5 and found that main difference is in the number of "canonical" ORFs and "within" Numbers for human Chr1-5 are: forward 937/381/14 and reverse 155/434/266 for canonical/canonical_truncated/within. I did not look for other type of ORF's in human but there should be more in 5-3'UTRs too. Bias is here but not so strong as in yeast. Finally I thought to test whether program treats strands differently, i. e. I want to exclude that possibility. One way to do that is revert genome and annotation: make it reverse complement and the same with annotation to match with sequence. When this bias is caused by genes it should swap the strand if it is related to how program treats strands then bias should stay in reverse strand. I used UGene to revert genome and annotation. It can do that automatically but problem is that it can handle GenBank format properly. I had hard times to get from gb
didn't work OR is there some simple way to convert Cheers, |
Hi Tonu, Answering a bit in reverse...
I am not aware of a better way to do this; however, for ensembl, you can directly download the gtf files, here for yeast, for example. I also had a look at the gff3 annotations, and there are some differences which will affect parsing, etc, in particular, the attribute names for However, the example we looked at earlier (YAL021C), is not a multi-exon transcript, so it is not clear that gff vs. gtf would make a difference there. Nevertheless, could you please give it a shot using the gtf file and let me know how that works. I'm not expecting it to "fix" things, but there could be some interaction I'm not thinking of.
You would still need to use the
When creating the profiles, the ORFs on different strands are handled somewhat differently. Basically, the reverse strand ORF profiles are created in the orientation of the 5' to 3' on the genome, then flipped at the end so they are in the orientation of the ORF. However, that doesn't really explain what is going on here, where there seems to be a frameshift or some such.
Hmm... this is strange. We've also ran a large number of human samples, and we never find any bias. For example, for the HEK-gao dataset we use in the paper, the breakdown is 4804/574/62 and the forward strand and 4654/577/70 on the reverse. (This might be slightly different than what we report in the paper due to updates to the pipeline.) Which annotations were you using here?
I will also see if I can reproduce what is going on with the public datasets. Just to double-check, are you using these samples, with this genome sequence? I will directly use the gtf file linked above. Have a good day, |
Hi Tonu, Could you please also let me know how you are creating the ribosomal index (i.e., which sequences)? I just want to have a setup that is as close as possible to what you have. Have a good day, |
Hi Brandon,
Yes, I used Saccharomyces_cerevisiae.R64-1-1.85 genome, ncRNA and annotation from Ensembl. From published data (Green's lab Cell paper 2015) SRR2046310 is better to use than SRR2046309. Both are WT controls but first one have less noncoding (12%) than the last one (60%). These links above are correct. For human I used first five chromosomes (pooled together) of GRCh38 Ensembl dna: Chr1 annotation in GTF: Homo_sapiens.GRCh38.87.chr.gtf Cheers, |
I try to run Gao's data SRR1630831 on my mac against human genome (GRCh38 - chromosomes only). Let see could I get similar results. |
Hi Brandon, About this GFF3 to GTF conversion question. It is not a problem generally. As you mentioned previously you can get gtf forom Ensembl repository and these files are working fine with rpbp. That is what I am using. My question was initiated because of that unusual task to revert standard genomes strands together with annotation - people normally don't do that. Solutions I found were not able to write proper GTF output. It probably would take less time to write some script than spend time for searching and testing. I think, I dont't need that anymore. Hopefully testing same datasets on different platforms will guide us close to reasons of found bias. Cheers, |
Hi Tonu,
Ah, okay, I see what you mean now. I don't really know a good way to do the flipping and such.
I hope so. I'm going to try the yeast data today, and I'll let you know what I find. Have a good day, |
Hi Tonu, So... good/bad news. I downloaded the yeast dataset and ran it on our cluster. I observe basically the same thing that you do: a "regular" distribution of ORF types on the forward strand, but very few Briefly looking at the counts of reads in each frame, it again appears like there is some sort of frame shift or something. The annotations and things in IGV also look okay (e.g., start codon matches with the "thick" part, etc.). This is the same code that we use for all of the other samples we have processed, so I don't really think it's a bug related to handling the strands. Also, the "out of frame" stuff seems to only be happening at the start of the reverse strand ORFs, not later on (thus, all of the Do you know if there is anything about the protocol that may result in the reverse strand behaving differently that the forward strand? I will try to dig a bit more and see if anything comes up. Please also let me know how things go with the HEK dataset. Thanks. Have a good day, |
Hi Tonu, There are a few things that can be checked; I'm going to make a list here to not forget. Also, please feel free to add to it.
Have a good day, |
Hi Brandon,
Then we can rule out platform specific differences. That is good.
The same here. The pattern we see is same. It can't be the real frameshift for translating ribosomes. It is possible that there might be a shift in 5' end of some reads in some regions with a biological meaning but seeing it in one strand predominantly doesn't make sense. I would try to use 3 prime assignment and see whether it gives the same results.
We are using Ingolia's protocol (Nature Protocols 2012) and Green's lab using the same. It is hard to imagine some strand specificity too slip into libraries. First strands are arbitrary for different chromosomes and in riboseq you extract ribosome protected mRNA fragments. I cant imagine something will flip in from here.
STAR indexing needs a bit more RAM to live together with OS than I have in my desktop [32G]. So I skip first two chromosomes and rerun it. I let you know when it will finish successfully. Cheers, |
Hi Brandon,
IGV shows mismatches as colour lines on reads on BAM alignment but I don't see any enrichment at the ends of reads in first part of genes on reverse strand. Unless mismatches are trimmed at the ends but as far I know it is not default behaviour of STAR. Cheers, |
Hi Tonu,
I agree. Maybe in a few places that could happen, but not ubiquitously throughout the genome.
Unfortunatley, no. As far as I know, Issue #42 does at least document the only places where the 5' ends are actually selected, so, in principle, it shouldn't be too difficult to add this. The bigger difficulty is probably more like propagating the "use-three-prime-ends" flag around the pipeline scripts. Again, not really "difficult", but it would require tests/checking in several places. If you are up for it, one option would be for you to implement this feature and create a pull request to add it in here. I can also try to take a look, but I don't really know when it would be (not too long, but probably not today or tomorrow).
It does seem unlikely; after discussing a little with others here working on ribosome profiling, they also doubt it is something "real".
I also had a look, and I don't see anything obvious in IGV. In some cases, most of the signal for an ORF is lost when using only the uniquely aligning reads, but there are plenty of instances where this is not the case, and the truncated version is still predicted. I will also try to have a look at the other two points in the check list above ("ORF type metagene profiles" and "Length-specific ORF profiles" since there are already scripts to create these things. Also, looking at the CIGAR strings in IGV, there seems to be a fair amount of soft-clipping going on with the alignments. We don't currently take that into account, but it is possible that is something we should do. I imagine this would just be another option to STAR, so that could also be something to consider before getting too deep into the internals. Have a good day, |
Hi Brandon, I used a subset (20 milj reads from fastq) of our yeast data and run it in rpbp and in my pipeline lets call it RiboSeqPy. First difference I see is the number of reads mapped once on genome (unique). STAR reports ~10 miljon reads when after HISAT alignment there is ~5.6 miljon reads with tag NH:i:1. Main part of this difference comes from prefiltering step what removes reads with lower quality in my pipeline. "Length specific profiles" around start (I used RiboSeqPy and read length range from 25 to 34 last line is sum) show that profiles generated from STAR alignment (left panel) look a bit ragged with sudden pikes when HISAT profiles (right panel) after first pileup around initiation are relatively smooth. Another thing what looks strange is the number of reads on plots. There is an order of magnitude less reads on plots generated from STAR alignment. Leaving difference of reads number on plot aside, what might be some another reasons, I would like to try to feed rpbp with the custom BAM made by HISAT to see whether there are some changes in strand specific bias. HISAT2 default behaviour is soft-clipping what can be turned off by Cheers, |
Hi Tonu, I will have a look at the entire results a bit later, but...
You can play with the flexbar options in the config file to control this, but we have experimented too much, so I don't really know how it would affect downstream analysis.
Yes, this is possible. There are docs here. From what you described, it sounds like you could put the bam files with only unique alignments at:
Are the hisat commands you use based on the following?
And are the STAR results based on the output of rpbp? I would like to test STAR without allowing soft-clipping to compare, but want to make sure I have the right things first. The prefiltering, alignment, etc., steps aren't really a "proper" part of Rp-Bp, so it would really be nice to have a feel for how these choices affect the inference in the models. Have a good day, |
Hi Brandon,
Yes thats it. Option
Yes, alignment I referred as STAR alignment is produced by rpbp Cheers, |
Hi Brandon, Reads filtering by quality and hisat2 alignment gives almost identical result to rpbp/STAR bam file. rbpp and STAR gives 25623/121/114 for forward and 239/1851/836 for reverse That don't explain why metagenomic profiles (I posted before) are so different. One reason might be that my script didn't handle rpbp/STAR bam file properly. I have to check. Cheers, |
Hi Tonu, I haven't heard from you in a while, and I'm just wondering if you were able to track down what was going on? Have a good day, |
Hi Brandon, Nope. It grow over my head and I made a brake to let things settle. To see wether this strand specific bias is solely data driven it might still worth to try run it on reverse-complemented genome/annotation. I asked is there such option available under genometools and they were kind enough to write the small script for reverting annotation. It still a bit buggy :-) Coming back to yeast. Cheers, |
This is related to Issue #54. Previously, the code assumed the orfs data frame was sorted in the same order as the profiles. At one point, this was true; however, it seems the access patterns and/or internal pandas data structures regarding indexes have changed (see "What's new" for 0.19.2), and it is no longer correct. Thus, ORFs on the reverse strand were not correctly flipped. The code now explicitly uses orf_num to match orfs in the data frame with their profile (which is the designed and expected behavior).
This is related to Issue #54, and it is essentially the same problem as fixed in commit dieterich-lab/rp-bp@/0a9253d9aa53f4cccb187f1960cc3643d1fe87b0. Again, the order of the orf data frame was assumed to match the order of the orf profiles, and this assumption was no longer valid. The code now explicitly uses orf_num rather than the index.
Hi Brandon, Best, |
Hi Tonu, Great! I'm glad to hear it's working now. I meant to write a bit earlier, but I left for vacation in the US. I'm a bit surprised if there is much to find in a de novo assembly for yeast. Does it look like unannotated "UTR"s? I'm really excited to hear what you find. Have a good day, P.S. I'm going to close the issue because I think the orf_num stuff fixes the actual problem; however, please let us know how things are looking. |
Hi Brandon, I haven't systematically checked where the group of ncRNA ORF's actually falls but I expect most of them to be uORFs and lesser degree dORFs. Tuller et al.,(2009) have summarised yeast UTR's showing coordinates for 4,409/5,127 of five- and three_prime_utr respectively. If I understand correctly to get proper ORF notation (uRORF, dORF) out of rpbp five_prime_UTRs and three_prime_UTRs have to be a part of the main annotation but rpbp don't read features like five_prime_UTR. So it have to be included in exons and transcripts?. I can write a small script to extend exons and transcripts coordinates accordingly but is it sufficient? Or is there some part in misc.* and/or riboutils.* what have more strict requirements for GTF when converting it to BED format? Second and separate issue for me is when I run rpbp with the whole dataset (140 milj reads per replica) the subprogram extract-orf-profiles crashes after completing 60-83% of "Finding all P-site intersections". It looks like it needs more RAM than 32GB. This RAM consumption seems to be related to the number of reads and my be with using de_novo transcript annotation. Cheers, |
Hi Brandon, After adding UTR part to gene/transcript/exons coordinates u/dORFs are now recognised. That's fine. Memory problem was strange, I didn't see it later. Hard to say what caused it. Cheers, |
Hi Tonu, Sorry also for the slow reply here; I was on vacation for the last couple weeks and not checking in so much. It sounds like things are worked out, but I can still clarify a few things.
rpbp uses When a de novo assembly is available, we can further intersect it with the annotations to separate totally novel transcripts (
The memory usage is a bit difficult to estimate ahead of time. Internally, we use sparse matrix representations (namely, compressed sparse row format) for the profiles; thus, as the sequencing depth increases, the efficiency of our data structures degrades somewhat (though it is still typically much better than a simple dense matrix). The relationship is roughly The memory usage is also affected by the number of processes used ( There is some non-trivial additional overhead besides that, such as the memory for the bam file, but the profiles are the main thing. Also, we have run into some problems where python actually reports more RAM usage than this because of the way python multiprocessing handles readonly shared memory (one of the biggest disadvantages of python :-/). For example, according to Have a good day, |
Hi Brandon,
I found something unexpected when plotted out metagene profiles using included IPy-notebook file "create-orf-type-metagene-profiles.ipynb" by using my own data from yeast.
Most of ORFs are tagged as "canonical", canonical_truncated", or "within". In forward strand majority ORFs are "canonical" (2,563) and "truncated/within" (121/114) forms smaller group. Surprisingly in reverse strand there is a much less "canonical" (239) while "truncated" and "within" forms the major class of ORFs (1,851 and 836 respectively). orf-metaplot
For reverse strand there is no ribosomes pilup at ends when in forward strand its clearly visible.
Could it be that coordinates become a bit shifted in reverse strand?
What kind test could be best to run to figure out what went wrong if any?
Cheers,
Tonu
The text was updated successfully, but these errors were encountered: