-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle paired end reads with barcodes on both reverse and forward reads? #175
Comments
[EDIT] - This response was written under the missapprehension that you were talking about a BAM file. Only the middle paragraph is actually relevant here. Hi @J-sara - Unfortunately, UMI-tools will not be able to handle this as it expects the read ids to be identical, with both containing the UMI. In fact, as I understand it, it's conventional for read pairs to have the same read id although I can't see any reference to this in the Sam file format documents. I'm assuming from the experimental design you describe that the barcodes are on one read pair or other at random in the fastq file? If this is the case, you'll need to write a bespoke script to extract the barcode from whichever read it resides in and append it to both reads ids in the format expected by UMI-tools As far as I'm aware, there are no tools which could handle the UMI being on one read pair or the other at random in the BAM file. |
@TomSmithCGAT I didn't read @J-sara 's post as saying they were on one or the other read name, but on one or other read sequence. Could the regex pattern matching not help us here? What happens with the regex if it fails to match on a read? |
@IanSudbery - My bad. I just assumed we were talking about using e.g UMI-tools If the regex fails to match with So this leaves us with the bespoke approach described above. |
Hi @J-sara - Did you manage to write a bespoke script to extract the UMIs? If not, I can help you out if you can share some more information about how the barcodes are encoded and a test fastq |
Hi @TomSmithCGAT - I have a similar issue to @J-sara, in that the barcode I am looking for can appear in either Read 1 OR Read 2. Which one it appears in can be determined by the presence of a specific sequence immediately after the barcode. I had a couple of questions regarding the options for extract (with my guesses in brackets):
And for perhaps the easiest solution to this: Thanks in advance for your help and apart from this issue, I have found the tool very useful! (especially the regex options) |
I also have the same question but have not yet found a solution. My library design is pictured below. As with @J-sara, I have a single UMI which may be found on read 1 or read 2 depending on how long the insert size is. If it's on read 1, the 12 bp UMI is preceded by a 11 bp universal sequence (ATTGGAGTCCT). If it is on read 2, the UMI is followed by the reverse compliment of the universal sequence. Therefore, what I would like to due is 1) extract the UMI if present on only one read and assign to both reads or 2) if both UMIs are present but differ in sequence then choose the UMI with the highest quality score and assign to both reads. I have tried using the bc-pattern regex as follows: While individually, both patterns work but when used together, I get the following error:
One option I have considered is to treat each read as single ended and extract the UMIs from each. However, it would get challenging when attempting to reconcile the two ends. |
Hi all, A make shift solution I'm currently using is running the extract twice, looking for the umi in one read while making sure that it isn't in the other (and vice versa). For example, this pattern below uses negative look around '?!' to make sure the other read doesn't match the sequence, in my case GCGATCGCCCTCGAGG. --bc-pattern='(?P<umi_1>.{10,25)(?P<discard_1>(GCGATCGCCCTCGAGG){s<=1})(?!(.{5,250})(?=GCGATCGCCCTCGAGG))' --bc-pattern2='(?P<umi_2>)(?!(.{0,250})(?P<discard_2>(GCGATCGCCCTCGAGG){s<=1}))' I then flip bc-pattern and bc-pattern2 to look at the other read, after which I merge the two outputs. There might be a more elegant way to do this, but it seems to do the trick. Having an output read pairs that didn't match would still be nice if possible, since that would make things easier to troubleshoot and evaluate |
Hi @OldManTae - This is indeed probably the only way to handle this in I propose adding an option |
@davidcoffey - I'm afraid this is a bug in the most recent release. This was fixed on the master branch on 19 Oct (see here) but we haven't issued a new release yet. If you installed from github you can pull these bug fixes. If you installed via |
FYI, I also have UMI sequences only on read2. |
@etwatson You should be able to handle reads that contain UMI sequences only on read2 already. Just provide what you are currently calling read2 as the primary input to extract and what you are calling read1 to the |
hi @J-sara, @davidcoffey & @OldManTae. I've finally got around to adding the option to extract the UMI from either read (#311). I could do with a more suitable toy dataset to use in the tests. Would any of you be able to share a small input data set (e.g 10,000 reads) with me so I can check the behaviour is as expected. |
It is now possible to extract a UMI from either read ( These options are available now from the master branch and will be made available in the next release |
P.s Still looking for some better test input though if anyone would like to share... |
Hello,
I have a paired end fastq file and my experiment is designed in a way that each PAIRED READ has ONE barcode in it. This barcode might be on the forward read or on the reverse read (not both of them) and the barcode has a specific sequence before and after it that helps to identify the barcode. So, some of the forward reads have the barcode and some of them do not. This is also true for reverse reads.
My question is whether UMI tools is able to handle this type of experiment design or not and if so, how I can do it?
Best
The text was updated successfully, but these errors were encountered: