Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle paired end reads with barcodes on both reverse and forward reads? #175

Closed
J-sara opened this issue Aug 24, 2017 · 14 comments
Closed

Comments

@J-sara
Copy link

J-sara commented Aug 24, 2017

Hello,

I have a paired end fastq file and my experiment is designed in a way that each PAIRED READ has ONE barcode in it. This barcode might be on the forward read or on the reverse read (not both of them) and the barcode has a specific sequence before and after it that helps to identify the barcode. So, some of the forward reads have the barcode and some of them do not. This is also true for reverse reads.

My question is whether UMI tools is able to handle this type of experiment design or not and if so, how I can do it?

Best

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Aug 24, 2017

[EDIT] - This response was written under the missapprehension that you were talking about a BAM file. Only the middle paragraph is actually relevant here.

Hi @J-sara - Unfortunately, UMI-tools will not be able to handle this as it expects the read ids to be identical, with both containing the UMI. In fact, as I understand it, it's conventional for read pairs to have the same read id although I can't see any reference to this in the Sam file format documents.

I'm assuming from the experimental design you describe that the barcodes are on one read pair or other at random in the fastq file? If this is the case, you'll need to write a bespoke script to extract the barcode from whichever read it resides in and append it to both reads ids in the format expected by UMI-tools dedup, e.g separated from the read_id by a "_".

As far as I'm aware, there are no tools which could handle the UMI being on one read pair or the other at random in the BAM file.

@IanSudbery
Copy link
Member

@TomSmithCGAT I didn't read @J-sara 's post as saying they were on one or the other read name, but on one or other read sequence.

Could the regex pattern matching not help us here? What happens with the regex if it fails to match on a read?

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Aug 24, 2017

@IanSudbery - My bad. I just assumed we were talking about using e.g UMI-tools dedup on a BAM file. You're right, the issue explicitly says we're talking about a fastq

If the regex fails to match with extract, the read pair will be discarded. This is intentional behaviour. E.g for inDrop where there is an adapter sequence one wants to match.

So this leaves us with the bespoke approach described above.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Oct 13, 2017

Hi @J-sara - Did you manage to write a bespoke script to extract the UMIs? If not, I can help you out if you can share some more information about how the barcodes are encoded and a test fastq

@OldManTae
Copy link

Hi @TomSmithCGAT - I have a similar issue to @J-sara, in that the barcode I am looking for can appear in either Read 1 OR Read 2. Which one it appears in can be determined by the presence of a specific sequence immediately after the barcode.

I had a couple of questions regarding the options for extract (with my guesses in brackets):

  1. does bc-pattern in paired end reads search specifically in read 1? (Yes)
  2. similar to above, does bc-pattern2 search specifically in read 2? (Yes)
  3. Is it possible to have extract do an OR rather than AND when using both options?

And for perhaps the easiest solution to this:
4. Is it possible to save reads that were discarded (so that it is possible to do a sequential extraction, where I apply one filter after another without double counting issues).

Thanks in advance for your help and apart from this issue, I have found the tool very useful! (especially the regex options)

@davidcoffey
Copy link

I also have the same question but have not yet found a solution. My library design is pictured below. As with @J-sara, I have a single UMI which may be found on read 1 or read 2 depending on how long the insert size is. If it's on read 1, the 12 bp UMI is preceded by a 11 bp universal sequence (ATTGGAGTCCT). If it is on read 2, the UMI is followed by the reverse compliment of the universal sequence. Therefore, what I would like to due is 1) extract the UMI if present on only one read and assign to both reads or 2) if both UMIs are present but differ in sequence then choose the UMI with the highest quality score and assign to both reads.

I have tried using the bc-pattern regex as follows:
bc-pattern='(?P<umi_1>.{12})(?P<discard_1>ATTGGAGTCCT{e<=1})'
bc-pattern2='.*(?P<discard_2>AGGACTCCAAT{e<=1})(?P<umi_2>.{12})(?P<discard_3>.*)'

While individually, both patterns work but when used together, I get the following error:

Traceback (most recent call last):
  File ".../extract.py", line 220, in main
    options.pattern2 = regex.compile(options.barcode_regex2)
AttributeError: 'Values' object has no attribute 'barcode_regex2'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File ".../umi_tools", line 11, in <module>
    load_entry_point('umi-tools==0.5.0', 'console_scripts', 'umi_tools')()
  File ".../umi_tools.py", line 50, in main
    module.main(sys.argv)
  File ".../extract.py", line 221, in main
    except regex.Error:
AttributeError: module 'regex' has no attribute 'Error'

One option I have considered is to treat each read as single ended and extract the UMIs from each. However, it would get challenging when attempting to reconcile the two ends.

library

@OldManTae
Copy link

Hi all,

A make shift solution I'm currently using is running the extract twice, looking for the umi in one read while making sure that it isn't in the other (and vice versa).

For example, this pattern below uses negative look around '?!' to make sure the other read doesn't match the sequence, in my case GCGATCGCCCTCGAGG.

--bc-pattern='(?P<umi_1>.{10,25)(?P<discard_1>(GCGATCGCCCTCGAGG){s<=1})(?!(.{5,250})(?=GCGATCGCCCTCGAGG))' --bc-pattern2='(?P<umi_2>)(?!(.{0,250})(?P<discard_2>(GCGATCGCCCTCGAGG){s<=1}))'

I then flip bc-pattern and bc-pattern2 to look at the other read, after which I merge the two outputs. There might be a more elegant way to do this, but it seems to do the trick.

Having an output read pairs that didn't match would still be nice if possible, since that would make things easier to troubleshoot and evaluate

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Nov 30, 2017

Hi @OldManTae - This is indeed probably the only way to handle this in umi_tools currently, though this is obviously less than ideal since it requires running extract twice and merging the outputs. Since there's at least 3 users who want to extract from either read1 or read2, I think we should look to add this functionality to umi_tools.

I propose adding an option --read1-or-read2. This option will demand the user also supplies two regexes with --bc-pattern and --bc-pattern2. If either regex matches, the UMI will be extracted and the read pair written out. If both match, the read pair will be discarded. We can log the number of times the extraction is done from read1 or read2 and the number of discarded pairs.

@TomSmithCGAT
Copy link
Member

@davidcoffey - I'm afraid this is a bug in the most recent release. This was fixed on the master branch on 19 Oct (see here) but we haven't issued a new release yet. If you installed from github you can pull these bug fixes. If you installed via pip/conda we should have a new release out shortly.

@etwatson
Copy link

FYI, I also have UMI sequences only on read2.

@IanSudbery
Copy link
Member

@etwatson You should be able to handle reads that contain UMI sequences only on read2 already. Just provide what you are currently calling read2 as the primary input to extract and what you are calling read1 to the --read2-in.

@TomSmithCGAT
Copy link
Member

hi @J-sara, @davidcoffey & @OldManTae. I've finally got around to adding the option to extract the UMI from either read (#311). I could do with a more suitable toy dataset to use in the tests. Would any of you be able to share a small input data set (e.g 10,000 reads) with me so I can check the behaviour is as expected.

@TomSmithCGAT
Copy link
Member

It is now possible to extract a UMI from either read (--either-read). Where both reads match the regexes provided, the default behaviour is to discard the read pair. To keep the read pairs and use the UMI with the highest sequence quality, use --either-read-resolve=quality.

These options are available now from the master branch and will be made available in the next release

@TomSmithCGAT
Copy link
Member

P.s Still looking for some better test input though if anyone would like to share...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants