How to handle paired end reads with barcodes on both reverse and forward reads? #175

J-sara · 2017-08-24T14:17:32Z

Hello,

I have a paired end fastq file and my experiment is designed in a way that each PAIRED READ has ONE barcode in it. This barcode might be on the forward read or on the reverse read (not both of them) and the barcode has a specific sequence before and after it that helps to identify the barcode. So, some of the forward reads have the barcode and some of them do not. This is also true for reverse reads.

My question is whether UMI tools is able to handle this type of experiment design or not and if so, how I can do it?

Best

TomSmithCGAT · 2017-08-24T17:18:31Z

[EDIT] - This response was written under the missapprehension that you were talking about a BAM file. Only the middle paragraph is actually relevant here.

Hi @J-sara - Unfortunately, UMI-tools will not be able to handle this as it expects the read ids to be identical, with both containing the UMI. In fact, as I understand it, it's conventional for read pairs to have the same read id although I can't see any reference to this in the Sam file format documents.

I'm assuming from the experimental design you describe that the barcodes are on one read pair or other at random in the fastq file? If this is the case, you'll need to write a bespoke script to extract the barcode from whichever read it resides in and append it to both reads ids in the format expected by UMI-tools dedup, e.g separated from the read_id by a "_".

As far as I'm aware, there are no tools which could handle the UMI being on one read pair or the other at random in the BAM file.

IanSudbery · 2017-08-24T17:21:35Z

@TomSmithCGAT I didn't read @J-sara 's post as saying they were on one or the other read name, but on one or other read sequence.

Could the regex pattern matching not help us here? What happens with the regex if it fails to match on a read?

TomSmithCGAT · 2017-08-24T17:27:29Z

@IanSudbery - My bad. I just assumed we were talking about using e.g UMI-tools dedup on a BAM file. You're right, the issue explicitly says we're talking about a fastq

If the regex fails to match with extract, the read pair will be discarded. This is intentional behaviour. E.g for inDrop where there is an adapter sequence one wants to match.

So this leaves us with the bespoke approach described above.

TomSmithCGAT · 2017-10-13T12:34:56Z

Hi @J-sara - Did you manage to write a bespoke script to extract the UMIs? If not, I can help you out if you can share some more information about how the barcodes are encoded and a test fastq

OldManTae · 2017-11-22T18:53:54Z

Hi @TomSmithCGAT - I have a similar issue to @J-sara, in that the barcode I am looking for can appear in either Read 1 OR Read 2. Which one it appears in can be determined by the presence of a specific sequence immediately after the barcode.

I had a couple of questions regarding the options for extract (with my guesses in brackets):

does bc-pattern in paired end reads search specifically in read 1? (Yes)
similar to above, does bc-pattern2 search specifically in read 2? (Yes)
Is it possible to have extract do an OR rather than AND when using both options?

And for perhaps the easiest solution to this:
4. Is it possible to save reads that were discarded (so that it is possible to do a sequential extraction, where I apply one filter after another without double counting issues).

Thanks in advance for your help and apart from this issue, I have found the tool very useful! (especially the regex options)

davidcoffey · 2017-11-25T09:42:04Z

I also have the same question but have not yet found a solution. My library design is pictured below. As with @J-sara, I have a single UMI which may be found on read 1 or read 2 depending on how long the insert size is. If it's on read 1, the 12 bp UMI is preceded by a 11 bp universal sequence (ATTGGAGTCCT). If it is on read 2, the UMI is followed by the reverse compliment of the universal sequence. Therefore, what I would like to due is 1) extract the UMI if present on only one read and assign to both reads or 2) if both UMIs are present but differ in sequence then choose the UMI with the highest quality score and assign to both reads.

I have tried using the bc-pattern regex as follows:
bc-pattern='(?P<umi_1>.{12})(?P<discard_1>ATTGGAGTCCT{e<=1})'
bc-pattern2='.*(?P<discard_2>AGGACTCCAAT{e<=1})(?P<umi_2>.{12})(?P<discard_3>.*)'

While individually, both patterns work but when used together, I get the following error:

Traceback (most recent call last):
  File ".../extract.py", line 220, in main
    options.pattern2 = regex.compile(options.barcode_regex2)
AttributeError: 'Values' object has no attribute 'barcode_regex2'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File ".../umi_tools", line 11, in <module>
    load_entry_point('umi-tools==0.5.0', 'console_scripts', 'umi_tools')()
  File ".../umi_tools.py", line 50, in main
    module.main(sys.argv)
  File ".../extract.py", line 221, in main
    except regex.Error:
AttributeError: module 'regex' has no attribute 'Error'

One option I have considered is to treat each read as single ended and extract the UMIs from each. However, it would get challenging when attempting to reconcile the two ends.

OldManTae · 2017-11-29T22:06:27Z

Hi all,

A make shift solution I'm currently using is running the extract twice, looking for the umi in one read while making sure that it isn't in the other (and vice versa).

For example, this pattern below uses negative look around '?!' to make sure the other read doesn't match the sequence, in my case GCGATCGCCCTCGAGG.

--bc-pattern='(?P<umi_1>.{10,25)(?P<discard_1>(GCGATCGCCCTCGAGG){s<=1})(?!(.{5,250})(?=GCGATCGCCCTCGAGG))' --bc-pattern2='(?P<umi_2>)(?!(.{0,250})(?P<discard_2>(GCGATCGCCCTCGAGG){s<=1}))'

I then flip bc-pattern and bc-pattern2 to look at the other read, after which I merge the two outputs. There might be a more elegant way to do this, but it seems to do the trick.

Having an output read pairs that didn't match would still be nice if possible, since that would make things easier to troubleshoot and evaluate

TomSmithCGAT · 2017-11-30T08:32:49Z

Hi @OldManTae - This is indeed probably the only way to handle this in umi_tools currently, though this is obviously less than ideal since it requires running extract twice and merging the outputs. Since there's at least 3 users who want to extract from either read1 or read2, I think we should look to add this functionality to umi_tools.

I propose adding an option --read1-or-read2. This option will demand the user also supplies two regexes with --bc-pattern and --bc-pattern2. If either regex matches, the UMI will be extracted and the read pair written out. If both match, the read pair will be discarded. We can log the number of times the extraction is done from read1 or read2 and the number of discarded pairs.

TomSmithCGAT · 2017-11-30T08:42:16Z

@davidcoffey - I'm afraid this is a bug in the most recent release. This was fixed on the master branch on 19 Oct (see here) but we haven't issued a new release yet. If you installed from github you can pull these bug fixes. If you installed via pip/conda we should have a new release out shortly.

etwatson · 2017-12-13T01:22:58Z

FYI, I also have UMI sequences only on read2.

IanSudbery · 2017-12-14T10:26:13Z

@etwatson You should be able to handle reads that contain UMI sequences only on read2 already. Just provide what you are currently calling read2 as the primary input to extract and what you are calling read1 to the --read2-in.

TomSmithCGAT · 2019-01-25T17:26:18Z

hi @J-sara, @davidcoffey & @OldManTae. I've finally got around to adding the option to extract the UMI from either read (#311). I could do with a more suitable toy dataset to use in the tests. Would any of you be able to share a small input data set (e.g 10,000 reads) with me so I can check the behaviour is as expected.

TomSmithCGAT · 2019-02-01T10:33:08Z

It is now possible to extract a UMI from either read (--either-read). Where both reads match the regexes provided, the default behaviour is to discard the read pair. To keep the read pairs and use the UMI with the highest sequence quality, use --either-read-resolve=quality.

These options are available now from the master branch and will be made available in the next release

TomSmithCGAT · 2019-02-01T10:33:33Z

P.s Still looking for some better test input though if anyone would like to share...

TomSmithCGAT added the question label Oct 13, 2017

TomSmithCGAT added the Next release label Nov 30, 2017

TomSmithCGAT mentioned this issue Jan 25, 2019

{ts} barcode extract either read #311

Merged

TomSmithCGAT closed this as completed Feb 1, 2019

TomSmithCGAT mentioned this issue Sep 14, 2020

Options to reverse-complement read 2 UMI #427

Closed

dlaehnemann mentioned this issue Jun 24, 2021

extend TrimPrimers to work amplicons with only a single primer fulcrumgenomics/fgbio#679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle paired end reads with barcodes on both reverse and forward reads? #175

How to handle paired end reads with barcodes on both reverse and forward reads? #175

J-sara commented Aug 24, 2017

TomSmithCGAT commented Aug 24, 2017 •

edited

Loading

IanSudbery commented Aug 24, 2017

TomSmithCGAT commented Aug 24, 2017 •

edited

Loading

TomSmithCGAT commented Oct 13, 2017 •

edited

Loading

OldManTae commented Nov 22, 2017

davidcoffey commented Nov 25, 2017

OldManTae commented Nov 29, 2017

TomSmithCGAT commented Nov 30, 2017 •

edited

Loading

TomSmithCGAT commented Nov 30, 2017

etwatson commented Dec 13, 2017

IanSudbery commented Dec 14, 2017

TomSmithCGAT commented Jan 25, 2019

TomSmithCGAT commented Feb 1, 2019

TomSmithCGAT commented Feb 1, 2019

How to handle paired end reads with barcodes on both reverse and forward reads? #175

How to handle paired end reads with barcodes on both reverse and forward reads? #175

Comments

J-sara commented Aug 24, 2017

TomSmithCGAT commented Aug 24, 2017 • edited Loading

IanSudbery commented Aug 24, 2017

TomSmithCGAT commented Aug 24, 2017 • edited Loading

TomSmithCGAT commented Oct 13, 2017 • edited Loading

OldManTae commented Nov 22, 2017

davidcoffey commented Nov 25, 2017

OldManTae commented Nov 29, 2017

TomSmithCGAT commented Nov 30, 2017 • edited Loading

TomSmithCGAT commented Nov 30, 2017

etwatson commented Dec 13, 2017

IanSudbery commented Dec 14, 2017

TomSmithCGAT commented Jan 25, 2019

TomSmithCGAT commented Feb 1, 2019

TomSmithCGAT commented Feb 1, 2019

TomSmithCGAT commented Aug 24, 2017 •

edited

Loading

TomSmithCGAT commented Aug 24, 2017 •

edited

Loading

TomSmithCGAT commented Oct 13, 2017 •

edited

Loading

TomSmithCGAT commented Nov 30, 2017 •

edited

Loading