Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend TrimPrimers to work amplicons with only a single primer #679

Conversation

dlaehnemann
Copy link

We basically have QIAseq amplicons that are generated with only one gene-specific primer, see this schematic, which is page 12 from this guide:
https://www.qiagen.com/us/resources/download.aspx?id=8907edbe-a462-4883-ae1b-2759657e7fd0&lang=en

Thus, we are trying to generalize Amplicon, so that the _start and _end on the left or right end of the amplicon are always Options and at one end can be None. In the input primer file, the respective columns would just be left empty to generate the None values.

We'll also include a testcase.

@nh13 nh13 self-requested a review June 16, 2021 20:38
@nh13 nh13 self-assigned this Jun 16, 2021
@nh13
Copy link
Member

nh13 commented Jun 16, 2021

@dlaehnemann exciting! Let me take a look and give feedback.

@dlaehnemann
Copy link
Author

Feel free to have a look, but this is still very early days and the PR is mostly to have a good online view of the diffs so far.

The test cases should be coming any time soon and we have a working IDE setup for Scala that helps a lot, so we have an idea of what we probably have to change to make our use case work.

But if you e.g. have some good pointers for an intro into the basics of Scala syntax, that might be good to have.

@nh13
Copy link
Member

nh13 commented Jun 16, 2021

@dlaehnemann I'd recommend reading through some of our code (utility classes or tools) for our style (everyone has opinions!). For a more comprehensive study, I'd read Scala for the Impatient.

@nh13
Copy link
Member

nh13 commented Jun 16, 2021

@dlaehnemann can you describe the desired result of primer trimming for R1 and R2 respectively? Do we only trim primers from the 5' end of R1 (or only R2), or something else? From briefly looking at the schematic, it isn't immediately obvious what the R1 and R2 read structures look like after sequencing.

@dlaehnemann
Copy link
Author

We have a primer file with tab-separated entries like these:

chr1	156851455	1	AGGCCCCAGTATTCCGGCTAACCACT
chr7	55259380	0	GATGCAGAGCTTCTTCCCATGATGATCTG

Chromosome and position don't really matter (they should be hg19 in this case, and also note that these are not the original primers, as someone might hold rights to those...), because we realign the primers to whatever genome is used in the analysis pipeline.

Column 3 indicates whether the primer is on the forward or the reverse strand. 0 means that the primer and the read will be on the forward strand (i.e. identical to the reference sequence). Thus, from the perspective of the reference sequence, the respective read should be the left-most and will have the primer sequence first, followed on the right by the read sequence. 1 means the primer sequence depicted here is the reverse complement of the reference sequence and the primer sequence will appear at the right end of a reverse complement read (i.e. the start of that read).

In effect, the primer should always be trimmed from the 5' end (start) of the read that is the first in pair (picard "Explain SAM Flags) , or as the SAM spec puts it: the first segment in the template.

This means, that we need an Amplicon definition, where only one primer is given and thus the other end of the amplicon is undefined / ragged. We think this can be achieved by wrapping all the primer position definitions for an amplicon in Option[]s and then matching all valid cases and throwing an exception otherwise.

One caveat we have just identified: when the total amplicon length is smaller than the read length, there will also be primer sequence at the (3') end of the last segment in the template (second in pair). But I guess simply also cutting the same primer region from the second in pair that we already cut at the start of the first in pair (if it appears in the second in pair) should do the trick for that case and not do anything in the case of longer amplicons.

OK, I hope I didn't confuse any flags or read directions or strands in this...

@nh13
Copy link
Member

nh13 commented Jun 23, 2021

@dlaehnemann @FelixMoelder do you want to test out #681 to if that works for you?

@FelixMoelder
Copy link

@nh13 Thank's for that alternative implementation. We are still setting up some test cases but as soon as those are done we will also try if this works on your PR.

@dlaehnemann
Copy link
Author

BTW, we also found a good graphic about the template / fragment setup with this type of targeted panel, that is less confusing than Qiagen's own graphic...

CGATOxford/UMI-tools#175 (comment)

@nh13
Copy link
Member

nh13 commented Sep 6, 2021

Closing in favor of #681

@nh13 nh13 closed this Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants