Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N bases #1799

Closed
bshim181 opened this issue Sep 23, 2024 · 8 comments
Closed

N bases #1799

bshim181 opened this issue Sep 23, 2024 · 8 comments

Comments

@bshim181
Copy link

i have noticed in that in the reports files, if there is a base N in the gene feature sequences, it translates to amino acid X.
I was wondering if there is a way to handle those base Ns. is there a way to replace those bases based on the reference?

Screenshot 2024-09-23 at 9 17 38 AM
@mizraelson
Copy link
Member

Hi, what command do you run to analyze the data?

@bshim181
Copy link
Author

bshim181 commented Sep 24, 2024

preset of analyze rnaseq-full-length with MiXCR version 4.3.2 I believe. Is there a possibility where updating to newer version of MiXCR might solve the issue?

Also, if updating to new MiXCR version is hard to do(predefined sets of workflow), is there a way to modify the parameter to handle this?

@bshim181
Copy link
Author

From looking at alignment files, it seems like alignment gaps leads to these ambiguous base of N.

Screenshot 2024-09-26 at 9 48 53 PM

@mizraelson
Copy link
Member

mizraelson commented Sep 27, 2024

Not exactly. In the example above, there is no ambiguity, but rather a single nucleotide deletion in FR3, which will shift the reading frame, rendering the clone non-productive.

The appearance of “N” occurs during the assembleContigs step, when MiXCR extends the initially assembled CDR3 clones to cover more regions of the sequence. This is where ambiguity can arise. You can discard such sequences by adding the following to the analyze command:

-MassembleContigs.parameters.discardAmbiguousNucleotideCalls=true to the analyze command.

@bshim181
Copy link
Author

bshim181 commented Oct 1, 2024

Regards to the image I have sent above, so I looked at an example where N base appeared in the sequence.
Screenshot 2024-10-01 at 11 44 12 AM
This is the sequence I looked at, I believe in the FR3 region with two Ns in the sequence.

I see two different pools of reads. Out of total of 21 reads that cover map to this clone, about half of the reads have this variation.

At these two positions with N, I am seeing deletion in the first N position and mismatch in the second N position (mismatch between reference=G and query=C)
image

For another half of reads, I am seeing mismatch in the first N position and the match to the reference in the second position.

Screenshot 2024-10-01 at 11 46 14 AM

Rather than replacing these bases with N, is there a possibility to output all possible sequences with variants? we are also interested in mutations within vdj sequences and these read evidences might be pointing toward potential biologically relevant targets.

@mizraelson
Copy link
Member

Did you try using:
-MassembleContigs.parameters.discardAmbiguousNucleotideCalls=true ? Do you still see Ns in the sequences?

Regarding the first case: a deletion of A nucleotide in FR3 will lead to a frameshift in translation of CDR3, FR4 and C gene and this clone will not be functional.

@bshim181
Copy link
Author

bshim181 commented Oct 2, 2024

I have tried using -MassembleContigs.parameters.discardAmbiguousNucleotideCalls=true and it does discard ambiguous nucleotides and replaces with the reference sequence.

Possibility that I am considering here is that the variants captured in these reads are mutations rather than sequencing error and therefore i was wondering if there is a way to output all possible variation at those N base positions (rather than getting replaced with ambiguous base).

@mizraelson
Copy link
Member

I see. Generally speaking, there is an algorithm behind assembleContigs that splits a clone if there is enough data to support both variants, which is the output you’re looking for. This algorithm considers the shares of each variant, the Phred quality of the nucleotides, their location on the read, and the surrounding context (for example, if you have NN, there might be multiple possible resolutions) among other things. In some cases, there isn’t enough data to determine if the clone should be split, and MiXCR will then place an N. Several parameters guide this process, but the main ones are:

-MassembleContigs.parameters.branchingMinimalQualityShare=0.1
-MassembleContigs.parameters.branchingMinimalSumQuality=60
-MassembleContigs.parameters.outputMinimalQualityShare=0.75

These are the default values for MiXCR v4.7 with the rna-seq preset. You can find explanations for all parameters on our website. I recommend trying the latest version first and adjusting the parameters if needed (generally, the lower the thresholds, the more likely MiXCR will split a clone into two).

That said, based on our experience, the default parameters work best, as they have been empirically evaluated on hundreds of different datasets.

@milaboratory milaboratory locked and limited conversation to collaborators Oct 2, 2024
@mizraelson mizraelson converted this issue into discussion #1813 Oct 2, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants