Uneven split #88

matsen · 2019-02-02T00:29:56Z

Right now I cut the CDR3 exactly in half. But here's a plot of the length of the untrimmed genes:

It looks like we'd probably do better keeping the V on one side and the J on the other if we put 40% on the V side and 60% on the J.

matsen · 2019-02-02T00:33:09Z

df = read.csv('/home/matsen/Downloads/repos/vampire/vampire/data/germline-cdr3-aas.csv', stringsAsFactors=FALSE)
df$length = nchar(df$sequence)

ggplot(df, aes(length, fill=locus)) + geom_histogram()

krdav · 2019-02-02T02:09:08Z

I disagree with that (:face_with_rolling_eyes:) because this is suppose to split residues between the anchor residues i.e. looking at a structure half of the loop should be left aligned and half should be right aligned. Splitting on V/J gene contribution to CDR3 does not guarantee that.

matsen · 2019-02-03T11:33:21Z

So what you are saying is that you think that there is meaningful structural homology in the middle 20% of the CDR3? 🤔

krdav · 2019-02-03T19:53:28Z

I am not sure what you mean by "meaningful structural homology".

Maybe an example can make my point more clear. Here are three CDR3 sequences. A = anchor residue, V = residue from V gene, J = residue from V gene:
AVVVVVJJJA
AVVJJJJJJA
AVVVVVJA

I suggest these are split into:
AVVVV---VJJJA
AVVJJ---JJJJA
AVVV-----VVJA

You suggest they are split into:
AVVVVV---JJJA
AVV---JJJJJJA
AVVVVV-----JA

Have we had the true protein structures of these and aligned them all to the anchor residue, I argue that the per-position distance, in 3d world, would be smaller for my alignment. To get the "per-position distance" walk along the alignment, at position X take the residues and map them back onto their protein structure, then make all pairwise distance comparisons and take the mean.

By making a split like this: AVVVVV-----JA
You indicate that the last V residue is far from the J-anchor and as a consequence, when comparing to longer CDR3 sequences, it might be in the same position as a residue in the middle between the two anchor residues.

matsen · 2019-02-03T21:52:25Z

Ah, sorry to be unclear. I'm suggesting a constant 40% / 60% split. Something like this:

AVVV---VVJJJA
AVVJ---JJJJJA
AVV-----VVVJA

with the logic that generally the J contribution is a little more than the V contribution.

krdav · 2019-02-03T22:18:24Z

Okay, this changes things slightly, but I will still argue that a 50/50 split will give the smallest per-position distance - which is what we want. If we did not care about this at all we should just left align everything.

It is hard to see the problem with a 40/60 split because it is already so close to a 50/50 split. The problem with 40/60 is that it is not symmetric. To see the problem more clearly let's look at an extreme case with one short and one long CDR3 and a more extreme 10/90 split:
AVV---VVVVVVVVJJJJJJJJJJA
AV-------------VVVVJJJJJA

Take position 13 in the alignment. For the first sequence this is in the middle of the long CDR3, far away from the V-anchor. For the second sequence this is the second residue after the V-anchor. Have we had the protein structure I guarantee you that those two residue would be far away from each other.

matsen · 2019-02-03T22:51:44Z

This is all a minor point, so we should only continue if it's fun. So...

a 50/50 split will give the smallest per-position distance - which is what we want.

Is that our only objective? I'd say that having the V's on one side and the J's on the other allows us to learn the rules of VDJ recombination more easily. I'm not convinced that alignment of residues in the middle of CDR3s of different length is actually so meaningful from a structural homology perspective.

If we did not care about this at all we should just left align everything.

No, not at all. If we left aligned everything, then we'd lose the real homology between the J genes for sequences of different length.

krdav · 2019-02-04T02:10:31Z

Is that our only objective? I'd say that having the V's on one side and the J's on the other allows us to learn the rules of VDJ recombination more easily.

Maybe, but following that argument we should be splitting on V/J gene border and not a hard 40/60 threshold. Also we don't even know when V starts and J ends - we impute it from alignment. But even if we had the true V/J start/end I still think structurally justified 50/50 splitting is better.

Also, I do think there is a meaningful structural difference:

Granted, the difference gets smaller the closer we get to 50/50, so 40/60 is not far from that.

matsen · 2019-02-04T12:07:49Z

Hm, it seems like you're wanting to take this argument to extremes. That's not what I'm proposing. Also, I'm not proposing anything other than a fixed split, ever.

Given non-equal-sized building blocks, it seems impossible that 50/50 would be the optimal split. Perhaps it's 49/51, but I don't see how 50/50 can be optimal. I'd think that structural homology is strongest in germline-gene-encoded regions, so that a slight modification would actually improve structural homology.

krdav · 2019-02-04T17:23:20Z

Well, the only reason I took it to the extreme was to show how unsymmetrical split breaks.

"non-equal-sized building blocks" the amino acid backbone is actually equally sized (with the slight exception of proline which is a bit more rotationally constrained).

I didn't grab the symmetry concept out of nowhere. The AHo numbering also uses two anchor residues and splits insertion residues between them.
https://www.sciencedirect.com/science/article/pii/S0022283601946625
And this is their reason: "it places the alignment gaps in a way that minimizes the average deviation from the averaged structure of the aligned domains"

I don't know if this is the kind of structural homology you are referring to?

Ultimately, this is a theoretical argument, but I will completely surrender to your argument if you can show me that this is improving any of the empirical metrics. My prediction is that a 40/60 split wont really do anything.

matsen · 2019-02-04T17:27:48Z

Yes, I knew all of this would just come down to "well, we'll see!"

Machine learning... 😬

krdav · 2019-02-04T17:30:48Z

Haha, feed into black box, watch what comes out, present it like you knew all along.

matsen added the enhancement New feature or request label Feb 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uneven split #88

Uneven split #88

matsen commented Feb 2, 2019 •

edited

Loading

matsen commented Feb 2, 2019

krdav commented Feb 2, 2019

matsen commented Feb 3, 2019

krdav commented Feb 3, 2019

matsen commented Feb 3, 2019

krdav commented Feb 3, 2019

matsen commented Feb 3, 2019

krdav commented Feb 4, 2019

matsen commented Feb 4, 2019 •

edited

Loading

krdav commented Feb 4, 2019

matsen commented Feb 4, 2019

krdav commented Feb 4, 2019

Uneven split #88

Uneven split #88

Comments

matsen commented Feb 2, 2019 • edited Loading

matsen commented Feb 2, 2019

krdav commented Feb 2, 2019

matsen commented Feb 3, 2019

krdav commented Feb 3, 2019

matsen commented Feb 3, 2019

krdav commented Feb 3, 2019

matsen commented Feb 3, 2019

krdav commented Feb 4, 2019

matsen commented Feb 4, 2019 • edited Loading

krdav commented Feb 4, 2019

matsen commented Feb 4, 2019

krdav commented Feb 4, 2019

matsen commented Feb 2, 2019 •

edited

Loading

matsen commented Feb 4, 2019 •

edited

Loading