Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uneven split #88

Open
matsen opened this issue Feb 2, 2019 · 12 comments
Open

Uneven split #88

matsen opened this issue Feb 2, 2019 · 12 comments
Labels
enhancement New feature or request

Comments

@matsen
Copy link
Contributor

matsen commented Feb 2, 2019

Right now I cut the CDR3 exactly in half. But here's a plot of the length of the untrimmed genes:

image

It looks like we'd probably do better keeping the V on one side and the J on the other if we put 40% on the V side and 60% on the J.

@matsen
Copy link
Contributor Author

matsen commented Feb 2, 2019

df = read.csv('/home/matsen/Downloads/repos/vampire/vampire/data/germline-cdr3-aas.csv', stringsAsFactors=FALSE)
df$length = nchar(df$sequence)

ggplot(df, aes(length, fill=locus)) + geom_histogram()

@krdav
Copy link
Collaborator

krdav commented Feb 2, 2019

I disagree with that (:face_with_rolling_eyes:) because this is suppose to split residues between the anchor residues i.e. looking at a structure half of the loop should be left aligned and half should be right aligned. Splitting on V/J gene contribution to CDR3 does not guarantee that.

@matsen
Copy link
Contributor Author

matsen commented Feb 3, 2019

So what you are saying is that you think that there is meaningful structural homology in the middle 20% of the CDR3? 🤔

@krdav
Copy link
Collaborator

krdav commented Feb 3, 2019

I am not sure what you mean by "meaningful structural homology".

Maybe an example can make my point more clear. Here are three CDR3 sequences. A = anchor residue, V = residue from V gene, J = residue from V gene:
AVVVVVJJJA
AVVJJJJJJA
AVVVVVJA

I suggest these are split into:
AVVVV---VJJJA
AVVJJ---JJJJA
AVVV-----VVJA

You suggest they are split into:
AVVVVV---JJJA
AVV---JJJJJJA
AVVVVV-----JA

Have we had the true protein structures of these and aligned them all to the anchor residue, I argue that the per-position distance, in 3d world, would be smaller for my alignment. To get the "per-position distance" walk along the alignment, at position X take the residues and map them back onto their protein structure, then make all pairwise distance comparisons and take the mean.

By making a split like this: AVVVVV-----JA
You indicate that the last V residue is far from the J-anchor and as a consequence, when comparing to longer CDR3 sequences, it might be in the same position as a residue in the middle between the two anchor residues.

@matsen
Copy link
Contributor Author

matsen commented Feb 3, 2019

Ah, sorry to be unclear. I'm suggesting a constant 40% / 60% split. Something like this:

AVVV---VVJJJA
AVVJ---JJJJJA
AVV-----VVVJA

with the logic that generally the J contribution is a little more than the V contribution.

@krdav
Copy link
Collaborator

krdav commented Feb 3, 2019

Okay, this changes things slightly, but I will still argue that a 50/50 split will give the smallest per-position distance - which is what we want. If we did not care about this at all we should just left align everything.

It is hard to see the problem with a 40/60 split because it is already so close to a 50/50 split. The problem with 40/60 is that it is not symmetric. To see the problem more clearly let's look at an extreme case with one short and one long CDR3 and a more extreme 10/90 split:
AVV---VVVVVVVVJJJJJJJJJJA
AV-------------VVVVJJJJJA

Take position 13 in the alignment. For the first sequence this is in the middle of the long CDR3, far away from the V-anchor. For the second sequence this is the second residue after the V-anchor. Have we had the protein structure I guarantee you that those two residue would be far away from each other.

@matsen
Copy link
Contributor Author

matsen commented Feb 3, 2019

This is all a minor point, so we should only continue if it's fun. So...

a 50/50 split will give the smallest per-position distance - which is what we want.

Is that our only objective? I'd say that having the V's on one side and the J's on the other allows us to learn the rules of VDJ recombination more easily. I'm not convinced that alignment of residues in the middle of CDR3s of different length is actually so meaningful from a structural homology perspective.

If we did not care about this at all we should just left align everything.

No, not at all. If we left aligned everything, then we'd lose the real homology between the J genes for sequences of different length.

@krdav
Copy link
Collaborator

krdav commented Feb 4, 2019

Is that our only objective? I'd say that having the V's on one side and the J's on the other allows us to learn the rules of VDJ recombination more easily.

Maybe, but following that argument we should be splitting on V/J gene border and not a hard 40/60 threshold. Also we don't even know when V starts and J ends - we impute it from alignment. But even if we had the true V/J start/end I still think structurally justified 50/50 splitting is better.

Also, I do think there is a meaningful structural difference:
screen shot 2019-02-03 at 6 06 11 pm

Granted, the difference gets smaller the closer we get to 50/50, so 40/60 is not far from that.

@matsen
Copy link
Contributor Author

matsen commented Feb 4, 2019

Hm, it seems like you're wanting to take this argument to extremes. That's not what I'm proposing. Also, I'm not proposing anything other than a fixed split, ever.

Given non-equal-sized building blocks, it seems impossible that 50/50 would be the optimal split. Perhaps it's 49/51, but I don't see how 50/50 can be optimal. I'd think that structural homology is strongest in germline-gene-encoded regions, so that a slight modification would actually improve structural homology.

@krdav
Copy link
Collaborator

krdav commented Feb 4, 2019

Well, the only reason I took it to the extreme was to show how unsymmetrical split breaks.

"non-equal-sized building blocks" the amino acid backbone is actually equally sized (with the slight exception of proline which is a bit more rotationally constrained).

I didn't grab the symmetry concept out of nowhere. The AHo numbering also uses two anchor residues and splits insertion residues between them.
https://www.sciencedirect.com/science/article/pii/S0022283601946625
And this is their reason: "it places the alignment gaps in a way that minimizes the average deviation from the averaged structure of the aligned domains"

I don't know if this is the kind of structural homology you are referring to?

Ultimately, this is a theoretical argument, but I will completely surrender to your argument if you can show me that this is improving any of the empirical metrics. My prediction is that a 40/60 split wont really do anything.

@matsen
Copy link
Contributor Author

matsen commented Feb 4, 2019

Yes, I knew all of this would just come down to "well, we'll see!"

Machine learning... 😬

@krdav
Copy link
Collaborator

krdav commented Feb 4, 2019

Haha, feed into black box, watch what comes out, present it like you knew all along.

@matsen matsen added the enhancement New feature or request label Feb 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants