Duplicate training samples with different coordinates, PSSM and entropy #26

tueboesen · 2020-09-24T17:49:38Z

During a sanity check of this data I noticed that quite a lot of the training examples have identical sequences, but with different PSSM and entropy. The coordinates for these duplicates are also not identical, even under translation/rotation, though the one example I actually plotted after matching the coordinates under translation and rotation, had coordinates that we close to identical, but deviated in a few places.

See the attached example (it was too long to paste in here)
identical_sequences.zip

Other training examples were repeated 6 times in the data.

Is there any good reason for this or is this an error?

tueboesen · 2020-09-24T18:50:17Z

It should be noted that this problem is not just in the training_100 data, but actually also extends into the training_95 data. I find this very surprising since I would have expected the clustering to at the very least group/remove identical sequences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate training samples with different coordinates, PSSM and entropy #26

Duplicate training samples with different coordinates, PSSM and entropy #26

tueboesen commented Sep 24, 2020

tueboesen commented Sep 24, 2020

Duplicate training samples with different coordinates, PSSM and entropy #26

Duplicate training samples with different coordinates, PSSM and entropy #26

Comments

tueboesen commented Sep 24, 2020

tueboesen commented Sep 24, 2020