Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate training samples with different coordinates, PSSM and entropy #26

Open
tueboesen opened this issue Sep 24, 2020 · 1 comment

Comments

@tueboesen
Copy link

During a sanity check of this data I noticed that quite a lot of the training examples have identical sequences, but with different PSSM and entropy. The coordinates for these duplicates are also not identical, even under translation/rotation, though the one example I actually plotted after matching the coordinates under translation and rotation, had coordinates that we close to identical, but deviated in a few places.

See the attached example (it was too long to paste in here)
identical_sequences.zip

Other training examples were repeated 6 times in the data.

Is there any good reason for this or is this an error?

@tueboesen
Copy link
Author

It should be noted that this problem is not just in the training_100 data, but actually also extends into the training_95 data. I find this very surprising since I would have expected the clustering to at the very least group/remove identical sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant