Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require Ensembl aliases to be versioned and drop unversioned from database #80

Closed
5 tasks done
reece opened this issue Apr 9, 2020 · 0 comments
Closed
5 tasks done

Comments

@reece
Copy link
Member

reece commented Apr 9, 2020

Unversioned Ensembl aliases are not unique, except when included with the namespace.
Having Ensembl versioned namespaces is expensive, apx. 400k aliases for each release. With 20 releases, that's a 20x expansion in Ensembl alias size.

Versions have been available since e83. It's time to drop support for unversioned aliases and, therefore, versioned Ensembl releases. SeqRepo will now use the Ensembl namespace (rather than a versioned Ensembl-## namespace).

  • Remove Ensembl release <= 84. This sidesteps issues regarding unversioned accessions and a problem with ambiguous identifiers for e84 protein accessions.
  • Remove Ensembl aliases that start with GENSCAN, KI, or GL
  • Ensure that no unversioned Ensembl aliases remain
  • Collapse remaining Ensembl-nn aliases into a single Ensembl namespace, preserving history
  • Require that Ensembl aliases are versioned on loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant