Announcing a New seqr Search Backend #4531
lynnpais
announced in
Feature Updates
Replies: 1 comment 1 reply
-
Hi @lynnpais . Thanks for this update! :). I was curious about the new search backend. Have you all made any choices regarding how you will move away from Hail as a backend variant store? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The seqr team is excited to announce support for a new backend variant search infrastructure powered by Hail. We will continue to support the existing Elasticsearch backend as a core part of seqr. Please review the new changes and considerations to determine if and when to upgrade your instance of seqr from the old Elasticsearch backend to the new backend.
New Features
Self Service Data Loading Interface: Loading data to seqr can be done directly from the user interface by a non-technical user, removing the need for complicated infrastructure setup and manually running scripts. Additionally, a lightweight user interface for the loading pipeline is available in seqr to simplify troubleshooting issues with loading. A walkthrough of how to use the loading interface is found in the local install documentation within the seqr repo.
Updated VEP and Plugin Annotations: We’ve updated the data loading pipeline to run VEP 110 and include support for the AlphaMissense, UTRAnnotator, and SpliceRegion plugins.
Global Allele Frequencies: Allele frequencies are computed internally within the seqr backend across all loaded callsets, and users see this global frequency regardless of their access to the underlying callsets.
Variant Lookup Tool: Users can now look into any single variant in seqr, including variants outside of your own projects, to see affected/unaffected status and some phenotype data. The information shown will depend on the level of access. Project-level opt out for this feature is an option.
Allele Registration w/ Clingen Allele Registry: New variants loaded into seqr may be optionally registered within the Clingen Allele Registry. Information on how to enable this feature is found below in the local install documentation in the seqr repo.
Automatic Reference Data Updates: seqr will automatically source variant reference data updates from both the codebase and external sources (e.g., Clinvar). These updates will propagate to all variants in seqr without needing to reload data or any other manual intervention.
Data Size and Cost Efficiency: The new variant store representation is ~20x more efficient in comparison to the elasticsearch structure.
Modernized Deployment: The minimal docker-compose deployment of seqr has been replaced by a Helm Chart, enabling scalable cloud deployments on Kubernetes. Technical documentation for how to set up the new infrastructure and how to migrate existing application data is available in the open source seqr-helm repository.
Performance Considerations
While the new search backend is more efficient and scalable, most searches, especially recessive searches and searches across multiple projects, are experiencing substantially slower performance. See the below “Future Compatibility” section for more details on how possible future performance improvements will be made available.
We strongly recommend testing out the performance of seqr on the main Broad instance before deciding whether or not the performance will be acceptable for your team. You can run test searches here to get a better understanding of the new search features and performance (you will first need to register your email for a free account in AnVIL).
Future Compatibility
For the time being, all seqr development will be compatible with either the existing Elasticsearch backend or the new Hail backend. However, all new search-specific functionality (i.e., eventual support for structural variants or long read data) will only be made available in the new Hail backend.
Due to ongoing performance concerns, it is likely that seqr will transition to a different search backend in 2025. Should this happen, tooling will be made available to easily migrate from the Hail backend to whatever comes next.
Deprecated Features
This release is cloud-agnostic. It does not yet support running the loading pipeline on Google Dataproc clusters, but we anticipate that this feature will be available in 2025.
Data Migration
In order to migrate your search backend, you will need to re-load each VCF into seqr via the loading interface. There is no functionality to export directly from Elasticsearch. Our prospective plan is that this new Hail backend will support direct data migrations to future versions.
There are instructions in the helm chart docs for migrating all other application data and static files. This means that all user-provided data including case metadata and tagged variants will be preserved.
Beta Was this translation helpful? Give feedback.
All reactions