Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add opt-in rich fasta headers with metadata #3448

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Contributor

@corneliusroemer corneliusroemer commented Dec 16, 2024

resolves #2284

preview URL: https://fasta-header.loculus.org

Summary

Allow unaligned sequences to be downloaded with a rich FASTA header (configured at organism level for simplicity), e.g. displayName instead of accessionVersion.

This requires creating a new astro API endpoint, for now /[organism]/api/sequences because LAPIS does not want to implement a dynamic FASTA header feature. This endpoint fetches metadata and sequences separately and joins them before responding with the result.

To prevent excessive memory consumption, there's a hardcoded limit of 5k sequences that can be accessed via this endpoint (enforced via an initial request to /aggregated).

Compression is not supported for this endpoint but this is currently not worth the effort (rich header is for ad hoc analysis/viewing not for download of full data).

Screenshot

2024-12-16 23 10 13

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

@corneliusroemer corneliusroemer added the preview Triggers a deployment to argocd label Dec 16, 2024
@corneliusroemer corneliusroemer marked this pull request as ready for review December 17, 2024 14:26
@theosanderson
Copy link
Member

Thanks for looking into this. I just tried it in the preview at https://fasta-header.loculus.org/ebola-sudan/search? , and it didn't seem to yield a difference?

I don't know if you saw, but I hacked something similar together at https://github.com/theosanderson/metadata-sequence-combiner to allow #3439 .

I'm a bit uncertain about whether we should add this to Loculus when it only support <5000 sequences. I do think it would be possible to stream from LAPIS from metadata and sequences and combine them on the fly in a stream, scaling to an unlimited number of sequences. I think my vote might be to keep this out of Loculus until we are able to do that, and rely on hacks on Pathoplexus to allow Nextclade and other integrations - which I think are the main use case atm?

Multi segmented

Refactor

Guard against excessive request size to prevent memory spikes

Refactor

Aadd UI for rich headers

Make configurable

f

proper base name

rich headers don't support compression
@corneliusroemer
Copy link
Contributor Author

I hadn't drilled through the new value into the website config (locally I'd just edited the config json directly), should be fixed now.

I'm a bit uncertain about whether we should add this to Loculus when it only support <5000 sequences. I do think it would be possible to stream from LAPIS from metadata and sequences and combine them on the fly in a stream, scaling to an unlimited number of sequences. I think my vote might be to keep this out of Loculus until we are able to do that, and rely on hacks on Pathoplexus to allow Nextclade and other integrations - which I think are the main use case atm?

I think I disagree on the first part (not doing it unless streaming) because imperfect but working is better than not having it all and if there's enough demand one can add streaming later. I see it like that: Right now one can download only 0 sequences with rich header, now one can do 5k, clear improvement, 90+% of total benefit achieved.

I do really like the tools UI/UX you've come up with, very smooth. I'm just a bit worried about deviating from our monorepo pattern for something that can be considered a core feature. But nothing stops us from going down the tools route and then insourcing it later, so happy either wray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
2 participants