Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lapis2): stream data from SILO #745

Merged
merged 1 commit into from
Apr 24, 2024
Merged

Conversation

fengelniederhammer
Copy link
Contributor

@fengelniederhammer fengelniederhammer commented Apr 22, 2024

resolves #744

PR Checklist

  • [ ] All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

Summary

The streaming still happens in the same thread as the request handling (by not using StreamingResponseBody). We'll have to observe whether this causes any issues in the future. I couldn't properly get using StreamingResponseBody work with the way we implemented compression. We have to close the compressions streams here, but then the lazy stream can't write anymore.

I profiled LAPIS with two datasets.

TLDR

  • Memory consumption when downloading 3.5k sequences goes down from 500 - 900 MB to a 130 - 300 MB.
  • Memory consumption when downloading 1M sequence details goes down from 14.5 GB (peak) goes down to a stable usage of ~100MB for the whole download.

Ebola

I executed this a couple of times with the data of this branch and observed the memory consumption:

curl -X 'GET' 'http://localhost:8090/sample/alignedNucleotideSequences' \
                   -H 'accept: text/x-fasta' > /dev/null

Memory consumption on main:
grafik

Memory consumption on this branch:
grafik

The open Covid dataset on the testserver

I connected my local LAPIS to the SILO running on the server and downloaded metadata:

curl -X 'GET' 'http://localhost:8090/sample/details?dataFormat=csv&limit=1000000'

This is the memory consumption on the current main:
grafik

Also the download was incredibly slow. It took 3.5 minutes to downlaod 426 MB:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  426M  100  426M    0     0  2054k      0  0:03:32  0:03:32 --:--:-- 97.7M

This is the same for the current branch:
grafik

The download isn't fast, but still almost twice as fast:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  426M    0  426M    0     0  3462k      0 --:--:--  0:02:06 --:--:--  2618

Note: The SILO version that this has been tested with does not stream any data lazily yet. It still loads the full result into memory. Only LAPIS handles the response stream from SILO lazily.

Copy link
Contributor

@JonasKellerer JonasKellerer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fengelniederhammer
Copy link
Contributor Author

fengelniederhammer commented Apr 24, 2024

Merging despite red docs tests, since this didn't change anything there. It's some infrastructure problem with GitHub / Microsoft and/or Ubuntu.

@fengelniederhammer fengelniederhammer merged commit 8fcf360 into main Apr 24, 2024
10 of 11 checks passed
@fengelniederhammer fengelniederhammer deleted the 744-stream-sequences branch April 24, 2024 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stream sequences
2 participants