Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid loading large files from S3 into memory #2028

Open
arsalansufi opened this issue Oct 30, 2024 · 4 comments
Open

Avoid loading large files from S3 into memory #2028

arsalansufi opened this issue Oct 30, 2024 · 4 comments
Milestone

Comments

@arsalansufi
Copy link
Contributor

arsalansufi commented Oct 30, 2024

https://votingworks.slack.com/archives/CKCVA0F9S/p1730309143498049

@arsalansufi arsalansufi added this to the 2024 milestone Oct 30, 2024
@carolinemodic carolinemodic self-assigned this Oct 31, 2024
@carolinemodic
Copy link
Contributor

carolinemodic commented Nov 5, 2024

Dominion - no bottlenecks
Hart - Loads all of the file names and metadata information into memory when unzipping zip folders. The zip folders in this CSV implementation have a LOT of very small files, so this can increase memory by up to a couple hundred mb for a 1.5gb zip file (containing 8 sub-zip files, each containing 100k+ files). I think this is ok and we would need an insanely huge zip for it to be problematic.
ES&S - As previously identified this part of the code loads the entire ballots file into memory to sort it and is very expensive in terms of memory (seems to increase memory by much more then the file size actually). We would have to use temporary files to write to and do a merge sort of different chunks of the file to solve. Doable but not trivial.
Clear Ballot - Have not prioritized testing but quickly glancing at the code I think its likely fine.

@carolinemodic
Copy link
Contributor

Passing off to @eventualbuddha to investiage the ess issue further

@eventualbuddha
Copy link
Contributor

I investigated resource usage for processing ES&S CVR & ballot files. I used @carolinemodic's generated files that had about 500K CVRs:

~/Downloads/arlo-scale-testing-11-2024 
❯ cat ballotsbig.csv | wc -l
 2611511

~/Downloads/arlo-scale-testing-11-2024 
❯ cat cvrbig.csv | wc -l
  522303

It is indeed slow and memory intensive. Using a manually-created Jurisdiction and set of uploads, I ran the following script to measure the time and memory usage:

#!/usr/bin/env python

import os
import time

os.environ["FLASK_ENV"] = "development"

from server.api.cvrs import parse_ess_cvrs
from server.models import Jurisdiction

jurisdiction_id = "9eae837e-37b2-4ef7-ac4a-5977c403608e"
jurisdiction = Jurisdiction.query.get(jurisdiction_id)

start_time = time.time()
(metadata, cvrs) = parse_ess_cvrs(jurisdiction, "/tmp/arlo-perf-test")

count = 0
try:
    for cvr in cvrs:
        count += 1
finally:
    end_time = time.time()
    print("Parsed %d CVRs in %s seconds" % (count, end_time - start_time))

    # print rss memory usage
    import resource

    print("Memory usage: %s (kb)" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

I also watched the process with htop and added some more granular logs in the parse_ess_cvrs function. At maximum RAM usage, this used about 3.4GB of memory. At the point where it crashes due to a mismatch in the data between the files, it also took about 26s to get to the point where it starts matching entries.

This isn't great, but should be within what a worker dyno can deal with as we've configured them. In practice, the largest set of ES&S CVRs we're likely to see is about 180K records, or about 1/3 of this "big" test fixture. With a RAM floor of about 130MB for our python process, that means a RAM utilization of somewhere around 1.3GB.

I also did a spike of using Rust to process these files. It's about 20x faster (1.2s) and 13x less RAM (250MB), and I wasn't even trying to make it efficient. There are huge gains to be had by moving away from python for anything CPU bound.

@eventualbuddha eventualbuddha removed their assignment Nov 12, 2024
@eventualbuddha
Copy link
Contributor

Unassigned myself so @arsalansufi can re-prioritize.

@jonahkagan jonahkagan removed this from the 2024 milestone Nov 13, 2024
@arsalansufi arsalansufi added this to the 2025 milestone Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants