Avoid loading large files from S3 into memory #2028

arsalansufi · 2024-10-30T19:39:32Z

https://votingworks.slack.com/archives/CKCVA0F9S/p1730309143498049

carolinemodic · 2024-11-05T21:20:53Z

Dominion - no bottlenecks
Hart - Loads all of the file names and metadata information into memory when unzipping zip folders. The zip folders in this CSV implementation have a LOT of very small files, so this can increase memory by up to a couple hundred mb for a 1.5gb zip file (containing 8 sub-zip files, each containing 100k+ files). I think this is ok and we would need an insanely huge zip for it to be problematic.
ES&S - As previously identified this part of the code loads the entire ballots file into memory to sort it and is very expensive in terms of memory (seems to increase memory by much more then the file size actually). We would have to use temporary files to write to and do a merge sort of different chunks of the file to solve. Doable but not trivial.
Clear Ballot - Have not prioritized testing but quickly glancing at the code I think its likely fine.

carolinemodic · 2024-11-05T22:08:53Z

Passing off to @eventualbuddha to investiage the ess issue further

eventualbuddha · 2024-11-12T23:11:23Z

I investigated resource usage for processing ES&S CVR & ballot files. I used @carolinemodic's generated files that had about 500K CVRs:

~/Downloads/arlo-scale-testing-11-2024 
❯ cat ballotsbig.csv | wc -l
 2611511

~/Downloads/arlo-scale-testing-11-2024 
❯ cat cvrbig.csv | wc -l
  522303

It is indeed slow and memory intensive. Using a manually-created Jurisdiction and set of uploads, I ran the following script to measure the time and memory usage:

#!/usr/bin/env python

import os
import time

os.environ["FLASK_ENV"] = "development"

from server.api.cvrs import parse_ess_cvrs
from server.models import Jurisdiction

jurisdiction_id = "9eae837e-37b2-4ef7-ac4a-5977c403608e"
jurisdiction = Jurisdiction.query.get(jurisdiction_id)

start_time = time.time()
(metadata, cvrs) = parse_ess_cvrs(jurisdiction, "/tmp/arlo-perf-test")

count = 0
try:
    for cvr in cvrs:
        count += 1
finally:
    end_time = time.time()
    print("Parsed %d CVRs in %s seconds" % (count, end_time - start_time))

    # print rss memory usage
    import resource

    print("Memory usage: %s (kb)" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

I also watched the process with htop and added some more granular logs in the parse_ess_cvrs function. At maximum RAM usage, this used about 3.4GB of memory. At the point where it crashes due to a mismatch in the data between the files, it also took about 26s to get to the point where it starts matching entries.

This isn't great, but should be within what a worker dyno can deal with as we've configured them. In practice, the largest set of ES&S CVRs we're likely to see is about 180K records, or about 1/3 of this "big" test fixture. With a RAM floor of about 130MB for our python process, that means a RAM utilization of somewhere around 1.3GB.

I also did a spike of using Rust to process these files. It's about 20x faster (1.2s) and 13x less RAM (250MB), and I wasn't even trying to make it efficient. There are huge gains to be had by moving away from python for anything CPU bound.

eventualbuddha · 2024-11-12T23:54:00Z

Unassigned myself so @arsalansufi can re-prioritize.

arsalansufi added this to the 2024 milestone Oct 30, 2024

carolinemodic self-assigned this Oct 31, 2024

carolinemodic assigned eventualbuddha and unassigned carolinemodic Nov 5, 2024

eventualbuddha removed their assignment Nov 12, 2024

jonahkagan removed this from the 2024 milestone Nov 13, 2024

arsalansufi added this to the 2025 milestone Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid loading large files from S3 into memory #2028

Avoid loading large files from S3 into memory #2028

arsalansufi commented Oct 30, 2024 •

edited

Loading

carolinemodic commented Nov 5, 2024 •

edited by eventualbuddha

Loading

carolinemodic commented Nov 5, 2024

eventualbuddha commented Nov 12, 2024

eventualbuddha commented Nov 12, 2024

Avoid loading large files from S3 into memory #2028

Avoid loading large files from S3 into memory #2028

Comments

arsalansufi commented Oct 30, 2024 • edited Loading

carolinemodic commented Nov 5, 2024 • edited by eventualbuddha Loading

carolinemodic commented Nov 5, 2024

eventualbuddha commented Nov 12, 2024

eventualbuddha commented Nov 12, 2024

arsalansufi commented Oct 30, 2024 •

edited

Loading

carolinemodic commented Nov 5, 2024 •

edited by eventualbuddha

Loading