Update pacer free document command to avoid high memory usage #4472

quevon24 · 2024-09-17T19:37:40Z

Remove @throttle_task in get pdfs process because there are very long times when scheduling the retry of the task, this happens mainly when many elements from the same court have to be processed as there are no more documents from other courts with which to intersperse
Wait longer (3 seconds) before queuing up more items from the same court.

wait longer when cycling the same court over and over again

sentry-io · 2024-09-17T19:38:41Z

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: cl/corpus_importer/tasks.py

Function	Unhandled Issue
`get_and_save_free_document_report`	ParsingException: Got XML when expecting HTML and cannot parse it. cl.corpus_importer.tasks in get_and_save_fr... `Event Count:` 1

_{Did you find this useful? React with a 👍 or 👎}

cl/corpus_importer/management/commands/scrape_pacer_free_opinions.py

…-and-queue-buildup

mlissner · 2024-09-19T22:24:51Z

Alberto has done more of the bulk scraping stuff than I have recently, so I'd like to get his eyes here too. I think architecturally, if I'm understanding this correctly, the idea is to stop queueing up everything all once and then hammering with Celery and to instead iterate over all the courts in a loop, doing each one every three seconds. Accurate?

quevon24 · 2024-09-19T23:11:06Z

Alberto has done more of the bulk scraping stuff than I have recently, so I'd like to get his eyes here too. I think architecturally, if I'm understanding this correctly, the idea is to stop queueing up everything all once and then hammering with Celery and to instead iterate over all the courts in a loop, doing each one every three seconds. Accurate?

When there are documents from multiple courts, we will wait 1 second during each cycle, but if the remaining documents are from the same court, only this one will be cycled, so we will wait 3s to give it extra time.

…-and-queue-buildup

albertisfu

This looks good. Just a comment regarding the sleep value used to wait between court cycles

albertisfu · 2024-09-23T20:46:06Z

cl/corpus_importer/management/commands/scrape_pacer_free_opinions.py

            )
-            time.sleep(1)
+            time.sleep(sleep)


As we talked about this, we could improve the sleep value here based on the number of courts being cycled through to ensure we don't surpass the scrape rate of 1/4s per court we had previously via the throttle_task decorator. We could consider the time it takes to process and download a document, then compute a dynamic value or threshold based on the number of courts being processed. This way, even when only a few courts remain in the list, we still maintain the 1/4s per court rate.

1/4s per court rate

Is that 0.25s per court or am I misunderstanding?

that's 1 task every 4 seconds per court according to get_task_wait docstrings

Right, duh, thank you. Um, so if sleep is set to four seconds, we'd do each court at most every four seconds, right? But if we use some timing info, we can set this dynamically so that we sleep exactly four seconds for each loop? Like, if downloads take 2s, then we set the sleep for 2s, and boom, we 4s is achieved?

Like, if downloads take 2s, then we set the sleep for 2s, and boom, we 4s is achieved?

Yeah, that's right. I think Kevin already has some timing info we can use here. The other scenario we need to consider is when the number of courts with remaining documents to scrape is reduced.

ca1 ca2 ca3 ca1 ca2 ca3 ...

In this case, with the current approach, we would schedule one task per court per second, which exceeds the 1/4-second rate per court. So the idea is to consider the number of courts in the last cycle and the average time to process a document to compute the sleep time for that cycle, ensuring the rate for these courts stays below 1/4 second.

Got it. Sounds great!

…-and-queue-buildup

Don't try to upload PACERFreeDocumentLog in development, because PacerHtmlFiles uses S3PrivateUUIDStorage keep count of courts iterated in previous cycle

…k-throttling-and-queue-buildup' into 4455-redis-memory-spike-from-task-throttling-and-queue-buildup

cl/corpus_importer/management/commands/scrape_pacer_free_opinions.py

quevon24 · 2024-09-25T22:50:52Z

I tried to make an approximation by calculating the time it takes to perform the most important part, which is downloading the PDF. The problem is that depending on the court, the time can vary. In the best case, it takes ~4s or less, but in other cases I saw from 7 to 9 seconds between the post being made and the binary data being obtained from the PDF.

Therefore, I tried a different approach where we took into account the number of courts that were processed in the last cycle to adjust the minimum and maximum amount of elements in the queue. In this way, the fewer pending courts there are, the queue will process fewer elements. It will wait the poll_interval (3s) from CeleryThrottle class until there is space available in the queue.

To use this approach we are going to need 2 queues: one for the daily cron and one for the sweep. This is because we depend on the size of the queue as we adjust it every cycle, and if both are executed at the same time in the same queue it could be that one becomes greedy.

…-and-queue-buildup

albertisfu

This looks good to me.

The new throttling mechanism, based exclusively on CeleryThrottle, will control the scraping rate based on the number of courts being scraped and will adjust the queue_length accordingly. This is a similar approach we used in the ready_mix_cases_project, so as Kevin mentioned, this will require creating an independent Celery queue for each process running the command. This way, the throttling mechanism for each running command is not affected by tasks from other processes.
Additionally, once this is merged, I think we should consider running the daily scraping based on the days the command has not been run.

mlissner · 2024-09-26T22:46:00Z

Cool, this is merged. @quevon24, will you make an issue in freelawproject/infrastructure to communicate to Ramiro how to get this launched?

fix(pacer_free_documents): remove @throttle_task in get pdfs process

6930136

wait longer when cycling the same court over and over again

quevon24 linked an issue Sep 17, 2024 that may be closed by this pull request

Redis memory spike from task throttling and queue buildup #4455

Closed

semgrep-app bot reviewed Sep 17, 2024

View reviewed changes

cl/corpus_importer/management/commands/scrape_pacer_free_opinions.py Outdated Show resolved Hide resolved

Merge branch 'main' into 4455-redis-memory-spike-from-task-throttling…

b196519

…-and-queue-buildup

quevon24 requested a review from mlissner September 19, 2024 02:34

quevon24 and others added 2 commits September 20, 2024 11:34

Merge branch 'main' into 4455-redis-memory-spike-from-task-throttling…

6d6df0f

…-and-queue-buildup

Merge branch 'main' into 4455-redis-memory-spike-from-task-throttling…

246a134

…-and-queue-buildup

albertisfu reviewed Sep 23, 2024

View reviewed changes

quevon24 added 4 commits September 24, 2024 10:31

Merge branch 'main' into 4455-redis-memory-spike-from-task-throttling…

fe32548

…-and-queue-buildup

Merge branch 'main' into 4455-redis-memory-spike-from-task-throttling…

5a7e9af

…-and-queue-buildup

feat(pacer_free_documents): control process to get pdfs

7543115

Don't try to upload PACERFreeDocumentLog in development, because PacerHtmlFiles uses S3PrivateUUIDStorage keep count of courts iterated in previous cycle

Merge remote-tracking branch 'origin/4455-redis-memory-spike-from-tas…

ce10905

…k-throttling-and-queue-buildup' into 4455-redis-memory-spike-from-task-throttling-and-queue-buildup

semgrep-app bot reviewed Sep 25, 2024

View reviewed changes

cl/corpus_importer/management/commands/scrape_pacer_free_opinions.py Show resolved Hide resolved

fix(pacer_free_documents): add type annotation

90e5a0e

quevon24 requested a review from albertisfu September 25, 2024 22:50

Merge branch 'main' into 4455-redis-memory-spike-from-task-throttling…

1b0d2a8

…-and-queue-buildup

albertisfu approved these changes Sep 26, 2024

View reviewed changes

mlissner merged commit 229e5a4 into main Sep 26, 2024
13 checks passed

mlissner deleted the 4455-redis-memory-spike-from-task-throttling-and-queue-buildup branch September 26, 2024 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pacer free document command to avoid high memory usage #4472

Update pacer free document command to avoid high memory usage #4472

quevon24 commented Sep 17, 2024

sentry-io bot commented Sep 17, 2024

mlissner commented Sep 19, 2024

quevon24 commented Sep 19, 2024

albertisfu left a comment

albertisfu Sep 23, 2024

mlissner Sep 23, 2024

albertisfu Sep 23, 2024

mlissner Sep 23, 2024

albertisfu Sep 23, 2024

mlissner Sep 23, 2024

quevon24 commented Sep 25, 2024

albertisfu left a comment

mlissner commented Sep 26, 2024

Update pacer free document command to avoid high memory usage #4472

Update pacer free document command to avoid high memory usage #4472

Conversation

quevon24 commented Sep 17, 2024

sentry-io bot commented Sep 17, 2024

🔍 Existing Issues For Review

mlissner commented Sep 19, 2024

quevon24 commented Sep 19, 2024

albertisfu left a comment

Choose a reason for hiding this comment

albertisfu Sep 23, 2024

Choose a reason for hiding this comment

mlissner Sep 23, 2024

Choose a reason for hiding this comment

albertisfu Sep 23, 2024

Choose a reason for hiding this comment

mlissner Sep 23, 2024

Choose a reason for hiding this comment

albertisfu Sep 23, 2024

Choose a reason for hiding this comment

mlissner Sep 23, 2024

Choose a reason for hiding this comment

quevon24 commented Sep 25, 2024

albertisfu left a comment

Choose a reason for hiding this comment

mlissner commented Sep 26, 2024