Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Disk cache the manifests #392

Closed
peterbe opened this issue Apr 12, 2018 · 0 comments
Closed

Disk cache the manifests #392

peterbe opened this issue Apr 12, 2018 · 0 comments

Comments

@peterbe
Copy link
Contributor

peterbe commented Apr 12, 2018

Every 24h about 80+% of the manifest sub-URLs change. It's the *.csv.gz files.

But for 24h the files are 100% the same every time (since they have hashes in them). So if you download more than once within 24h you could greatly benefit for a disk cache.

The production server that runs the cron job sits near S3 so the downloads are pretty fast. Also, we're only running the cron job every 24h at the moment.

However, for local development it's near impossible to run the scraping more than once. It's just too much network downloading. Makes it hard to try and try again.

Also, if we could implement a caching for every 24h we would be able to run the scraping cron job in production every 1h instead. That means we're much better equipped to recover if the cron fails. I.e. if the lambda event is missed and the scraping bombs out due you have to wait a whole other day to get it right. That definitely has happened in production.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant