Disk cache the manifests #392

peterbe · 2018-04-12T18:52:31Z

Every 24h about 80+% of the manifest sub-URLs change. It's the *.csv.gz files.

But for 24h the files are 100% the same every time (since they have hashes in them). So if you download more than once within 24h you could greatly benefit for a disk cache.

The production server that runs the cron job sits near S3 so the downloads are pretty fast. Also, we're only running the cron job every 24h at the moment.

However, for local development it's near impossible to run the scraping more than once. It's just too much network downloading. Makes it hard to try and try again.

Also, if we could implement a caching for every 24h we would be able to run the scraping cron job in production every 1h instead. That means we're much better equipped to recover if the cron fails. I.e. if the lambda event is missed and the scraping bombs out due you have to wait a whole other day to get it right. That definitely has happened in production.

peterbe closed this as completed in 178f97a Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk cache the manifests #392

Disk cache the manifests #392

peterbe commented Apr 12, 2018

Disk cache the manifests #392

Disk cache the manifests #392

Comments

peterbe commented Apr 12, 2018