You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.
Every 24h about 80+% of the manifest sub-URLs change. It's the *.csv.gz files.
But for 24h the files are 100% the same every time (since they have hashes in them). So if you download more than once within 24h you could greatly benefit for a disk cache.
The production server that runs the cron job sits near S3 so the downloads are pretty fast. Also, we're only running the cron job every 24h at the moment.
However, for local development it's near impossible to run the scraping more than once. It's just too much network downloading. Makes it hard to try and try again.
Also, if we could implement a caching for every 24h we would be able to run the scraping cron job in production every 1h instead. That means we're much better equipped to recover if the cron fails. I.e. if the lambda event is missed and the scraping bombs out due you have to wait a whole other day to get it right. That definitely has happened in production.
The text was updated successfully, but these errors were encountered:
Every 24h about 80+% of the manifest sub-URLs change. It's the *.csv.gz files.
But for 24h the files are 100% the same every time (since they have hashes in them). So if you download more than once within 24h you could greatly benefit for a disk cache.
The production server that runs the cron job sits near S3 so the downloads are pretty fast. Also, we're only running the cron job every 24h at the moment.
However, for local development it's near impossible to run the scraping more than once. It's just too much network downloading. Makes it hard to try and try again.
Also, if we could implement a caching for every 24h we would be able to run the scraping cron job in production every 1h instead. That means we're much better equipped to recover if the cron fails. I.e. if the lambda event is missed and the scraping bombs out due you have to wait a whole other day to get it right. That definitely has happened in production.
The text was updated successfully, but these errors were encountered: