-
Notifications
You must be signed in to change notification settings - Fork 10
Ability to limit by date in csv_to_records
#380
Comments
After a looong time the script just finished here on my laptop. Before it started I injected a piece of code like this: with open('lastmodified.log','a') as f:#XXX
f.write('{}\n'.format(lastmodified.isoformat())) ...inside
I wrote a little script to analyze the dates it spotted:
So arguably if you run something like |
csv_to_records
csv_to_records
Once this lands, I think we should take this to ops and ask them to execute cron a bit more carefully. E.g.
|
Sounds good. Please file a bugzilla with instructions so we can follow up on this. |
Ops bug filed: https://bugzilla.mozilla.org/show_bug.cgi?id=1448871 |
We might want to consider running the whole job a LOT more frequently. The manifest.json files change every 24h by AWS S3. Only about 15% of the new files mentioned in each daily manifest.json is repeated. But we if we change the frequency to say, every 1h, then most of its time will be spent downloading the .csv.gz files in the latest manifest.json. But this would be easy to optimize with a little disk cache implementation so that potentially the actual network download penalty is only paid once a day. |
What would be the goal to run it more often like every 1H? |
[1] doesn't say how often the cron should run, except that it should run once a week without I wonder, can the cron run frequently to a point that it replaces the lambda? If it can do that, then that's good news because we have one less thing to maintain. |
Ultimately, yes. Every 1h is extreme. Especially since the manifest.json will be the same 24 times in a row. Also, if lambda fails at 01:00 you still have to wait 23 hours till that new release is in the CSV files. Ideally...
Lastly, ideally we should not have Lambda any more at all. One less thing to worry about. Especially since the email notifications are practically useless. If we ditch Lambda, generally it means we lose the real-time'ness. :( |
When you scrape with
latest-inventory-to-records
it analyses every single row from the CSV files downloaded from the manifest in S3.This makes it exceptionally time consuming to run the whole thing. Not only is it inconvenient when running it multiple times (because might be testing something) but when so many operations have to be done, the risk of a bad network error somewhere blowing everything up is real.
Alternatively, we could use this to do something like:
Once a week/day run the whole thing in a cron job. Pretty sure we have all the firefox 20, 21, 22 etc. releases in the database, but if our algorithm changes or improves it'd be nice to revisit those old releases from S3 and try them again differently.
If it's now not so horribly slow, you can run the cron job every hour instead. That's more quickly patch runtime flaws in the Lambda executions.
The text was updated successfully, but these errors were encountered: