Ability to limit by date in `csv_to_records` #380

peterbe · 2018-03-23T17:22:14Z

When you scrape with latest-inventory-to-records it analyses every single row from the CSV files downloaded from the manifest in S3.

This makes it exceptionally time consuming to run the whole thing. Not only is it inconvenient when running it multiple times (because might be testing something) but when so many operations have to be done, the risk of a bad network error somewhere blowing everything up is real.

Alternatively, we could use this to do something like:

Once a week/day run the whole thing in a cron job. Pretty sure we have all the firefox 20, 21, 22 etc. releases in the database, but if our algorithm changes or improves it'd be nice to revisit those old releases from S3 and try them again differently.
If it's now not so horribly slow, you can run the cron job every hour instead. That's more quickly patch runtime flaws in the Lambda executions.

The text was updated successfully, but these errors were encountered:

peterbe · 2018-03-23T18:23:26Z

After a looong time the script just finished here on my laptop. Before it started I injected a piece of code like this:

with open('lastmodified.log','a') as f:#XXX
    f.write('{}\n'.format(lastmodified.isoformat()))

...inside csv_to_records.
It's HUUUGE.

▶ ls -lh lastmodified.log
-rw-r--r--  1 peterbe  staff    10M Mar 23 13:50 lastmodified.log

▶ wc -l lastmodified.log
  526708 lastmodified.log

I wrote a little script to analyze the dates it spotted:

YEAR 2015 104,357
YEAR 2016 136,109
YEAR 2017 236,664
YEAR 2018 49,578
2018 MONTH 1 20,370
2018 MONTH 2 18,669
2018 MONTH 3 10,539
PER DAY, MEAN   666.75
PER DAY, MEDIAN 408.0

So arguably if you run something like MIN_DAYS=1 latest-inventory-to-records you'd only have to check about 400 releases.

peterbe · 2018-03-23T20:16:54Z

Once this lands, I think we should take this to ops and ask them to execute cron a bit more carefully. E.g.

MON,TUE,WED,THU,FRI,SAT - Run MIN_AGE_LAST_MODIFIED_HOURS=48 latest-inventory-to-kinto
SUN - Run latest-inventory-to-kinto

bqbn · 2018-03-23T20:30:22Z

Sounds good. Please file a bugzilla with instructions so we can follow up on this.

peterbe · 2018-03-26T14:07:23Z

Ops bug filed: https://bugzilla.mozilla.org/show_bug.cgi?id=1448871

peterbe · 2018-03-26T14:08:52Z

We might want to consider running the whole job a LOT more frequently. The manifest.json files change every 24h by AWS S3. Only about 15% of the new files mentioned in each daily manifest.json is repeated.

But we if we change the frequency to say, every 1h, then most of its time will be spent downloading the .csv.gz files in the latest manifest.json. But this would be easy to optimize with a little disk cache implementation so that potentially the actual network download penalty is only paid once a day.

leplatrem · 2018-03-26T15:04:20Z

What would be the goal to run it more often like every 1H?
Be faster catching up missing data due to an S3 event lambda error?

bqbn · 2018-03-26T22:15:54Z

[1] doesn't say how often the cron should run, except that it should run once a week without MIN_AGE_LAST_MODIFIED_HOURS=48.

I wonder, can the cron run frequently to a point that it replaces the lambda? If it can do that, then that's good news because we have one less thing to maintain.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1448871

peterbe · 2018-03-27T13:03:56Z

What would be the goal to run it more often like every 1H?
Be faster catching up missing data due to an S3 event lambda error?

Ultimately, yes.

Every 1h is extreme. Especially since the manifest.json will be the same 24 times in a row. Also, if lambda fails at 01:00 you still have to wait 23 hours till that new release is in the CSV files.

Ideally...

Run frequently so it runs as soon as the new manifest.json is made available
Re-run frequently within a day if anything goes wrong
Not hurt (or not run at all) if it worked flawlessly for the most recent manifest.json

Lastly, ideally we should not have Lambda any more at all. One less thing to worry about. Especially since the email notifications are practically useless.

If we ditch Lambda, generally it means we lose the real-time'ness. :(
Perhaps a more frequent scan of TaskCluster Artifact API might be able to replace that. Not sure yet.

peterbe self-assigned this Mar 23, 2018

peterbe changed the title ~~Ability to limit lastmodifieddate in csv_to_records~~ Ability to limit by date in csv_to_records Mar 23, 2018

peterbe closed this as completed in 9ffe318 Mar 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to limit by date in `csv_to_records` #380

Ability to limit by date in `csv_to_records` #380

peterbe commented Mar 23, 2018

peterbe commented Mar 23, 2018

peterbe commented Mar 23, 2018

bqbn commented Mar 23, 2018

peterbe commented Mar 26, 2018

peterbe commented Mar 26, 2018

leplatrem commented Mar 26, 2018

bqbn commented Mar 26, 2018

peterbe commented Mar 27, 2018

Ability to limit by date in csv_to_records #380

Ability to limit by date in csv_to_records #380

Comments

peterbe commented Mar 23, 2018

peterbe commented Mar 23, 2018

peterbe commented Mar 23, 2018

bqbn commented Mar 23, 2018

peterbe commented Mar 26, 2018

peterbe commented Mar 26, 2018

leplatrem commented Mar 26, 2018

bqbn commented Mar 26, 2018

peterbe commented Mar 27, 2018

Ability to limit by date in `csv_to_records` #380

Ability to limit by date in `csv_to_records` #380