Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Ability to limit by date in csv_to_records #380

Closed
peterbe opened this issue Mar 23, 2018 · 8 comments
Closed

Ability to limit by date in csv_to_records #380

peterbe opened this issue Mar 23, 2018 · 8 comments
Assignees

Comments

@peterbe
Copy link
Contributor

peterbe commented Mar 23, 2018

When you scrape with latest-inventory-to-records it analyses every single row from the CSV files downloaded from the manifest in S3.

This makes it exceptionally time consuming to run the whole thing. Not only is it inconvenient when running it multiple times (because might be testing something) but when so many operations have to be done, the risk of a bad network error somewhere blowing everything up is real.

Alternatively, we could use this to do something like:

  • Once a week/day run the whole thing in a cron job. Pretty sure we have all the firefox 20, 21, 22 etc. releases in the database, but if our algorithm changes or improves it'd be nice to revisit those old releases from S3 and try them again differently.

  • If it's now not so horribly slow, you can run the cron job every hour instead. That's more quickly patch runtime flaws in the Lambda executions.

@peterbe peterbe self-assigned this Mar 23, 2018
@peterbe
Copy link
Contributor Author

peterbe commented Mar 23, 2018

After a looong time the script just finished here on my laptop. Before it started I injected a piece of code like this:

with open('lastmodified.log','a') as f:#XXX
    f.write('{}\n'.format(lastmodified.isoformat()))

...inside csv_to_records.
It's HUUUGE.

▶ ls -lh lastmodified.log
-rw-r--r--  1 peterbe  staff    10M Mar 23 13:50 lastmodified.log

▶ wc -l lastmodified.log
  526708 lastmodified.log

I wrote a little script to analyze the dates it spotted:

YEAR 2015 104,357
YEAR 2016 136,109
YEAR 2017 236,664
YEAR 2018 49,578
2018 MONTH 1 20,370
2018 MONTH 2 18,669
2018 MONTH 3 10,539
PER DAY, MEAN   666.75
PER DAY, MEDIAN 408.0

So arguably if you run something like MIN_DAYS=1 latest-inventory-to-records you'd only have to check about 400 releases.

@peterbe peterbe changed the title Ability to limit lastmodifieddate in csv_to_records Ability to limit by date in csv_to_records Mar 23, 2018
@peterbe
Copy link
Contributor Author

peterbe commented Mar 23, 2018

Once this lands, I think we should take this to ops and ask them to execute cron a bit more carefully. E.g.

  • MON,TUE,WED,THU,FRI,SAT - Run MIN_AGE_LAST_MODIFIED_HOURS=48 latest-inventory-to-kinto
  • SUN - Run latest-inventory-to-kinto

@bqbn
Copy link
Collaborator

bqbn commented Mar 23, 2018

Sounds good. Please file a bugzilla with instructions so we can follow up on this.

@peterbe
Copy link
Contributor Author

peterbe commented Mar 26, 2018

@peterbe
Copy link
Contributor Author

peterbe commented Mar 26, 2018

We might want to consider running the whole job a LOT more frequently. The manifest.json files change every 24h by AWS S3. Only about 15% of the new files mentioned in each daily manifest.json is repeated.

But we if we change the frequency to say, every 1h, then most of its time will be spent downloading the .csv.gz files in the latest manifest.json. But this would be easy to optimize with a little disk cache implementation so that potentially the actual network download penalty is only paid once a day.

@leplatrem
Copy link
Collaborator

What would be the goal to run it more often like every 1H?
Be faster catching up missing data due to an S3 event lambda error?

@bqbn
Copy link
Collaborator

bqbn commented Mar 26, 2018

[1] doesn't say how often the cron should run, except that it should run once a week without MIN_AGE_LAST_MODIFIED_HOURS=48.

I wonder, can the cron run frequently to a point that it replaces the lambda? If it can do that, then that's good news because we have one less thing to maintain.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1448871

@peterbe
Copy link
Contributor Author

peterbe commented Mar 27, 2018

What would be the goal to run it more often like every 1H?
Be faster catching up missing data due to an S3 event lambda error?

Ultimately, yes.

Every 1h is extreme. Especially since the manifest.json will be the same 24 times in a row. Also, if lambda fails at 01:00 you still have to wait 23 hours till that new release is in the CSV files.

Ideally...

  • Run frequently so it runs as soon as the new manifest.json is made available
  • Re-run frequently within a day if anything goes wrong
  • Not hurt (or not run at all) if it worked flawlessly for the most recent manifest.json

Lastly, ideally we should not have Lambda any more at all. One less thing to worry about. Especially since the email notifications are practically useless.

If we ditch Lambda, generally it means we lose the real-time'ness. :(
Perhaps a more frequent scan of TaskCluster Artifact API might be able to replace that. Not sure yet.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants