-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write Lighthouse reports to a new *_lighthouse table #16
Conversation
- defaults to empty string if LH property not in HAR file
Skip row if too big for BQ. This may be too aggressive and we may want to consider truncating the JSON data (making it invalid?) or dropping the LH data entirely.
@rviscomi can you share an example of what these duplicates are? |
@igrigorik here's the same color-contrast audit in two places |
Hmm, @ebidel why is this data duplicated in LH reports? I imagine optimizing report size is not high on the agenda for LH, but duplication still feels a bit odd. |
Known issue: GoogleChrome/lighthouse#1999 More info: when we render a report in LH, it's easier that the related audits to the category are nested under |
Gotcha, thanks Eric. Would be great to get this fixed upstream.. With more and more tools integrating LH reporting, I imagine this will become a recurring pain point. |
I'm going to start the work upstream to remove the dupes. No promises on timeline though :) |
Updated to use a separate HAR table. Doesn't seem to have an effect on the number of skipped rows, surprisingly. Also, ~50k pages have One positive side-effect of this change is that the processing takes the normal amount of 45 minutes, rather than 2.5 hours. |
Faster import time is a nice win. Looks like we'll need to find a way to prune the data a bit more? =/ |
FWIW, I see only 18 tables with non-
|
Sorry you can ignore the 6/20 tables. Those are temporary and just for testing. |
Here's a list of URLs and ranks for sites omitted from the LH table, presumably because they were too large and skipped: query (results) Lowest rank is 250 so it does affect some of the more popular sites (eg slate, wsj, digg, theknot, etc). Using this data, I can analyze these specific HARs for ways to minimize bytes. |
@igrigorik does this LGTY? It'd be good to standardize the schema for a new LH table for the 7/15 crawl ending soon. I can follow up with optimizations to retain as many reports as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a non-event for the most part.. but should probably account for URL size? Other then that, lgtm.
if (pageRowSize > MAX_CONTENT_SIZE) { | ||
String lighthouseRowJSON = MAPPER.writeValueAsString(lighthouseRow); | ||
Integer lighthouseRowSize = lighthouseRowJSON.getBytes("UTF-8").length; | ||
if (lighthouseRowSize > MAX_CONTENT_SIZE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you account for URL size as well? We do this for body table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Updated production dataflow worker |
Rick, when do you expect the first dumps to show up?
…On Fri, Jul 28, 2017, 11:59 AM Rick Viscomi ***@***.***> wrote:
Updated production dataflow worker
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOigNZGOfv8NIacsXKGYiay59uxbgAKks5sSi-TgaJpZM4OIhfZ>
.
|
@ebidel The 6/1, 6/15, and 7/1 crawls have been beta testing this schema so it's already in prod. The 7/15 crawl looks just about done so it should also be processed in the next few days. |
To summarize the latest changes in this PR:
NULL
rather than the string"null"
when applicableThese are efforts to minimize the row size of the LH report so that we can fit more reports. Here are the number of skipped rows per optimization: