Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write Lighthouse reports to a new *_lighthouse table #16

Merged
merged 8 commits into from
Jul 28, 2017

Conversation

rviscomi
Copy link
Member

@rviscomi rviscomi commented Jun 28, 2017

To summarize the latest changes in this PR:

  • LH reports are no longer written to the *_pages table
  • reportCategories..audits are omitted
  • screenshot-thumbnails items (base64 encoded images) are omitted
  • report is NULL rather than the string "null" when applicable

These are efforts to minimize the row size of the LH report so that we can fit more reports. Here are the number of skipped rows per optimization:

  • none: ~12k
  • omit audits: ~4.8k
  • ^ + omit images: ~4.6k
  • ^ + separate table: ~4.6k

- defaults to empty string if LH property not in HAR file
Skip row if too big for BQ. This may be too aggressive and we may want to consider truncating the JSON data (making it invalid?) or dropping the LH data entirely.
@igrigorik
Copy link
Collaborator

@rviscomi can you share an example of what these duplicates are?

@rviscomi
Copy link
Member Author

@igrigorik here's the same color-contrast audit in two places

image

@igrigorik
Copy link
Collaborator

igrigorik commented Jun 28, 2017

Hmm, @ebidel why is this data duplicated in LH reports? I imagine optimizing report size is not high on the agenda for LH, but duplication still feels a bit odd.

@ebidel
Copy link

ebidel commented Jun 28, 2017

Known issue: GoogleChrome/lighthouse#1999

More info: when we render a report in LH, it's easier that the related audits to the category are nested under reportCategories.audits. But that's something we just need to fix. Instead, reference the audits by their ids.

@igrigorik
Copy link
Collaborator

Gotcha, thanks Eric. Would be great to get this fixed upstream.. With more and more tools integrating LH reporting, I imagine this will become a recurring pain point.

@ebidel
Copy link

ebidel commented Jun 29, 2017

I'm going to start the work upstream to remove the dupes. No promises on timeline though :)

@rviscomi rviscomi changed the title Prune duplicate audits Write Lighthouse reports to a new *_lighthouse table Jun 30, 2017
@rviscomi
Copy link
Member Author

Updated to use a separate HAR table. Doesn't seem to have an effect on the number of skipped rows, surprisingly. Also, ~50k pages have NULL reports. The 6/15 table without the changes in this PR has only ~33k "null" reports.

One positive side-effect of this change is that the processing takes the normal amount of 45 minutes, rather than 2.5 hours.

image

@igrigorik
Copy link
Collaborator

Faster import time is a nice win. Looks like we'll need to find a way to prune the data a bit more? =/

@ebidel
Copy link

ebidel commented Jun 30, 2017

FWIW, I see only 18 tables with non-null reports for 6/20:

SELECT
  count(*)
FROM
  httparchive.har.2017_06_20_android_lighthouse
WHERE
  report != 'null'

@rviscomi
Copy link
Member Author

Sorry you can ignore the 6/20 tables. Those are temporary and just for testing.

@rviscomi
Copy link
Member Author

rviscomi commented Jul 5, 2017

Here's a list of URLs and ranks for sites omitted from the LH table, presumably because they were too large and skipped: query (results)

Lowest rank is 250 so it does affect some of the more popular sites (eg slate, wsj, digg, theknot, etc).

Using this data, I can analyze these specific HARs for ways to minimize bytes.

@rviscomi
Copy link
Member Author

@igrigorik does this LGTY? It'd be good to standardize the schema for a new LH table for the 7/15 crawl ending soon. I can follow up with optimizations to retain as many reports as possible.

Copy link
Collaborator

@igrigorik igrigorik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a non-event for the most part.. but should probably account for URL size? Other then that, lgtm.

if (pageRowSize > MAX_CONTENT_SIZE) {
String lighthouseRowJSON = MAPPER.writeValueAsString(lighthouseRow);
Integer lighthouseRowSize = lighthouseRowJSON.getBytes("UTF-8").length;
if (lighthouseRowSize > MAX_CONTENT_SIZE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you account for URL size as well? We do this for body table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@igrigorik igrigorik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rviscomi rviscomi merged commit 1b11d6a into HTTPArchive:master Jul 28, 2017
@rviscomi
Copy link
Member Author

Updated production dataflow worker

@ebidel
Copy link

ebidel commented Jul 28, 2017 via email

@rviscomi
Copy link
Member Author

@ebidel The 6/1, 6/15, and 7/1 crawls have been beta testing this schema so it's already in prod. The 7/15 crawl looks just about done so it should also be processed in the next few days.

@rviscomi rviscomi deleted the lighthouse branch March 26, 2018 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants