From 17bd7430d90b3319c11b5f876fce0b1f47498184 Mon Sep 17 00:00:00 2001 From: Eduardo Filho Date: Mon, 16 Dec 2024 10:31:58 -0500 Subject: [PATCH] Update glam docs --- .spelling | 4 + src/cookbooks/glam.md | 15 +-- src/datasets/glam.md | 253 +++++++++++------------------------------- 3 files changed, 73 insertions(+), 199 deletions(-) diff --git a/.spelling b/.spelling index 445e62d9c..a02401229 100644 --- a/.spelling +++ b/.spelling @@ -495,3 +495,7 @@ rescan unmute 2023-Q4 2024-H1 +is_lineage_mode +schemaFilter +client_scalar_probe_counts +20Definition \ No newline at end of file diff --git a/src/cookbooks/glam.md b/src/cookbooks/glam.md index e9e29344b..7eef6ee5b 100644 --- a/src/cookbooks/glam.md +++ b/src/cookbooks/glam.md @@ -1,10 +1,10 @@ # Introduction to GLAM -GLAM was built to help Mozillians answer their data questions without needing data analysis or coding skills. It contains a visualization layer meant to answer most "easy" questions about how a probe or metric has changed over build ids and releases. +GLAM was built to help Mozillians answer most "easy" questions about how a probe or metric has changed over build ids and releases. GLAM is one of several high-level data tools that we provide at Mozilla. For more information, see [Tools for Data Analysis](../introduction/tools.md). -Access to GLAM is currently limited to Mozilla employees and designated contributors (this [may change in the future](https://bugzilla.mozilla.org/show_bug.cgi?id=1712353)). For more information, see [gaining access](../concepts/gaining_access.md). +Access to GLAM is public! ## How to use GLAM @@ -14,7 +14,7 @@ You can visit GLAM at [`glam.telemetry.mozilla.org`](https://glam.telemetry.mozi ![](../assets/GLAM_screenshots/front-page.png) -The front page includes two main sections: the search bar and the random probe explorer. Fuzzy tech search is implemented to let users search not only by the probe title, but also by the full description. +The front page includes two main sections: the search bar and the random probe explorer. Users can search not only by the probe title, but also by the full description. GLAM is currently serving data for Firefox Desktop and Firefox for Android. @@ -43,15 +43,6 @@ Clicking on a probe or metric name takes you to the individual explorer, where m **`(6)`** shows the volume of clients with each given Build ID -## Differences between GLAM and `telemetry.mozilla.org` dashboard - -GLAM is aggregated per client, `telemetry.mozilla.org` (TMO) is aggregated per ping. This will cause different movements in the visualization between the two systems. Notably: - -- Because GLAM normalizes the aggregations by client ID, a single client weighs equally to all other clients, regardless of how many samples that client sends. -- Reversely, TMO does not normalize by client ID, so if a single client sends a lot of pings, that client will impact the distribution more heavily. This can result in some changes appearing bigger on TMO. - -As of July 2022, TMO serves only Firefox Desktop (telemetry) data, while GLAM supports Firefox Desktop (both telemetry and Glean), Firefox for Android (Fenix), with ongoing efforts to integrate Firefox iOS and more products that use Glean as their telemetry system. - ## Going deeper For more information about the datasets that power GLAM, see [GLAM Datasets](../datasets/glam.md). diff --git a/src/datasets/glam.md b/src/datasets/glam.md index 737938c20..8e73be86c 100644 --- a/src/datasets/glam.md +++ b/src/datasets/glam.md @@ -1,151 +1,54 @@ # GLAM datasets -[GLAM](https://glam.telemetry.mozilla.org) aims to answer a majority of the "easy" questions of how a probe or metric has changed over time. -The GLAM aggregation tables are useful for accessing the data that drives GLAM if more exploration is required. -Exploring the GLAM tables could take that a little farther, but still has some limitations. -If you need to dive deeper or aggregate on a field that isn't included here, consider reading [Visualizing Percentiles of a Main Ping Exponential Histogram](https://docs.telemetry.mozilla.org/cookbooks/main_ping_exponential_histograms.html). - -The GLAM tables: - -- Are aggregated at the client level, not the submission ping level -- Provide a set of dimensions for subsets: channel, OS, process or ping type -- Are aggregated by build ID and version -- For each aggregation, the distribution and percentiles over time are calculated -- Have the last 3 versions of data aggregated every day -- Retain data for up to 10 major versions - -## Firefox Desktop - -### Data source table - -- `moz-fx-data-shared-prod.telemetry.client_probe_counts` - -### Data reference - -- `os`: One of Windows, Mac, Linux, or NULL for all OSes -- `app_version`: Integer representing the major version -- `app_build_id`: The full build ID, or NULL if aggregated by major version -- `channel`: One of nightly, beta, or release -- `metric`: The name of the metric -- `metric_type`: The type of metric, e.g. `histogram-enumerated` -- `key`: The key if the metric is a keyed metric -- `process`: The process -- `client_agg_type`: The type of client aggregation used, e.g. `summed_histogram` -- `agg_type`: One of histogram or percentiles representing what data will be in the `aggregates` column -- `total_users`: The number of users that submitted data for the combination of dimensions -- `aggregates`: The data as a key/value record, either percentiles or histogram - -### Sample query - -```sql -SELECT - os, - app_version, - app_build_id, - channel, - metric, - metric_type, - key, - process, - client_agg_type, - agg_type, - total_users, - mozfun.glam.histogram_cast_json(aggregates) AS aggregates -FROM - `moz-fx-data-shared-prod.telemetry.client_probe_counts` -WHERE - metric="checkerboard_severity" - AND channel="nightly" - AND os IS NULL - AND process="parent" - AND app_build_id IS NULL -``` - -Notes: - -- To query all OSes, use: `WHERE os IS NULL` -- To query by build ID, use: `WHERE app_build_id IS NOT NULL` -- To query by version, use: `WHERE app_build_id IS NULL` - -## Firefox for Android - -### Data source tables - -- `org_mozilla_fenix_glam_release__view_probe_counts_v1` -- `org_mozilla_fenix_glam_beta__view_probe_counts_v1` -- `org_mozilla_fenix_glam_nightly__view_probe_counts_v1` - -## Data reference - -- `os`: Just "Android" for now -- `app_version`: Integer representing the major version -- `app_build_id`: The full build ID, or "\*" if aggregated by major version -- `channel`: Always "\*", use the different source tables to select a channel -- `metric`: The name of the metric -- `metric_type`: The type of metric, e.g. `timing_distribution` -- `key`: The key if the metric is a keyed metric -- `ping_type`: The type of ping, or "\*" for all ping types -- `client_agg_type`: The type of client aggregation used, e.g. `summed_histogram` -- `agg_type`: One of histogram or percentiles representing what data will be in the `aggregates` column -- `total_users`: The number of users that submitted data for the combination of dimensions -- `aggregates`: The data as a key/value record, either percentiles or histogram - -### Sample query - -```sql -SELECT - ping_type, - os, - app_version, - app_build_id, - metric, - metric_type, - key, - client_agg_type, - agg_type, - total_users, - mozfun.glam.histogram_cast_json(aggregates) AS aggregates, -FROM - `moz-fx-data-shared-prod.glam_etl.org_mozilla_fenix_glam_release__view_probe_counts_v1` -WHERE - metric="performance_time_dom_complete" - AND os="Android" - AND ping_type="*" - AND app_build_id!="*" -``` - -Notes: - -- To query all ping types, use: `WHERE ping_type = "*"` -- To query by build ID, use: `WHERE app_build_id != "*"` -- To query by version, use: `WHERE app_build_id = "*"` - -## GLAM Intermediate Tables - -In addition to the above tables, the GLAM ETL produces intermediate tables that can be useful outside of the GLAM ETL in some cases. -These tables include the client ID and could be joined with other tables to filter by client based data (e.g. specific hardware). - -### Firefox Desktop - -Data sources: - -- `moz-fx-data-shared-prod.telemetry.clients_daily_histogram_aggregates` -- `moz-fx-data-shared-prod.telemetry.clients_daily_scalar_aggregates` - -These tables are: - -- Preprocessed from main telemetry to intermediate data with one row per client per metric per day, then aggregated normalizing across clients. -- Clients daily aggregates analogous to clients daily with: - - all metrics aggregated - - each scalar includes min, max, average, sum, and count aggregations - - each histogram aggregated over all client data per day - - each date is further aggregated over the dimensions: channel, os, version, build ID +[GLAM](https://glam.telemetry.mozilla.org) provides aggregated telemetry data in a way that makes it easy to understand how a given probe or metric has been changing over subsequent builds. GLAM aggregations are statistically validated by data scientists to ensure an accurate picture of the observed behavior of telemetry data. + +GLAM data is also meant to be explored by itself: GLAM aggregation tables are useful for accessing the data that drives GLAM if more digging is required. Please read through the next section to learn more! + +## GLAM final tables (Aggregates dataset) + +The following datasets are split in three categories: Firefox Desktop Glean, Firefox Desktop Legacy and Firefox for Android. The tables contain the final aggregated data that powers GLAM. + +Each link below points to the dataset's page on [Mozilla's Data Catalog](https://mozilla.acryl.io/) where you can find the dataset's full documentation. + +> **_NOTE:_** You may realize that the Aggregates dataset does not have the dimensions you need. For example, the dataset does not contain client-level or day-by-day aggregations. +> If you need to dive deeper or aggregate on a field that isn't included in the Aggregates dataset, you will need to write queries against raw telemetry tables. Should that be your quest we don't want you to start from scratch, this is why GLAM has the `View SQL Query` -> `Telemetry SQL` feature, which gives you a query that already works so you can tweak it. The feature is accessible once you pick a metric or probe. Additionally, you can read other material such as [Visualizing Percentiles of a Main Ping Exponential Histogram](https://docs.telemetry.mozilla.org/cookbooks/main_ping_exponential_histograms.html) in order to learn how to write queries that can give you what you need. Finally, #data-help on slack is a place where all questions related to data are welcome. + +### Firefox Desktop (Glean) + +- [`moz-fx-data-shared-prod.glam_etl.glam_fog_nightly_aggregates`]() +- [`moz-fx-data-shared-prod.glam_etl.glam_fog_beta_aggregates`]() +- [`moz-fx-data-shared-prod.glam_etl.glam_fog_release_aggregates`]() + +### Firefox for Android + +- [`moz-fx-data-shared-prod.glam_etl.glam_fenix_nightly_aggregates`]() +- [`moz-fx-data-shared-prod.glam_etl.glam_fenix_beta_aggregates`]() +- [`moz-fx-data-shared-prod.glam_etl.glam_fenix_release_aggregates`]() + +### Firefox Desktop (Legacy Telemetry) + +- [`moz-fx-data-shared-prod.glam_etl.glam_desktop_nightly_aggregates`]() +- [`moz-fx-data-shared-prod.glam_etl.glam_desktop_beta_aggregates`]() +- [`moz-fx-data-shared-prod.glam_etl.glam_desktop_release_aggregates`]() + +In addition to the above tables, the GLAM ETL saves intermediate data transformed after each step. The next section provides an overview of each of the steps with the dataset they produce. ## ETL Pipeline ### Scheduling -GLAM is scheduled to run daily via Airflow. There are two separate ETL pipelines for computing GLAM datasets for [Firefox Desktop legacy](https://workflow.telemetry.mozilla.org/dags/glam/grid), [Fenix](https://workflow.telemetry.mozilla.org/dags/glam_fenix/grid) and [Firefox on Glean](https://workflow.telemetry.mozilla.org/dags/glam_fog/grid). +Most of the GLAM ETL is scheduled to run daily via Airflow. There are separate ETL pipelines for computing GLAM datasets: + +- [Firefox Desktop on Glean](https://workflow.telemetry.mozilla.org/dags/glam_fog/grid) + - Runs daily + - Only `daily_` (first "half") jobs for release are processed +- [Firefox Desktop on Glean (release)](https://workflow.telemetry.mozilla.org/dags/glam_fog_release/grid) + - Runs weekly + - Second "half" of release ETL is processed +- [Firefox Desktop legacy](https://workflow.telemetry.mozilla.org/dags/glam/grid) + - Runs daily +- [Firefox for Android](https://workflow.telemetry.mozilla.org/dags/glam_fenix/grid) + - Runs daily ### Source Code @@ -155,11 +58,11 @@ The ETL code base lives in the [bigquery-etl repository](https://github.com/mozi GLAM has a separate set of steps and intermediate tables to aggregate scalar and histogram probes. -#### `latest_versions` +#### [`latest_versions`]() - This task pulls in the most recent version for each channel from https://product-details.mozilla.org/1.0/firefox_versions.json -#### `clients_daily_histogram_aggregates_` +#### [`clients_daily_histogram_aggregates_`]() - The set of steps that load data to this table are divided into different processes (`parent`, `content`, `gpu`) plus a keyed step for keyed histograms. - The parent job creates or overwrites the partition corresponding to the `logical_date`, and other processes append data to that partition. @@ -172,38 +75,34 @@ GLAM has a separate set of steps and intermediate tables to aggregate scalar and - Clients that are on the release channel of the Windows operating system get sampled to reduce the data size. - The partitions are set to expire after 7 days. -#### `clients_histogram_aggregates_new` +#### [`clients_histogram_aggregates_new`]() - This step groups together all rows that have the same `submission_date` and `logical_date` from different processes and keyed and non-keyed sources, and combines them into a single row in the `histogram_aggregates` column. It sums the histogram values with the same key. - This process is only applied to the last three versions. - The table is overwritten at every execution of this step. -#### `clients_histogram_aggregates` +#### [`clients_histogram_aggregates`]() +- This is the most important histogram table in the intermediate dataset, where each row represents a `client_id` with its cumulative sum of the histograms for the last 3 versions of all metrics. - New entries from `clients_histogram_aggregates_new` are merged with the 3 last versions of previous day’s partition and written to the current day’s partition. -- The most recent partition contains the current snapshot of the last three versions of data. -- The partitions expire in 7 days. +- This table only holds the most recent `submission_date`, which marks the most recent date of data ingestion. A check before running this jobs ensures that the ETL does not skip days. In other words, the ETL only processes date `d` if the last date processed was `d-1`. +- In case of failures in the GLAM ETL this table must be backfilled one day at a time. -#### `clients_histogram_buckets_counts` +#### [`clients_histogram_buckets_counts`]() -- This process starts by creating wildcards for `os` and `app_build_id` which are needed for aggregating values across os and build IDs later on. -- It then filters out builds that have less than 0.5% of WAU (which can vary per channel). This is referenced in https://github.com/mozilla/glam/issues/1575#issuecomment-946880387. -- The process then normalizes histograms per client - it sets the sum of histogram values for each client for a given metric to 1. -- Finally, it removes the `client_id` dimension by aggregating all histograms for a given metric and adding the clients' histogram values. +- This process creates wildcards for `os` and `app_build_id`, which are needed for aggregating values across os and build IDs later on. +- It then adds a normalized histogram per client, while keeping a non-normalized histogram. +- Finally, it removes the `client_id` dimension by breaking histograms into key/value pairs and doing the `SUM` all values of the same key for the same metric/os/version/build. -#### `clients_histogram_probe_counts` +#### [`clients_histogram_probe_counts`]() -- This process generates buckets - which can be linear or exponential - based on the `metric_type`. +- This process uses the `metric_type` to select the algorithm to build histograms using the broken down buckets from the previous step. Histograms can be `linear`, `exponential` or `custom`. - It then aggregates metrics per wildcards (`os`, `app_build_id`). - Finally, it rebuilds histograms using the Dirichlet Distribution, normalized using the number of clients that contributed to that histogram in the `clients_histogram_buckets_counts` step. -#### `histogram_percentiles` - -- Uses `mozfun.glam.percentile` UDF to build histogram percentiles, from [0.1 to 99.9] - --- -#### `clients_daily_scalar_aggregates` +#### [`clients_daily_scalar_aggregates`]() - The set of steps that load data to this table are divided into non-keyed `scalar`, `keyed_boolean` and `keyed_scalar`. The non-keyed scalar job creates or overwrites the partition corresponding to the `logical_date`, and other processes append data to that partition. - The process uses `telemetry.buildhub2` to select rows with valid `build_ids`. @@ -213,42 +112,22 @@ GLAM has a separate set of steps and intermediate tables to aggregate scalar and - Clients that are on the release channel of the Windows operating system get sampled to reduce the data size. - Partitions expire in 7 days. -#### `clients_scalar_aggregates` +#### [`clients_scalar_aggregates`]() -- The process starts by taking the `clients_daily_scalar_aggregates` as the primary source. -- It then groups all rows that have the same `submission_date` and `logical_date` from the keyed and non-keyed, scalar and boolean sources, and combines them into a single row in the `scalar_aggregates` column. +- This process groups all rows with the same `submission_date` and `logical_date` from `clients_daily_scalar_aggregates` and combines them into a single row in the `scalar_aggregates` column. - If the `agg_type` is `count`, `sum`, `true`, or `false`, the process will sum the values. - If the `agg_type` is `max`, it will take the maximum value, and if it is `min`, it will take the minimum value. - This process is only applied to the last three versions. -- The partitions expire in 7 days. - -#### `scalar_percentiles` +- The table is partitioned by `submission_date`. The partitions expire in 7 days. -- This process produces a user count and percentiles for scalar metrics. -- It generates wildcard combinations of `os` and `app_build_id` and merges all submissions from a client for the same `os`, `app_version`, `app_build_id` and channel into the `scalar_aggregates` column. -- The `user_count` column is computed taking sampling into account. -- Finally it splits the aggregates into percentiles from [0.1 to 99.9] - -#### `client_scalar_probe_counts` +#### [`client_scalar_probe_counts`]() - This step processes booleans and scalars, although booleans are not supported by GLAM. - - For boolean metrics the process aggregates their values with the following rule: "never" if all values for a metric are false, "always" if all values are true, and sometimes if there's a mix. - - For scalar and `keyed_scalar` probes the process starts by building the buckets per metric, then it generates wildcards for os and `app_build_id`. It then aggregates all submissions from the same `client_id` under one row and assigns the `user_count` column to it with the following rule: 10 if os is "Windows" and channel is "release", 1 otherwise. After that it finishes by aggregating the rows per metric, placing the scalar values in their appropriate buckets and summing up all `user_count` values for that metric. +- For boolean metrics the process aggregates their values with the following rule: "never" if all values for a metric are false, "always" if all values are true, and "sometimes" if there's a mix. +- For scalar and `keyed_scalar` probes the process starts by building the buckets per metric, then it generates wildcards for os and `app_build_id`. It then aggregates all submissions from the same `client_id` under one row and assigns the `user_count` column to it with the following rule: 10 if os is "Windows" and channel is "release", 1 otherwise. After that it finishes by aggregating the rows per metric, placing the scalar values in their appropriate buckets and summing up all `user_count` values for that metric. --- -#### `glam_user_counts` - -- Combines both aggregated scalar and histogram values. -- This process produces a user count for each combination of `os`, `app_version`, `app_build_id`, channel. -- It builds a client count from the union of histograms and scalars, including all combinations in which `os`, `app_version`, `app_build_id`, and `channel` are wildcards. - -#### `glam_sample_counts` - -- This process calculates the `total_sample` column by adding up all the `aggregates` values. -- This works because in the primary sources the values also represent a count of the samples that registered their respective keys - -#### `extract_user_counts` +#### [`glam_sample_counts`]() -- This step exports user counts in its final shape to GCS as a CSV. -- It first copies a deduplicated version of the primary source to a temporary table, removes the previously exported CSV files from GCS, then exports the temporary table to GCS as CSV files. +- This process calculates the `total_sample` column.