Skip to content

Commit

Permalink
GLAM artifact docs (#800)
Browse files Browse the repository at this point in the history
  • Loading branch information
scholtzan authored Nov 21, 2023
1 parent 1dec626 commit 6bcaabf
Show file tree
Hide file tree
Showing 2 changed files with 114 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .spelling
Original file line number Diff line number Diff line change
Expand Up @@ -422,6 +422,8 @@ wakeup
webhook
whitelist
whitelists
wildcard
wildcards
workgroup
workgroups
WAU
Expand Down
112 changes: 112 additions & 0 deletions src/datasets/glam.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,3 +140,115 @@ These tables are:
- each scalar includes min, max, average, sum, and count aggregations
- each histogram aggregated over all client data per day
- each date is further aggregated over the dimensions: channel, os, version, build ID

## ETL Pipeline

### Scheduling

GLAM is scheduled to run daily via Airflow. There are two separate ETL pipelines for computing GLAM datasets for [Firefox Desktop legacy](https://workflow.telemetry.mozilla.org/dags/glam/grid), [Fenix](https://workflow.telemetry.mozilla.org/dags/glam_fenix/grid) and [Firefox on Glean](https://workflow.telemetry.mozilla.org/dags/glam_fog/grid).

### Source Code

The ETL code base lives in the [bigquery-etl repository](https://github.com/mozilla/bigquery-etl) and is partially generated. The scripts for generating ETL queries for Firefox Desktop Legacy currently live [here](https://github.com/mozilla/bigquery-etl/tree/main/script/glam) while the GLAM logic for Glean apps lives [here](https://github.com/mozilla/bigquery-etl/tree/main/bigquery_etl/glam).

### Steps

GLAM has a separate set of steps and intermediate tables to aggregate scalar and histogram probes.

#### `latest_versions`

- This task pulls in the most recent version for each channel from https://product-details.mozilla.org/1.0/firefox_versions.json

#### `clients_daily_histogram_aggregates_<process>`

- The set of steps that load data to this table are divided into different processes (`parent`, `content`, `gpu`) plus a keyed step for keyed histograms.
- The parent job creates or overwrites the partition corresponding to the `logical_date`, and other processes append data to that partition.
- The process uses `telemetry.buildhub2` to select rows with valid `build_ids`.
- Aggregations are done per client, per day, and include a line for each `submission_date`, `client_id`, `os`, `app_version`, `build_id`, and `channel`.
- The aggregation is done by adding histogram values with the same key for the dimensions listed above.
- The queries for the different steps are generated and run as part of each step.
- The "keyed" step includes all Keyed Histogram probes, regardless of process (`parent`, `content`, `gpu`).
- As a result of the subdivisions in this step, it generates different rows for each process and keyed/non-keyed metric, which will be grouped together in the `clients_histogram_aggregates` step.
- Clients that are on the release channel of the Windows operating system get sampled to reduce the data size.
- The partitions are set to expire after 7 days.

#### `clients_histogram_aggregates_new`

- This step groups together all rows that have the same `submission_date` and `logical_date` from different processes and keyed and non-keyed sources, and combines them into a single row in the `histogram_aggregates` column. It sums the histogram values with the same key.
- This process is only applied to the last three versions.
- The table is overwritten at every execution of this step.

#### `clients_histogram_aggregates`

- New entries from `clients_histogram_aggregates_new` are merged with the 3 last versions of previous day’s partition and written to the current day’s partition.
- The most recent partition contains the current snapshot of the last three versions of data.
- The partitions expire in 7 days.

#### `clients_histogram_buckets_counts`

- This process starts by creating wildcards for `os` and `app_build_id` which are needed for aggregating values across os and build IDs later on.
- It then filters out builds that have less than 0.5% of WAU (which can vary per channel). This is referenced in https://github.com/mozilla/glam/issues/1575#issuecomment-946880387.
- The process then normalizes histograms per client - it sets the sum of histogram values for each client for a given metric to 1.
- Finally, it removes the `client_id` dimension by aggregating all histograms for a given metric and adding the clients' histogram values.

#### `clients_histogram_probe_counts`

- This process generates buckets - which can be linear or exponential - based on the `metric_type`.
- It then aggregates metrics per wildcards (`os`, `app_build_id`).
- Finally, it rebuilds histograms using the Dirichlet Distribution, normalized using the number of clients that contributed to that histogram in the `clients_histogram_buckets_counts` step.

#### `histogram_percentiles`

- Uses `mozfun.glam.percentile` UDF to build histogram percentiles, from [0.1 to 99.9]

---

#### `clients_daily_scalar_aggregates`

- The set of steps that load data to this table are divided into non-keyed `scalar`, `keyed_boolean` and `keyed_scalar`. The non-keyed scalar job creates or overwrites the partition corresponding to the `logical_date`, and other processes append data to that partition.
- The process uses `telemetry.buildhub2` to select rows with valid `build_ids`.
- Aggregations are done per client, per day and include a line for each `client`, `os`, `app_version`, `build_id`, and `channel`.
- The queries for the different steps are generated and run as part of each step. All steps include probes regardless of process (`parent`, `content`, `gpu`).
- As a result of the subdivisions in this step, it generates different rows for each keyed/non-keyed, boolean/scalar metric, which will be grouped together in `clients_scalar_aggregates`.
- Clients that are on the release channel of the Windows operating system get sampled to reduce the data size.
- Partitions expire in 7 days.

#### `clients_scalar_aggregates`

- The process starts by taking the `clients_daily_scalar_aggregates` as the primary source.
- It then groups all rows that have the same `submission_date` and `logical_date` from the keyed and non-keyed, scalar and boolean sources, and combines them into a single row in the `scalar_aggregates` column.
- If the `agg_type` is `count`, `sum`, `true`, or `false`, the process will sum the values.
- If the `agg_type` is `max`, it will take the maximum value, and if it is `min`, it will take the minimum value.
- This process is only applied to the last three versions.
- The partitions expire in 7 days.

#### `scalar_percentiles`

- This process produces a user count and percentiles for scalar metrics.
- It generates wildcard combinations of `os` and `app_build_id` and merges all submissions from a client for the same `os`, `app_version`, `app_build_id` and channel into the `scalar_aggregates` column.
- The `user_count` column is computed taking sampling into account.
- Finally it splits the aggregates into percentiles from [0.1 to 99.9]

#### `client_scalar_probe_counts`

- This step processes booleans and scalars, although booleans are not supported by GLAM.
- For boolean metrics the process aggregates their values with the following rule: "never" if all values for a metric are false, "always" if all values are true, and sometimes if there's a mix.
- For scalar and `keyed_scalar` probes the process starts by building the buckets per metric, then it generates wildcards for os and `app_build_id`. It then aggregates all submissions from the same `client_id` under one row and assigns the `user_count` column to it with the following rule: 10 if os is "Windows" and channel is "release", 1 otherwise. After that it finishes by aggregating the rows per metric, placing the scalar values in their appropriate buckets and summing up all `user_count` values for that metric.

---

#### `glam_user_counts`

- Combines both aggregated scalar and histogram values.
- This process produces a user count for each combination of `os`, `app_version`, `app_build_id`, channel.
- It builds a client count from the union of histograms and scalars, including all combinations in which `os`, `app_version`, `app_build_id`, and `channel` are wildcards.

#### `glam_sample_counts`

- This process calculates the `total_sample` column by adding up all the `aggregates` values.
- This works because in the primary sources the values also represent a count of the samples that registered their respective keys

#### `extract_user_counts`

- This step exports user counts in its final shape to GCS as a CSV.
- It first copies a deduplicated version of the primary source to a temporary table, removes the previously exported CSV files from GCS, then exports the temporary table to GCS as CSV files.

0 comments on commit 6bcaabf

Please sign in to comment.