Skip to content

Commit

Permalink
use_cases generated
Browse files Browse the repository at this point in the history
  • Loading branch information
paul.marcombes committed Oct 18, 2024
1 parent 22c91b3 commit dbffbfe
Show file tree
Hide file tree
Showing 73 changed files with 2,912 additions and 0 deletions.
59 changes: 59 additions & 0 deletions use_cases/json_query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
Let's imagine you have a BigQuery table storing user activity logs, where each row contains a JSON string representing various actions a user took within a session. The JSON structure might look like this:

```json
{
"userId": "12345",
"sessionId": "abcde",
"actions": [
{"type": "pageview", "url": "/home"},
{"type": "click", "element": "button1"},
{"type": "form_submit", "data": {"name": "John", "email": "john@example.com"}},
{"type": "pageview", "url": "/products"},
{"type": "click", "element": "addtocart"}
]
}
```

Here are a few use cases for the `json_query` function with this data:

1. **Extracting all URLs visited during a session:**

```sql
SELECT bigfunctions.YOUR_REGION.json_query(activity_json, 'actions[*].url') AS visited_urls
FROM your_table
WHERE userId = '12345' AND sessionId = 'abcde';
```

This query would return an array like `["/home", "/products"]`.

2. **Finding all "click" actions and the elements clicked:**

```sql
SELECT bigfunctions.YOUR_REGION.json_query(activity_json, 'actions[?type==`click`].element') AS clicked_elements
FROM your_table
WHERE userId = '12345' AND sessionId = 'abcde';
```

This would return `["button1", "addtocart"]`.

3. **Getting the data submitted in a form:**

```sql
SELECT bigfunctions.YOUR_REGION.json_query(activity_json, 'actions[?type==`form_submit`].data') AS form_data
FROM your_table
WHERE userId = '12345' AND sessionId = 'abcde';
```

This would return an array containing a single object: `[{"name": "John", "email": "john@example.com"}]`. You could further refine this to get specific fields within the `data` object.

4. **Checking if a specific action type occurred:**

```sql
SELECT bigfunctions.YOUR_REGION.json_query(activity_json, 'actions[?type==`purchase`]') IS NOT NULL AS purchased
FROM your_table
WHERE userId = '12345' AND sessionId = 'abcde';
```

This query returns `true` if a "purchase" action exists in the `actions` array and `false` otherwise.

These examples demonstrate the flexibility of `json_query` for extracting and analyzing data from complex JSON structures within BigQuery. The function's use of JMESPath allows for complex filtering and projections, simplifying tasks that would otherwise require more complicated SQL or User-Defined Functions (UDFs).
30 changes: 30 additions & 0 deletions use_cases/json_schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
A use case for the `json_schema` function is to dynamically determine the schema of JSON data stored in a BigQuery table without prior knowledge of its structure. This can be particularly helpful in situations like:

* **Data ingestion from diverse sources:** Imagine receiving JSON data from various APIs or partners where the structure might not be consistent or documented thoroughly. `json_schema` can be used to automatically analyze a sample of the incoming data and infer its schema. This information can then be used to create or validate table schemas, ensuring proper data loading.

* **Data exploration and analysis:** When dealing with unfamiliar JSON data, `json_schema` helps quickly understand its structure and the types of information it contains. This is useful for exploratory data analysis and building queries without manually examining the JSON objects.

* **Schema evolution tracking:** By periodically applying `json_schema` to incoming data, you can detect changes in the JSON structure over time. This allows you to adapt your processing pipelines or table schemas as needed, ensuring compatibility and avoiding errors.

* **Data validation:** After inferring the schema, it can be used to validate subsequent JSON data against the expected structure. This can prevent malformed data from being ingested, ensuring data quality.

* **Automated documentation:** The output of `json_schema` can be used to generate documentation for the JSON data, simplifying communication and understanding among different teams or users.


**Example Scenario:**

Let's say you have a BigQuery table containing a `raw_data` column storing JSON strings from different sources. You can use the following query to get the schema of the JSON data in each row:

```sql
SELECT bigfunctions.us.json_schema(raw_data) AS inferred_schema
FROM your_dataset.your_table;
```

This will return a table where each row contains the inferred schema of the corresponding JSON data in `raw_data`. You can then further process this output to:

* Identify the common schema across different JSON data.
* Create a new table with the appropriate schema to store the extracted JSON data in a structured format.
* Flag rows with unexpected schemas for further investigation.


By dynamically determining the schema of JSON data using `json_schema`, you can make your data ingestion, analysis, and validation processes more robust and efficient.
30 changes: 30 additions & 0 deletions use_cases/json_values.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
You have a table in BigQuery that stores JSON strings representing user activity. Each JSON string contains key-value pairs where the keys represent activity types and the values represent timestamps or user IDs. You want to extract all the values from these JSON strings to analyze the different types of activities performed without needing to know the specific keys.

**Example Table:**

| UserID | ActivityJSON |
|---|---|
| 1 | `{"login": "2023-10-26 10:00:00", "purchase": "item123"}` |
| 2 | `{"logout": "2023-10-26 10:15:00", "view_product": "item456"}` |
| 3 | `{"login": "2023-10-26 10:30:00", "add_to_cart": "item789"}` |


**Query using `json_values`:**

```sql
SELECT
UserID,
bigfunctions.us.json_values(ActivityJSON) AS ActivityValues
FROM
`your_project.your_dataset.your_table`;
```

**Result:**

| UserID | ActivityValues |
|---|---|
| 1 | `['2023-10-26 10:00:00', 'item123']` |
| 2 | `['2023-10-26 10:15:00', 'item456']` |
| 3 | `['2023-10-26 10:30:00', 'item789']` |

Now you have an array of values for each user, which you can further process. For instance, you could unnest the array to analyze the frequency of different activity values or join it with another table based on these values. The key benefit here is that you've extracted the relevant data without needing to explicitly parse the JSON based on individual keys. This is particularly useful when the keys in the JSON strings can vary across different rows but the values themselves hold the information you're interested in.
19 changes: 19 additions & 0 deletions use_cases/last_value.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Imagine you have a table of customer orders, and each order has an array of timestamps representing different stages of the order fulfillment process (e.g., order placed, payment processed, shipped, delivered). You want to find the last timestamp in each array, which would represent the time the order was completed (delivered in this example).

```sql
SELECT
order_id,
bigfunctions.us.last_value(fulfillment_timestamps) AS order_completion_timestamp
FROM
your_project.your_dataset.your_order_table
```

This query would use the `last_value` function to extract the last timestamp from the `fulfillment_timestamps` array for each order, giving you the order completion time.

Other use cases could include:

* **Finding the latest status update:** If you have an array of status updates for a task or project, `last_value` can give you the most recent status.
* **Getting the last element of a sequence:** If you have an array representing a sequence of events, `last_value` can retrieve the final event in the sequence.
* **Extracting the latest value from sensor readings:** If you have an array of sensor readings over time, `last_value` can retrieve the most recent reading.

Essentially, anytime you need to efficiently extract the last element from an array within BigQuery, the `last_value` function provides a clean and easy solution.
67 changes: 67 additions & 0 deletions use_cases/levenshtein.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
A common use case for the Levenshtein distance function is **fuzzy string matching**. Here are a few scenarios within BigQuery where this function would be helpful:

**1. Data Cleaning and Deduplication:**

Imagine you have a table of customer names, and you suspect there are duplicate entries due to slight variations in spelling (e.g., "John Smith" vs. "Jon Smith" or "John Smyth"). You can use `levenshtein` to identify pairs of names with a small Levenshtein distance, which suggests they might refer to the same person. This allows you to flag potential duplicates for manual review or automated merging.

```sql
#standardSQL
WITH CustomerNames AS (
SELECT 'John Smith' AS name UNION ALL
SELECT 'Jon Smith' AS name UNION ALL
SELECT 'John Smyth' AS name UNION ALL
SELECT 'Jane Doe' AS name UNION ALL
SELECT 'Jane Doe ' AS name -- Example with extra space
)

SELECT
name1,
name2,
bigfunctions.us.levenshtein(name1, name2) AS distance
FROM
CustomerNames AS c1
CROSS JOIN
CustomerNames AS c2
WHERE c1.name < c2.name -- Avoid comparing a name to itself and duplicate pairs
AND bigfunctions.us.levenshtein(name1, name2) <= 2 -- Consider names with distance 2 or less as potential duplicates
```

**2. Spell Checking and Correction:**

You could use `levenshtein` to suggest corrections for misspelled words in a text field. By comparing a misspelled word to a dictionary of correctly spelled words, you can find the closest matches based on Levenshtein distance and offer them as suggestions.

```sql
#standardSQL
WITH Dictionary AS (
SELECT 'apple' AS word UNION ALL
SELECT 'banana' AS word UNION ALL
SELECT 'orange' AS word
),
MisspelledWords AS (
SELECT 'aple' AS misspelled_word UNION ALL
SELECT 'bananna' AS misspelled_word
)

SELECT
m.misspelled_word,
d.word AS suggested_correction,
bigfunctions.us.levenshtein(m.misspelled_word, d.word) AS distance
FROM
MisspelledWords AS m
CROSS JOIN
Dictionary AS d
ORDER BY
m.misspelled_word,
distance
```

**3. Record Linkage/Matching:**

If you have two datasets that should contain information about the same entities but lack a common key, you can use `levenshtein` on string fields (e.g., names, addresses) to help link records across the datasets. This is especially useful when dealing with data from different sources that may have inconsistencies in formatting or spelling.

**4. Similar Product Search:**

In an e-commerce setting, you might want to suggest products with similar names to what a user searches for. `levenshtein` can help you identify products with names that are close to the search query, even if there are typos or slight variations in wording.


These are just a few examples. The Levenshtein distance is a versatile tool for dealing with string variations and has applications in many areas of data analysis and processing within BigQuery. Remember to choose the appropriate BigQuery region for the `bigfunctions` dataset according to your data location (e.g., `bigfunctions.us` for US-based data).
13 changes: 13 additions & 0 deletions use_cases/list_bigquery_resources_in_current_project.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
A use case for the `list_bigquery_resources_in_current_project` function is to **analyze BigQuery resource usage and identify popular or underutilized data assets within a project.**

Imagine a large organization with numerous datasets, tables, and views in their BigQuery project. They want to understand:

* **Which datasets are most actively used:** This can inform decisions about data retention, access control, and resource allocation. The `popularity` score, reflecting recent usage by distinct users, highlights heavily used datasets.
* **Identify unused or rarely used tables:** These might be candidates for deletion or archiving to save storage costs and simplify data governance. Low `popularity` scores indicate underutilization.
* **Understand data lineage and dependencies:** The `details` field can reveal relationships between datasets and tables, helping to visualize data flow and assess the impact of potential changes. For example, you could see which tables are referenced by a particular view.
* **Track user activity:** The function can identify users interacting with different BigQuery resources, providing insights into data access patterns and potential security risks.
* **Automate data discovery and documentation:** The output can be used to generate reports or dashboards summarizing key information about BigQuery resources, including descriptions and usage metrics. This assists in data discovery and documentation efforts.

**Example Scenario:**

A data engineering team needs to optimize their BigQuery costs. They suspect that many tables are no longer being used. By calling `list_bigquery_resources_in_current_project`, they can get a ranked list of tables by popularity. Tables with a popularity of zero are prime candidates for deletion. This allows them to reclaim storage space and reduce costs. Further, they can examine the `details` for the popular tables to ensure appropriate access controls and optimize performance for frequently accessed data.
33 changes: 33 additions & 0 deletions use_cases/list_dataset_tables.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
A use case for the `list_dataset_tables` function is to quickly get an overview of the tables within one or more datasets in BigQuery. This can be useful in several scenarios:

* **Data Discovery/Exploration:** When working with a new project or dataset, you might not know all the tables that exist. `list_dataset_tables` provides a quick way to see what data is available.
* **Auditing/Documentation:** You can use this function to generate a list of tables for documentation purposes or to audit the contents of your datasets.
* **Automated Processes:** In scripts or workflows, you could use `list_dataset_tables` to dynamically determine which tables to process based on their presence in a dataset. For example, you might have a process that iterates over all tables in a dataset and performs some operation (e.g., data validation, backup, etc.).
* **Data Governance:** This function can be used as part of a data governance process to track and manage the tables within your BigQuery environment. You can regularly run the function and compare the results to a known list of approved tables to identify any unauthorized tables.
* **Interactive Analysis:** When working in the BigQuery console, you might want a quick reminder of the tables available in a dataset without navigating through the UI. This function can provide that information directly in the query results.


Example in a data pipeline:

Imagine you have a daily data pipeline that aggregates data from several raw tables into a summary table. You could use the `list_dataset_tables` function to automatically determine which raw tables to include in the aggregation process, making the pipeline more flexible and adaptable to changes in the raw data.


```sql
DECLARE raw_dataset_id STRING DEFAULT "your-project.your_raw_dataset";
DECLARE raw_tables ARRAY<STRING>;

SET raw_tables = (
SELECT ARRAY_AGG(table_name)
FROM bigfunctions.your_region.list_dataset_tables(raw_dataset_id)
WHERE STARTS_WITH(table_name, 'raw_data_') -- Filter for relevant tables
);

-- Use the raw_tables array in your aggregation query
SELECT ...
FROM UNNEST(raw_tables) AS table_name
JOIN `your-project.your_raw_dataset`.table_name -- Dynamically access tables
...
```


This example shows how `list_dataset_tables` can help automate processes by dynamically retrieving a list of tables within a dataset, enhancing the pipeline's flexibility and maintainability. Replace `your_region` with the appropriate BigQuery region (e.g., `us`, `eu`, `us-central1`). Remember to adjust the table name filtering logic (`STARTS_WITH`) to suit your specific requirements.
27 changes: 27 additions & 0 deletions use_cases/list_public_datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
A use case for the `list_public_datasets` BigQuery function is to **dynamically discover and explore the available public datasets in BigQuery**. This can be useful for several scenarios:

1. **Data Discovery and Exploration:** A data analyst or scientist might want to explore what public datasets are available for research or analysis without manually browsing the BigQuery UI or relying on outdated documentation. This function provides a quick and programmatic way to get a list of all public datasets.

2. **Automated Data Pipelines:** In an automated data pipeline, you could use this function to check for the existence of a specific public dataset before attempting to query it. This adds robustness to your pipeline, handling cases where a dataset might be temporarily unavailable or renamed.

3. **Building a Data Catalog:** You can use the output of this function to populate a custom data catalog or metadata store. This allows you to maintain an internal index of available public datasets with additional metadata, such as descriptions or tags.

4. **Interactive Data Exploration Tools:** A web application or interactive notebook could use this function to present users with a list of available public datasets to choose from for analysis.

5. **Training and Education:** In a training environment, this function can be used to quickly demonstrate the breadth of publicly available data in BigQuery, allowing students to explore different datasets.


**Example Scenario:**

Let's say a data analyst wants to build a dashboard showing trends in cryptocurrency prices. They know there are several public datasets related to cryptocurrency, but they're not sure of the exact names or what data is available. They can use the `list_public_datasets` function to get a list of all public datasets. Then, they can filter that list (perhaps using a regular expression) to find datasets related to cryptocurrency and explore their schemas to determine which datasets are suitable for their dashboard.


**Code Example (Illustrative):**

```sql
SELECT dataset_id
FROM UNNEST(bigfunctions.us.list_public_datasets()) AS dataset_id
WHERE REGEXP_CONTAINS(dataset_id, r'cryptocurrency');
```

This query would return all public datasets containing the term "cryptocurrency" in their ID, allowing the analyst to quickly identify relevant datasets.
Loading

0 comments on commit dbffbfe

Please sign in to comment.