From dbffbfec538a6518f4a96656e88b5eeb78024313 Mon Sep 17 00:00:00 2001 From: "paul.marcombes" Date: Fri, 18 Oct 2024 14:54:50 +0000 Subject: [PATCH] use_cases generated --- use_cases/json_query.md | 59 ++++++++++++ use_cases/json_schema.md | 30 ++++++ use_cases/json_values.md | 30 ++++++ use_cases/last_value.md | 19 ++++ use_cases/levenshtein.md | 67 +++++++++++++ ...t_bigquery_resources_in_current_project.md | 13 +++ use_cases/list_dataset_tables.md | 33 +++++++ use_cases/list_public_datasets.md | 27 ++++++ use_cases/list_scheduled_queries.md | 23 +++++ use_cases/load_api_data.md | 61 ++++++++++++ use_cases/load_api_data_into_temp_dataset.md | 68 ++++++++++++++ use_cases/load_file.md | 43 +++++++++ use_cases/load_file_into_temp_dataset.md | 34 +++++++ use_cases/markdown2html.md | 14 +++ use_cases/max_value.md | 33 +++++++ use_cases/median_value.md | 30 ++++++ use_cases/min_max_scaler.md | 37 ++++++++ use_cases/min_value.md | 23 +++++ use_cases/ngram_frequency_similarity.md | 22 +++++ use_cases/parse_date.md | 48 ++++++++++ use_cases/parse_url.md | 31 ++++++ use_cases/parse_user_agent.md | 38 ++++++++ use_cases/percentile_value.md | 53 +++++++++++ use_cases/phone_number_info.md | 44 +++++++++ use_cases/post.md | 60 ++++++++++++ use_cases/precision_recall_auc.md | 39 ++++++++ use_cases/precision_recall_curve.md | 35 +++++++ use_cases/prophet.md | 51 ++++++++++ use_cases/quantize_into_bins.md | 42 +++++++++ use_cases/quantize_into_bins_with_labels.md | 66 +++++++++++++ use_cases/quantize_into_fixed_width_bins.md | 45 +++++++++ use_cases/rare_values.md | 54 +++++++++++ use_cases/refresh_powerbi.md | 67 +++++++++++++ use_cases/refresh_tableau.md | 33 +++++++ use_cases/remove_accents.md | 32 +++++++ use_cases/remove_extra_whitespaces.md | 32 +++++++ use_cases/remove_strings.md | 23 +++++ use_cases/remove_value.md | 50 ++++++++++ use_cases/remove_words.md | 23 +++++ use_cases/render_handlebars_template.md | 57 +++++++++++ use_cases/render_template.md | 58 ++++++++++++ use_cases/replace_special_characters.md | 25 +++++ use_cases/reverse_geocode.md | 19 ++++ use_cases/roc_auc.md | 25 +++++ use_cases/roc_curve.md | 40 ++++++++ use_cases/run_python.md | 60 ++++++++++++ use_cases/sankey_chart.md | 14 +++ use_cases/send_google_chat_message.md | 38 ++++++++ use_cases/send_mail.md | 94 +++++++++++++++++++ use_cases/send_mail_with_excel.md | 44 +++++++++ use_cases/send_slack_message.md | 37 ++++++++ use_cases/send_sms.md | 52 ++++++++++ use_cases/send_teams_message.md | 57 +++++++++++ use_cases/sentiment_score.md | 20 ++++ use_cases/sleep.md | 31 ++++++ use_cases/sort_values.md | 38 ++++++++ use_cases/sort_values_desc.md | 33 +++++++ use_cases/sql_to_flatten_json_column.md | 52 ++++++++++ use_cases/sum_values.md | 36 +++++++ use_cases/timestamp_from_unix_date_time.md | 58 ++++++++++++ use_cases/timestamp_to_unix_date_time.md | 50 ++++++++++ use_cases/translate.md | 8 ++ use_cases/translated_month_name.md | 54 +++++++++++ use_cases/translated_weekday_name.md | 17 ++++ use_cases/upload_table_to_gsheet.md | 25 +++++ use_cases/upload_to_gsheet.md | 38 ++++++++ use_cases/upsert.md | 73 ++++++++++++++ use_cases/url_decode.md | 31 ++++++ use_cases/validate_address.md | 34 +++++++ use_cases/weighted_average.md | 26 +++++ use_cases/xml2json.md | 41 ++++++++ use_cases/xml_extract.md | 60 ++++++++++++ use_cases/z_scores.md | 35 +++++++ 73 files changed, 2912 insertions(+) create mode 100644 use_cases/json_query.md create mode 100644 use_cases/json_schema.md create mode 100644 use_cases/json_values.md create mode 100644 use_cases/last_value.md create mode 100644 use_cases/levenshtein.md create mode 100644 use_cases/list_bigquery_resources_in_current_project.md create mode 100644 use_cases/list_dataset_tables.md create mode 100644 use_cases/list_public_datasets.md create mode 100644 use_cases/list_scheduled_queries.md create mode 100644 use_cases/load_api_data.md create mode 100644 use_cases/load_api_data_into_temp_dataset.md create mode 100644 use_cases/load_file.md create mode 100644 use_cases/load_file_into_temp_dataset.md create mode 100644 use_cases/markdown2html.md create mode 100644 use_cases/max_value.md create mode 100644 use_cases/median_value.md create mode 100644 use_cases/min_max_scaler.md create mode 100644 use_cases/min_value.md create mode 100644 use_cases/ngram_frequency_similarity.md create mode 100644 use_cases/parse_date.md create mode 100644 use_cases/parse_url.md create mode 100644 use_cases/parse_user_agent.md create mode 100644 use_cases/percentile_value.md create mode 100644 use_cases/phone_number_info.md create mode 100644 use_cases/post.md create mode 100644 use_cases/precision_recall_auc.md create mode 100644 use_cases/precision_recall_curve.md create mode 100644 use_cases/prophet.md create mode 100644 use_cases/quantize_into_bins.md create mode 100644 use_cases/quantize_into_bins_with_labels.md create mode 100644 use_cases/quantize_into_fixed_width_bins.md create mode 100644 use_cases/rare_values.md create mode 100644 use_cases/refresh_powerbi.md create mode 100644 use_cases/refresh_tableau.md create mode 100644 use_cases/remove_accents.md create mode 100644 use_cases/remove_extra_whitespaces.md create mode 100644 use_cases/remove_strings.md create mode 100644 use_cases/remove_value.md create mode 100644 use_cases/remove_words.md create mode 100644 use_cases/render_handlebars_template.md create mode 100644 use_cases/render_template.md create mode 100644 use_cases/replace_special_characters.md create mode 100644 use_cases/reverse_geocode.md create mode 100644 use_cases/roc_auc.md create mode 100644 use_cases/roc_curve.md create mode 100644 use_cases/run_python.md create mode 100644 use_cases/sankey_chart.md create mode 100644 use_cases/send_google_chat_message.md create mode 100644 use_cases/send_mail.md create mode 100644 use_cases/send_mail_with_excel.md create mode 100644 use_cases/send_slack_message.md create mode 100644 use_cases/send_sms.md create mode 100644 use_cases/send_teams_message.md create mode 100644 use_cases/sentiment_score.md create mode 100644 use_cases/sleep.md create mode 100644 use_cases/sort_values.md create mode 100644 use_cases/sort_values_desc.md create mode 100644 use_cases/sql_to_flatten_json_column.md create mode 100644 use_cases/sum_values.md create mode 100644 use_cases/timestamp_from_unix_date_time.md create mode 100644 use_cases/timestamp_to_unix_date_time.md create mode 100644 use_cases/translate.md create mode 100644 use_cases/translated_month_name.md create mode 100644 use_cases/translated_weekday_name.md create mode 100644 use_cases/upload_table_to_gsheet.md create mode 100644 use_cases/upload_to_gsheet.md create mode 100644 use_cases/upsert.md create mode 100644 use_cases/url_decode.md create mode 100644 use_cases/validate_address.md create mode 100644 use_cases/weighted_average.md create mode 100644 use_cases/xml2json.md create mode 100644 use_cases/xml_extract.md create mode 100644 use_cases/z_scores.md diff --git a/use_cases/json_query.md b/use_cases/json_query.md new file mode 100644 index 00000000..16279736 --- /dev/null +++ b/use_cases/json_query.md @@ -0,0 +1,59 @@ +Let's imagine you have a BigQuery table storing user activity logs, where each row contains a JSON string representing various actions a user took within a session. The JSON structure might look like this: + +```json +{ + "userId": "12345", + "sessionId": "abcde", + "actions": [ + {"type": "pageview", "url": "/home"}, + {"type": "click", "element": "button1"}, + {"type": "form_submit", "data": {"name": "John", "email": "john@example.com"}}, + {"type": "pageview", "url": "/products"}, + {"type": "click", "element": "addtocart"} + ] +} +``` + +Here are a few use cases for the `json_query` function with this data: + +1. **Extracting all URLs visited during a session:** + +```sql +SELECT bigfunctions.YOUR_REGION.json_query(activity_json, 'actions[*].url') AS visited_urls +FROM your_table +WHERE userId = '12345' AND sessionId = 'abcde'; +``` + +This query would return an array like `["/home", "/products"]`. + +2. **Finding all "click" actions and the elements clicked:** + +```sql +SELECT bigfunctions.YOUR_REGION.json_query(activity_json, 'actions[?type==`click`].element') AS clicked_elements +FROM your_table +WHERE userId = '12345' AND sessionId = 'abcde'; +``` + +This would return `["button1", "addtocart"]`. + +3. **Getting the data submitted in a form:** + +```sql +SELECT bigfunctions.YOUR_REGION.json_query(activity_json, 'actions[?type==`form_submit`].data') AS form_data +FROM your_table +WHERE userId = '12345' AND sessionId = 'abcde'; +``` + +This would return an array containing a single object: `[{"name": "John", "email": "john@example.com"}]`. You could further refine this to get specific fields within the `data` object. + +4. **Checking if a specific action type occurred:** + +```sql +SELECT bigfunctions.YOUR_REGION.json_query(activity_json, 'actions[?type==`purchase`]') IS NOT NULL AS purchased +FROM your_table +WHERE userId = '12345' AND sessionId = 'abcde'; +``` + +This query returns `true` if a "purchase" action exists in the `actions` array and `false` otherwise. + +These examples demonstrate the flexibility of `json_query` for extracting and analyzing data from complex JSON structures within BigQuery. The function's use of JMESPath allows for complex filtering and projections, simplifying tasks that would otherwise require more complicated SQL or User-Defined Functions (UDFs). diff --git a/use_cases/json_schema.md b/use_cases/json_schema.md new file mode 100644 index 00000000..f75be8ee --- /dev/null +++ b/use_cases/json_schema.md @@ -0,0 +1,30 @@ +A use case for the `json_schema` function is to dynamically determine the schema of JSON data stored in a BigQuery table without prior knowledge of its structure. This can be particularly helpful in situations like: + +* **Data ingestion from diverse sources:** Imagine receiving JSON data from various APIs or partners where the structure might not be consistent or documented thoroughly. `json_schema` can be used to automatically analyze a sample of the incoming data and infer its schema. This information can then be used to create or validate table schemas, ensuring proper data loading. + +* **Data exploration and analysis:** When dealing with unfamiliar JSON data, `json_schema` helps quickly understand its structure and the types of information it contains. This is useful for exploratory data analysis and building queries without manually examining the JSON objects. + +* **Schema evolution tracking:** By periodically applying `json_schema` to incoming data, you can detect changes in the JSON structure over time. This allows you to adapt your processing pipelines or table schemas as needed, ensuring compatibility and avoiding errors. + +* **Data validation:** After inferring the schema, it can be used to validate subsequent JSON data against the expected structure. This can prevent malformed data from being ingested, ensuring data quality. + +* **Automated documentation:** The output of `json_schema` can be used to generate documentation for the JSON data, simplifying communication and understanding among different teams or users. + + +**Example Scenario:** + +Let's say you have a BigQuery table containing a `raw_data` column storing JSON strings from different sources. You can use the following query to get the schema of the JSON data in each row: + +```sql +SELECT bigfunctions.us.json_schema(raw_data) AS inferred_schema +FROM your_dataset.your_table; +``` + +This will return a table where each row contains the inferred schema of the corresponding JSON data in `raw_data`. You can then further process this output to: + +* Identify the common schema across different JSON data. +* Create a new table with the appropriate schema to store the extracted JSON data in a structured format. +* Flag rows with unexpected schemas for further investigation. + + +By dynamically determining the schema of JSON data using `json_schema`, you can make your data ingestion, analysis, and validation processes more robust and efficient. diff --git a/use_cases/json_values.md b/use_cases/json_values.md new file mode 100644 index 00000000..4d5ed37c --- /dev/null +++ b/use_cases/json_values.md @@ -0,0 +1,30 @@ +You have a table in BigQuery that stores JSON strings representing user activity. Each JSON string contains key-value pairs where the keys represent activity types and the values represent timestamps or user IDs. You want to extract all the values from these JSON strings to analyze the different types of activities performed without needing to know the specific keys. + +**Example Table:** + +| UserID | ActivityJSON | +|---|---| +| 1 | `{"login": "2023-10-26 10:00:00", "purchase": "item123"}` | +| 2 | `{"logout": "2023-10-26 10:15:00", "view_product": "item456"}` | +| 3 | `{"login": "2023-10-26 10:30:00", "add_to_cart": "item789"}` | + + +**Query using `json_values`:** + +```sql +SELECT + UserID, + bigfunctions.us.json_values(ActivityJSON) AS ActivityValues +FROM + `your_project.your_dataset.your_table`; +``` + +**Result:** + +| UserID | ActivityValues | +|---|---| +| 1 | `['2023-10-26 10:00:00', 'item123']` | +| 2 | `['2023-10-26 10:15:00', 'item456']` | +| 3 | `['2023-10-26 10:30:00', 'item789']` | + +Now you have an array of values for each user, which you can further process. For instance, you could unnest the array to analyze the frequency of different activity values or join it with another table based on these values. The key benefit here is that you've extracted the relevant data without needing to explicitly parse the JSON based on individual keys. This is particularly useful when the keys in the JSON strings can vary across different rows but the values themselves hold the information you're interested in. diff --git a/use_cases/last_value.md b/use_cases/last_value.md new file mode 100644 index 00000000..5d30aafa --- /dev/null +++ b/use_cases/last_value.md @@ -0,0 +1,19 @@ +Imagine you have a table of customer orders, and each order has an array of timestamps representing different stages of the order fulfillment process (e.g., order placed, payment processed, shipped, delivered). You want to find the last timestamp in each array, which would represent the time the order was completed (delivered in this example). + +```sql +SELECT + order_id, + bigfunctions.us.last_value(fulfillment_timestamps) AS order_completion_timestamp +FROM + your_project.your_dataset.your_order_table +``` + +This query would use the `last_value` function to extract the last timestamp from the `fulfillment_timestamps` array for each order, giving you the order completion time. + +Other use cases could include: + +* **Finding the latest status update:** If you have an array of status updates for a task or project, `last_value` can give you the most recent status. +* **Getting the last element of a sequence:** If you have an array representing a sequence of events, `last_value` can retrieve the final event in the sequence. +* **Extracting the latest value from sensor readings:** If you have an array of sensor readings over time, `last_value` can retrieve the most recent reading. + +Essentially, anytime you need to efficiently extract the last element from an array within BigQuery, the `last_value` function provides a clean and easy solution. diff --git a/use_cases/levenshtein.md b/use_cases/levenshtein.md new file mode 100644 index 00000000..37c30422 --- /dev/null +++ b/use_cases/levenshtein.md @@ -0,0 +1,67 @@ +A common use case for the Levenshtein distance function is **fuzzy string matching**. Here are a few scenarios within BigQuery where this function would be helpful: + +**1. Data Cleaning and Deduplication:** + +Imagine you have a table of customer names, and you suspect there are duplicate entries due to slight variations in spelling (e.g., "John Smith" vs. "Jon Smith" or "John Smyth"). You can use `levenshtein` to identify pairs of names with a small Levenshtein distance, which suggests they might refer to the same person. This allows you to flag potential duplicates for manual review or automated merging. + +```sql +#standardSQL +WITH CustomerNames AS ( + SELECT 'John Smith' AS name UNION ALL + SELECT 'Jon Smith' AS name UNION ALL + SELECT 'John Smyth' AS name UNION ALL + SELECT 'Jane Doe' AS name UNION ALL + SELECT 'Jane Doe ' AS name -- Example with extra space +) + +SELECT + name1, + name2, + bigfunctions.us.levenshtein(name1, name2) AS distance +FROM + CustomerNames AS c1 +CROSS JOIN + CustomerNames AS c2 +WHERE c1.name < c2.name -- Avoid comparing a name to itself and duplicate pairs + AND bigfunctions.us.levenshtein(name1, name2) <= 2 -- Consider names with distance 2 or less as potential duplicates +``` + +**2. Spell Checking and Correction:** + +You could use `levenshtein` to suggest corrections for misspelled words in a text field. By comparing a misspelled word to a dictionary of correctly spelled words, you can find the closest matches based on Levenshtein distance and offer them as suggestions. + +```sql +#standardSQL +WITH Dictionary AS ( + SELECT 'apple' AS word UNION ALL + SELECT 'banana' AS word UNION ALL + SELECT 'orange' AS word +), +MisspelledWords AS ( + SELECT 'aple' AS misspelled_word UNION ALL + SELECT 'bananna' AS misspelled_word +) + +SELECT + m.misspelled_word, + d.word AS suggested_correction, + bigfunctions.us.levenshtein(m.misspelled_word, d.word) AS distance +FROM + MisspelledWords AS m +CROSS JOIN + Dictionary AS d +ORDER BY + m.misspelled_word, + distance +``` + +**3. Record Linkage/Matching:** + +If you have two datasets that should contain information about the same entities but lack a common key, you can use `levenshtein` on string fields (e.g., names, addresses) to help link records across the datasets. This is especially useful when dealing with data from different sources that may have inconsistencies in formatting or spelling. + +**4. Similar Product Search:** + +In an e-commerce setting, you might want to suggest products with similar names to what a user searches for. `levenshtein` can help you identify products with names that are close to the search query, even if there are typos or slight variations in wording. + + +These are just a few examples. The Levenshtein distance is a versatile tool for dealing with string variations and has applications in many areas of data analysis and processing within BigQuery. Remember to choose the appropriate BigQuery region for the `bigfunctions` dataset according to your data location (e.g., `bigfunctions.us` for US-based data). diff --git a/use_cases/list_bigquery_resources_in_current_project.md b/use_cases/list_bigquery_resources_in_current_project.md new file mode 100644 index 00000000..1ac72e5d --- /dev/null +++ b/use_cases/list_bigquery_resources_in_current_project.md @@ -0,0 +1,13 @@ +A use case for the `list_bigquery_resources_in_current_project` function is to **analyze BigQuery resource usage and identify popular or underutilized data assets within a project.** + +Imagine a large organization with numerous datasets, tables, and views in their BigQuery project. They want to understand: + +* **Which datasets are most actively used:** This can inform decisions about data retention, access control, and resource allocation. The `popularity` score, reflecting recent usage by distinct users, highlights heavily used datasets. +* **Identify unused or rarely used tables:** These might be candidates for deletion or archiving to save storage costs and simplify data governance. Low `popularity` scores indicate underutilization. +* **Understand data lineage and dependencies:** The `details` field can reveal relationships between datasets and tables, helping to visualize data flow and assess the impact of potential changes. For example, you could see which tables are referenced by a particular view. +* **Track user activity:** The function can identify users interacting with different BigQuery resources, providing insights into data access patterns and potential security risks. +* **Automate data discovery and documentation:** The output can be used to generate reports or dashboards summarizing key information about BigQuery resources, including descriptions and usage metrics. This assists in data discovery and documentation efforts. + +**Example Scenario:** + +A data engineering team needs to optimize their BigQuery costs. They suspect that many tables are no longer being used. By calling `list_bigquery_resources_in_current_project`, they can get a ranked list of tables by popularity. Tables with a popularity of zero are prime candidates for deletion. This allows them to reclaim storage space and reduce costs. Further, they can examine the `details` for the popular tables to ensure appropriate access controls and optimize performance for frequently accessed data. diff --git a/use_cases/list_dataset_tables.md b/use_cases/list_dataset_tables.md new file mode 100644 index 00000000..3d54bea3 --- /dev/null +++ b/use_cases/list_dataset_tables.md @@ -0,0 +1,33 @@ +A use case for the `list_dataset_tables` function is to quickly get an overview of the tables within one or more datasets in BigQuery. This can be useful in several scenarios: + +* **Data Discovery/Exploration:** When working with a new project or dataset, you might not know all the tables that exist. `list_dataset_tables` provides a quick way to see what data is available. +* **Auditing/Documentation:** You can use this function to generate a list of tables for documentation purposes or to audit the contents of your datasets. +* **Automated Processes:** In scripts or workflows, you could use `list_dataset_tables` to dynamically determine which tables to process based on their presence in a dataset. For example, you might have a process that iterates over all tables in a dataset and performs some operation (e.g., data validation, backup, etc.). +* **Data Governance:** This function can be used as part of a data governance process to track and manage the tables within your BigQuery environment. You can regularly run the function and compare the results to a known list of approved tables to identify any unauthorized tables. +* **Interactive Analysis:** When working in the BigQuery console, you might want a quick reminder of the tables available in a dataset without navigating through the UI. This function can provide that information directly in the query results. + + +Example in a data pipeline: + +Imagine you have a daily data pipeline that aggregates data from several raw tables into a summary table. You could use the `list_dataset_tables` function to automatically determine which raw tables to include in the aggregation process, making the pipeline more flexible and adaptable to changes in the raw data. + + +```sql +DECLARE raw_dataset_id STRING DEFAULT "your-project.your_raw_dataset"; +DECLARE raw_tables ARRAY; + +SET raw_tables = ( + SELECT ARRAY_AGG(table_name) + FROM bigfunctions.your_region.list_dataset_tables(raw_dataset_id) + WHERE STARTS_WITH(table_name, 'raw_data_') -- Filter for relevant tables +); + +-- Use the raw_tables array in your aggregation query +SELECT ... +FROM UNNEST(raw_tables) AS table_name +JOIN `your-project.your_raw_dataset`.table_name -- Dynamically access tables +... +``` + + +This example shows how `list_dataset_tables` can help automate processes by dynamically retrieving a list of tables within a dataset, enhancing the pipeline's flexibility and maintainability. Replace `your_region` with the appropriate BigQuery region (e.g., `us`, `eu`, `us-central1`). Remember to adjust the table name filtering logic (`STARTS_WITH`) to suit your specific requirements. diff --git a/use_cases/list_public_datasets.md b/use_cases/list_public_datasets.md new file mode 100644 index 00000000..01583997 --- /dev/null +++ b/use_cases/list_public_datasets.md @@ -0,0 +1,27 @@ +A use case for the `list_public_datasets` BigQuery function is to **dynamically discover and explore the available public datasets in BigQuery**. This can be useful for several scenarios: + +1. **Data Discovery and Exploration:** A data analyst or scientist might want to explore what public datasets are available for research or analysis without manually browsing the BigQuery UI or relying on outdated documentation. This function provides a quick and programmatic way to get a list of all public datasets. + +2. **Automated Data Pipelines:** In an automated data pipeline, you could use this function to check for the existence of a specific public dataset before attempting to query it. This adds robustness to your pipeline, handling cases where a dataset might be temporarily unavailable or renamed. + +3. **Building a Data Catalog:** You can use the output of this function to populate a custom data catalog or metadata store. This allows you to maintain an internal index of available public datasets with additional metadata, such as descriptions or tags. + +4. **Interactive Data Exploration Tools:** A web application or interactive notebook could use this function to present users with a list of available public datasets to choose from for analysis. + +5. **Training and Education:** In a training environment, this function can be used to quickly demonstrate the breadth of publicly available data in BigQuery, allowing students to explore different datasets. + + +**Example Scenario:** + +Let's say a data analyst wants to build a dashboard showing trends in cryptocurrency prices. They know there are several public datasets related to cryptocurrency, but they're not sure of the exact names or what data is available. They can use the `list_public_datasets` function to get a list of all public datasets. Then, they can filter that list (perhaps using a regular expression) to find datasets related to cryptocurrency and explore their schemas to determine which datasets are suitable for their dashboard. + + +**Code Example (Illustrative):** + +```sql +SELECT dataset_id +FROM UNNEST(bigfunctions.us.list_public_datasets()) AS dataset_id +WHERE REGEXP_CONTAINS(dataset_id, r'cryptocurrency'); +``` + +This query would return all public datasets containing the term "cryptocurrency" in their ID, allowing the analyst to quickly identify relevant datasets. diff --git a/use_cases/list_scheduled_queries.md b/use_cases/list_scheduled_queries.md new file mode 100644 index 00000000..c0926ce9 --- /dev/null +++ b/use_cases/list_scheduled_queries.md @@ -0,0 +1,23 @@ +A use case for the `list_scheduled_queries` function would be for an administrator or developer who needs to gain an overview of all scheduled queries within a specific Google Cloud project. Here are some more detailed scenarios: + +* **Auditing and Governance:** A data governance team could use this function to regularly check for any unauthorized or outdated scheduled queries. They could then disable or modify them as needed, ensuring compliance with data policies. + +* **Monitoring and Performance Tuning:** By listing all scheduled queries, a performance engineer can identify resource-intensive queries that might be impacting overall BigQuery performance. This allows for optimization efforts and better resource allocation. + +* **Documentation and Knowledge Sharing:** This function can be used to generate a list of existing scheduled queries for documentation purposes. This is useful for onboarding new team members or understanding the data pipelines within a project. + +* **Dependency Management:** Before making changes to underlying datasets or tables, a developer could use `list_scheduled_queries` to identify any scheduled queries that depend on those resources. This helps prevent unintended consequences and ensures a smooth transition during updates. + +* **Troubleshooting and Debugging:** When investigating issues with data freshness or unexpected results, knowing which scheduled queries are running and their configurations is crucial. This function provides that information quickly and easily. + +* **Building Management Tools:** You could integrate this function into a custom management tool or dashboard that provides a centralized view of all scheduled tasks within a project, including queries, data transfers, and other operations. + + +Example: Imagine a company that uses scheduled queries to generate daily reports. They could use `list_scheduled_queries` within a script to: + +1. **Retrieve all scheduled queries.** +2. **Filter the list** based on specific criteria (e.g., queries that run on a specific dataset, or queries containing certain keywords). +3. **Generate alerts** if any crucial scheduled queries are missing or disabled. +4. **Automatically enable or disable queries** based on certain conditions. + +This allows for programmatic control and monitoring of scheduled queries, simplifying administration and improving reliability. diff --git a/use_cases/load_api_data.md b/use_cases/load_api_data.md new file mode 100644 index 00000000..490f0c2e --- /dev/null +++ b/use_cases/load_api_data.md @@ -0,0 +1,61 @@ +Let's say you want to analyze customer feedback from your Zendesk Support instance in BigQuery. You can use the `load_api_data` function to achieve this without manual data extraction and uploads. + +**1. Identify the Source and Explore Configuration:** + +* **Source:** `airbyte-source-zendesk-support==2.6.10` (or a later compatible version) + +**2. Generate Encrypted Secret for your Zendesk Access Token:** + +* Follow the instructions in the documentation to encrypt your Zendesk access token. This ensures your credentials aren't exposed in logs. Let's assume the encrypted secret is `ENCRYPTED_SECRET(your_encrypted_token)`. + +**3. Determine Available Streams:** + +* Call `load_api_data` with `streams` set to `null` to see what data Zendesk makes available: + +```sql +call bigfunctions.us.load_api_data('airbyte-source-zendesk-support==2.6.10', ''' + credentials: + access_token: ENCRYPTED_SECRET(your_encrypted_token) +''', null, null); +select * from bigfunction_result; +``` +* This will return a list of available streams, such as `tickets`, `users`, `organizations`, etc. + +**4. Select Desired Streams and Destination:** + +* Decide which streams you need (e.g., `tickets`, `users`). +* Choose your BigQuery destination dataset. For example: `your_project.your_zendesk_data` + +**5. Load the Data:** + +* Call `load_api_data` with the correct parameters: + +```sql +call bigfunctions.us.load_api_data('airbyte-source-zendesk-support==2.6.10', ''' + credentials: + access_token: ENCRYPTED_SECRET(your_encrypted_token) +''', 'tickets,users', 'your_project.your_zendesk_data'); +select * from bigfunction_result; +``` + +This will: + +* Create temporary tables within the `bigfunctions` project. +* Extract data from the `tickets` and `users` streams in Zendesk. +* Load the extracted data into the temporary tables. +* Move the data from the temporary tables to your specified destination dataset (`your_project.your_zendesk_data`). +* Clean up the temporary tables and resources. + +**Result:** You now have Zendesk ticket and user data in your BigQuery dataset, ready for analysis. Subsequent calls will incrementally load new or updated data based on the state saved in the `_airbyte_states` table. + + +**Key Improvements over other Methods:** + +* **Simplified Data Integration:** No need to build custom ETL pipelines or manage infrastructure. +* **Wide Connector Support:** Access data from 250+ sources through Airbyte. +* **Incremental Loads:** Avoids redundant data processing by loading only new or changed data. +* **Secure Credential Handling:** Encryption protects sensitive information. +* **Serverless:** Leverages BigQuery's serverless architecture for scalability and cost-efficiency. + + +This example showcases how `load_api_data` streamlines data ingestion from external APIs into BigQuery, enabling efficient data analysis and reporting. You can adapt this approach to integrate data from various other sources supported by Airbyte connectors. diff --git a/use_cases/load_api_data_into_temp_dataset.md b/use_cases/load_api_data_into_temp_dataset.md new file mode 100644 index 00000000..7742c480 --- /dev/null +++ b/use_cases/load_api_data_into_temp_dataset.md @@ -0,0 +1,68 @@ +Let's say you're a data analyst working for an e-commerce company and you want to analyze customer feedback from your Zendesk Support instance. Here's how `load_api_data_into_temp_dataset` could help: + +1. **Discover Available Connectors and Configuration:** + + You start by checking if a Zendesk connector exists and what configuration parameters it requires: + + ```sql + SELECT bigfunctions.us.load_api_data_into_temp_dataset(null, null, null, null); + ``` + + This would list all available Airbyte connectors, including (hopefully) `airbyte-source-zendesk-support`. Then, you'd run: + + ```sql + SELECT bigfunctions.us.load_api_data_into_temp_dataset('airbyte-source-zendesk-support==2.6.10', null, null, null); -- Replace with actual version + ``` + + This provides a sample `source_config` YAML showing the required fields like `credentials.access_token`. + +2. **Encrypt your Zendesk API Token:** + + Use the provided code snippet to encrypt your Zendesk access token. This crucial step protects your sensitive information. Replace `kdoekdswlxzapdldpzlfpfd` in the example with your actual encrypted token. + +3. **Load Zendesk Data to a Temporary Dataset:** + + Now, load data from the 'tickets' stream (assuming you are interested in support tickets) into a temporary BigQuery dataset: + + ```sql + SELECT bigfunctions.us.load_api_data_into_temp_dataset( + 'airbyte-source-zendesk-support==2.6.10', -- Replace with actual version + ''' + credentials: + access_token: ENCRYPTED_SECRET(YOUR_ENCRYPTED_TOKEN) + start_date: '2023-01-01T00:00:00Z' -- Optional: Pull data from a specific date + ''', + 'tickets', -- Specify the 'tickets' stream + null -- Initial load, no state provided + ); + ``` + This creates a temporary dataset (the name is returned by the function) containing a table named `tickets` with your Zendesk ticket data, as well as `_airbyte_logs` and `_airbyte_states` tables. + +4. **Incremental Loads:** + + After the initial load, you can perform incremental updates by retrieving the latest state from the `_airbyte_states` table and using it in subsequent calls. This ensures you only pull new or updated ticket data. Example: + + ```sql + SELECT state FROM `YOUR_TEMP_DATASET._airbyte_states` ORDER BY emitted_at DESC LIMIT 1; -- Get the latest state + + -- Store the state in a variable (replace with the actual retrieved state) + DECLARE latest_state STRING DEFAULT '{"tickets": {"cutoff_time": "2023-10-27T12:00:00Z"}}'; + + + SELECT bigfunctions.us.load_api_data_into_temp_dataset( + 'airbyte-source-zendesk-support==2.6.10', -- Replace with actual version + ''' + credentials: + access_token: ENCRYPTED_SECRET(YOUR_ENCRYPTED_TOKEN) + ''', + 'tickets', + latest_state + ); + ``` + +5. **Analyze Data:** + + Finally, query the temporary dataset to analyze your Zendesk ticket data directly within BigQuery. + + +This use case demonstrates how `load_api_data_into_temp_dataset` simplifies data ingestion from external APIs like Zendesk into BigQuery, while prioritizing security and enabling incremental updates. This approach can be applied to other data sources supported by Airbyte connectors. diff --git a/use_cases/load_file.md b/use_cases/load_file.md new file mode 100644 index 00000000..515f6b88 --- /dev/null +++ b/use_cases/load_file.md @@ -0,0 +1,43 @@ +The `load_file` function is useful for quickly loading data from various web-based file formats directly into a BigQuery table. Here's a breakdown of potential use cases categorized by data type and source: + +**1. CSV Data:** + +* **Public Datasets:** Loading publicly available datasets in CSV format, like government data or research data. Example: Loading census data or economic indicators from a government website. +* **Web APIs:** Some APIs return data in CSV format. This function can be used to directly ingest that data into BigQuery. Example: A marketing API providing campaign performance data. +* **GitHub/GitLab:** Loading data directly from CSV files stored in repositories like GitHub or GitLab. This is helpful for sharing data within teams or for reproducible research. Example: Loading a training dataset for a machine learning model. + +**2. JSON Data:** + +* **REST APIs:** Many REST APIs return data in JSON format. `load_file` simplifies the process of ingesting this data into BigQuery without intermediate processing. Example: Loading product information from an e-commerce API. +* **GeoJSON Data:** Loading geospatial data in GeoJSON format. Example: Loading geographic boundaries of cities or countries. +* **Configuration Files:** Loading configuration data from JSON files hosted online. + + +**3. Parquet/Delta Lake Data:** + +* **Data Lakes:** Accessing and loading data directly from data lakes stored on cloud storage platforms like Google Cloud Storage. This is efficient for large datasets as Parquet and Delta Lake are optimized for analytical queries. Example: Loading historical sales data from a data lake. + + +**4. Excel/Shapefiles (via 'geo' file_type):** + +* **Legacy Data:** Loading data from legacy systems that often store data in Excel or Shapefile formats. Example: Loading customer data from an older CRM system. +* **GIS Data:** Loading geospatial data from shapefiles. Example: Loading data on road networks or land parcels. + + +**5. General Web Files:** + +* **Automated Data Ingestion:** Regularly loading data from a web source as part of an automated data pipeline. Example: Daily updates of stock prices. +* **Ad-hoc Data Analysis:** Quickly loading data from a web source for exploratory data analysis. Example: Analyzing a competitor's publicly available product catalog. + + +**Key Advantages of using `load_file`:** + +* **Simplicity:** Reduces the need for complex ETL pipelines for simple data loading tasks. +* **Speed:** Directly loads data into BigQuery, bypassing intermediate steps. +* **Flexibility:** Supports various file formats and sources. +* **Accessibility:** Makes web-based data easily accessible for analysis within BigQuery. + + +**Example Scenario:** + +A marketing analyst needs to analyze the performance of their recent social media campaigns. The social media platform provides an API that returns campaign data in CSV format. Instead of manually downloading the CSV file, processing it, and then uploading it to BigQuery, the analyst can use the `load_file` function to directly load the data from the API endpoint into a BigQuery table, saving time and effort. diff --git a/use_cases/load_file_into_temp_dataset.md b/use_cases/load_file_into_temp_dataset.md new file mode 100644 index 00000000..3a08bced --- /dev/null +++ b/use_cases/load_file_into_temp_dataset.md @@ -0,0 +1,34 @@ +This function is useful for quickly loading data from various online sources directly into BigQuery for analysis without needing to manually download, format, and upload the data. Here are a few specific use cases: + +**1. Data Exploration and Prototyping:** + +* You find a dataset on a public repository (like Github) or a government data portal, and you want to quickly explore it in BigQuery. `load_file_into_temp_dataset` lets you load the data directly without intermediate steps. This is perfect for initial data analysis and prototyping before deciding to store the data permanently. + +**2. Ad-hoc Analysis of Public Data:** + +* You need to analyze some publicly available data, such as weather data, stock prices, or social media trends, for a one-time report or analysis. You can use this function to load the data on demand without storing it permanently. + +**3. ETL Pipelines with Dynamic Data Sources:** + +* You're building an ETL pipeline that needs to process data from various sources that are updated frequently. `load_file_into_temp_dataset` can be integrated into your pipeline to dynamically load data from different URLs as needed. This is especially helpful when dealing with data sources that don't have a stable schema or format. + +**4. Data Enrichment:** + +* You have a dataset in BigQuery and need to enrich it with external data, such as geographic information, currency exchange rates, or product catalogs. You can use this function to load the external data into a temporary table and then join it with your existing table. + +**5. Sharing Data Snippets:** + +* You want to share a small dataset with a colleague or client without giving them access to your entire data warehouse. Load the data into a temporary dataset using this function and then grant them temporary access. This offers a secure and convenient way to share data snippets. + + +**Example: Analyzing Tweet Sentiment from a Public API:** + +Imagine an API that returns tweet data in JSON format. You want to analyze the sentiment of tweets related to a specific hashtag. + +1. Call the API to retrieve the tweets. The API might offer a download link or allow you to stream the data directly. +2. Use `load_file_into_temp_dataset` within a BigQuery query to load the JSON data from the API's URL. +3. Apply BigQuery's text processing functions to analyze the sentiment of the tweets in the temporary table. +4. Generate your report or visualization directly from the results. + + +This avoids the need to download the JSON file, create a table schema, and manually load the data, significantly speeding up your analysis. The temporary dataset automatically cleans itself up, simplifying data management. diff --git a/use_cases/markdown2html.md b/use_cases/markdown2html.md new file mode 100644 index 00000000..84e7d96c --- /dev/null +++ b/use_cases/markdown2html.md @@ -0,0 +1,14 @@ +The `markdown2html` function is useful anytime you need to convert text formatted in Markdown to HTML within BigQuery. Here are a few use cases: + +* **Generating HTML reports directly from BigQuery:** Imagine you have data in BigQuery that you want to present in a formatted report. You can use `markdown2html` to create the HTML structure of the report dynamically, including headings, lists, tables, and formatted text, all within your SQL query. The output can then be visualized directly in the BigQuery console (using the bookmarklet method described in the documentation) or exported for use in other applications. + +* **Email formatting:** Suppose you are using BigQuery to generate email content. You can store email templates in Markdown format within a BigQuery table. Then, using `markdown2html`, convert the Markdown to HTML within your query and send the formatted HTML as the body of the email. + +* **Dynamic content creation for web applications:** If your web application integrates with BigQuery, you might store content in Markdown format in BigQuery. Using `markdown2html`, you can query the content and convert it to HTML on the fly, reducing the need to store and manage HTML directly. This allows for easier content updates and a more streamlined workflow. + +* **Data documentation:** You could use Markdown to document your BigQuery datasets and tables. Using `markdown2html` within a query, you can dynamically generate HTML documentation pages based on the Markdown content, making it easier for users to understand the data. + +* **Enriching data exports:** If you're exporting data from BigQuery for use in another system that requires HTML formatting, you can use `markdown2html` to transform any Markdown fields into HTML before export. + + +In essence, `markdown2html` bridges the gap between the simplicity of Markdown for writing and editing text, and the richness of HTML for presentation, all within the BigQuery environment. diff --git a/use_cases/max_value.md b/use_cases/max_value.md new file mode 100644 index 00000000..a5ac80f8 --- /dev/null +++ b/use_cases/max_value.md @@ -0,0 +1,33 @@ +You have a table of products, and each product has a list of prices at different stores. You want to find the highest price for each product. + +```sql +WITH Products AS ( + SELECT + 'Product A' AS product_name, + [10.99, 12.50, 11.75] AS prices + UNION ALL SELECT + 'Product B' AS product_name, + [5.00, 5.50, 4.99] AS prices + UNION ALL SELECT + 'Product C' AS product_name, + [20.00, 19.50, 21.25] AS prices +) +SELECT + product_name, + bigfunctions.us.max_value(prices) AS max_price +FROM Products; +``` + +This query uses the `max_value` function to find the highest price within the `prices` array for each product. The result will be: + +``` ++-------------+-----------+ +| product_name | max_price | ++-------------+-----------+ +| Product A | 12.5 | +| Product B | 5.5 | +| Product C | 21.25 | ++-------------+-----------+ +``` + +This shows how `max_value` can be practically used to extract the maximum value from an array of numbers within a larger dataset. This could be useful for things like pricing analysis, finding peak values in time series data (if stored as arrays), or determining the maximum score in a game played multiple times. diff --git a/use_cases/median_value.md b/use_cases/median_value.md new file mode 100644 index 00000000..1e771692 --- /dev/null +++ b/use_cases/median_value.md @@ -0,0 +1,30 @@ +You have a table of users, and each user has a list of scores they've achieved in a game. You want to find the median score for each user. + +```sql +WITH UserScores AS ( + SELECT 'UserA' AS user_id, [85, 92, 78, 95, 88] AS scores UNION ALL + SELECT 'UserB' AS user_id, [70, 75, 68, 72, 77] AS scores UNION ALL + SELECT 'UserC' AS user_id, [90, 95, 88, 92] AS scores +) + +SELECT user_id, bigfunctions.us.median_value(scores) AS median_score +FROM UserScores; +``` + +This query uses the `median_value` function to calculate the median score from the `scores` array for each user. It will return a table like this: + +| user_id | median_score | +|---|---| +| UserA | 88 | +| UserB | 75 | +| UserC | 91 | + + +This is a practical use case where you need to find a representative central value for a set of numbers associated with each row in a table. Other potential use cases include: + +* **Sales Analysis:** Finding the median sales amount per customer. +* **Financial Modeling:** Calculating the median value of a portfolio of investments. +* **Sensor Data Analysis:** Determining the median value of readings from a sensor over a period of time. +* **Performance Monitoring:** Calculating the median latency of API calls. + +In essence, anytime you have an array of numeric data associated with individual records, and you need to find a typical or central value that is robust to outliers, the `median_value` function becomes very useful. diff --git a/use_cases/min_max_scaler.md b/use_cases/min_max_scaler.md new file mode 100644 index 00000000..c17e2d9e --- /dev/null +++ b/use_cases/min_max_scaler.md @@ -0,0 +1,37 @@ +Let's say you have a table of product prices and you want to compare their relative affordability. The prices range from $10 to $1000, but you need them on a normalized scale between 0 and 1 for a machine learning model or visualization. Here's how `min_max_scaler` can be used: + +```sql +WITH ProductPrices AS ( + SELECT 'Product A' AS product, 10 AS price + UNION ALL SELECT 'Product B' AS product, 50 AS price + UNION ALL SELECT 'Product C' AS product, 200 AS price + UNION ALL SELECT 'Product D' AS product, 1000 AS price +), +MinMaxScaledPrices AS ( + SELECT + product, + bigfunctions.us.min_max_scaler(ARRAY_AGG(price) OVER ()) AS scaled_prices + FROM ProductPrices +) +SELECT + product, + scaled_price +FROM MinMaxScaledPrices, UNARRAY(scaled_prices) AS scaled_price; + +``` + +This query first collects all prices into an array using `ARRAY_AGG`. Then, `min_max_scaler` normalizes these prices within the array. Finally, the `UNARRAY` function expands the resulting array so you get each product and its scaled price on separate rows. + +This results in a table like this (the exact values might vary slightly due to floating-point precision): + +| product | scaled_price | +|------------|--------------| +| Product A | 0 | +| Product B | 0.04 | +| Product C | 0.19 | +| Product D | 1 | + +Now "Product A", with the lowest price, has a scaled price of 0, and "Product D", with the highest price, has a scaled price of 1. The other products have scaled prices in between, reflecting their relative affordability. + + +Another use case would be normalizing features in a machine learning preprocessing step directly within BigQuery before exporting the data for training. This can simplify your data pipeline. diff --git a/use_cases/min_value.md b/use_cases/min_value.md new file mode 100644 index 00000000..ead835a6 --- /dev/null +++ b/use_cases/min_value.md @@ -0,0 +1,23 @@ +You have a table of products, and each product has an array of prices representing its price history. You want to find the lowest price ever recorded for each product. + +```sql +WITH Products AS ( + SELECT + 'Product A' AS product_name, + [10, 12, 8, 15, 9] AS prices + UNION ALL + SELECT + 'Product B' AS product_name, + [20, 18, 18, 19, 21] AS prices + UNION ALL + SELECT + 'Product C' AS product_name, + [5, 7, 5, 6, 4] AS prices +) +SELECT + product_name, + bigfunctions.us.min_value(prices) AS min_price +FROM Products; +``` + +This query would utilize the `min_value` function to efficiently determine the minimum value within the `prices` array for each product, effectively identifying the historical lowest price. You would replace `bigfunctions.us` with the appropriate dataset for your region. diff --git a/use_cases/ngram_frequency_similarity.md b/use_cases/ngram_frequency_similarity.md new file mode 100644 index 00000000..a2fc4d8d --- /dev/null +++ b/use_cases/ngram_frequency_similarity.md @@ -0,0 +1,22 @@ +This `ngram_frequency_similarity` function is useful for several text analysis and data matching tasks where you want to determine how similar two strings are based on the sequences of characters they contain. Here are a few use cases: + +**1. Plagiarism Detection:** Compare student submissions or documents to identify potential plagiarism by calculating the n-gram similarity. A high similarity score could indicate copied content. + +**2. Duplicate Detection:** Identify duplicate records in a database, even if they have slight variations in wording or spelling. For example, finding near-identical product descriptions or customer addresses. + +**3. Fuzzy Matching:** Match records that are not exactly the same but are similar enough to be considered a potential match. This is useful in situations where data entry errors or variations in naming conventions might exist. Examples include: + * Matching customer names from different sources. + * Matching product names across different retailers. + * Finding similar articles or news stories. + +**4. Recommendation Systems:** Suggest related products or content based on the similarity of their descriptions or titles. If two products have a high n-gram similarity, they might be relevant to the same customer. + +**5. Spell Checking/Auto-Correction:** Suggest possible corrections for misspelled words by finding words with high n-gram similarity to the incorrect input. + +**6. Information Retrieval:** Improve search relevance by identifying documents that are semantically similar to a search query, even if the exact words are not present. + +**7. Text Classification:** Group similar texts together based on their n-gram profiles. This could be used to categorize documents, emails, or social media posts. + +**Example Scenario (Fuzzy Matching):** + +Imagine an e-commerce site that wants to prevent duplicate product listings. A seller might try to list a "Samsung Galaxy S23" slightly differently, like "Samsung Galaxy S23 Smartphone" or "New Samsung Galaxy S23". By using `ngram_frequency_similarity` with an appropriate `n` value, the system can detect these near-duplicates and flag them for review, even though the strings aren't identical. This prevents redundant listings and ensures data quality. diff --git a/use_cases/parse_date.md b/use_cases/parse_date.md new file mode 100644 index 00000000..622e4aad --- /dev/null +++ b/use_cases/parse_date.md @@ -0,0 +1,48 @@ +You have a table containing date strings in various formats, and you need to standardize them into a consistent DATE type in BigQuery for analysis. The `parse_date` function can automatically detect and convert these different formats. + +**Scenario:** + +You're analyzing customer orders, and the `order_date` column contains date values, but they were entered using different formats due to various data sources or input methods: + +| order_id | order_date | +|----------|--------------------| +| 1 | 2023-10-26 | +| 2 | 10/27/2023 | +| 3 | Oct 28, 2023 | +| 4 | 28/10/23 | +| 5 | Fri Oct 29 08:00:00 2023 | + + +**Query using `parse_date`:** + +```sql +SELECT + order_id, + bigfunctions.us.parse_date(order_date) AS standardized_order_date +FROM + your_project.your_dataset.your_table; +``` + +**(Replace `bigfunctions.us` with the appropriate dataset for your region.)** + +**Result:** + +| order_id | standardized_order_date | +|----------|--------------------------| +| 1 | 2023-10-26 | +| 2 | 2023-10-27 | +| 3 | 2023-10-28 | +| 4 | 2023-10-28 | +| 5 | 2023-10-29 | + + +Now all your dates are in a standard `DATE` format, allowing you to perform date-based calculations, filtering, and aggregations consistently without having to manually handle the different formats. For example, you could then easily query for all orders placed in October: + +```sql +SELECT + * +FROM + your_project.your_dataset.your_table +WHERE + standardized_order_date BETWEEN '2023-10-01' AND '2023-10-31'; +``` diff --git a/use_cases/parse_url.md b/use_cases/parse_url.md new file mode 100644 index 00000000..bb583e79 --- /dev/null +++ b/use_cases/parse_url.md @@ -0,0 +1,31 @@ +You could use the `parse_url` function to analyze website traffic logs stored in BigQuery. Imagine you have a table with a column named `request_url` containing full URLs of pages visited. You want to understand which parts of your website are most popular, which campaigns (identified through URL parameters) are driving traffic, or which sections are accessed most frequently by users from specific referring domains. + +Here's a practical example: + +```sql +SELECT + parsed_url.host, + parsed_url.path, + REGEXP_EXTRACT(parsed_url.query, r'utm_campaign=([^&]*)') AS utm_campaign, + REGEXP_EXTRACT(parsed_url.ref, r'//([^/]*)') AS referring_domain, + COUNT(*) AS page_views + FROM + `your_project.your_dataset.your_table`, + UNNEST([bigfunctions.your_region.parse_url(request_url)]) AS parsed_url + GROUP BY 1, 2, 3, 4 + ORDER BY page_views DESC; + +``` + +**Explanation:** + +1. **`your_project.your_dataset.your_table`**: Replace this with the actual location of your website traffic log table in BigQuery. +2. **`bigfunctions.your_region.parse_url(request_url)`**: This calls the `parse_url` function (make sure to replace `your_region` with your BigQuery region) on the `request_url` column, breaking it down into its components. The result is an array containing a struct. +3. **`UNNEST(...) AS parsed_url`**: This unnests the resulting array so that you can access individual fields of the URL parts struct. +4. **`parsed_url.host`, `parsed_url.path`, etc.**: These access the individual components of the URL, like host, path, query string, and referring domain. +5. **`REGEXP_EXTRACT(...)`**: These functions extract specific parameters from the query string and referring domain. In this example, it's extracting the `utm_campaign` parameter (often used for tracking marketing campaigns) and the main domain from the referrer. You can adapt these regular expressions to extract other parameters you're interested in. +6. **`COUNT(*) AS page_views`**: This counts the number of times each combination of host, path, campaign, and referring domain appears, representing the number of page views. +7. **`GROUP BY 1, 2, 3, 4`**: This groups the results by the extracted fields. +8. **`ORDER BY page_views DESC`**: This sorts the results to show the most viewed pages first. + +This query gives you valuable insights into user behavior on your website, allowing you to identify popular content, track marketing campaign effectiveness, and understand referral traffic patterns. You could further refine this by adding filters based on date ranges, user segments, or other criteria relevant to your analysis. diff --git a/use_cases/parse_user_agent.md b/use_cases/parse_user_agent.md new file mode 100644 index 00000000..ce0bb4a7 --- /dev/null +++ b/use_cases/parse_user_agent.md @@ -0,0 +1,38 @@ +A website analytics team could use the `parse_user_agent` function to analyze website traffic and user behavior. Here's a breakdown of how they might use it: + +**Scenario:** The team wants to understand which browsers are most popular among their users, identify trends in mobile device usage, and optimize the website experience for different operating systems. They have a BigQuery table containing website access logs, including a column with user agent strings. + +**Use Case with BigQuery SQL:** + +```sql +SELECT + parsed_user_agent.browser.name AS browser_name, + parsed_user_agent.browser.version AS browser_version, + parsed_user_agent.os.name AS os_name, + parsed_user_agent.os.version AS os_version, + parsed_user_agent.device.model AS device_model, + parsed_user_agent.device.type AS device_type, + COUNT(*) AS access_count + FROM + `your_project.your_dataset.website_access_logs`, + UNTABLE(bigfunctions.your_region.parse_user_agent(user_agent) AS parsed_user_agent) + GROUP BY 1, 2, 3, 4, 5, 6 + ORDER BY access_count DESC; + +``` +**(Replace `your_project.your_dataset.website_access_logs` and `your_region` with your actual values.)** + +**Benefits:** + +* **Browser Statistics:** By aggregating results by `browser_name` and `browser_version`, the team can determine the market share of different browsers accessing their website. This helps in prioritizing browser compatibility testing and ensuring a consistent user experience. + +* **Mobile Device Insights:** Grouping by `device_model` and `device_type` reveals which mobile devices are commonly used to visit the site. This information is valuable for responsive design and mobile optimization efforts. + +* **Operating System Analysis:** Analyzing data based on `os_name` and `os_version` allows the team to identify potential compatibility issues or optimize the website for specific operating systems. + +* **Targeted Improvements:** By understanding the breakdown of user agents, the team can make data-driven decisions about website improvements. For example, if a significant portion of users are on older versions of a specific browser, they might choose to display a message encouraging them to update for better performance and security. + +* **Troubleshooting:** If there's a sudden spike in errors from a specific browser or device, the parsed user agent data helps pinpoint the problem quickly. + + +This use case demonstrates how the `parse_user_agent` function empowers the analytics team to gain valuable insights from raw user agent data within BigQuery, leading to informed decisions about website development and optimization. diff --git a/use_cases/percentile_value.md b/use_cases/percentile_value.md new file mode 100644 index 00000000..83e0b437 --- /dev/null +++ b/use_cases/percentile_value.md @@ -0,0 +1,53 @@ +Let's illustrate a use case for the `percentile_value` BigQuery function. + +**Scenario:** You have a table storing website session durations (in seconds) for different users. You want to analyze user engagement and identify the 95th percentile of session durations. This will help you understand how long highly engaged users typically spend on your site. + +**Table Schema:** + +```sql +CREATE OR REPLACE TABLE `your_project.your_dataset.session_durations` ( + user_id INT64, + session_duration INT64 +); + +INSERT INTO `your_project.your_dataset.session_durations` (user_id, session_duration) VALUES +(1, 120), (2, 300), (3, 60), (4, 1800), (5, 45), (6, 900), (7, 240), (8, 30), (9, 600), (10, 150); +``` + +**Query using `percentile_value`:** + +```sql +SELECT + bigfunctions.us.percentile_value(ARRAY_AGG(session_duration), 0.95) AS p95_session_duration + FROM + `your_project.your_dataset.session_durations`; + +``` + +**Explanation:** + +1. **`ARRAY_AGG(session_duration)`:** This aggregates all session durations into an array. +2. **`bigfunctions.us.percentile_value(..., 0.95)`:** This calculates the 95th percentile value from the array of session durations. Remember to replace `us` with your BigQuery region if different. +3. **`AS p95_session_duration`:** This aliases the result for clarity. + +**Result:** + +The query will return a single value representing the 95th percentile of session durations. This value indicates that 95% of sessions are shorter than or equal to this duration. Let's say the result is 1500 seconds. This tells you that highly engaged users tend to have sessions lasting around 1500 seconds or less. + +**Benefits of using `percentile_value`:** + +* **Simplified calculation:** Instead of manually implementing percentile logic, you can use this function directly. +* **Efficiency:** BigQuery functions are generally optimized for performance. +* **Flexibility:** You can easily change the percentile value (e.g., to calculate the median (50th percentile) or other percentiles) by adjusting the second argument. + + + +This is a simple example. You can apply this function to any scenario where you need to calculate percentiles from an array of values within BigQuery, such as: + +* **E-commerce:** Analyzing product prices, order values, or customer spending. +* **Gaming:** Analyzing player scores, playtime, or in-game purchases. +* **Finance:** Analyzing stock prices, transaction amounts, or customer balances. +* **Healthcare:** Analyzing patient wait times, treatment costs, or lengths of stay. + + +By using `percentile_value`, you can gain valuable insights into the distribution of your data and identify important thresholds or outliers. diff --git a/use_cases/phone_number_info.md b/use_cases/phone_number_info.md new file mode 100644 index 00000000..f0106bf7 --- /dev/null +++ b/use_cases/phone_number_info.md @@ -0,0 +1,44 @@ +A customer service department stores customer phone numbers in a BigQuery table. They want to clean up the data and enrich it with location information. The `phone_number_info` function can be used to accomplish this. + + +**Use Case Scenario:** + +The table `customer_data` contains a column `phone` with various formats of phone numbers, including some with extra characters or missing country codes. + +**Example BigQuery SQL:** + +```sql +SELECT + phone, + bigfunctions.us.phone_number_info(phone, JSON '{"defaultCountry": "US"}') AS phone_info +FROM + `project_id.dataset_id.customer_data`; +``` + +**Explanation:** + +1. **`bigfunctions.us.phone_number_info(phone, JSON '{"defaultCountry": "US"}')`**: This calls the `phone_number_info` function. + - We're using the `us` dataset because our project is in the US multi-region. Choose the appropriate regional or multi-regional dataset for *your* project's location. + - `phone` is the column containing the phone number string. + - `JSON '{"defaultCountry": "US"}'` provides the optional `defaultCountry` parameter. This is important for correctly interpreting phone numbers that don't start with a "+" and country code. It assumes any number without a "+" is a US number. You would change this to match the expected default country for your data. + +2. **`AS phone_info`**: This assigns the output of the function to a new column named `phone_info`. The output is a JSON structure. + +**Benefits:** + +* **Standardization:** The function parses and standardizes the phone numbers into a consistent international format (`number` field in the JSON output), even if the original data was messy. +* **Validation:** The `isValid` field in the JSON output indicates whether the phone number is valid according to international standards. This allows for identifying and correcting invalid numbers. +* **Enrichment:** The function provides additional information like `country` and `type` (e.g., mobile, fixed line). This data can be used for segmentation, analytics, and reporting. +* **Data Cleaning:** You can use the output to filter out invalid numbers: + +```sql +SELECT + phone +FROM + `project_id.dataset_id.customer_data`, + UNNEST(bigfunctions.us.phone_number_info(phone, JSON '{"defaultCountry": "US"}')) AS phone_info +WHERE phone_info.isValid = TRUE; +``` + + +This example demonstrates how to use the `phone_number_info` function to clean, validate, and standardize phone number data in BigQuery, enabling better data quality and more insightful analysis. Remember to adjust the dataset and `defaultCountry` parameter based on your project's location and the characteristics of your data. diff --git a/use_cases/post.md b/use_cases/post.md new file mode 100644 index 00000000..257cef41 --- /dev/null +++ b/use_cases/post.md @@ -0,0 +1,60 @@ +This `post` BigQuery function could be used in several scenarios: + +**1. Sending Data to a Webhook:** + +Imagine you have a BigQuery table that tracks user sign-ups. You could use the `post` function to send real-time notifications to a Slack channel or other messaging platform via a webhook every time a new user registers. The `data` parameter would contain the user information you want to send in the notification. + +```sql +SELECT bigfunctions.us.post('YOUR_WEBHOOK_URL', TO_JSON_STRING(new_users), NULL) +FROM project.dataset.new_users; +``` + +**2. Interacting with an API:** + +You could use `post` to interact with REST APIs from within BigQuery. For example, you might want to enrich your data with information from a third-party service. After performing some transformations on your data in BigQuery, you could use the `post` function to send the transformed data to the API endpoint, receive the response, and then process it further within BigQuery. + +```sql +SELECT bigfunctions.us.post('https://api.example.com/data', TO_JSON_STRING(t), NULL) +FROM ( + SELECT user_id, SUM(order_value) as total_spent + FROM project.dataset.orders + GROUP BY user_id +) AS t; +``` + +**3. Triggering Actions in External Systems:** + +Suppose you have a BigQuery table that monitors key performance indicators (KPIs). If a KPI falls below a certain threshold, you could use the `post` function to trigger an action in an external system. This could be anything from sending an alert email to initiating a process in a workflow automation tool. + +```sql +SELECT bigfunctions.us.post('https://api.example.com/alert', TO_JSON_STRING(t), NULL) +FROM ( + SELECT * + FROM project.dataset.kpis + WHERE kpi_value < threshold +) AS t; +``` + +**4. Sending Data to a Real-time Dashboard:** + +If you are using a real-time dashboarding tool, you could use the `post` function to send data updates directly from BigQuery. This would allow you to keep your dashboards up-to-date with the latest information without needing to build complex data pipelines. + +```sql +SELECT bigfunctions.us.post('https://api.dashboard.com/update', TO_JSON_STRING(t), NULL) +FROM ( + SELECT COUNT(*) AS active_users + FROM project.dataset.users + WHERE last_seen > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR) +) AS t; +``` + +**Key Considerations:** + +* **Data Format:** The `data` parameter must be a valid JSON string. You can use the `TO_JSON_STRING` function in BigQuery to convert your data into the required format. +* **Headers:** The `headers` parameter allows you to set custom HTTP headers for your request. This can be useful for authentication or setting content types. Pass `NULL` if no headers are needed. +* **Error Handling:** You should implement proper error handling to ensure that your queries are resilient to network issues or API errors. Check the `status_code` in the response to determine if the request was successful. +* **Rate Limiting:** Be mindful of rate limits imposed by the API you are interacting with. You might need to implement retry mechanisms or introduce delays to avoid exceeding these limits. +* **Security:** If you are sending sensitive data, ensure that the connection to the API is secure (HTTPS) and consider using appropriate authentication methods. + + +By leveraging the `post` function, you can extend the functionality of BigQuery and seamlessly integrate it with other systems and services. This opens up a wide range of possibilities for automating tasks, enriching data, and building more dynamic data-driven applications. diff --git a/use_cases/precision_recall_auc.md b/use_cases/precision_recall_auc.md new file mode 100644 index 00000000..18655333 --- /dev/null +++ b/use_cases/precision_recall_auc.md @@ -0,0 +1,39 @@ +You're evaluating a machine learning model designed to predict customer churn for a telecommunications company. You have a dataset with customer features and a label indicating whether they churned (1) or not (0). Your model outputs a churn probability score for each customer. + +Here's how you would use the `precision_recall_auc` function in BigQuery to evaluate your model: + +```sql +SELECT bigfunctions.YOUR_REGION.precision_recall_auc( + ( + SELECT + ARRAY_AGG( + STRUCT( + predicted_churn_probability AS predicted_score, + churned AS label + ) + ) + FROM + `your_project.your_dataset.customer_churn_predictions` + ) +) AS auc_pr; +``` + + +**Explanation:** + +1. **`your_project.your_dataset.customer_churn_predictions`**: Replace this with the actual location of your BigQuery table containing the predictions. This table should have at least two columns: + * `predicted_churn_probability`: The predicted probability of churn (a floating-point number between 0 and 1). + * `churned`: The ground truth label (1 for churn, 0 for no churn). + +2. **`ARRAY_AGG(STRUCT(...))`**: This constructs an array of structs, where each struct contains the predicted score and the true label for a single customer. This is the required input format for the `precision_recall_auc` function. + +3. **`bigfunctions.YOUR_REGION.precision_recall_auc`**: Replace `YOUR_REGION` with the appropriate BigQuery region where your data resides (e.g., `us`, `eu`, `us-central1`). This function calculates the area under the precision-recall curve. + +4. **`AS auc_pr`**: This assigns the resulting AUC-PR value to a column named `auc_pr`. + + +**Why use AUC-PR in this case?** + +Churn prediction is often an imbalanced classification problem, meaning there are significantly more non-churners than churners. AUC-PR is a better metric than AUC-ROC for imbalanced datasets because it focuses on the positive class (churners in this case). A higher AUC-PR indicates a better model at identifying churners, even if they are a small portion of the overall customer base. + +By calculating the AUC-PR, you get a single number summarizing your model's performance, making it easier to compare different models or track the performance of a single model over time. diff --git a/use_cases/precision_recall_curve.md b/use_cases/precision_recall_curve.md new file mode 100644 index 00000000..e44cf085 --- /dev/null +++ b/use_cases/precision_recall_curve.md @@ -0,0 +1,35 @@ +You're evaluating a binary classification model (e.g., spam detection, fraud detection, disease diagnosis) and want to understand its performance across different thresholds. The `precision_recall_curve` function helps you analyze the trade-off between precision and recall. + +**Use Case: Optimizing a Fraud Detection Model** + +Imagine you've trained a model to predict fraudulent transactions. Each transaction is assigned a score between 0 and 1, representing the model's confidence that the transaction is fraudulent. You need to choose a threshold above which you flag a transaction as fraudulent. A higher threshold means higher precision (fewer false positives—legitimate transactions flagged as fraud) but lower recall (more false negatives—fraudulent transactions missed). + +Here's how `precision_recall_curve` helps: + +1. **Data Preparation:** You have a dataset with the predicted scores from your model and the ground truth labels (whether the transaction was actually fraudulent). This data is formatted as an array of structs, where each struct contains the `predicted_score` (float64) and the `ground_truth_label` (bool). + +2. **Calling the Function:** You use the `precision_recall_curve` function in your BigQuery query, passing in the array of structs: + + ```sql + SELECT * + FROM bigfunctions.your_region.precision_recall_curve( + ARRAY[ + (0.1, false), -- Low score, not fraud + (0.4, false), -- Low score, not fraud + (0.35, true), -- Moderate score, fraud + (0.8, true), -- High score, fraud + (0.95, false), -- Very high score, surprisingly not fraud (potential outlier?) + (0.6, true), -- Moderate-high score, fraud + (0.2, false) -- Low score, not fraud + ] + ); + ``` + +3. **Interpreting the Results:** The function returns a table with `precision` and `recall` columns. Each row represents a different threshold, and the values show the precision and recall achieved at that threshold. By examining this curve: + + * **Visualization:** You can plot the precision-recall curve (precision on the y-axis, recall on the x-axis) to visualize the trade-off. + * **Threshold Selection:** You can identify the optimal threshold based on your specific business requirements. For fraud detection, you might prioritize high recall (catching most fraudulent transactions even if it means more false positives that you can investigate manually) or balance precision and recall based on the costs associated with each type of error. + * **Model Evaluation:** The overall shape of the curve tells you about the performance of your model. A curve closer to the top-right corner indicates a better-performing model. You can compare the precision-recall curves of different models to choose the best one. + * **Identifying Issues:** The example shows a case where a very high score (0.95) was associated with a non-fraudulent transaction. This could be a sign of an issue with your model or a data anomaly worth investigating. The precision-recall curve, combined with an understanding of your data, helps pinpoint such scenarios. + +In essence, the `precision_recall_curve` function provides a powerful tool for evaluating and fine-tuning your binary classification models, enabling you to make informed decisions about selecting the best operating point based on the desired balance between precision and recall. diff --git a/use_cases/prophet.md b/use_cases/prophet.md new file mode 100644 index 00000000..5acd5872 --- /dev/null +++ b/use_cases/prophet.md @@ -0,0 +1,51 @@ +A use case for this `prophet` BigQuery function would be forecasting future sales based on historical sales data. Imagine you have a table in BigQuery called `sales_data` with two columns: `date` (DATE) and `sales` (INTEGER). You want to predict sales for the next 7 days. + +```sql +SELECT bigfunctions..prophet( + ( + SELECT + JSON_ARRAY(CAST(date AS STRING), sales) + FROM + `your-project.your_dataset.sales_data` + ORDER BY + date + ), + 7 +) AS forecasted_sales; + +``` + +Replace `` with the appropriate BigQuery region for your dataset (e.g., `us`, `eu`, `us-central1`). This query will: + +1. **Prepare the input data:** The subquery selects the date and sales data from your `sales_data` table, converts the date to a string, and uses `JSON_ARRAY` to create an array of [date, sales] pairs for each row. This is the format expected by the `prophet` function. The data is ordered by date, which is crucial for time series forecasting. + +2. **Call the prophet function:** The `prophet` function is called with the JSON array of historical data and the number of periods (7 days) to forecast. + +3. **Return the forecast:** The function returns a JSON array containing the forecasted sales for the next 7 days in the same [date, sales] format. The result is aliased as `forecasted_sales`. + +You can then use the forecasted sales data for inventory planning, resource allocation, and other business decisions. + + +**More advanced example with custom seasonality:** + +You can also pass additional parameters to the underlying Prophet model using the `kwargs` argument. For example, to add a weekly seasonality: + +```sql +SELECT bigfunctions..prophet( + ( + SELECT + JSON_ARRAY(CAST(date AS STRING), sales) + FROM + `your-project.your_dataset.sales_data` + ORDER BY + date + ), + 7, + STRUCT(JSON'{"weekly_seasonality": true}' as kwargs) +) AS forecasted_sales_with_weekly_seasonality; +``` + +This allows you to customize the model to better fit your specific data and business needs, such as accounting for daily, weekly, or yearly seasonality. Refer to the Prophet documentation for a complete list of available parameters. + + +This example demonstrates how the `prophet` BigQuery function can be used for practical time series forecasting directly within BigQuery, simplifying the process and leveraging the power of Prophet without needing external libraries or tools. diff --git a/use_cases/quantize_into_bins.md b/use_cases/quantize_into_bins.md new file mode 100644 index 00000000..da8233f3 --- /dev/null +++ b/use_cases/quantize_into_bins.md @@ -0,0 +1,42 @@ +You could use this function to categorize website session durations into bins for analysis. Let's say you have a table of website session data with a `session_duration_seconds` column. You want to group these sessions into duration categories like "Short (0-30s)", "Medium (31-60s)", "Long (61-180s)", and "Very Long (181s+)". + +```sql +SELECT + user_id, + bigfunctions.us.quantize_into_bins(session_duration_seconds, [0, 30, 60, 180]) AS session_duration_category + FROM + `your_project.your_dataset.your_session_table` +``` + +This query would add a `session_duration_category` column to your results. For a session lasting 20 seconds, the category would be "]−∞, 0[", since the lower bound isn't inclusive. For 45 seconds it would be "[30, 60[", for 150 seconds it would be "[60, 180]", and for 200 seconds it would be "]180, +∞[". You can then use this new category for aggregation and reporting, such as: + +```sql +SELECT + session_duration_category, + COUNT(*) AS num_sessions, + AVG(pages_viewed) AS avg_pages_viewed + FROM ( + SELECT + user_id, + bigfunctions.us.quantize_into_bins(session_duration_seconds, [0, 30, 60, 180]) AS session_duration_category, + pages_viewed + FROM + `your_project.your_dataset.your_session_table` + ) + GROUP BY 1 + ORDER BY 1 +``` + +This would give you a summary table showing the number of sessions and average pages viewed for each session duration category. This allows you to analyze user behavior based on how long they spend on your website. + + +Other use cases include: + +* **Customer Segmentation by Purchase Value:** Categorize customers based on their total spending into different tiers (e.g., low, medium, high spenders). +* **Lead Scoring:** Assign leads to different score ranges based on factors like engagement and demographics. +* **Performance Analysis:** Group employees into performance categories based on metrics like sales or customer satisfaction scores. +* **Data Visualization:** Create histograms or other visualizations where data needs to be binned for clarity. The output of `quantize_into_bins` can be used directly for grouping in chart creation. +* **Data Preprocessing for Machine Learning:** Binning continuous variables can be a useful preprocessing step for certain machine learning models. + + +Remember to replace `bigfunctions.us` with the appropriate dataset for your BigQuery region. diff --git a/use_cases/quantize_into_bins_with_labels.md b/use_cases/quantize_into_bins_with_labels.md new file mode 100644 index 00000000..c253d997 --- /dev/null +++ b/use_cases/quantize_into_bins_with_labels.md @@ -0,0 +1,66 @@ +A common use case for the `quantize_into_bins_with_labels` function is assigning letter grades to students based on their numerical scores. + +Imagine a grading system where: + +* 0-50: Fail +* 50-60: Wait for result exam +* 60-90: Pass +* 90-100: Pass with mention + +You have a table of student scores: + +```sql +CREATE TEMP TABLE StudentScores AS +SELECT 'Alice' AS student, 75 AS score UNION ALL +SELECT 'Bob', 55 AS score UNION ALL +SELECT 'Charlie', 92 AS score UNION ALL +SELECT 'David', 45 AS score UNION ALL +SELECT 'Eve', 105 AS score; +``` + +You can use the `quantize_into_bins_with_labels` function to assign letter grades: + +```sql +SELECT + student, + score, + bigfunctions.us.quantize_into_bins_with_labels(score, [0, 50, 60, 90, 100], ['Fail', 'Wait for result exam', 'Pass', 'Pass with mention']) AS grade +FROM + StudentScores; +``` + +This will return: + +``` ++---------+------+----------------------+ +| student | score | grade | ++---------+------+----------------------+ +| Alice | 75 | Pass | +| Bob | 55 | Wait for result exam | +| Charlie | 92 | Pass with mention | +| David | 45 | Fail | +| Eve | 105 | UNDEFINED_SUP | ++---------+------+----------------------+ +``` + +This clearly shows which grade each student receives based on their score. The `UNDEFINED_SUP` for Eve indicates her score is above the defined range. You could handle this by adding another bin (e.g., 100-110: Exceptional) or by using an n+1 label approach as shown in the documentation example 4. For example: + + +```sql +SELECT + student, + score, + bigfunctions.us.quantize_into_bins_with_labels(score, [0, 50, 60, 90, 100], ['Lower than very bad!', 'Fail', 'Wait for result exam', 'Pass', 'Pass with mention', 'Genius!']) AS grade +FROM + StudentScores; +``` + + +Other use cases could include: + +* **Categorizing customer spending:** Assign labels like "Low Spender," "Medium Spender," "High Spender" based on purchase amounts. +* **Classifying product sales:** Group products into "Low Sales," "Moderate Sales," "High Sales" categories based on units sold. +* **Defining age groups:** Assign age ranges to individuals like "Child," "Teenager," "Adult," "Senior." +* **Bucketing sensor data:** Categorize sensor readings into different levels (e.g., "Low," "Medium," "High") for easier analysis and alerts. + +Essentially, anytime you need to categorize continuous numeric data into discrete labeled bins, `quantize_into_bins_with_labels` can be helpful. diff --git a/use_cases/quantize_into_fixed_width_bins.md b/use_cases/quantize_into_fixed_width_bins.md new file mode 100644 index 00000000..954dc2aa --- /dev/null +++ b/use_cases/quantize_into_fixed_width_bins.md @@ -0,0 +1,45 @@ +**Use Case: Customer Segmentation based on Purchase Value** + +An e-commerce company wants to segment its customers based on their total purchase value over the last year. They want to create 5 segments of equal width, ranging from the lowest purchase value to the highest. + +**Implementation with `quantize_into_fixed_width_bins`:** + +1. **Determine the minimum and maximum purchase values:** + ```sql + SELECT MIN(total_purchase_value) AS min_value, MAX(total_purchase_value) AS max_value + FROM customer_purchases; + ``` + Let's assume `min_value` is 0 and `max_value` is 1000. + +2. **Apply the `quantize_into_fixed_width_bins` function:** + ```sql + SELECT customer_id, total_purchase_value, + bigfunctions.us.quantize_into_fixed_width_bins(total_purchase_value, 0, 1000, 5) AS purchase_segment + FROM customer_purchases; + ``` + This will categorize each customer into one of the following segments: + + * `]-∞, 0[` (unlikely in this case, as purchase value should be non-negative) + * `[0, 200[` + * `[200, 400[` + * `[400, 600[` + * `[600, 800[` + * `[800, 1000]` + * `]1000, +∞[` + + +3. **Analyze and utilize the segments:** The company can now use these segments for targeted marketing campaigns, personalized recommendations, and other business strategies. For example, customers in the highest segment (`[800, 1000]` and `]1000, +∞[`) could receive exclusive offers or loyalty programs. + +**Benefits of using `quantize_into_fixed_width_bins`:** + +* **Simplified segmentation:** Easily creates equally sized bins, making it straightforward to understand and interpret the segments. +* **Flexibility:** The number of bins and the range can be adjusted to suit different segmentation needs. +* **Efficiency:** The function handles the binning logic within the SQL query, eliminating the need for complex pre-processing steps. + + +**Other Use Cases:** + +* **Categorizing website traffic:** Segmenting users based on time spent on site, number of pages viewed, or other metrics. +* **Analyzing sensor data:** Grouping sensor readings into bins for easier analysis and visualization. +* **Performance monitoring:** Classifying response times or error rates into different severity levels. +* **Creating histograms:** Generating histograms of data distributions using the binned values. diff --git a/use_cases/rare_values.md b/use_cases/rare_values.md new file mode 100644 index 00000000..ebbe2909 --- /dev/null +++ b/use_cases/rare_values.md @@ -0,0 +1,54 @@ +Let's say you have a dataset of e-commerce transactions and you want to identify potentially fraudulent orders based on unusual shipping addresses. You could use the `rare_values` function to find addresses that appear infrequently. + +**Scenario:** + +You have a table `orders` with a column `shipping_city`. Most orders are shipped to common cities, but fraudulent orders might be shipped to less common locations. + +**Query:** + +```sql +SELECT + shipping_city + FROM + `your-project.your_dataset.orders` + WHERE + shipping_city IN ( + SELECT + * + FROM + UNARRAY( + bigfunctions.us.rare_values( + ( + SELECT + ARRAY_AGG(shipping_city) + FROM + `your-project.your_dataset.orders` + ), + 0.01 + ) + ) + ) + +``` + +**Explanation:** + +1. **`SELECT ARRAY_AGG(shipping_city) FROM your-project.your_dataset.orders`**: This subquery aggregates all the `shipping_city` values into a single array. +2. **`bigfunctions.us.rare_values(... , 0.01)`**: This calls the `rare_values` function with the array of cities and a `frequency_threshold` of 0.01. This means any city that appears in less than 1% of the orders will be considered "rare". +3. **`SELECT * FROM UNARRAY(...)`**: This unnests the array of rare values returned by `rare_values` into individual rows. +4. **`WHERE shipping_city IN (...)`**: This filters the original `orders` table to only include rows where the `shipping_city` is present in the list of rare cities. + +**Result:** + +The query will return a list of `shipping_city` values that are considered rare based on the defined threshold. You can then further investigate these orders to determine if they are potentially fraudulent. + + +**Other Use Cases:** + +* **Product Anomaly Detection:** Identify rarely purchased products, which could indicate data entry errors, discontinued items, or sudden changes in demand. +* **User Behavior Analysis:** Find users with uncommon activity patterns, which could be a sign of bots or malicious actors. +* **Error Detection in Logs:** Identify rare error messages in system logs, which might point to new or infrequent bugs. +* **Spam Detection:** Find rare words or phrases used in emails or messages, which could indicate spam or phishing attempts. + + +By adjusting the `frequency_threshold`, you can fine-tune the sensitivity of the rare value detection to suit your specific needs. diff --git a/use_cases/refresh_powerbi.md b/use_cases/refresh_powerbi.md new file mode 100644 index 00000000..faa56a9d --- /dev/null +++ b/use_cases/refresh_powerbi.md @@ -0,0 +1,67 @@ +A common use case for the `refresh_powerbi` function is automating the refresh of a Power BI dataset after data in its connected BigQuery tables has been updated. + +**Scenario:** Imagine you have a BigQuery data warehouse that is used as a source for a Power BI dashboard. You have a daily ETL process that updates several tables in BigQuery. After this process completes, you want to ensure that the Power BI dataset is refreshed so that the dashboard reflects the latest data. + +**Implementation:** You could use an orchestration tool like Airflow, Cloud Composer, or Cloud Functions to schedule the ETL process and the subsequent Power BI dataset refresh. After the ETL tasks have successfully completed, a final task would call the `refresh_powerbi` function. This function would trigger the refresh of the Power BI dataset using the provided credentials and parameters. + +**Example (using Airflow):** + +```python +from airflow import DAG +from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator +from datetime import datetime + +with DAG( + dag_id="refresh_powerbi_example", + start_date=datetime(2023, 10, 26), + schedule_interval="@daily", + catchup=False, +) as dag: + # ETL tasks (e.g., loading data into BigQuery) + etl_task_1 = BigQueryInsertJobOperator( + task_id="etl_task_1", + configuration={ + "query": { + "query": "your_etl_query_1", + "useLegacySql": False, + } + }, + ) + + etl_task_2 = BigQueryInsertJobOperator( + task_id="etl_task_2", + configuration={ + "query": { + "query": "your_etl_query_2", + "useLegacySql": False, + } + }, + ) + + + # Refresh Power BI dataset after ETL completes + refresh_powerbi_task = BigQueryInsertJobOperator( + task_id="refresh_powerbi", + configuration={ + "query": { + "query": f""" + SELECT bigfunctions.{your_region}.refresh_powerbi( + '{your_dataset_id}', + '{your_workspace_id}', + '{your_tenant_id}', + '{your_app_id}', + 'ENCRYPTED_SECRET({your_encrypted_token})', + NULL + ); + """, + "useLegacySql": False, + } + }, + ) + + + [etl_task_1, etl_task_2] >> refresh_powerbi_task + +``` + +Replace the placeholder values with your actual configuration. This setup ensures that the Power BI dataset is automatically refreshed after the ETL process finishes, keeping the dashboard up-to-date. This automation simplifies data management and provides users with the most current insights. diff --git a/use_cases/refresh_tableau.md b/use_cases/refresh_tableau.md new file mode 100644 index 00000000..16825965 --- /dev/null +++ b/use_cases/refresh_tableau.md @@ -0,0 +1,33 @@ +A common use case for the `refresh_tableau` function is automating the refresh of Tableau dashboards after underlying data has been updated. + +**Scenario:** Imagine a company that uses BigQuery to store sales data and Tableau to visualize this data in dashboards. They have a daily ETL process that updates the sales data in BigQuery. They want their Tableau dashboards to reflect this updated data automatically. + +**Implementation using `refresh_tableau`:** + +1. **Tableau Setup:** A personal access token is created in Tableau Server with appropriate permissions to refresh the target datasource or workbook. + +2. **BigQuery Implementation:** The `refresh_tableau` function is called within a BigQuery script, scheduled to run after the daily ETL process completes. This script would look something like this (using the US region example): + +```sql +-- Assume the ETL process has just finished updating sales data. + +SELECT bigfunctions.us.refresh_tableau( + 'Sales Dashboard', -- Replace with the actual workbook/datasource name + 'site_name', -- Replace with the Tableau site name + 'eu-west-1a.online.tableau.com', -- Replace with your Tableau server address + 'token_name', -- Replace with your token name + 'ENCRYPTED_SECRET(GvVm...)' -- Replace with your encrypted token secret +); +``` + +3. **Orchestration (Optional):** A workflow orchestration tool like Cloud Composer or Cloud Functions could be used to manage the dependencies between the ETL process and the BigQuery script. The orchestration tool would ensure that the `refresh_tableau` function is called only after the ETL process has successfully completed. + +**Benefits:** + +* **Automation:** Eliminates the need for manual refreshes, saving time and ensuring data consistency. +* **Data Freshness:** Dashboards always reflect the latest data. +* **Integration:** Seamlessly integrates with BigQuery ETL processes. +* **Centralized Management:** Tableau refresh logic is managed within BigQuery, simplifying administration. + + +This automation ensures that business users always have access to the most up-to-date insights in their Tableau dashboards without any manual intervention. The encrypted secret provides a secure way to manage the Tableau access token within the BigQuery environment. diff --git a/use_cases/remove_accents.md b/use_cases/remove_accents.md new file mode 100644 index 00000000..f096d809 --- /dev/null +++ b/use_cases/remove_accents.md @@ -0,0 +1,32 @@ +A use case for the `remove_accents` function is to standardize text data for searching, indexing, or comparison. For example, if you have a database of customer names with accents and you want to make it easier to search for names regardless of whether the user includes accents in their query, you can use this function. + +**Scenario:** + +You have a table of customer names in BigQuery, some of which contain accents: + +| customer_name | +|---|---| +| José Pérez | +| François Dupont | +| Anna Müller | + +You want to be able to search for "Jose Perez" and still find "José Pérez". + +**Query:** + +```sql +SELECT * +FROM your_table +WHERE bigfunctions.your_region.remove_accents(customer_name) = bigfunctions.your_region.remove_accents('Jose Perez'); +``` + +(Remember to replace `your_region` with the appropriate BigQuery region for your data, e.g., `us`, `eu`, `us-central1`, etc.) + +This query will remove accents from both the stored customer names and the search query, allowing you to find matches even if the accents are not typed precisely. + +**Other Use Cases:** + +* **Data Cleaning:** Removing accents can be a part of a broader data cleaning process to standardize text and remove inconsistencies. +* **Natural Language Processing (NLP):** Accents can sometimes interfere with NLP tasks like text classification or sentiment analysis. Removing them can improve the accuracy of these models. +* **Generating slugs or URL-friendly strings:** Accents can be problematic in URLs. Removing them can create cleaner and more readable slugs. +* **Matching data from different sources:** If you're combining data from multiple sources that might have different conventions for accents, removing them can help standardize the data and improve matching accuracy. diff --git a/use_cases/remove_extra_whitespaces.md b/use_cases/remove_extra_whitespaces.md new file mode 100644 index 00000000..bb7ed6d4 --- /dev/null +++ b/use_cases/remove_extra_whitespaces.md @@ -0,0 +1,32 @@ +You have a table of user-submitted comments where some users may have accidentally or intentionally added extra spaces within their text. This can affect analysis and presentation. You want to normalize the comments by removing extra spaces. + +**Example Table:** + +| comment_id | comment_text | +|------------|------------------------------------| +| 1 | " This is a comment . " | +| 2 | "Another comment." | +| 3 | " Yet another comment. " | + + +**Query using `remove_extra_whitespaces`:** + +```sql +SELECT + comment_id, + bigfunctions.us.remove_extra_whitespaces(comment_text) AS cleaned_comment_text +FROM + `your_project.your_dataset.your_comments_table`; +``` + +**Resulting Table:** + +| comment_id | cleaned_comment_text | +|------------|--------------------------| +| 1 | "This is a comment ." | +| 2 | "Another comment." | +| 3 | "Yet another comment." | + + + +By using the `remove_extra_whitespaces` function, the extra spaces within the comments are removed, leaving only single spaces between words and removing leading/trailing spaces. This makes the comments cleaner and easier to analyze, search, and present. For example, if you were doing sentiment analysis or keyword extraction, removing the extra spaces would improve the accuracy of your results. diff --git a/use_cases/remove_strings.md b/use_cases/remove_strings.md new file mode 100644 index 00000000..3fdaf773 --- /dev/null +++ b/use_cases/remove_strings.md @@ -0,0 +1,23 @@ +Let's say you have a dataset of product descriptions that are cluttered with promotional phrases like "Free Shipping!", "Limited Time Offer!", or "New Arrival!". You want to clean these descriptions to improve text analysis or create a more uniform presentation. + +Here's how `remove_strings` could be used: + +```sql +SELECT product_id, bigfunctions.us.remove_strings(description, ['Free Shipping!', 'Limited Time Offer!', 'New Arrival!']) AS cleaned_description +FROM product_descriptions; +``` + +This query would process each row in the `product_descriptions` table. For each product, the `remove_strings` function would remove any occurrences of "Free Shipping!", "Limited Time Offer!", or "New Arrival!" from the `description` field. The result would be stored in a new column called `cleaned_description`. + + +**Another example:** Imagine you have user-generated comments and want to remove common spam words or phrases. + +```sql +SELECT comment_id, bigfunctions.us.remove_strings(comment_text, ['[link removed]', 'click here', 'make money fast']) AS cleaned_comment +FROM user_comments; +``` + +This would remove instances of "[link removed]", "click here", and "make money fast" from the `comment_text`, resulting in a cleaner `cleaned_comment` field. + + +In essence, `remove_strings` is helpful anytime you need to remove a specific set of strings from a larger body of text for cleaning, pre-processing, or standardization purposes. diff --git a/use_cases/remove_value.md b/use_cases/remove_value.md new file mode 100644 index 00000000..e6988b6f --- /dev/null +++ b/use_cases/remove_value.md @@ -0,0 +1,50 @@ +Imagine you have a table of user preferences where each user has a list of favorite colors stored in an array. You want to remove a specific color from a user's preference list. + +**Table Schema:** + +```sql +CREATE OR REPLACE TABLE `your_project.your_dataset.user_preferences` ( + user_id INT64, + favorite_colors ARRAY +); + +INSERT INTO `your_project.your_dataset.user_preferences` (user_id, favorite_colors) VALUES +(1, ['red', 'blue', 'green', 'yellow']), +(2, ['blue', 'green', 'purple']), +(3, ['red', 'orange', 'yellow']); +``` + +**Use Case: Removing 'blue' from user preferences:** + +```sql +SELECT + user_id, + bigfunctions.your_region.remove_value(favorite_colors, 'blue') AS updated_favorite_colors +FROM + `your_project.your_dataset.user_preferences`; +``` + +**Result:** + +``` ++---------+-------------------------+ +| user_id | updated_favorite_colors | ++---------+-------------------------+ +| 1 | ['red', 'green', 'yellow'] | +| 2 | ['green', 'purple'] | +| 3 | ['red', 'orange', 'yellow'] | ++---------+-------------------------+ +``` + +This query uses the `remove_value` function to remove the color 'blue' from each user's `favorite_colors` array. Users who didn't have 'blue' in their list remain unaffected. Replace `your_region` with the appropriate BigQuery region for your project (e.g., `us`, `eu`, `us-central1`). + +Other scenarios where `remove_value` can be useful: + +* **Product Recommendations:** Removing previously purchased items from a recommendation list. +* **Inventory Management:** Removing out-of-stock items from a product catalog. +* **Data Cleaning:** Removing specific erroneous values from a dataset. +* **Filtering Search Results:** Removing unwanted tags or categories from a search query. +* **Access Control:** Removing revoked permissions from a user's access list. + + +In essence, whenever you need to dynamically filter elements from an array based on their value, the `remove_value` function provides a concise and efficient solution. diff --git a/use_cases/remove_words.md b/use_cases/remove_words.md new file mode 100644 index 00000000..1e4071b9 --- /dev/null +++ b/use_cases/remove_words.md @@ -0,0 +1,23 @@ +A common use case for the `remove_words` function is cleaning text data by removing stop words or unwanted terms. + +**Example: Product Review Analysis** + +Imagine you have a dataset of product reviews and you want to perform sentiment analysis. Common words like "a," "the," "and," "is," etc. (stop words) don't contribute much to the sentiment and can even skew the analysis. You can use `remove_words` to eliminate them: + +```sql +SELECT bigfunctions.us.remove_words(review_text, ['a', 'the', 'and', 'is', 'this', 'it', 'to', 'in', 'of', 'for', 'on', 'with', 'at', 'by', 'that', 'from']) AS cleaned_review +FROM `your_project.your_dataset.product_reviews`; +``` + +This query will process each `review_text` and return a `cleaned_review` with the specified stop words removed. This cleaned text can then be used for more accurate sentiment analysis or other text processing tasks. + +**Other Use Cases:** + +* **Data Preprocessing for Machine Learning:** Removing irrelevant or noisy words from text data before feeding it into a machine learning model can improve performance. +* **Spam Filtering:** Identifying and removing common spam words from emails or messages. +* **Content Filtering:** Blocking inappropriate or offensive language from user-generated content. +* **Keyword Extraction:** Removing common words to identify the most important keywords in a piece of text. +* **Search Optimization:** Cleaning search queries by removing unnecessary terms. + + +By customizing the `words_to_remove` array, you can tailor the `remove_words` function to various text cleaning and preprocessing tasks. diff --git a/use_cases/render_handlebars_template.md b/use_cases/render_handlebars_template.md new file mode 100644 index 00000000..875fc6fc --- /dev/null +++ b/use_cases/render_handlebars_template.md @@ -0,0 +1,57 @@ +Let's say you have a BigQuery table with customer data, including their name and purchase history. You want to generate personalized email greetings for each customer, incorporating details from their purchase history. The `render_handlebars_template` function makes this easy. + +**Example Scenario:** + +Your table, `customer_data`, looks like this: + +| customer_id | customer_name | last_purchase_date | last_purchase_amount | +|---|---|---|---| +| 1 | Alice | 2024-03-15 | 50.00 | +| 2 | Bob | 2024-03-22 | 100.00 | +| 3 | Carol | 2024-03-29 | 25.00 | + + +You could use the following query: + +```sql +SELECT + customer_id, + bigfunctions.us.render_handlebars_template( + """ + Hello {{customer_name}}, + + Thank you for your recent purchase on {{last_purchase_date}} for ${{last_purchase_amount}}. We appreciate your business! + """, + TO_JSON_STRING(STRUCT(customer_name, last_purchase_date, last_purchase_amount)) + ) AS personalized_email + FROM + `your-project.your_dataset.customer_data`; + +``` + +This query would produce a table with the `customer_id` and `personalized_email`: + + +| customer_id | personalized_email | +|---|---| +| 1 | Hello Alice,\n\nThank you for your recent purchase on 2024-03-15 for $50.00. We appreciate your business! | +| 2 | Hello Bob,\n\nThank you for your recent purchase on 2024-03-22 for $100.00. We appreciate your business! | +| 3 | Hello Carol,\n\nThank you for your recent purchase on 2024-03-29 for $25.00. We appreciate your business! | + + +**Explanation:** + +1. **Template:** The first argument to `render_handlebars_template` is the template string. It uses Handlebars syntax (`{{variable_name}}`) to denote placeholders that will be replaced with actual values. + +2. **Context:** The second argument is a JSON string representing the context. This provides the values for the placeholders in the template. `TO_JSON_STRING(STRUCT(...))` is used to convert the desired columns into a JSON object. + +3. **Result:** The function substitutes the values from the context into the template, generating the personalized email greeting for each customer. + +**Other Use Cases:** + +* **Generating dynamic reports:** Create report templates with placeholders for metrics, dates, and other data, then populate them using query results. +* **Creating custom error messages:** Craft more informative error messages by incorporating dynamic context from the data. +* **Formatting data for external APIs:** Prepare data in specific formats required by external services using templating. + + +This function provides a flexible and powerful way to generate dynamic text within BigQuery, improving tasks involving personalization, reporting, and data formatting. diff --git a/use_cases/render_template.md b/use_cases/render_template.md new file mode 100644 index 00000000..36700c09 --- /dev/null +++ b/use_cases/render_template.md @@ -0,0 +1,58 @@ +Let's say you have a BigQuery table with customer data, including their name and purchase history. You want to generate personalized email greetings for each customer using a template. + +**Table Example:** + +| customer_id | customer_name | last_purchase_date | +|---|---|---| +| 1 | Alice | 2023-10-26 | +| 2 | Bob | 2023-10-27 | +| 3 | Charlie | 2023-10-28 | + + +**Template:** + +``` +Hello {{ customer_name }}, + +Thank you for your recent purchase on {{ last_purchase_date }}. We appreciate your business! + +Sincerely, + +The Team +``` + +**BigQuery SQL using `render_template`:** + +```sql +SELECT + customer_id, + bigfunctions.us.render_template( + """ + Hello {{ customer_name }}, + + Thank you for your recent purchase on {{ last_purchase_date }}. We appreciate your business! + + Sincerely, + + The Team + """, + TO_JSON_STRING(STRUCT(customer_name, last_purchase_date)) + ) AS personalized_email + FROM + `your_project.your_dataset.your_customer_table` + +``` + +This query will generate a new column `personalized_email` containing the rendered email greeting for each customer. The `TO_JSON_STRING` function converts the `STRUCT` of `customer_name` and `last_purchase_date` into a JSON string which is then used as the context for the template. + + +**Result:** + +| customer_id | personalized_email | +|---|---| +| 1 | Hello Alice,

Thank you for your recent purchase on 2023-10-26. We appreciate your business!

Sincerely,

The Team | +| 2 | Hello Bob,

Thank you for your recent purchase on 2023-10-27. We appreciate your business!

Sincerely,

The Team | +| 3 | Hello Charlie,

Thank you for your recent purchase on 2023-10-28. We appreciate your business!

Sincerely,

The Team | + + +This demonstrates how `render_template` can be used for dynamic content generation based on data within BigQuery, useful for various applications like personalized emails, custom reports, or dynamic SQL query generation. You can use more advanced templating features like loops and conditional logic provided by nunjucks.js as well. diff --git a/use_cases/replace_special_characters.md b/use_cases/replace_special_characters.md new file mode 100644 index 00000000..208c90c1 --- /dev/null +++ b/use_cases/replace_special_characters.md @@ -0,0 +1,25 @@ +A use case for the `replace_special_characters` function is cleaning user-generated data before storing or processing it. Imagine you have a website where users can submit product reviews. These reviews might contain special characters like emoticons, punctuation marks beyond the standard set, or even unintended HTML entities. These characters can cause problems when: + +* **Storing data in a database:** Some databases may not handle certain special characters correctly, leading to errors or data corruption. +* **Displaying data:** Special characters may not render correctly on different browsers or devices, leading to a poor user experience. +* **Performing text analysis:** Special characters can interfere with natural language processing tasks like sentiment analysis or topic modeling. + +Using the `replace_special_characters` function, you could clean the user-submitted reviews before storing them in your database. For example: + +```sql +SELECT bigfunctions.us.replace_special_characters(review_text, ' ') AS cleaned_review +FROM `your_project.your_dataset.user_reviews`; +``` + +This query would replace all special characters in the `review_text` column with spaces, resulting in a cleaner version of the review text that is more suitable for storage, display, and analysis. This helps to ensure data consistency and improve the performance of downstream tasks. + + +Here's another example, focusing on creating URL-friendly strings (slugs): + +```sql +SELECT bigfunctions.us.replace_special_characters('This is a product title with special characters!@#$%^&*()', '-') AS url_slug +``` + +This would output `This-is-a-product-title-with-special-characters-------`, which, after removing repeating hyphens, could be used as a URL slug. + +In essence, the `replace_special_characters` BigQuery function assists in data sanitization and preparation for various uses by removing or replacing characters that could otherwise cause issues. diff --git a/use_cases/reverse_geocode.md b/use_cases/reverse_geocode.md new file mode 100644 index 00000000..8b5121d8 --- /dev/null +++ b/use_cases/reverse_geocode.md @@ -0,0 +1,19 @@ +A delivery company has a database of orders with latitude and longitude coordinates of delivery locations. They want to enrich this data with more detailed address information for reporting, analysis, and customer service purposes. + +They can use the `reverse_geocode` function to get the full address details for each delivery location. For example, if they have a delivery location with latitude 48.86988770000001 and longitude 2.3079341, they can use the following query in BigQuery: + +```sql +SELECT order_id, bigfunctions.eu.reverse_geocode(latitude, longitude) AS address_details +FROM `orders_table` +``` + +This will add a new column `address_details` to the `orders_table` containing the full address information for each order, including the formatted address, address components, place ID, and more. This information can then be used to: + +* **Improve reporting:** Generate reports on deliveries by city, postal code, or other administrative area. +* **Enhance analysis:** Analyze delivery patterns and optimize routes based on address details. +* **Improve customer service:** Provide customer service representatives with accurate address information to resolve delivery issues. +* **Data validation:** Verify the accuracy of the provided latitude and longitude coordinates. +* **Geocoding database cleanup:** Identify and correct inaccurate or incomplete address information in their database. + + +Another use case could be for a real estate company that wants to analyze property values based on location details derived from latitude/longitude data. Or, a ride-sharing service might use this function to provide drivers with more detailed pickup/dropoff location information. diff --git a/use_cases/roc_auc.md b/use_cases/roc_auc.md new file mode 100644 index 00000000..69dd3746 --- /dev/null +++ b/use_cases/roc_auc.md @@ -0,0 +1,25 @@ +Let's say you're building a machine learning model in BigQuery to predict customer churn for a subscription service. You've trained your model and it outputs a `predicted_score` between 0 and 1 for each customer, where higher scores indicate a higher probability of churn. You also have the ground truth labels indicating whether each customer actually churned (`true`) or not (`false`). + +You can use the `roc_auc` function to evaluate the performance of your churn prediction model. Here's how: + +```sql +SELECT bigfunctions.us.roc_auc( + ( + SELECT + ARRAY_AGG(STRUCT(predicted_score, churned)) + FROM `your_project.your_dataset.your_predictions_table` + ) +); +``` + +* **`your_project.your_dataset.your_predictions_table`**: This table contains your model's predictions and the actual churn outcomes. It should have at least two columns: `predicted_score` (FLOAT64) and `churned` (BOOL). +* **`ARRAY_AGG(STRUCT(predicted_score, churned))`**: This gathers all the predictions and labels into an array of structs, which is the required input format for the `roc_auc` function. +* **`bigfunctions.us.roc_auc(...)`**: This calls the `roc_auc` function in the `us` region (replace with your appropriate region) with the array of structs. + +The query will return a single value representing the ROC AUC. This value will be between 0 and 1. A higher ROC AUC indicates a better performing model: + +* **ROC AUC = 1**: Perfect classifier. +* **ROC AUC = 0.5**: No better than random guessing. +* **ROC AUC = 0**: The classifier is always wrong (predicting positive when it's negative, and vice versa). + +By calculating the ROC AUC, you can quantify how well your churn prediction model distinguishes between customers who will churn and those who won't. This allows you to compare different models, tune hyperparameters, and ultimately select the best model for deployment. diff --git a/use_cases/roc_curve.md b/use_cases/roc_curve.md new file mode 100644 index 00000000..a33d53b0 --- /dev/null +++ b/use_cases/roc_curve.md @@ -0,0 +1,40 @@ +You're evaluating a new machine learning model designed to predict customer churn for a telecommunications company. You have a dataset with predicted churn probabilities (output of your model) and the actual churn outcomes (true or false) for a set of customers. You want to assess the performance of your model across different probability thresholds. The ROC curve is a perfect tool for this. + +Here's how you would use the `roc_curve` BigQuery function in this scenario: + +```sql +#standardSQL +WITH churn_predictions AS ( + SELECT + customer_id, + predicted_churn_probability, + IF(churned, TRUE, FALSE) AS actual_churned + FROM + `your_project.your_dataset.customer_churn_data` +) + +SELECT * +FROM bigfunctions.your_region.roc_curve( + ARRAY_AGG( + STRUCT(predicted_churn_probability, actual_churned) + ) +) AS roc; + +``` + +**Explanation:** + +1. **`churn_predictions` CTE:** This selects the customer ID, the predicted churn probability from your model, and the actual churn outcome. The `IF` statement converts the `churned` column (presumably an integer or string) into a boolean `TRUE` or `FALSE` as required by the `roc_curve` function. + +2. **`ARRAY_AGG`:** This aggregates the predicted probability and actual churn outcome into an array of structs, which is the expected input format for the `roc_curve` function. + +3. **`bigfunctions.your_region.roc_curve(...)`:** This calls the `roc_curve` function with the array of structs. Remember to replace `your_region` with the appropriate BigQuery region (e.g., `us`, `eu`, `us-central1`). + +4. **`AS roc`:** This assigns the output of the function to a table alias `roc`. + +**Result and Interpretation:** + +The query will return a table with two columns: `false_positive_rate` and `true_positive_rate`. These represent the coordinates of the ROC curve. By plotting these points, you can visualize the trade-off between the model's sensitivity (true positive rate) and its specificity (1 - false positive rate) at various threshold settings. A higher area under the ROC curve (AUC) indicates better model performance. + + +This example demonstrates how `roc_curve` can be practically used to evaluate the performance of a binary classification model in a real-world business scenario. You could then use this information to choose an appropriate threshold for your model based on the desired balance between correctly identifying churned customers and minimizing false alarms. diff --git a/use_cases/run_python.md b/use_cases/run_python.md new file mode 100644 index 00000000..a4b1a712 --- /dev/null +++ b/use_cases/run_python.md @@ -0,0 +1,60 @@ +This `run_python` function allows you to execute arbitrary Python code within BigQuery. Here's a breakdown of potential use cases and how it addresses them: + +**1. Text Preprocessing/Natural Language Processing (NLP):** + +* **Stemming/Lemmatization:** The provided example demonstrates stemming words using the `snowballstemmer` library. This is useful for NLP tasks like text analysis, where you want to reduce words to their root form (e.g., "running," "runs," "ran" become "run"). Imagine you have a BigQuery table with product reviews. You could use `run_python` to stem the review text directly within BigQuery before feeding it into a sentiment analysis model. +* **Regular Expressions:** You can use Python's powerful `re` module for complex pattern matching and string manipulation in your data. For instance, extract specific information from text fields, validate data formats, or clean up inconsistent data. +* **Other NLP tasks:** Tokenization, part-of-speech tagging, named entity recognition – any Python NLP library that can be installed in the sandbox can be leveraged. + +**2. Data Cleaning and Transformation:** + +* **Custom logic:** Implement data transformations that are too complex for standard SQL functions. This could include handling missing values in a specific way, recoding variables based on complex criteria, or applying custom business rules. +* **Date/Time manipulation:** Python's `datetime` module offers more flexibility than standard SQL for working with dates and times. You might use it to parse dates in unusual formats, calculate time differences, or handle time zones. +* **Numerical computations:** Perform complex calculations beyond basic arithmetic, such as using the `math` or `NumPy` libraries. + + +**3. User-Defined Functions (UDFs) with Python Flexibility:** + +* **Code Reusability:** While less performant than compiled UDFs, `run_python` offers a quick way to prototype and deploy UDF-like functionality without the need for separate deployment steps. +* **Complex logic encapsulation:** Package up complex logic within the function, making your SQL queries cleaner and easier to understand. + + +**4. Prototyping and Experimentation:** + +* **Quick tests:** Quickly test Python code snippets against your BigQuery data without leaving the BigQuery environment. This is great for exploratory data analysis or testing different transformations. +* **Library exploration:** Experiment with different Python libraries to see how they might be applied to your data. + + +**Example: Sentiment Analysis Preprocessing** + +Let's say you have a table called `product_reviews` with a column `review_text`. You could use `run_python` to perform basic sentiment preprocessing: + +```sql +SELECT + review_id, + bigfunctions.us.run_python( + ''' + import re + from snowballstemmer import stemmer + text = re.sub(r'[^\w\s]', '', text).lower() # Remove punctuation and lowercase + stemmer_en = stemmer('english') + stemmed_text = ' '.join(stemmer_en.stemWords(text.split())) + return stemmed_text + ''', + 're snowballstemmer', + TO_JSON(STRUCT(review_text as text)) + ) AS processed_review_text + FROM + `your_project.your_dataset.product_reviews`; + +``` + +This query removes punctuation, lowercases the text, and stems the words, preparing the `review_text` for further sentiment analysis. + +**Key Considerations:** + +* **Performance:** As noted in the documentation, `run_python` is relatively slow due to the sandboxed environment. For production-level, high-performance scenarios, consider using compiled UDFs instead. +* **Security:** The sandboxed environment limits network access and available libraries for security reasons. + + +This function provides a powerful way to bridge the gap between SQL and Python within BigQuery, enabling more complex data manipulation and analysis directly within your data warehouse. However, be mindful of the performance implications and security constraints. diff --git a/use_cases/sankey_chart.md b/use_cases/sankey_chart.md new file mode 100644 index 00000000..923b834c --- /dev/null +++ b/use_cases/sankey_chart.md @@ -0,0 +1,14 @@ +The `sankey_chart` function is best used when you want to visualize the flow of something between different stages or categories within BigQuery. Here are some use cases: + +* **E-commerce Customer Journey:** Track how users move through different stages of a purchase funnel (e.g., product view, add to cart, checkout, purchase). The thickness of the flow lines in the Sankey diagram would represent the number of users transitioning between each stage, highlighting bottlenecks or drop-off points. The input data would consist of tuples representing the source stage, destination stage, and the number of users making that transition. + +* **Sales Lead Tracking:** Visualize the progression of leads through your sales pipeline (e.g., lead generation, qualification, proposal, negotiation, closed won/lost). This helps identify stages with low conversion rates and optimize the sales process. The input data would be similar to the e-commerce example, with tuples representing the sales stage transitions and the number of leads. + +* **Website User Flow:** Analyze how users navigate through your website, from the landing page to various sections and ultimately to a desired action (e.g., signup, purchase). This allows you to identify popular paths, areas of friction, and optimize website design for better user experience. The input data would represent transitions between website pages and the number of users navigating between them. + +* **Supply Chain Management:** Track the flow of goods and materials through different stages of your supply chain. This helps visualize dependencies, identify potential disruptions, and optimize logistics. Input data would represent movement of goods between locations or stages of production and the quantity of goods. + +* **Financial Transactions:** Visualize the flow of money between different accounts or entities. This can be used for fraud detection, financial analysis, or understanding complex financial networks. Input tuples would represent transfers between accounts and the amount transferred. + + +In each of these scenarios, the `sankey_chart` function takes the structured data from your BigQuery tables and generates an interactive HTML visualization that makes it easy to understand and analyze the flow patterns. The visualization can then be embedded in reports, dashboards, or presentations. diff --git a/use_cases/send_google_chat_message.md b/use_cases/send_google_chat_message.md new file mode 100644 index 00000000..63661f5a --- /dev/null +++ b/use_cases/send_google_chat_message.md @@ -0,0 +1,38 @@ +Here are a few use cases for the `send_google_chat_message` BigQuery function: + +**1. Data Monitoring and Alerting:** + +* **Threshold breaches:** Imagine you have a BigQuery table tracking website traffic. You can schedule a query to check if traffic drops below a certain threshold. If it does, use `send_google_chat_message` to send an alert to a Google Chat space dedicated to website monitoring. The message could include details like the current traffic level, the threshold breached, and a timestamp. +* **Data quality issues:** A scheduled query can check for data quality issues, such as null values in critical columns or inconsistencies between tables. If a problem is detected, the function can send a notification to the data engineering team's Google Chat space. +* **Job completion status:** After a long-running BigQuery job finishes (e.g., a large data import or a complex transformation), the function can send a message to the relevant team confirming completion (or failure, along with the error message). + +**2. Report Automation and Sharing:** + +* **Daily summaries:** Generate a daily summary of key business metrics from your BigQuery data and send it to a Google Chat space for management review. The message could be formatted as a table or a short bullet-point list. +* **Weekly performance reports:** Consolidate weekly performance data and send a report to the sales team's Google Chat space, highlighting top performers, areas for improvement, and key trends. +* **Ad-hoc data insights:** After running an exploratory query that reveals an interesting insight, use the function to share the finding with colleagues in a Google Chat space, along with a link to the BigQuery query. + +**3. Workflow Integration:** + +* **Triggering downstream actions:** When certain conditions are met in your BigQuery data (e.g., a new customer signs up, an order is placed), the function can send a message to a Google Chat space that integrates with other tools. This message could trigger a downstream action in another system, such as updating a CRM or sending a welcome email. +* **Human-in-the-loop processes:** Some data processes might require human intervention. The function can be used to notify a human operator in a Google Chat space when their input is needed. The operator can then take the necessary action and update the relevant data in BigQuery. + +**Example (Data Monitoring):** + +```sql +#standardSQL +DECLARE threshold INT64 DEFAULT 1000; +DECLARE current_traffic INT64; + +SET current_traffic = (SELECT COUNT(*) FROM `your-project.your_dataset.website_traffic` WHERE _PARTITIONTIME = CURRENT_DATE()); + +IF current_traffic < threshold THEN + SELECT bigfunctions.us.send_google_chat_message( + FORMAT("ALERT: Website traffic dropped below %d. Current traffic: %d", threshold, current_traffic), + "YOUR_WEBHOOK_URL" + ); +END IF; +``` + + +This demonstrates how `send_google_chat_message` can be integrated into a SQL script to provide real-time alerts based on data in BigQuery. You can adapt this pattern for various other use cases as needed. Remember to replace placeholders like `"YOUR_WEBHOOK_URL"`, `"your-project.your_dataset.website_traffic"`, and adjust the logic to suit your specific requirements. Also ensure you're calling the function from the correct regional dataset (e.g., `bigfunctions.us`, `bigfunctions.eu`). diff --git a/use_cases/send_mail.md b/use_cases/send_mail.md new file mode 100644 index 00000000..d5e92c9e --- /dev/null +++ b/use_cases/send_mail.md @@ -0,0 +1,94 @@ +This `send_mail` function has several practical use cases within BigQuery: + +**1. Data-Driven Alerting:** + +Imagine you have a BigQuery script that monitors website traffic. You could use `send_mail` to send an alert if traffic drops below a certain threshold. + +```sql +DECLARE low_traffic_threshold INT64 DEFAULT 1000; +DECLARE current_traffic INT64; + +SET current_traffic = (SELECT COUNT(*) FROM `your_project.your_dataset.website_traffic` WHERE _PARTITIONTIME = CURRENT_DATE()); + +IF current_traffic < low_traffic_threshold THEN + SELECT bigfunctions.us.send_mail( + 'admin@yourcompany.com', + 'Low Website Traffic Alert', + FORMAT('Website traffic dropped to %d today, below the threshold of %d', current_traffic, low_traffic_threshold), + null, + null + ); +END IF; +``` + +**2. Report Generation and Distribution:** + +You can generate reports within BigQuery and then email them directly using this function. The example in the documentation shows converting JSON to Excel and attaching it. You could adapt this for CSV reports as well: + +```sql +SELECT bigfunctions.us.send_mail( + 'marketing@yourcompany.com', + 'Weekly Sales Report', + 'Please find attached the weekly sales report.', + 'weekly_sales.csv', + (SELECT STRING_AGG(FORMAT('%t,%t', product_name, sales), '\n') FROM `your_project.your_dataset.sales_data` WHERE _PARTITIONTIME BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()) +); +``` + +**3. Scheduled Notifications:** + +Combine `send_mail` with BigQuery's scheduled queries to automate regular email updates. For example, send a daily summary of key metrics: + +```sql +-- Scheduled Query Configuration (set in the BigQuery UI) +-- Destination Table: None +-- Schedule: Daily at 8:00 AM + +SELECT bigfunctions.us.send_mail( + 'team@yourcompany.com', + 'Daily Metrics Summary', + FORMAT(""" + Total users: %d + Total revenue: %f + """, + (SELECT COUNT(DISTINCT user_id) FROM `your_project.your_dataset.user_activity` WHERE _PARTITIONTIME = CURRENT_DATE()), + (SELECT SUM(revenue) FROM `your_project.your_dataset.transactions` WHERE _PARTITIONTIME = CURRENT_DATE()) + ), + null, + null +); + +``` + +**4. User-Specific Notifications (within a script):** + +You could iterate through a result set and send customized emails to different recipients based on data in the table. For example, sending personalized product recommendations: + +```sql +DECLARE done BOOLEAN DEFAULT FALSE; +DECLARE current_user STRUCT; +DECLARE cur CURSOR FOR + SELECT user_email, recommended_product + FROM `your_project.your_dataset.product_recommendations`; + +BEGIN + OPEN cur; + LOOP + FETCH cur INTO current_user; + IF done THEN + LEAVE; + END IF; + SELECT bigfunctions.us.send_mail( + current_user.email, + 'Personalized Product Recommendation', + FORMAT('We recommend you check out: %s', current_user.recommended_product), + null, + null + ); + END LOOP; + CLOSE cur; +END; +``` + + +These are just a few examples. The flexibility of `send_mail` allows it to be integrated into various data processing workflows within BigQuery, enhancing communication and automation. Remember to choose the correct regional dataset for the `bigfunctions` project based on your BigQuery data location. diff --git a/use_cases/send_mail_with_excel.md b/use_cases/send_mail_with_excel.md new file mode 100644 index 00000000..94fc6a47 --- /dev/null +++ b/use_cases/send_mail_with_excel.md @@ -0,0 +1,44 @@ +A marketing analyst wants to send a weekly performance report to their team. They have a BigQuery table called `marketing.weekly_performance` that contains data on ad spend, impressions, clicks, conversions, and other relevant metrics. + +**Use Case:** + +Using the `send_mail_with_excel` function, the analyst can automate the process of: + +1. **Querying the BigQuery table:** The `table_or_view_or_query` parameter can be set to `marketing.weekly_performance`. +2. **Converting the results to an Excel file:** The function automatically handles the conversion of the query results into an Excel file named, for example, `weekly_report.xlsx`. +3. **Emailing the report:** The analyst can specify recipients (`to`), subject line (`subject`), and email body content (`content`). The Excel file will be attached to the email. + +**Example BigQuery SQL:** + +```sql +call bigfunctions..send_mail_with_excel( + 'marketing_team@company.com', + 'Weekly Marketing Performance Report', + ''' + Hello Team, + + Please find attached the weekly marketing performance report. + + Regards, + Marketing Analyst + ''', + 'weekly_report.xlsx', + 'marketing.weekly_performance' +); +``` + + +**Benefits:** + +* **Automation:** Eliminates the manual steps of querying, exporting to Excel, and emailing. +* **Time-saving:** Frees up the analyst's time for more strategic tasks. +* **Consistency:** Ensures that the report is delivered on time and in a consistent format. +* **Collaboration:** Makes it easy to share the report with the entire marketing team. + + +**Other potential use cases:** + +* **Sales reporting:** Sending daily or weekly sales figures to the sales team. +* **Financial reporting:** Distributing monthly financial statements to stakeholders. +* **Customer support reporting:** Sharing weekly customer support metrics with the customer support team. +* **Automated alerts:** Triggering an email with relevant data when certain thresholds are met (e.g., a sudden drop in website traffic). This would likely require integrating the function within a scheduled query or other automated workflow. diff --git a/use_cases/send_slack_message.md b/use_cases/send_slack_message.md new file mode 100644 index 00000000..5b1acf8f --- /dev/null +++ b/use_cases/send_slack_message.md @@ -0,0 +1,37 @@ +A use case for the `send_slack_message` BigQuery function would be to **alert a team on Slack when a certain threshold is met in a BigQuery table**. + +For example, imagine you have a table monitoring website traffic, and you want to be notified if the error rate exceeds 5%. You could schedule a query to run periodically, calculate the error rate, and use the `send_slack_message` function to send a notification if the threshold is breached: + +```sql +#standardSQL +CREATE TEMP FUNCTION send_slack_message(message STRING, webhook_url STRING) RETURNS STRING + OPTIONS ( + library="gs://bigfunctions-europe-west1/lib/send_slack_message-v0.0.1.js", + endpoint="https://europe-west1-bigfunctions.cloudfunctions.net/send_slack_message-v0.0.1" -- Update to the same region as where your query is run. + ); + + +SELECT + IF(error_rate > 0.05, + bigfunctions.europe_west1.send_slack_message(FORMAT("Error rate exceeded 5%%! Current rate: %f", error_rate), "YOUR_WEBHOOK_URL"), + 'OK') AS notification_status + FROM ( + SELECT + COUNTIF(status_code >= 400) / COUNT(*) AS error_rate + FROM + `your-project.your_dataset.website_traffic` + WHERE _PARTITIONTIME BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR) AND CURRENT_TIMESTAMP() + ); + +``` + +This query calculates the error rate over the past hour. If the `error_rate` is greater than 0.05 (5%), it calls `send_slack_message` with a formatted message including the current error rate and sends it to the specified Slack webhook URL. Otherwise, it returns 'OK'. You can then schedule this query to run regularly in BigQuery. + +Other Use Cases: + +* **Data quality monitoring:** Alert the data engineering team if a data pipeline fails or produces unexpected results (e.g., null values in a critical column). +* **Report generation notification:** Send a message to a Slack channel when a scheduled report generation is complete. +* **Anomaly detection:** Notify relevant stakeholders when unusual patterns are detected in data, such as a sudden spike or drop in sales. +* **Resource usage alerts:** Send notifications if BigQuery storage or compute costs exceed a defined budget. + +Remember to replace `"YOUR_WEBHOOK_URL"` with the actual webhook URL for your Slack channel and adjust the region of the `bigfunctions` dataset according to your needs. Also, consider using environment variables or a secrets management solution to securely store your webhook URL. diff --git a/use_cases/send_sms.md b/use_cases/send_sms.md new file mode 100644 index 00000000..64547f54 --- /dev/null +++ b/use_cases/send_sms.md @@ -0,0 +1,52 @@ +A use case for the `send_sms` BigQuery function would be sending SMS notifications based on data changes or thresholds within BigQuery. + +**Scenario:** An e-commerce company uses BigQuery to store order data. They want to be notified via SMS when a high-value order is placed. + +**Implementation:** + +1. **BigQuery Table:** The company has a table called `orders` with columns like `order_id`, `order_total`, `customer_phone`. + +2. **Scheduled Query:** They create a scheduled query that runs every hour, checking for orders exceeding a certain value (e.g., $1000). + +3. **`send_sms` Integration:** Within the scheduled query, they incorporate the `send_sms` function. The query would look something like this (using the `us` region as an example, adjust according to your location): + +```sql +SELECT + bigfunctions.us.send_sms( + FORMAT("High-value order placed! Order ID: %s, Total: $%f", order_id, order_total), + customer_phone + ) + FROM + `your_project.your_dataset.orders` + WHERE order_total > 1000 + AND order_placed_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR) -- Only check last hour's orders + +``` + +**How it works:** + +* The scheduled query runs hourly. +* It filters for new orders placed in the last hour exceeding $1000. +* For each matching order, it calls the `send_sms` function. +* The function sends an SMS message to the `customer_phone` number with the order details. + + +**Other use cases:** + +* **Fraud detection:** Send an SMS alert to a security team when unusual activity is detected. +* **Appointment reminders:** Send SMS reminders to customers about upcoming appointments. +* **Low stock alerts:** Send an SMS to inventory managers when product stock falls below a threshold. +* **Service outages:** Notify relevant personnel via SMS when a service outage is detected. +* **Two-factor authentication:** Send a verification code via SMS for user login. + + +**Important considerations:** + +* **Cost:** Be mindful of the cost of sending SMS messages, especially for high-volume scenarios. +* **Privacy:** Ensure you comply with data privacy regulations related to phone numbers and user consent. +* **Error handling:** Implement error handling within your queries to manage situations where sending SMS messages fails (e.g., invalid phone numbers). The provided documentation doesn't show the full response structure, but you should check for error codes/messages within the returned JSON. +* **Rate limiting:** Be aware of any rate limits imposed by the SMS provider used by the `send_sms` function. You might need to implement logic to handle these limits. +* **Phone number format:** Ensure phone numbers are in the correct international format (e.g., +1 for US, +44 for UK, etc.). + + +By combining BigQuery's powerful data processing capabilities with the `send_sms` function, you can create real-time notification systems directly within your data warehouse. diff --git a/use_cases/send_teams_message.md b/use_cases/send_teams_message.md new file mode 100644 index 00000000..ddb00081 --- /dev/null +++ b/use_cases/send_teams_message.md @@ -0,0 +1,57 @@ +A use case for the `send_teams_message` BigQuery function would be to send notifications to a Microsoft Teams channel upon the completion of a BigQuery job or when specific conditions are met in your data. + +**Scenario 1: BigQuery Job Completion Notification:** + +Imagine you have a long-running BigQuery query that aggregates daily sales data. You want to be notified in your team's channel when the job finishes. You could create a scheduled query and then add a final step using the `send_teams_message` function. This step would execute only after the main query completes. + +```sql +-- Your main query to calculate daily sales +SELECT ... +FROM ... + +-- Send a Teams notification when the query is done +SELECT bigfunctions.us.send_teams_message( + CONCAT("Daily sales data aggregation complete! Total sales: $", SUM(daily_sales)), + "YOUR_WEBHOOK_URL" +); +``` + +**Scenario 2: Anomaly Detection Alert:** + +Suppose you're monitoring website traffic and want to be alerted if traffic drops below a certain threshold. You can set up a scheduled query to check the traffic data and use `send_teams_message` to send an alert if an anomaly is detected. + +```sql +-- Check for low website traffic +SELECT + CASE + WHEN current_traffic < 1000 THEN bigfunctions.us.send_teams_message( + "ALERT: Website traffic is unusually low!", + "YOUR_WEBHOOK_URL" + ) + ELSE CAST(NULL as STRING) -- Do nothing if traffic is normal + END +FROM + (SELECT COUNT(*) AS current_traffic FROM `your_project.your_dataset.website_traffic` WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)) +; + +``` + +**Scenario 3: Data Validation Notification:** + +You can use this function to notify your team about data quality issues. For example, if a data validation check fails, send a message to Teams. + +```sql +-- Check for invalid records +SELECT + CASE + WHEN invalid_records > 0 THEN bigfunctions.us.send_teams_message( + CONCAT("Data validation failed! Found ", invalid_records, " invalid records."), + "YOUR_WEBHOOK_URL" + ) + ELSE CAST(NULL as STRING) + END +FROM + (SELECT COUNT(*) AS invalid_records FROM `your_project.your_dataset.your_table` WHERE some_validation_check IS FALSE); +``` + +These examples illustrate how `send_teams_message` can integrate BigQuery with Microsoft Teams for real-time notifications, allowing for proactive monitoring and faster responses to critical events. Remember to replace `"YOUR_WEBHOOK_URL"` with the actual webhook URL for your Teams channel and select the correct BigFunctions dataset based on your BigQuery region (e.g., `bigfunctions.eu`, `bigfunctions.asia_southeast1`). diff --git a/use_cases/sentiment_score.md b/use_cases/sentiment_score.md new file mode 100644 index 00000000..4323b836 --- /dev/null +++ b/use_cases/sentiment_score.md @@ -0,0 +1,20 @@ +A company wants to analyze customer feedback left on their website. They store the feedback text in a BigQuery table called `customer_feedback`. They can use the `sentiment_score` function to determine the sentiment (positive, negative, or neutral) of each feedback entry. + +```sql +SELECT + feedback_id, + feedback_text, + bigfunctions.us.sentiment_score(feedback_text) AS sentiment_score + FROM + `your-project.your_dataset.customer_feedback` +``` + +This query adds a new column called `sentiment_score` to the table. This score will be a numerical value indicating the sentiment. A higher score indicates more positive sentiment, while a lower score indicates more negative sentiment. They can then use this score to: + +* **Identify trends:** Track changes in overall customer sentiment over time. +* **Categorize feedback:** Group feedback into positive, negative, and neutral categories for easier analysis. +* **Prioritize responses:** Address negative feedback first to mitigate customer dissatisfaction. +* **Measure campaign effectiveness:** Analyze sentiment before and after a marketing campaign to gauge its impact. +* **Improve products/services:** Identify areas where customers express negative sentiment and use that information to make improvements. + +By applying this function to their existing feedback data, the company can gain valuable insights into customer opinions and make data-driven decisions to improve their business. diff --git a/use_cases/sleep.md b/use_cases/sleep.md new file mode 100644 index 00000000..a863f141 --- /dev/null +++ b/use_cases/sleep.md @@ -0,0 +1,31 @@ +The `sleep` function in BigQuery can be useful in a few scenarios, primarily related to testing and managing dependencies within scripts or workflows: + +1. **Testing BigQuery function performance:** You can use `sleep` to introduce controlled delays and measure the execution time of other BigQuery functions or queries. This allows you to benchmark performance and identify bottlenecks. + +2. **Simulating latency:** In testing scenarios, you might want to simulate real-world conditions where there are delays in data processing or availability. `sleep` can help mimic these latencies. + +3. **Managing dependencies in scripts:** If you have a BigQuery script where one part needs to complete before another begins, you can use `sleep` to ensure a certain time has passed before the dependent part executes. However, this is generally not the ideal way to handle dependencies within a BigQuery script. BigQuery scripting features like `WAIT` clauses for `MERGE` statements or explicitly checking for job completion status offer more robust solutions. `sleep` would be a less reliable approach as execution times can vary. + +4. **Rate limiting:** If you're interacting with an external API or service via BigQuery and need to adhere to rate limits, `sleep` can be used to pause execution for a specified duration between calls. However, dedicated rate limiting libraries or built-in functionality within the API or service itself would be preferable for more precise control. + +5. **Troubleshooting and debugging:** In some cases, introducing a delay with `sleep` can be helpful for debugging timing-related issues or examining intermediate states within a complex BigQuery script. + +**Example (Testing performance):** + +```sql +-- Measure the time taken to execute a complex query +SELECT bigfunctions.eu.sleep(5); -- Introduce a delay to clear the cache (less reliable, better alternatives exist) + +DECLARE start_time TIMESTAMP; +SET start_time = CURRENT_TIMESTAMP(); + +-- Your complex query here +SELECT * FROM large_table WHERE some_condition; + +DECLARE end_time TIMESTAMP; +SET end_time = CURRENT_TIMESTAMP(); + +SELECT TIMESTAMP_DIFF(end_time, start_time, SECOND) AS execution_time; +``` + +**Caveats:** While `sleep` can be useful in limited cases, relying heavily on it within production BigQuery scripts is generally discouraged. For dependency management, error handling, and performance optimization, using BigQuery's built-in features and best practices is more appropriate. Using `sleep` for rate limiting is also suboptimal; dedicated rate-limiting mechanisms are more robust. It's primarily useful for simple testing and debugging scenarios. diff --git a/use_cases/sort_values.md b/use_cases/sort_values.md new file mode 100644 index 00000000..cbc322a7 --- /dev/null +++ b/use_cases/sort_values.md @@ -0,0 +1,38 @@ +A use case for the `sort_values` function is preparing data for aggregation or other operations where the order of elements within an array matters. + +**Scenario:** You have a table storing the daily sales for different products, and you want to find the median sales value for each product over a week. + +**Table:** + +| product_id | daily_sales | +|---|---| +| 1 | [10, 12, 8, 15, 11, 9, 13] | +| 2 | [5, 7, 6, 8, 4, 9, 10] | +| 3 | [20, 18, 22, 19, 21, 17, 23] | + + +**Query:** + +```sql +SELECT + product_id, + ( + SELECT + CAST(daily_sales[OFFSET(CAST(ARRAY_LENGTH(daily_sales) / 2 AS INT64))] AS BIGNUMERIC) + FROM + UNNEST([bigfunctions.YOUR_REGION.sort_values(daily_sales)]) AS daily_sales + ) AS median_sales + FROM + `your_project.your_dataset.your_table` + +``` + +**Explanation:** + +1. **`bigfunctions.YOUR_REGION.sort_values(daily_sales)`**: This sorts the `daily_sales` array in ascending order for each product. Replace `YOUR_REGION` with your BigQuery region (e.g., `us`, `eu`, `us-central1`). +2. **`UNNEST(...) AS daily_sales`**: This unnests the sorted array, creating a separate row for each daily sales value. However, since we're putting it inside a subquery and immediately re-aggregating it, we're using UNNEST here as a trick to access elements of the now-sorted array by index. +3. **`ARRAY_LENGTH(daily_sales) / 2`**: This calculates the middle index of the sorted array. +4. **`daily_sales[OFFSET(CAST(... AS INT64))]`**: This retrieves the element at the calculated middle index, effectively giving you the median value. We cast to INT64 because ARRAY_LENGTH returns an INT64 and OFFSET requires an INT64. +5. **`CAST(... AS BIGNUMERIC)`**: This is just to handle potential overflow if your sales numbers are very large. Adjust the data type as needed for your data. + +By sorting the array first, you can easily find the median value using the array's middle index. This wouldn't be reliable with the unsorted data. Similar logic could be used to calculate other quantiles or perform operations sensitive to the order of elements within the array. diff --git a/use_cases/sort_values_desc.md b/use_cases/sort_values_desc.md new file mode 100644 index 00000000..0d9161df --- /dev/null +++ b/use_cases/sort_values_desc.md @@ -0,0 +1,33 @@ +You have a table of product sales with columns like `product_id` and `sales_amount`. You want to find the top 3 products by sales in descending order. You can use `sort_values_desc` within an aggregation to achieve this: + +```sql +SELECT product_id +FROM `your_project.your_dataset.your_sales_table` +GROUP BY product_id +ORDER BY bigfunctions.YOUR_REGION.sort_values_desc(ARRAY_AGG(sales_amount)) DESC +LIMIT 3 +``` + +**Explanation:** + +1. **`ARRAY_AGG(sales_amount)`**: For each `product_id`, this gathers all the `sales_amount` values into an array. +2. **`sort_values_desc(...)`**: This sorts the array of sales amounts in descending order. The highest sales amount will now be the first element in each array. +3. **`ORDER BY ... DESC`**: This orders the `product_id` groups based on the sorted sales amount arrays in descending order. Since the largest sales amount is the first element of each array after sorting, ordering by the array itself (descending) effectively orders by the highest sales amount. +4. **`LIMIT 3`**: This returns only the top 3 `product_id`s based on the ordering. + +**Another Use Case (Data Cleaning):** + +Imagine you have a table with a column containing lists of dates (perhaps representing important events related to a customer). These date lists might be in any order. You want to consistently store these dates in descending chronological order. You could use `sort_values_desc`: + +```sql +SELECT + customer_id, + bigfunctions.YOUR_REGION.sort_values_desc(dates_array) AS sorted_dates +FROM + `your_project.your_dataset.your_customer_table` +``` + +This would update or create a new column `sorted_dates` with the dates arranged from most recent to oldest. + + +Remember to replace `YOUR_REGION` with the appropriate BigQuery region (e.g., `us`, `eu`, `asia-northeast1`, etc.) that corresponds to your dataset's location. diff --git a/use_cases/sql_to_flatten_json_column.md b/use_cases/sql_to_flatten_json_column.md new file mode 100644 index 00000000..1b4d9802 --- /dev/null +++ b/use_cases/sql_to_flatten_json_column.md @@ -0,0 +1,52 @@ +You have a BigQuery table containing a JSON column called `data`, and you want to analyze specific fields within these JSON objects. Instead of repeatedly using `JSON_EXTRACT` or `JSON_VALUE` in your queries, you can use `sql_to_flatten_json_column` to generate a SQL query that extracts all the JSON fields into separate columns. This makes subsequent analysis easier and potentially more performant. + +**Use Case Example:** + +Let's say you have a table called `website_events` with a JSON column named `event_details`: + +``` +Table: website_events +Columns: event_id (INT64), event_timestamp (TIMESTAMP), event_details (STRING) + +Sample Data: +1, 2024-10-26 10:00:00, '{"eventType": "pageview", "pageUrl": "/home", "userId": "123"}' +2, 2024-10-26 10:01:00, '{"eventType": "click", "elementId": "button1", "userId": "456"}' +``` + + +You want to analyze the `eventType`, `pageUrl` (when available), and `userId` for all events. + +**Steps:** + +1. **Generate the flattening SQL:** + + In the BigQuery console, run the following query, replacing `..website_events` with the fully qualified table name and choosing the appropriate BigFunctions dataset for your region (e.g., `bigfunctions.us` for US, `bigfunctions.eu` for EU, etc.): + + ```sql + SELECT bigfunctions..sql_to_flatten_json_column(event_details, '..website_events.event_details'); + ``` + +2. **Execute the generated SQL:** The output of the above query will be a new SQL query that flattens the JSON. It will look something like this: + + ```sql + SELECT + *, + CAST(JSON_VALUE(`event_details`, '$.eventType') AS STRING) AS eventType, + JSON_VALUE(`event_details`, '$.pageUrl') AS pageUrl, + CAST(JSON_VALUE(`event_details`, '$.userId') AS STRING) AS userId + FROM + `..website_events` + ``` + +3. **Copy and run the generated SQL:** This final query will give you a table with individual columns for `eventType`, `pageUrl`, and `userId`. + + +**Benefits:** + +* **Simplified Queries:** Instead of constantly extracting JSON fields in every query, you have dedicated columns, making your queries cleaner and easier to read. +* **Potential Performance Improvement:** BigQuery can sometimes optimize queries against flattened data better than queries with repeated JSON extractions. +* **Data Exploration:** Flattening the JSON makes it easier to explore the data in the BigQuery UI and identify all the fields present in the JSON data. + + + +This approach is especially useful when you need to analyze the JSON data repeatedly or when the JSON structure is complex and contains numerous nested fields. diff --git a/use_cases/sum_values.md b/use_cases/sum_values.md new file mode 100644 index 00000000..e0c47689 --- /dev/null +++ b/use_cases/sum_values.md @@ -0,0 +1,36 @@ +You have a table of customer orders, and each order contains an array of item prices. You want to calculate the total value of each order. + +**Table Schema (Example):** + +```sql +CREATE OR REPLACE TABLE `your_project.your_dataset.orders` AS ( + SELECT 1 AS order_id, [10.50, 25.00, 5.99] AS item_prices UNION ALL + SELECT 2 AS order_id, [150.00, 12.75] AS item_prices UNION ALL + SELECT 3 AS order_id, [5.00, 5.00, 5.00, 5.00] AS item_prices +); +``` + +**Query using `sum_values`:** + +```sql +SELECT + order_id, + bigfunctions.us.sum_values(item_prices) AS total_order_value -- Replace 'us' with your region + FROM + `your_project.your_dataset.orders`; +``` + +**Result:** + +``` ++---------+-----------------+ +| order_id | total_order_value | ++---------+-----------------+ +| 1 | 41.49 | +| 2 | 162.75 | +| 3 | 20.00 | ++---------+-----------------+ +``` + + +This use case demonstrates how `sum_values` simplifies the process of summing elements within an array, eliminating the need for more complex SQL involving unnest and aggregate functions. It's a very practical application for e-commerce, inventory management, and other scenarios where you need to work with arrays of numeric values. diff --git a/use_cases/timestamp_from_unix_date_time.md b/use_cases/timestamp_from_unix_date_time.md new file mode 100644 index 00000000..3590053b --- /dev/null +++ b/use_cases/timestamp_from_unix_date_time.md @@ -0,0 +1,58 @@ +You have a table storing Unix timestamps (integers representing seconds since 1970-01-01 00:00:00 UTC). You want to convert these timestamps into BigQuery TIMESTAMP format, but at different levels of granularity. Here are a few use cases: + +* **Analyzing data by year:** You have event data with Unix timestamps and you want to analyze trends year by year. You can use `timestamp_from_unix_date_time(unix_timestamp, 'YEAR')` to truncate the timestamps to the beginning of each year, then group your data by this truncated timestamp. + +```sql +SELECT + bigfunctions.us.timestamp_from_unix_date_time(event_timestamp, 'YEAR') AS event_year, + COUNT(*) AS event_count +FROM + `your_project.your_dataset.your_table` +GROUP BY + event_year +ORDER BY + event_year; +``` + + +* **Generating reports by month:** You want to create monthly reports based on user activity. You have user activity timestamps stored as Unix timestamps. Use `timestamp_from_unix_date_time(unix_timestamp, 'MONTH')` to get the beginning of the month for each activity, and then aggregate data accordingly. + +```sql +SELECT + bigfunctions.us.timestamp_from_unix_date_time(activity_timestamp, 'MONTH') AS activity_month, + COUNT(DISTINCT user_id) AS active_users +FROM + `your_project.your_dataset.user_activity` +GROUP BY + activity_month +ORDER BY + activity_month; +``` + + +* **Data bucketing/aggregation:** You want to group events into hourly buckets. You can use `timestamp_from_unix_date_time(unix_timestamp, 'HOUR')` to truncate timestamps to the beginning of each hour, enabling easy grouping and aggregation. + + +```sql +SELECT + bigfunctions.us.timestamp_from_unix_date_time(event_timestamp, 'HOUR') AS event_hour, + SUM(event_value) AS total_value +FROM + `your_project.your_dataset.events` +GROUP BY + event_hour +ORDER BY + event_hour; + +``` + +* **Simplifying date comparisons:** Sometimes, you only care about the date part of a timestamp. Using `timestamp_from_unix_date_time(unix_timestamp, 'DAY')` effectively converts the Unix timestamp to a date, allowing for straightforward date comparisons without dealing with the time component. + + +```sql +SELECT * +FROM `your_project.your_dataset.events` +WHERE bigfunctions.us.timestamp_from_unix_date_time(event_timestamp, 'DAY') = '2024-03-15'; +``` + +These examples demonstrate the flexibility of the function to handle different levels of time granularity based on the `date_time_part` argument, enabling a variety of time-based analysis and reporting tasks. Remember to replace `your_project.your_dataset.your_table` with your actual table information and the correct regional dataset for `bigfunctions`. diff --git a/use_cases/timestamp_to_unix_date_time.md b/use_cases/timestamp_to_unix_date_time.md new file mode 100644 index 00000000..a9458315 --- /dev/null +++ b/use_cases/timestamp_to_unix_date_time.md @@ -0,0 +1,50 @@ +**Use Case 1: Event Time Difference Calculation** + +Imagine you have a table of events with timestamps, and you want to calculate the time elapsed between events in a specific unit (e.g., days, hours, minutes). `timestamp_to_unix_date_time` can help achieve this. + +```sql +SELECT + event_id, + event_timestamp, + bigfunctions.YOUR_REGION.timestamp_to_unix_date_time(event_timestamp, 'SECOND') - + LAG(bigfunctions.YOUR_REGION.timestamp_to_unix_date_time(event_timestamp, 'SECOND')) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS time_difference_seconds +FROM + your_event_table +``` +This query calculates the difference in seconds between consecutive events for each user. You can change 'SECOND' to 'MINUTE', 'HOUR', 'DAY', etc., depending on the desired unit. + +**Use Case 2: Bucketing Events by Time Intervals** + +You might want to group events into specific time intervals for analysis, such as hourly, daily, or weekly buckets. `timestamp_to_unix_date_time` allows you to generate bucket identifiers. + +```sql +SELECT + event_id, + event_timestamp, + bigfunctions.YOUR_REGION.timestamp_to_unix_date_time(event_timestamp, 'HOUR') AS hour_bucket +FROM + your_event_table +``` +This query assigns each event to an hourly bucket based on its timestamp. Events within the same hour will have the same `hour_bucket` value. You can then use this `hour_bucket` for aggregation or filtering. + +**Use Case 3: Data Retention Policies** + +For implementing data retention policies, you can use `timestamp_to_unix_date_time` to identify data older than a specific period. + +```sql +SELECT + * +FROM + your_data_table +WHERE + bigfunctions.YOUR_REGION.timestamp_to_unix_date_time(CURRENT_TIMESTAMP(), 'DAY') - bigfunctions.YOUR_REGION.timestamp_to_unix_date_time(data_timestamp, 'DAY') > 30 -- Delete data older than 30 days +``` + +This query selects data older than 30 days. You can modify the condition and integrate it into a DELETE statement to automatically remove old data. + + +**Use Case 4: Simplified Date Arithmetic** + +Sometimes you need to perform date arithmetic but don't want to deal with complexities of date and timestamp functions. Converting to Unix time can simplify these calculations. For example, adding 7 days to a timestamp becomes as simple as adding 7 * 24 * 60 * 60 to the Unix timestamp representation. + +**Important Note:** Remember to replace `YOUR_REGION` with the appropriate BigQuery region (e.g., `us`, `eu`, `us-central1`) where you are running your query. diff --git a/use_cases/translate.md b/use_cases/translate.md new file mode 100644 index 00000000..686d0e67 --- /dev/null +++ b/use_cases/translate.md @@ -0,0 +1,8 @@ +A company has a database of customer reviews in various languages. They want to analyze the sentiment of these reviews but their sentiment analysis tool only works on English text. They can use the `translate` function within BigQuery to translate all reviews into English before processing them with the sentiment analysis tool. + +```sql +SELECT review_id, sentiment(bigfunctions..translate(review_text, 'en')) AS sentiment_score +FROM `project.dataset.reviews`; +``` + +Replacing `` with the appropriate BigQuery region for their dataset (e.g., `us`, `eu`, `europe-west1`). This query translates each `review_text` into English and then calculates the sentiment score using the hypothetical `sentiment` function. This allows the company to perform sentiment analysis on all reviews regardless of the original language. diff --git a/use_cases/translated_month_name.md b/use_cases/translated_month_name.md new file mode 100644 index 00000000..6ec692e9 --- /dev/null +++ b/use_cases/translated_month_name.md @@ -0,0 +1,54 @@ +A company has a table of sales data with a date column. They want to create a report that displays the month name in different languages based on the user's locale. They can use the `translated_month_name` function to achieve this. + +**Example Scenario:** + +The company operates in France and Spain. They have a BigQuery table called `sales` with columns `date` and `sales_amount`. + +```sql +CREATE OR REPLACE TABLE `your_project.your_dataset.sales` AS +SELECT DATE('2023-01-15') AS date, 1200 AS sales_amount UNION ALL +SELECT DATE('2023-02-20') AS date, 1500 AS sales_amount UNION ALL +SELECT DATE('2023-03-10') AS date, 1800 AS sales_amount UNION ALL +SELECT DATE('2023-04-05') AS date, 1100 AS sales_amount; +``` + +**Query for French Users:** + +```sql +SELECT + bigfunctions.eu.translated_month_name(date, 'fr') AS month_name_fr, + sales_amount +FROM + `your_project.your_dataset.sales`; +``` + +**Result:** + +| month_name_fr | sales_amount | +|---|---| +| janvier | 1200 | +| février | 1500 | +| mars | 1800 | +| avril | 1100 | + + +**Query for Spanish Users:** + +```sql +SELECT + bigfunctions.eu.translated_month_name(date, 'es') AS month_name_es, + sales_amount +FROM + `your_project.your_dataset.sales`; +``` + +**Result:** + +| month_name_es | sales_amount | +|---|---| +| enero | 1200 | +| febrero | 1500 | +| marzo | 1800 | +| abril | 1100 | + +This allows the company to generate reports tailored to different language preferences without needing complex case statements or separate tables for each language. The `translated_month_name` function simplifies the process of localizing date information. Remember to replace `your_project.your_dataset` and the region prefix (e.g. `eu`, `us`) as needed. diff --git a/use_cases/translated_weekday_name.md b/use_cases/translated_weekday_name.md new file mode 100644 index 00000000..bc483234 --- /dev/null +++ b/use_cases/translated_weekday_name.md @@ -0,0 +1,17 @@ +A company has a database of customer orders with timestamps. They want to generate reports based on the day of the week, but need the reports to be localized for different regions. + +For example, they might want to generate a report showing the total sales for each day of the week in French for their French-speaking customers, and a separate report in Spanish for their Spanish-speaking customers. + +Using the `translated_weekday_name` function, they can achieve this easily. They can query their order data, extract the weekday from the timestamp, and then use the function to translate the weekday name into the desired language. A simplified example in BigQuery SQL (assuming the dataset is in the EU region) would be: + +```sql +SELECT + bigfunctions.eu.translated_weekday_name(EXTRACT(DATE from order_timestamp), 'fr') AS french_weekday, + SUM(order_total) AS total_sales + FROM + `your_project.your_dataset.your_orders_table` + GROUP BY 1 + ORDER BY 1 +``` + +This would output a table showing the total sales for each day of the week, with the weekday name translated into French. They could then repeat the query with a different language code (e.g., 'es' for Spanish) to generate a localized report for a different region. diff --git a/use_cases/upload_table_to_gsheet.md b/use_cases/upload_table_to_gsheet.md new file mode 100644 index 00000000..b23fb927 --- /dev/null +++ b/use_cases/upload_table_to_gsheet.md @@ -0,0 +1,25 @@ +Here are a few use cases for the `upload_table_to_gsheet` function: + +**1. Reporting and Sharing Data:** + +* **Regular Reporting:** A marketing team could use this function to automatically export weekly or monthly website traffic data from a BigQuery table to a Google Sheet. This sheet could then be used for reporting, visualization, and sharing with stakeholders who may not have direct access to BigQuery. +* **Ad-hoc Data Extracts:** A business analyst might need to quickly extract a subset of customer data for a specific analysis. They could use `upload_table_to_gsheet` to pull the relevant data into a Google Sheet for easier manipulation and sharing with collaborators. +* **Data Sharing with External Parties:** You might need to share data with a client or partner who doesn't have access to your BigQuery project. Exporting the data to a Google Sheet offers a simple and accessible way to share information. + +**2. Collaboration and Data Entry:** + +* **Collaborative Data Editing:** A team working on a project might use a Google Sheet as a central hub for data entry and review. `upload_table_to_gsheet` could be used to seed the sheet with initial data from BigQuery, allowing the team to build upon it collaboratively. +* **Collecting Feedback:** You could upload survey results from BigQuery to a Google Sheet to facilitate collaborative analysis and discussion among team members. + +**3. Data Integration and Transformation:** + +* **Preprocessing Data for Other Tools:** Some tools and applications might not have direct integration with BigQuery. Exporting data to a Google Sheet can serve as an intermediary step, allowing you to format and prepare the data for import into those tools. +* **Manual Data Cleansing and Enrichment:** While BigQuery is powerful for data transformation, sometimes manual cleaning or enrichment is necessary. Exporting data to a Google Sheet provides a user-friendly interface for making such adjustments. + +**4. Small-Scale Data Backup:** + +* **Backing Up Important Tables:** For relatively small tables, `upload_table_to_gsheet` can be a simple way to create a backup copy in a different format. However, for large datasets, BigQuery's native backup and recovery mechanisms are more suitable. + +**Example Scenario:** + +An e-commerce company uses BigQuery to store sales data. Every Monday, the marketing team needs a report of the previous week's sales by product category. They could schedule a query to calculate this data and then use `upload_table_to_gsheet` to automatically export the results to a designated Google Sheet. This automates the reporting process and makes the data readily available for analysis and visualization. diff --git a/use_cases/upload_to_gsheet.md b/use_cases/upload_to_gsheet.md new file mode 100644 index 00000000..cf89f515 --- /dev/null +++ b/use_cases/upload_to_gsheet.md @@ -0,0 +1,38 @@ +A marketing team wants to analyze the performance of their recent social media campaigns. They have the campaign data stored in a BigQuery table. To share this data with non-technical stakeholders who primarily use Google Sheets, they can utilize the `upload_to_gsheet` function. + + +Here's a breakdown of the use case: + +1. **Data Preparation in BigQuery:** The marketing team creates a BigQuery query to aggregate the relevant campaign data, such as campaign name, impressions, clicks, conversions, and cost. Let's assume the query results in a table named `campaign_performance`. + +2. **Converting to JSON:** They use BigQuery's `TO_JSON_STRING` function to convert the results of the `campaign_performance` table into a JSON array of objects, where each object represents a row of campaign data. + + ```sql + SELECT TO_JSON_STRING(t) + FROM `project.dataset.campaign_performance` AS t; + ``` + +3. **Uploading to Google Sheets:** They use the `upload_to_gsheet` function within BigQuery to upload this JSON data directly to a designated Google Sheet. + + ```sql + SELECT bigfunctions.us.upload_to_gsheet( + ( + SELECT TO_JSON_STRING(t) + FROM `project.dataset.campaign_performance` AS t + ), + 'https://docs.google.com/spreadsheets/d/YOUR_SPREADSHEET_ID', + 'Campaign Performance', + 'write_truncate' + ); + ``` + This code snippet does the following: + * Calls the `upload_to_gsheet` function from the appropriate regional dataset (e.g., `bigfunctions.us`). + * Passes the JSON string generated in the subquery as the `data` argument. + * Provides the URL of the target Google Sheet, replacing `YOUR_SPREADSHEET_ID` with the actual ID. + * Specifies the worksheet name as 'Campaign Performance'. + * Uses the `write_truncate` mode to overwrite the sheet if it already exists, ensuring they always have the latest data. Alternatively, they could use `write_append` to add new data to the existing sheet. + +4. **Sharing and Analysis in Google Sheets:** The Google Sheet is then shared with the non-technical stakeholders, who can easily access, visualize, and analyze the campaign performance data within their familiar spreadsheet environment. They can create charts, pivot tables, and use other Google Sheets features to gain insights. + + +This process automates the data transfer from BigQuery to Google Sheets, ensuring that stakeholders have up-to-date campaign performance data readily available for analysis and decision-making. It bridges the gap between technical data storage and non-technical data consumption, enabling broader access to critical business information. diff --git a/use_cases/upsert.md b/use_cases/upsert.md new file mode 100644 index 00000000..61a3ce14 --- /dev/null +++ b/use_cases/upsert.md @@ -0,0 +1,73 @@ +Let's illustrate the `upsert` function with a concrete use case: managing a product catalog in BigQuery. + +**Scenario:** You have a BigQuery table called `product_catalog` that stores information about your products. You receive regular updates about product information from various sources, and you need to efficiently update your `product_catalog` table with these changes. + +**Table Schema (product_catalog):** + +* `product_id` (STRING): Unique identifier for each product (primary key) +* `name` (STRING): Product name +* `price` (NUMERIC): Product price +* `description` (STRING): Product description +* `last_updated` (TIMESTAMP): Timestamp indicating the last update time + +**Update Data:** You receive a new batch of product updates in another table or as the result of a query. This data may contain new products, updates to existing products, or even information about products that need to be removed. + +**Use Case Examples:** + +**1. Delta Update (Insert and Update):** + +You want to insert new products and update existing ones based on the latest information. You use the `delta` insertion mode and the `last_updated` field to determine the most recent record. + +```sql +CALL bigfunctions..upsert( + 'dataset_id.product_updates', -- Source table with updates + 'dataset_id.product_catalog', -- Destination table + 'delta', -- Insertion mode + ['product_id'], -- Primary key + 'last_updated' -- Recency field +); +``` + +This will: + +* **Insert:** Any new products (based on `product_id`) found in `product_updates` that are not present in `product_catalog`. +* **Update:** For products with matching `product_id` in both tables, the values in `product_catalog` will be updated with the values from `product_updates` if the `last_updated` timestamp in `product_updates` is more recent. + +**2. Full Merge (Insert, Update, and Delete):** + +You want to perform a complete synchronization of your product catalog. This means inserting new products, updating existing ones, and *deleting* products that are no longer present in the source data. You use the `full` insertion mode. + +```sql +CALL bigfunctions..upsert( + -- Query that selects active products from a larger dataset + 'SELECT * FROM dataset_id.all_products WHERE active = TRUE', + 'dataset_id.product_catalog', -- Destination table + 'full', -- Insertion mode + ['product_id'], -- Primary key + 'last_updated' -- Recency field +); +``` + +This will: + +* **Insert:** New products. +* **Update:** Existing products with more recent data. +* **Delete:** Products present in `product_catalog` but not returned by the source query (meaning they are no longer active). + + +**3. Insert Only:** + +If you only want to insert new products without updating existing ones: + +```sql +CALL bigfunctions..upsert( + 'dataset_id.new_products', -- Source table with new products + 'dataset_id.product_catalog', -- Destination table + 'insert_only', -- Insertion mode + ['product_id'], -- Primary key + NULL -- No recency field needed for insert only +); +``` + + +These examples demonstrate how the `upsert` function simplifies the process of merging data into a BigQuery table, handling various update scenarios with a single function call. Remember to replace `` with the appropriate BigQuery region (e.g., `us`, `eu`, `us-central1`). diff --git a/use_cases/url_decode.md b/use_cases/url_decode.md new file mode 100644 index 00000000..99cda0b1 --- /dev/null +++ b/use_cases/url_decode.md @@ -0,0 +1,31 @@ +You have a table in BigQuery that stores URLs, but some of these URLs are URL-encoded (meaning special characters are replaced with percent signs followed by hexadecimal codes). You want to decode these URLs to their original, readable form. + +**Example Scenario:** + +Let's say you have a table named `website_traffic` with a column called `encoded_url`. This column contains URL-encoded strings like this: + +``` +'https%3A%2F%2Fwww.example.com%2Fproducts%3Fid%3D123%26source%3Dgoogle' +``` + +You can use the `url_decode` function to decode these URLs within a query: + +```sql +SELECT url_decode(encoded_url) AS decoded_url +FROM `your_project.your_dataset.website_traffic`; +``` + +This query would produce a result set with a `decoded_url` column containing the properly decoded URLs: + +``` +https://www.example.com/products?id=123&source=google +``` + +**Use Cases:** + +* **Log Analysis:** Web server logs often store URLs in a URL-encoded format. Decoding them makes the logs more human-readable and easier to analyze. +* **Data Cleaning:** If you have URL data from different sources, some might be encoded and some might not. Using `url_decode` ensures consistency in your data. +* **Reporting:** Presenting decoded URLs in reports makes the information clearer and more understandable for stakeholders. +* **Data Integration:** If you're integrating data from a system that provides URL-encoded URLs, you'll need to decode them before storing or processing them in BigQuery. + +In essence, whenever you encounter URL-encoded strings in your BigQuery data and need to work with the actual URLs, the `url_decode` function becomes indispensable. diff --git a/use_cases/validate_address.md b/use_cases/validate_address.md new file mode 100644 index 00000000..31a50c5b --- /dev/null +++ b/use_cases/validate_address.md @@ -0,0 +1,34 @@ +A practical use case for the `validate_address` function is cleaning and standardizing address data in a customer database. + +Imagine you have a large table of customer data in BigQuery, including a column with their addresses. These addresses might have been entered manually or collected from various sources, leading to inconsistencies like: + +* **Different formats:** "123 Main St", "123 Main Street", "123 Main St.", etc. +* **Typos:** "123 Main Sreet", "124 Main St", etc. +* **Missing information:** Some addresses might be missing city, state, or zip code. + +You can use the `validate_address` function within a BigQuery query to process these addresses and improve their quality: + +```sql +SELECT + original_address, + bigfunctions.us.validate_address(original_address).result.address.formattedAddress AS standardized_address, + bigfunctions.us.validate_address(original_address).result.verdict.validationGranularity AS validation_granularity, + bigfunctions.us.validate_address(original_address).result.verdict.geocodeGranularity AS geocode_granularity +FROM + `your_project.your_dataset.your_customer_table`; +``` + +This query will: + +1. **Standardize the format:** The `formattedAddress` field in the function's output will provide a consistent format for all valid addresses. +2. **Correct minor errors:** The function can often correct typos and infer missing information. +3. **Identify invalid addresses:** By checking the `validationGranularity` and `geocodeGranularity`, you can identify addresses that are completely invalid or only partially valid (e.g., only the street is valid). You can then flag these addresses for manual review or further investigation. + +This standardized and validated address data can then be used for various purposes, such as: + +* **Geocoding:** Accurately map customer locations for visualizations or analyses. +* **Logistics:** Optimize delivery routes and calculate shipping costs. +* **Marketing:** Target specific geographic areas with advertising campaigns. +* **Data integration:** Improve the accuracy and consistency of data when integrating with other systems. + +By using the `validate_address` function, you can significantly enhance the quality and usability of your customer address data. This leads to more accurate analyses, improved operational efficiency, and better business decisions. diff --git a/use_cases/weighted_average.md b/use_cases/weighted_average.md new file mode 100644 index 00000000..155d7466 --- /dev/null +++ b/use_cases/weighted_average.md @@ -0,0 +1,26 @@ +A teacher wants to calculate the weighted average grade for a student. The student has two grades: + +* **Quiz:** Grade = 10, Weight = 1 (Quizzes are worth less) +* **Exam:** Grade = 13, Weight = 2 (Exams are worth more) + +Using the `weighted_average` function, the calculation would be: + +```sql +SELECT bigfunctions.{region}.weighted_average(grade, weight) AS weighted_average +FROM ( + SELECT 10 AS grade, 1 AS weight UNION ALL + SELECT 13 AS grade, 2 AS weight +); +``` + +This would return 12, as shown in the example. The exam grade (13) contributes more to the final average because it has a higher weight. + +**Other use cases:** + +* **Calculating average stock prices:** Where `element` is the price of the stock and `weight` is the number of shares held at that price. +* **Determining the weighted average cost of capital:** Where `element` is the cost of each type of capital (debt, equity, etc.) and `weight` is the proportion of each type of capital in the company's capital structure. +* **Computing the weighted average of customer satisfaction scores:** Where `element` is the satisfaction score and `weight` is the number of customers who gave that score. +* **Creating a composite index from multiple indicators:** Where `element` is the value of each indicator and `weight` reflects the importance of each indicator in the overall index. For instance, a happiness index could be created weighting factors like GDP per capita, life expectancy, and social support. + + +In essence, anytime you need an average where some elements contribute more than others, the `weighted_average` function is useful. diff --git a/use_cases/xml2json.md b/use_cases/xml2json.md new file mode 100644 index 00000000..891433e8 --- /dev/null +++ b/use_cases/xml2json.md @@ -0,0 +1,41 @@ +Let's say you have a BigQuery table that stores product information, but some of that information is stored in XML format within a string column. You want to analyze this data using BigQuery's powerful SQL capabilities, but working directly with XML in SQL can be cumbersome. The `xml2json` function provides a solution. + +**Scenario:** + +Your table `products` has columns like `product_id`, `product_name`, and `product_details`. The `product_details` column contains XML data like this: + +```xml + + Red + Large + 25.99 + +``` + +**Use Case with `xml2json`:** + +You can use `xml2json` to convert the XML data into JSON within your SQL query, making it easier to access specific elements: + +```sql +SELECT + product_id, + product_name, + JSON_VALUE(bigfunctions.us.xml2json(product_details), '$.product_details.color') AS color, + JSON_VALUE(bigfunctions.us.xml2json(product_details), '$.product_details.size') AS size, + CAST(JSON_VALUE(bigfunctions.us.xml2json(product_details), '$.product_details.price') AS NUMERIC) AS price +FROM + products; +``` + +This query uses `xml2json` to convert the `product_details` XML into a JSON string. Then, `JSON_VALUE` extracts the `color`, `size`, and `price` values using JSONPath expressions. This transforms the XML data into a more manageable format for analysis within BigQuery. + + +**Other Potential Use Cases:** + +* **Data Transformation for downstream applications:** Convert XML data to JSON before exporting it to other systems that work better with JSON. +* **Simplifying complex XML structures:** Transform complex, nested XML into a flatter JSON structure for easier querying and reporting. +* **API Integration:** If an API returns data in XML format, `xml2json` can be used to convert the response into JSON within BigQuery for analysis. +* **Log Processing:** If log files are stored in XML format, this function can convert them to JSON for easier parsing and analysis within BigQuery. + + +By converting XML to JSON within BigQuery using `xml2json`, you unlock the power of BigQuery's JSON functions and make complex XML data more accessible for analysis and processing. diff --git a/use_cases/xml_extract.md b/use_cases/xml_extract.md new file mode 100644 index 00000000..bcc233b2 --- /dev/null +++ b/use_cases/xml_extract.md @@ -0,0 +1,60 @@ +Let's say you have a BigQuery table called `product_catalog` that stores product information, including an XML description field. The XML data might look like this: + +```xml + + Awesome Gadget + + Long battery life + Waterproof + + 99.99 + +``` + +**Use Case 1: Extracting Feature List** + +You want to analyze the most common product features. You can use `xml_extract` to pull out all the features into an array: + +```sql +SELECT + product_id, + bigfunctions.us.xml_extract(xml_description, '/product/features/feature') AS features + FROM + product_catalog; +``` + +This query would return a table with `product_id` and a `features` column containing an array of strings, like `["Long battery life", "Waterproof"]`. You can then unnest this array for further analysis. + +**Use Case 2: Finding Products within a Price Range** + +You want to find all products priced between $50 and $100. You can use `xml_extract` to extract the price and then filter based on its value: + +```sql +SELECT + product_id, + CAST(bigfunctions.us.xml_extract(xml_description, '/product/price')[OFFSET(0)] AS BIGNUMERIC) AS price + FROM + product_catalog + WHERE CAST(bigfunctions.us.xml_extract(xml_description, '/product/price')[OFFSET(0)] AS BIGNUMERIC) BETWEEN 50 AND 100; +``` + +This query extracts the price, casts it to a numeric type (important!), and then filters the results. The `[OFFSET(0)]` is used since `xml_extract` returns an array, even for single elements. + +**Use Case 3: Checking for a Specific Feature** + +You want to find all products that have the "Waterproof" feature. + +```sql +SELECT + product_id + FROM + product_catalog + WHERE 'Waterproof' IN ( + SELECT feature FROM UNNEST(bigfunctions.us.xml_extract(xml_description, '/product/features/feature')) AS feature + ); +``` + +This query uses `UNNEST` to turn the array of features into individual rows and then filters based on the presence of "Waterproof". + + +These are just a few examples. The key takeaway is that `xml_extract` allows you to query and analyze data embedded within XML structures stored in your BigQuery tables without needing complex string manipulation or external tools. This makes working with XML data in BigQuery significantly easier. Remember to replace `bigfunctions.us` with the appropriate dataset for your BigQuery region. diff --git a/use_cases/z_scores.md b/use_cases/z_scores.md new file mode 100644 index 00000000..8efe9f14 --- /dev/null +++ b/use_cases/z_scores.md @@ -0,0 +1,35 @@ +A use case for the `z_scores` function is to identify outliers in a dataset. Let's imagine you have a table of website session durations in seconds: + +```sql +CREATE OR REPLACE TABLE `your_project.your_dataset.session_durations` AS +SELECT * FROM UNNEST([ + 10, 25, 30, 35, 40, 45, 50, 55, 60, 300, 65, 70, 75, 80, 85 +]) AS session_duration; +``` + +You suspect that the session duration of 300 seconds is an outlier. You can use `z_scores` to confirm this: + + +```sql +SELECT + session_duration, + bigfunctions.your_region.z_scores(ARRAY_AGG(session_duration) OVER ()) as z_score + FROM + `your_project.your_dataset.session_durations`; + +``` + +Replace `your_region` with your BigQuery region (e.g., `us`, `eu`, `us_central1`). + +This query will calculate the z-score for each session duration. The session with a duration of 300 seconds will likely have a z-score significantly higher than other sessions (above 2 or 3, depending on your data distribution), indicating it's an outlier. You could then filter based on the z-score to identify and potentially remove or further investigate these outlier sessions. + + +Other use cases include: + +* **Standardizing data:** Transforming data to have a mean of 0 and a standard deviation of 1, useful for comparing variables measured on different scales. +* **Anomaly detection:** Similar to outlier detection, but in a time-series context, identifying unusual fluctuations in metrics. +* **Machine learning preprocessing:** Many machine learning algorithms benefit from standardized input data. +* **Ranking and scoring:** Z-scores can provide a relative ranking of items based on their performance compared to the average. For example, ranking students based on their test scores. + + +Remember to choose the correct BigQuery region for the `bigfunctions` dataset based on where your data resides.