Skip to content

Latest commit

 

History

History
457 lines (354 loc) · 12.5 KB

BigQuery.md

File metadata and controls

457 lines (354 loc) · 12.5 KB

Memo for Google BigQuery

Useful commands/scripts for use on Google Cloud Platform.

videos

Definition

BigQuery is the PetaBytes-scale datawarehouse on GCP (for immutable very-large datasets)

Start web console for BigQuery

Use one of these:

Query large datasets in seconds (Examples)

#standardsql
SELECT
  nppes_provider_state AS state,
  ROUND(SUM(total_claim_count) / 1e6) AS total_claim_count_millions
FROM
  `bigquery-public-data.medicare.part_d_prescriber_2014`
GROUP BY
  state
ORDER BY
  total_claim_count_million DESC
LIMIT
  5;

Simple Query

SELECT
  airline,
  date,
  departure_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  departure_delay > 0
  AND departure_airport = 'LGA'
LIMIT
  100

To write a query to get both the number of flights delayed and the total number of flights in a single query:

SELECT
  f.airline,
  COUNT(f.departure_delay) AS total_flights,
  SUM(IF(f.departure_delay > 0, 1, 0)) AS num_delayed
FROM
   `bigquery-samples.airline_ontime_data.flights` AS f
WHERE
  f.departure_airport = 'LGA' AND f.date = '2008-05-13'
GROUP BY
  f.airline
LIMIT
  100  

Aggregate and Boolean functions

SELECT
  airline,
  COUNT(departure_delay)
FROM
   `bigquery-samples.airline_ontime_data.flights`
WHERE
  departure_airport = 'LGA'
  AND date = '2008-05-13'
GROUP BY
  airline
ORDER BY airline
SELECT
  airline,
  COUNT(departure_delay)
FROM
   `bigquery-samples.airline_ontime_data.flights`
WHERE
  departure_delay > 0 AND
  departure_airport = 'LGA'
  AND date = '2008-05-13'
GROUP BY
  airline
ORDER BY airline

String operations

SELECT
  CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
FROM
  `bigquery-samples.weather_geo.gsod`
WHERE
  station_number = 725030
  AND total_precipitation > 0

Join on Date

SELECT
  f.airline,
  SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
  COUNT(f.arrival_delay) AS total_flights
FROM
  `bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
  SELECT
    CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
  FROM
    `bigquery-samples.weather_geo.gsod`
  WHERE
    station_number = 725030
    AND total_precipitation > 0) AS w
ON
  w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airline

Subquery

SELECT
  airline,
  num_delayed,
  total_flights,
  num_delayed / total_flights AS frac_delayed
FROM (
SELECT
  f.airline AS airline,
  SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
  COUNT(f.arrival_delay) AS total_flights
FROM
  `bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
  SELECT
    CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
  FROM
    `bigquery-samples.weather_geo.gsod`
  WHERE
    station_number = 725030
    AND total_precipitation > 0) AS w
ON
  w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airline
  )
ORDER BY
  frac_delayed ASC

Acknowledgment: This lab is based on an article by Felipe Hoffa: https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804

Upload and Export Data

Load data from a CSV file

  1. From the web GUI
  • Go to https://bigquery.cloud.google.com
  • You will need to use/create a dataset (within a given project) where the data will be stored.
  • click on the + aside your dataset and follow the instructions (Table name in CAPITAL by convention)
  1. From the command line tool: bq
  • open Cloud Shell from the Google Cloud Console which already has bq installed
  • download or Create a schema for your table

You can check that file online: https://storage.googleapis.com/cloud-training/CBP200/BQ/lab4/schema_flight_performance.json

 cat schema_flight_performance.json 
[
    {
        "name": "YEAR",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "QUARTER",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "MONTH",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "DAY_OF_MONTH",
        "type": "INTEGER"
    },
    {
        "name": "FULL_DATE",
        "type": "STRING"
    },
...
    {
        "name": "SECURITY_DELAY",
        "type": "INTEGER"
    },
    {
        "name": "LATE_AIRCRAFT_DELAY",
        "type": "INTEGER"
    }
]%                                                

Type the following command to download schema_flight_performance.json (the schema file for the table in this example) to your working directory.

curl https://storage.googleapis.com/cloud-training/CPB200/BQ/lab4/schema_flight_performance.json -o schema_flight_performance.json

The JSON files containing the data for your table are now stored in a Google Cloud Storage bucket. They have URIs like the following: gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json

  • use bq and type the following command to create a table named flights_2014 in the cpb101_flight_data dataset, using data from files in Google Cloud Storage and the schema file stored on your virtual machine.

Note that your Project ID is stored as a variable in Cloud Shell ($DEVSHELL_PROJECT_ID) so there's no need for you to remember it. If you require it, you can view your Project ID in the command line to the right of your username (after the @ symbol).

bq load --source_format=NEWLINE_DELIMITED_JSON $DEVSHELL_PROJECT_ID:cpb101_flight_data.flights_2014 gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json ./schema_flight_performance.json

Note

There are multiple JSON files in the bucket named according to the convention: domestic_2014_flights_.json. The wildcard () character is used to include all of the .json files in the bucket.

More on wildcards for better performance in this video.

Performance can also be improved at loading time with Table Partitioning using Time-partitioning tables.

Declare the table as partitioned at creation time using this flag: --time_partitioning_type, then filter your selection like so:

SELECT ...
FROM `sales`
WHERE _PARTITIONTIME
BETWEEN TIMESTAMP("20160101")
    AND TIMESTAMP("20160131")

You can then analyze the performance of your query using Google Stackdriver:

  • check the multiple (sequential) stages of your query
  • check the number of rows processed at each stage which you want to keep decreasing very rapidely (with adequate filtering)
  • Wait/Read/Copute/Write: helps undertand where a stage will be stucked or not, or have tai lskew (i.e. some stages being much longer than average!)

Once the table is created, type the following command to verify table flights_2014 exists in dataset cpb101_flight_data.

bq ls $DEVSHELL_PROJECT_ID:cpb101_flight_data

The output should look like the following:

pme@cloudshell:~ (southern-flash-208711)$ bq ls $DEVSHELL_PROJECT_ID:cpb101_flight_data
    tableId      Type    Labels   Time Partitioning
 -------------- ------- -------- -------------------
  AIRPORTS       TABLE
  flights_2014   TABLE

Acknowledgment: This lab is based on an article by Felipe Hoffa: https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804

Export BigQuery table in a bucket using the web UI (to .csv, .json or .avro format)

  1. If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique. (e.g. demo-export-table-bigquery)
  2. Back to the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI
  3. Select the AIRPORTS table that you created recently, and using the "down" button to its right, select the option for Export Table.
  4. In the dialog, specify gs://<your-bucket-name>/bq/airports.csv and click OK. (for me gs://demo-export-table-bigquery/bq/airports.csv)
  5. Browse to your bucket and ensure that the .csv file has been created.
  6. Test using a query:
SELECT *
FROM `southern-flash-208711.cpb101_flight_data.AIRPORTS`

Advanced Capabilities in BigQuery

WITH, COUNT(DISTINCT)

WITH WashinghtonStations AS (
SELECT ...
...
)
SELECT washinghton_stations.name, ...
FROM WashinghtonStations AS washinghton_stations
ORDER BY rainy_days DESC;

Processing BigQuery Data Types (ARRAY/STRUCT, JOIN condition, ...)

  • EXTRACT to extract a date from a timestamp.
SELECT
  author.email,
  diff.new_path AS path,
  author.date
FROM
  `bigquery-public-data.github_repos.commits`,
  UNNEST(difference) diff
WHERE
  EXTRACT(YEAR
  FROM
    author.date)=2016
LIMIT 10
  • using ARRAY and STRUCT to build for exmaple a table with several entries for a given date.
  • STARTS_WITH to find words that start with a prefix
  • JOIN condition example

Standard SQL functions

https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators

  • Aggregate functions
  • String functions (e.g. REGEXP_CONTAINS)
  • Analytic (window) functions
    • Standard aggregations (SUM, AVG, MIN, MAX, COUNT, ...)
    • Navigation functions (LEAD, LAG, NTH_VALUE, ...)
    • Ranking and numbering functions (CUME_DIST, DENSE_RANK, ROW_NUMBER, RANK, PERCENT_RANK, ...)
  • Datetime functions
  • Array functions
  • Other functions and operators
WITH
  commits AS (
  SELECT
    author.email,
    EXTRACT(DAYOFWEEK
    FROM
      author.date) BETWEEN 2
    AND 6 is_weekday,
    LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
    diff.new_path AS path,
    author.date
  FROM
    `bigquery-public-data.github_repos.commits`,
    UNNEST(difference) diff
  WHERE
    EXTRACT(YEAR
    FROM
      author.date)=2016)
SELECT
  lang,
  is_weekday,
  COUNT(path) AS numcommits
FROM
  commits
WHERE
  LENGTH(lang) < 8
  AND lang IS NOT NULL
  AND REGEXP_CONTAINS(lang, '[a-zA-Z]')
GROUP BY
  lang,
  is_weekday
HAVING
  numcommits > 100
ORDER BY
  numcommits DESC

ref. The top weekend programming languages — based on GitHub’s activity

Date and time functions

Multiple ways to create a date:

  • DATE(year, month, day)
  • DATE(timestamp) and supports time zone
  • DATETIME(date, time) using DATE and TIME objects

User-defined functions (CREATE TEMPORARY FUNCTION)

written in SQL or Javascript, they are temporary. But user-defined functions have constraints.

  • programming logic
  • loops, complex conditions non-trivial string parsing (not supported by standard SQL)

Architecting a BigQuery project

Building a Mobile Gaming Analytics Platform - a Reference Architecture

More links