Useful commands/scripts for use on Google Cloud Platform.
- Definition
- Start web console for BigQuery
- Query large datasets in seconds (Examples)
- Load and Export Data
- Architecting a BigQuery project
- More links
BigQuery is the PetaBytes-scale datawarehouse on GCP (for immutable very-large datasets)
Use one of these:
- https://bigquery.cloud.google.com
- https://console.cloud.google.com/ then from the 'hamburger' menu, look for
BigQuery
under theBig Data
section.
#standardsql
SELECT
nppes_provider_state AS state,
ROUND(SUM(total_claim_count) / 1e6) AS total_claim_count_millions
FROM
`bigquery-public-data.medicare.part_d_prescriber_2014`
GROUP BY
state
ORDER BY
total_claim_count_million DESC
LIMIT
5;
SELECT
airline,
date,
departure_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
departure_delay > 0
AND departure_airport = 'LGA'
LIMIT
100
To write a query to get both the number of flights delayed and the total number of flights in a single query:
SELECT
f.airline,
COUNT(f.departure_delay) AS total_flights,
SUM(IF(f.departure_delay > 0, 1, 0)) AS num_delayed
FROM
`bigquery-samples.airline_ontime_data.flights` AS f
WHERE
f.departure_airport = 'LGA' AND f.date = '2008-05-13'
GROUP BY
f.airline
LIMIT
100
SELECT
airline,
COUNT(departure_delay)
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
departure_airport = 'LGA'
AND date = '2008-05-13'
GROUP BY
airline
ORDER BY airline
SELECT
airline,
COUNT(departure_delay)
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
departure_delay > 0 AND
departure_airport = 'LGA'
AND date = '2008-05-13'
GROUP BY
airline
ORDER BY airline
SELECT
CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
FROM
`bigquery-samples.weather_geo.gsod`
WHERE
station_number = 725030
AND total_precipitation > 0
SELECT
f.airline,
SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
COUNT(f.arrival_delay) AS total_flights
FROM
`bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
SELECT
CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
FROM
`bigquery-samples.weather_geo.gsod`
WHERE
station_number = 725030
AND total_precipitation > 0) AS w
ON
w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airline
SELECT
airline,
num_delayed,
total_flights,
num_delayed / total_flights AS frac_delayed
FROM (
SELECT
f.airline AS airline,
SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
COUNT(f.arrival_delay) AS total_flights
FROM
`bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
SELECT
CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
FROM
`bigquery-samples.weather_geo.gsod`
WHERE
station_number = 725030
AND total_precipitation > 0) AS w
ON
w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airline
)
ORDER BY
frac_delayed ASC
Acknowledgment: This lab is based on an article by Felipe Hoffa: https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804
- From the web GUI
- Go to https://bigquery.cloud.google.com
- You will need to use/create a dataset (within a given project) where the data will be stored.
- click on the
+
aside your dataset and follow the instructions (Table name in CAPITAL by convention)
- From the command line tool:
bq
- open
Cloud Shell
from the Google Cloud Console which already hasbq
installed - download or Create a schema for your table
You can check that file online: https://storage.googleapis.com/cloud-training/CBP200/BQ/lab4/schema_flight_performance.json
cat schema_flight_performance.json
[
{
"name": "YEAR",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "QUARTER",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "MONTH",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "DAY_OF_MONTH",
"type": "INTEGER"
},
{
"name": "FULL_DATE",
"type": "STRING"
},
...
{
"name": "SECURITY_DELAY",
"type": "INTEGER"
},
{
"name": "LATE_AIRCRAFT_DELAY",
"type": "INTEGER"
}
]%
Type the following command to download schema_flight_performance.json (the schema file for the table in this example) to your working directory.
curl https://storage.googleapis.com/cloud-training/CPB200/BQ/lab4/schema_flight_performance.json -o schema_flight_performance.json
The JSON files containing the data for your table are now stored in a Google Cloud Storage bucket. They have URIs like the following: gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json
- use
bq
and type the following command to create a table named flights_2014 in the cpb101_flight_data dataset, using data from files in Google Cloud Storage and the schema file stored on your virtual machine.
Note that your Project ID is stored as a variable in Cloud Shell ($DEVSHELL_PROJECT_ID
) so there's no need for you to remember it. If you require it, you can view your Project ID in the command line to the right of your username (after the @ symbol).
bq load --source_format=NEWLINE_DELIMITED_JSON $DEVSHELL_PROJECT_ID:cpb101_flight_data.flights_2014 gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json ./schema_flight_performance.json
Note
There are multiple JSON files in the bucket named according to the convention: domestic_2014_flights_.json. The wildcard () character is used to include all of the .json files in the bucket.
More on wildcards for better performance in this video.
Performance can also be improved at loading time with Table Partitioning using Time-partitioning tables.
Declare the table as partitioned at creation time using this flag: --time_partitioning_type
, then filter your selection like so:
SELECT ...
FROM `sales`
WHERE _PARTITIONTIME
BETWEEN TIMESTAMP("20160101")
AND TIMESTAMP("20160131")
You can then analyze the performance of your query using Google Stackdriver:
- check the multiple (sequential) stages of your query
- check the number of rows processed at each stage which you want to keep decreasing very rapidely (with adequate filtering)
Wait
/Read
/Copute
/Write
: helps undertand where a stage will be stucked or not, or have tai lskew (i.e. some stages being much longer than average!)
Once the table is created, type the following command to verify table flights_2014 exists in dataset cpb101_flight_data.
bq ls $DEVSHELL_PROJECT_ID:cpb101_flight_data
The output should look like the following:
pme@cloudshell:~ (southern-flash-208711)$ bq ls $DEVSHELL_PROJECT_ID:cpb101_flight_data
tableId Type Labels Time Partitioning
-------------- ------- -------- -------------------
AIRPORTS TABLE
flights_2014 TABLE
Acknowledgment: This lab is based on an article by Felipe Hoffa: https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804
- If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique. (e.g.
demo-export-table-bigquery
) - Back to the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI
- Select the
AIRPORTS
table that you created recently, and using the "down" button to its right, select the option for Export Table. - In the dialog, specify
gs://<your-bucket-name>/bq/airports.csv
and click OK. (for me gs://demo-export-table-bigquery/bq/airports.csv) - Browse to your bucket and ensure that the .csv file has been created.
- Test using a query:
SELECT *
FROM `southern-flash-208711.cpb101_flight_data.AIRPORTS`
WITH WashinghtonStations AS (
SELECT ...
...
)
SELECT washinghton_stations.name, ...
FROM WashinghtonStations AS washinghton_stations
ORDER BY rainy_days DESC;
EXTRACT
to extract a date from a timestamp.
SELECT
author.email,
diff.new_path AS path,
author.date
FROM
`bigquery-public-data.github_repos.commits`,
UNNEST(difference) diff
WHERE
EXTRACT(YEAR
FROM
author.date)=2016
LIMIT 10
- using
ARRAY
andSTRUCT
to build for exmaple a table with several entries for a given date.
STARTS_WITH
to find words that start with a prefix
JOIN
condition example
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators
- Aggregate functions
- String functions (e.g.
REGEXP_CONTAINS
) - Analytic (window) functions
- Standard aggregations (SUM, AVG, MIN, MAX, COUNT, ...)
- Navigation functions (LEAD, LAG, NTH_VALUE, ...)
- Ranking and numbering functions (CUME_DIST, DENSE_RANK, ROW_NUMBER, RANK, PERCENT_RANK, ...)
- Datetime functions
- Array functions
- Other functions and operators
WITH
commits AS (
SELECT
author.email,
EXTRACT(DAYOFWEEK
FROM
author.date) BETWEEN 2
AND 6 is_weekday,
LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
diff.new_path AS path,
author.date
FROM
`bigquery-public-data.github_repos.commits`,
UNNEST(difference) diff
WHERE
EXTRACT(YEAR
FROM
author.date)=2016)
SELECT
lang,
is_weekday,
COUNT(path) AS numcommits
FROM
commits
WHERE
LENGTH(lang) < 8
AND lang IS NOT NULL
AND REGEXP_CONTAINS(lang, '[a-zA-Z]')
GROUP BY
lang,
is_weekday
HAVING
numcommits > 100
ORDER BY
numcommits DESC
ref. The top weekend programming languages — based on GitHub’s activity
Multiple ways to create a date:
DATE(year, month, day)
DATE(timestamp)
and supports time zoneDATETIME(date, time)
using DATE and TIME objects
written in SQL or Javascript, they are temporary. But user-defined functions have constraints.
- programming logic
- loops, complex conditions non-trivial string parsing (not supported by standard SQL)
Building a Mobile Gaming Analytics Platform - a Reference Architecture