-
Notifications
You must be signed in to change notification settings - Fork 7
Dataset
Global summary of day data for 18 surface meteorological elements are derived from the synoptic/hourly observations. The daily elements included in the dataset (as available from each station) are as follows.
Table 1: Weather variables provided by the dataset.
Observation | Precision | Engineering Units |
---|---|---|
mean temperature | 0.1 | Fahrenheit |
mean dew point | 0.1 | Fahrenheit |
mean sea level pressure | 0.1 | mbar |
mean station pressure | 0.1 | mbar |
mean visibility | 0.1 | miles |
mean wind speed | 0.1 | knots |
maximum sustained wind speed | 0.1 | knots |
maximum wind gust | 0.1 | knots |
maximum temperature | 0.1 | Fahrenheit |
minimum temperature | 0.1 | Fahrenheit |
precipitation amount | 0.01 | inches |
snow depth | 0.1 | inches |
Indicators for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel are also included.
The dataset is referenced on Kaggle and available via the Google BigQuery API. For the purposes of this competition, the data has already been acquired and staged on the cluster.
As an added challenge, the dataset has been transformed from its original format and has been unpivoted into a normalized table. This has been done to increase the difficulty and represents a typical scenario for practicing data scientists. Often, the most difficult part of a project is wrangling the data from a problematic format into something more manageable and/or efficient.
You will find the file, gsod.obs.csv
, in a publicly accessible directory on
the cluster. There are 2,291,730,018 records across six columns making
up 76 gigabytes in total.
/home/glentner/public/datasets/noaa/gsod.obs.csv
Table 2: Column descriptions.
Column | Description |
---|---|
station_id |
Station number (WMO/DATSAV3 number) for the location. |
wban_id |
WBAN number where applicable--this is the historical "Weather Bureau Air Force Navy" number - with WBAN being the acronym. |
datetime |
yyyy/mm/dd of the observation. |
measure_id |
Unique identifier for the type of observation. |
measure_value |
Numerical value of the observation. |
measure_ref |
Reference value (variable). |
The measure_ref
column contains an additional numerical field with a value
giving additional context to the observation. Its nature changes depending on
which variable is in question. For the primary measures it gives the count
of observations used in the summary. For others it is a flag indicating some
additional context.
Table 3: Observation types and their measure_ref
.
measure_id |
measure_value |
measure_ref |
---|---|---|
0001 | mean temperature | count of observations |
0002 | mean dew point | count of observations |
0003 | mean sea level pressure | count of observations |
0004 | mean station pressure | count of observations |
0005 | mean visibility | count of observations |
0006 | mean wind speed | count of observations |
0007 | maximum sustained wind speed | null |
0008 | maximum wind gust | null |
0009 | maximum temperature |
0 indicates max temp was taken from the explicit max temp report and not from the hourly data; 1 indicates max temp was derived from the hourly data (i.e., highest hourly or synoptic-reported temperature) |
0010 | minimum temperature |
0 indicates min temp was taken from the explicit min temp report and not from the hourly data; 1 indicates min temp was derived from the hourly data (i.e., highest hourly or synoptic-reported temperature) |
0011 | precipitation amount | * |
0012 | snow depth |
1 = yes, 0 = no/not reported |
0013 | fog |
1 = yes, 0 = no/not reported |
0014 | rain / drizzle |
1 = yes, 0 = no/not reported |
0015 | snow / ice / pellets |
1 = yes, 0 = no/not reported |
0016 | hail |
1 = yes, 0 = no/not reported |
0017 | thunder |
1 = yes, 0 = no/not reported |
0018 | tornado / funnel cloud |
1 = yes, 0 = no/not reported |
* Precipitation reference codes are as follows.
-
1
: One report of 6-hour precipitation amount. -
2
: Summation of 2 reports of 6-hour precipitation amount. -
3
: Summation of 3 reports of 6-hour precipitation amount. -
4
: Summation of 4 reports of 6-hour precipitation amount. -
5
: One report of 12-hour precipitation amount. -
6
: Summation of 2 reports of 12-hour precipitation amount. -
7
: One report of 24-hour precipitation amount. -
8
: Station reported '0' as the amount for the day. (e.g., from 6-hour reports), but also reported at least one occurrence of precipitation in hourly observations — this could indicate a trace occurred, but should be considered as incomplete data for the day. -
9
: Station did not report any precipitation data for the day and did not report any occurrences of precipitation in its hourly observations — it's still possible that precipitation occurred but was not reported.
Each station_id
is associated with its proper name, host country information,
precise geographic coordinates, as well as the first and last day of operation.
You can find this reference table, gsod.stn.csv
, in the same location.
/home/glentner/public/datasets/noaa/gsod.stn.csv
Table 4: Station reference data.
Column | Description |
---|---|
station_id |
Unique identifier (shared with obs table). |
wban_id |
Unique identifier (shared with obs table). |
station_name |
Proper name of the station. |
country |
Host country name. |
state |
State name within country. |
call_name |
Call sign (if available). |
latitude |
Latitude (decimal degrees) |
longitude |
Longitude (decimal degrees) |
elevation |
Elevation (meters) |
date_start |
First observation (yyyymmdd) |
date_end |
Last observation (yyyymmdd) |
Previous: Overview | Next: Challenge
Figure 1: Global distribution of GSOD weather stations.
AITP Computing Challenge Day 2019 | Data Science Challenge | Research Computing |
---|