Skip to content

Dataset

Geoffrey Lentner edited this page Mar 23, 2019 · 1 revision

GSOD Observations

Global summary of day data for 18 surface meteorological elements are derived from the synoptic/hourly observations. The daily elements included in the dataset (as available from each station) are as follows.

Table 1: Weather variables provided by the dataset.

Observation Precision Engineering Units
mean temperature 0.1 Fahrenheit
mean dew point 0.1 Fahrenheit
mean sea level pressure 0.1 mbar
mean station pressure 0.1 mbar
mean visibility 0.1 miles
mean wind speed 0.1 knots
maximum sustained wind speed 0.1 knots
maximum wind gust 0.1 knots
maximum temperature 0.1 Fahrenheit
minimum temperature 0.1 Fahrenheit
precipitation amount 0.01 inches
snow depth 0.1 inches

Indicators for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel are also included.

The dataset is referenced on Kaggle and available via the Google BigQuery API. For the purposes of this competition, the data has already been acquired and staged on the cluster.

As an added challenge, the dataset has been transformed from its original format and has been unpivoted into a normalized table. This has been done to increase the difficulty and represents a typical scenario for practicing data scientists. Often, the most difficult part of a project is wrangling the data from a problematic format into something more manageable and/or efficient.

You will find the file, gsod.obs.csv, in a publicly accessible directory on the cluster. There are 2,291,730,018 records across six columns making up 76 gigabytes in total.

/home/glentner/public/datasets/noaa/gsod.obs.csv

Table 2: Column descriptions.

Column Description
station_id Station number (WMO/DATSAV3 number) for the location.
wban_id WBAN number where applicable--this is the historical "Weather Bureau Air Force Navy" number - with WBAN being the acronym.
datetime yyyy/mm/dd of the observation.
measure_id Unique identifier for the type of observation.
measure_value Numerical value of the observation.
measure_ref Reference value (variable).

The measure_ref column contains an additional numerical field with a value giving additional context to the observation. Its nature changes depending on which variable is in question. For the primary measures it gives the count of observations used in the summary. For others it is a flag indicating some additional context.

Table 3: Observation types and their measure_ref.

measure_id measure_value measure_ref
0001 mean temperature count of observations
0002 mean dew point count of observations
0003 mean sea level pressure count of observations
0004 mean station pressure count of observations
0005 mean visibility count of observations
0006 mean wind speed count of observations
0007 maximum sustained wind speed null
0008 maximum wind gust null
0009 maximum temperature 0 indicates max temp was taken from the explicit max temp report and not from the hourly data; 1 indicates max temp was derived from the hourly data (i.e., highest hourly or synoptic-reported temperature)
0010 minimum temperature 0 indicates min temp was taken from the explicit min temp report and not from the hourly data; 1 indicates min temp was derived from the hourly data (i.e., highest hourly or synoptic-reported temperature)
0011 precipitation amount *
0012 snow depth 1 = yes, 0 = no/not reported
0013 fog 1 = yes, 0 = no/not reported
0014 rain / drizzle 1 = yes, 0 = no/not reported
0015 snow / ice / pellets 1 = yes, 0 = no/not reported
0016 hail 1 = yes, 0 = no/not reported
0017 thunder 1 = yes, 0 = no/not reported
0018 tornado / funnel cloud 1 = yes, 0 = no/not reported

* Precipitation reference codes are as follows.

  • 1: One report of 6-hour precipitation amount.
  • 2: Summation of 2 reports of 6-hour precipitation amount.
  • 3: Summation of 3 reports of 6-hour precipitation amount.
  • 4: Summation of 4 reports of 6-hour precipitation amount.
  • 5: One report of 12-hour precipitation amount.
  • 6: Summation of 2 reports of 12-hour precipitation amount.
  • 7: One report of 24-hour precipitation amount.
  • 8: Station reported '0' as the amount for the day. (e.g., from 6-hour reports), but also reported at least one occurrence of precipitation in hourly observations — this could indicate a trace occurred, but should be considered as incomplete data for the day.
  • 9: Station did not report any precipitation data for the day and did not report any occurrences of precipitation in its hourly observations — it's still possible that precipitation occurred but was not reported.

GSOD Stations

Each station_id is associated with its proper name, host country information, precise geographic coordinates, as well as the first and last day of operation.

You can find this reference table, gsod.stn.csv, in the same location.

/home/glentner/public/datasets/noaa/gsod.stn.csv

Table 4: Station reference data.

Column Description
station_id Unique identifier (shared with obs table).
wban_id Unique identifier (shared with obs table).
station_name Proper name of the station.
country Host country name.
state State name within country.
call_name Call sign (if available).
latitude Latitude (decimal degrees)
longitude Longitude (decimal degrees)
elevation Elevation (meters)
date_start First observation (yyyymmdd)
date_end Last observation (yyyymmdd)


Previous: Overview   |   Next: Challenge


Figure 1: Global distribution of GSOD weather stations.

Clone this wiki locally