diff --git a/09-xarray/xarray-oceanhackweek20.ipynb b/09-xarray/xarray-oceanhackweek20.ipynb new file mode 100644 index 00000000..1bb34e5e --- /dev/null +++ b/09-xarray/xarray-oceanhackweek20.ipynb @@ -0,0 +1,877 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "# Xarray: Data structures for high-level analysis of multi-dimensional data\n", + "\n", + "In this lesson, we discuss cover the basics of Xarray data structures. By the end of the lesson, we will be able to:\n", + "\n", + "- Understand the basic data structures in Xarray\n", + "- Inspect `DataArray` and `Dataset` objects.\n", + "- Read and write netCDF files using Xarray.\n", + "- Understand that there are many packages that build on top of xarray" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A practical example" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import xarray as xr\n", + "\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# load tutorial dataset\n", + "ds = xr.tutorial.load_dataset(\"air_temperature\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What's in a dataset? many DataArrays" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# dataset repr\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Datasets are dict-like containers of DataArrays i.e. they are a mapping of variable name to DataArray." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pull out \"air\" dataarray\n", + "ds[\"air\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pull out dataarray using dot notation\n", + "ds.air ## same as ds[\"air\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What's in a DataArray? data + (a lot of) metadata\n", + "\n", + "### Named dimensions `.dims`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.air.dims" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Coordinate variables or \"tick labels\" (`.coords`)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.air.coords" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extracting coordinate variables\n", + "ds.air.lon" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extracting coorindate variables from .coords\n", + "ds.coords[\"lon\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Arbitrary attributes (`.attrs`)\n", + "\n", + "`.attrs` is a dictionary that can contain arbitrary python objects. Your only limitation is that some attributes may not be writeable to a netCDF file" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.air.attrs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# assign your own attribute\n", + "ds.air.attrs[\"who_is_awesome\"] = \"xarray\"\n", + "ds.air.attrs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Underlying data (`.data`)\n", + "\n", + "This is a numpy array which you may be familiar with. Xarray structures wrap underlying simpler data structures. \n", + "\n", + "This part of xarray is quite extensible allowing for GPU arrays, sparse arrays, arrays with units etc. See the demo at the end." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.air.data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# what is the type of the underlying data\n", + "type(ds.air.data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Review\n", + "\n", + "\n", + "Xarray provides two main data structures\n", + "* DataArrays that wrap underlying data containers (e.g. numpy arrays) and contain associated metadata\n", + "* Datasets that are dict-like containers of DataArrays\n", + "\n", + "For more see\n", + "* https://xarray.pydata.org/en/stable/data-structures.html#dataset\n", + "* https://xarray.pydata.org/en/stable/data-structures.html#dataarray" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Why xarray? Use metadata for fun and ~profit~ papers!\n", + "\n", + "### Analysis without xarray: `X(`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# plot the first timestep\n", + "lat = ds.air.lat.data # numpy array\n", + "lon = ds.air.lon.data # numpy array\n", + "temp = ds.air.data # numpy array\n", + "plt.figure()\n", + "plt.pcolormesh(lon, lat, temp[0, :, :])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp.mean(axis=1) ## what did I just do? I can't tell by looking at this line. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Analysis with xarray `=)`\n", + "\n", + "How readable is this code?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure()\n", + "ds.air.isel(time=1).plot(x=\"lon\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure()\n", + "ds.air.mean(\"time\").plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extracting data or \"indexing\" : `.sel`, `.isel`\n", + "\n", + "Xarray supports \n", + "* label-based indexing using `.sel`\n", + "* position-based indexing using `.isel`\n", + "\n", + "For more see https://xarray.pydata.org/en/stable/indexing.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Label-based indexing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pull out data for all of 2013-May\n", + "ds.sel(time=\"2013-05\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate slicing\n", + "ds.sel(time=slice(\"2013-05\", \"2013-07\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate \"nearest\" indexing\n", + "ds.sel(lon=240.2, method=\"nearest\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Position-based indexing\n", + "\n", + "\n", + "This is similar to your usual numpy `array[0, 2, 3]` but with the power of named dimensions!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pull out time index 0 and lat index 0\n", + "ds.air.isel(time=0, lat=0) # much better than ds.air[0, 0, :]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate slicing\n", + "ds.air.isel(lat=slice(10))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Concepts for computation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Broadcasting: expanding data\n", + "\n", + "Let's try to calculate grid cell area associated with the air temperature data. We will use this to make a proper domain-average\n", + "\n", + "A very approximate formula is\n", + "\n", + "\\begin{equation}\n", + "Δlat*\\cos(\\text{latitude}) * Δlon\n", + "\\end{equation}\n", + "\n", + "assuming that $Δlon$ = 110km and $Δlat$ = 105km" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dlon = np.cos(ds.air.lat * np.pi / 180) * 110e3\n", + "dlon" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dlat = 105e3* xr.ones_like(ds.air.lon)\n", + "dlat" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cell_area = dlon * dlat\n", + "cell_area" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The result has two dimensions because xarray realizes that dimensions `lon` and `lat` are different so it automatically \"broadcasts\" to get a 2D result. See the last row in this image from *Jake VanderPlas Python Data Science Handbook*\n", + "\n", + "\n", + "\n", + "\n", + "Because xarray knows about dimension names we avoid having to create unnecessary size-1 dimensions using `np.newaxis` or `.reshape`. For more, see https://xarray.pydata.org/en/stable/computation.html#broadcasting-by-dimension-name\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "### Alignment: putting data on the same grid\n", + "\n", + "When doing arithmetic operations xarray automatically \"aligns\" i.e. puts the data on the same grid. In this case `cell_area` and `ds.air` are at the same lat, lon points so things are multiplied as you would expect" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "(cell_area * ds.air.isel(time=1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now lets make `cell_area` unaligned i.e. change the coordinate labels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# make a copy of cell_area\n", + "# then add 1e-5 to lat\n", + "cell_area_bad = cell_area.copy(deep=True)\n", + "cell_area_bad[\"lat\"] = cell_area.lat + 1e-5\n", + "cell_area_bad" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cell_area_bad * ds.air.isel(time=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Tip:** If you notice extra NaNs or missing points after xarray computation, it means that your xarray coordinates were not aligned *exactly*.\n", + "\n", + "For more, see https://xarray.pydata.org/en/stable/computation.html#automatic-alignment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## High level computation: `groupby`, `resample`, `rolling`, `coarsen`, `weighted`\n", + "\n", + "Xarray has some very useful high level objects that let you do common computations:\n", + "\n", + "1. `groupby` : [Bin data in to groups and reduce](https://xarray.pydata.org/en/stable/groupby.html)\n", + "1. `resample` : [Groupby specialized for time axes. Either downsample or upsample your data.](https://xarray.pydata.org/en/stable/time-series.html#resampling-and-grouped-operations)\n", + "1. `rolling` : [Operate on rolling windows of your data e.g. running mean](https://xarray.pydata.org/en/stable/computation.html#rolling-window-operations)\n", + "1. `coarsen` : [Downsample your data](https://xarray.pydata.org/en/stable/computation.html#coarsen-large-arrays)\n", + "1. `weighted` : [Weight your data before reducing](https://xarray.pydata.org/en/stable/computation.html#weighted-array-reductions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### groupby" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# seasonal groups\n", + "ds.groupby(\"time.season\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# make a seasonal mean\n", + "seasonal_mean = ds.groupby(\"time.season\").mean()\n", + "seasonal_mean" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### resample" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# resample to monthly frequency\n", + "ds.resample(time=\"M\").mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### weighted" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# weight by cell_area and take mean over (time, lon)\n", + "ds.weighted(cell_area).mean([\"lon\", \"time\"]).air.plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## Visualization: `.plot`\n", + "\n", + "For more see https://xarray.pydata.org/en/stable/plotting.html" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# facet the seasonal_mean\n", + "seasonal_mean.air.plot(col=\"season\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# contours\n", + "seasonal_mean.air.plot.contour(col=\"season\", levels=20, add_colorbar=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## Reading and writing to disk" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# write ds to netCDF\n", + "ds.to_netcdf(\"my-example-dataset.nc\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# read from disk\n", + "fromdisk = xr.open_dataset(\"my-example-dataset.nc\")\n", + "fromdisk" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check that the two are identical\n", + "ds.identical(fromdisk)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### More\n", + "\n", + "Xarray supports many disk formats. For more see https://xarray.pydata.org/en/stable/io.html\n", + "\n", + "A common use case to read datasets that are a collection of many netCDF files. See https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets for how to handle that" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# More information\n", + "\n", + "1. A description of common terms used in the xarray documentation: https://xarray.pydata.org/en/stable/terminology.html\n", + "1. For information on how to create a DataArray from an existing numpy array: https://xarray.pydata.org/en/stable/data-structures.html#creating-a-dataarray\n", + "1. Answers to common questions on \"how to do X\" are here: https://xarray.pydata.org/en/stable/howdoi.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# The scientific python / pangeo ecosystem: demo\n", + "\n", + "Xarray ties in to the larger scientific python ecosystem and in turn many packages build on top of xarray. A long list of such packages is here: https://xarray.pydata.org/en/stable/related-projects.html.\n", + "\n", + "\n", + "Now we will demonstrate some cool features." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pandas: tabular data structures \n", + "\n", + "You can easily convert between xarray and pandas structures: https://pandas.pydata.org/\n", + "\n", + "This allows you to conveniently use the extensive pandas ecosystem of packages (like seaborn) for your work.\n", + "\n", + "See https://xarray.pydata.org/en/stable/pandas.html" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert to pandas dataframe\n", + "df = ds.isel(time=slice(10)).to_dataframe()\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# convert dataframe to xarray\n", + "df.to_xarray()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## xarray can wrap other array types, not just numpy\n", + "\n", + "### dask: parallel arrays\n", + "\n", + "Dask cuts up NumPy arrays into blocks and parallelizes your analysis code across these blocks\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# make dask cluster; this is for demo purposes\n", + "import dask\n", + "import distributed\n", + "\n", + "cluster = distributed.LocalCluster()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate dask dataset\n", + "dasky = xr.tutorial.open_dataset(\"air_temperature\", chunks={\"time\": 100})\n", + "\n", + "dasky.air" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate lazy mean\n", + "dasky.air.mean()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# \"compute\" the mean\n", + "dasky.air.mean().compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### More:\n", + "\n", + "* dask: https://xarray.pydata.org/en/stable/dask.html\n", + "* pydata/sparse: sparse arrays http://sparse.pydata.org\n", + "* cupy: GPU arrays http://cupy.chainer.org\n", + "* pint: unit-aware computations https://pint.readthedocs.org & https://github.com/xarray-contrib/pint-xarray" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## holoviews: javascript interactive plots\n", + "\n", + "the ``hvplot`` package is a nice easy way to access the holoviews functionality. It attaches itself to all xarray objects under the `.hvplot` namespace. So instead of using `.plot` use `.hvplot`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import hvplot.xarray\n", + "\n", + "ds.air.hvplot(groupby=\"time\", clim=(270, 300))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### cf_xarray : use even more metadata for fun and ~profit~ papers\n", + "\n", + "[cf_xarray](https://cf-xarray.readthedocs.io/) is a new project that tries to let you make use of other CF attributes that xarray ignores. It attaches itself to all xarray objects under the `.cf` namespace" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cf_xarray" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# describe cf attributes in dataset\n", + "ds.air.cf.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate equivalent of .mean(\"lat\")\n", + "ds.air.cf.mean(\"latitude\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# demonstrate indexing\n", + "ds.air.cf.sel(longitude=242.5, method=\"nearest\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/environment.yml b/environment.yml index c6f71616..88bf26f0 100644 --- a/environment.yml +++ b/environment.yml @@ -9,6 +9,7 @@ dependencies: - cartopy - cdsapi - cf-units + - cf_xarray - colorcet - compilers - ctd @@ -32,7 +33,7 @@ dependencies: - netcdf4 - numpy - palettable - - pandas + - pandas<1.1 - panel - param - podaacpy