Skip to content
Damien Irving edited this page Apr 23, 2021 · 29 revisions

Relevant code and packages

Relevant papers

Data for playing around

Project xv83 on NCI (zarr format), e.g:
/g/data/xv83/ds0092/CAFE/forecasts/f6/WIP/c5-d60-pX-f6-19841101/ZARR/atmos_isobaric_daily.zarr.zip

The control run that the simulation branched from (c5), data assimilation method (d60), perturbation method (pX) are indicated. They are the defining features of the f5 dataset.

The f5 dataset has 10 ensemble members and was run over a longer period of time, whereas f6 has 96 members over a shorter period.

Processing steps

Step 1: Pre-processing

  • Data extraction (for examples see Dougie's prepare*.ipynb files)
  • Region extraction (using regionmask)
  • Index calculation (possibly with ability to pass a custom function)
  • Bias correction (additive or multiplicative method)
get_data(files, region)
get_index()
bias_correct(model_data, obs_data, method="additive")

Some of these functions might need to write to file, because dask gets confused if many operations pile up. Outputs are typically a single zipped zarr file.

Step 2: Fidelity testing

Returns a mask (to remove gird points that fail the test) or a test statistic/s for each grid point corresponding to different alpha levels (e.g. 0.99, 0.95) that a mask could be created from.

Dougie's paper applies the Kolmogorov–Smirnov test (the 1D function comes from SciPy, the 2D he wrote himself) and other papers apply the Multiple-ooment test (1D only).

An issue is data selection where you want the same number of samples for each lead time (e.g. for a dataset where 10-year forecasts are initiated every year from 2005-2020, you don't get full overlap until 2014). Dougie has written a function stack_super_ensemble to subset the overlap data, but it's inefficient.

ks_test_1d()
ks_test_2d()
moment_test()

Step 3: The easier stuff

i.e. counting to determine the likelihood of a particular event.

Development notes

Patch extraction

A common operation involves taking observational datasets and converting the time axis to a dual initial date / lead time coordinate. This basically involves duplicating the data via overlapping windows, which is analogous to a process in image processing (to make images more blurry) called patch extraction. You can use either skimage.util.view_as_windows or numpy.lib.stride_tricks.sliding_window_view. The latter doesn't seem to be available with the latest version of numpy, but it has been implemented as dask.array.overlap.sliding_window_view.

A more convenient option might be xarray.core.rolling.DataArrayRolling.construct?

Making functions dask-aware

Use apply_ufunc. Here's some examples:

Clone this wiki locally