-
Notifications
You must be signed in to change notification settings - Fork 4
Home
The package assists with going from a forecast dataset to a collection of valid, independent samples for assessment.
- https://github.com/csiro-dcfp/pangeo_hpc
- https://github.com/dougiesquire/unseen_bushfires
- https://github.com/dougiesquire/hydro_tasmania/tree/main/2020
- ClimPred
- xks: xarray-compatible Kolmogorov-Smirnov testing in 1D and 2D
- ACS indices
- https://github.com/cgevans/scikits-bootstrap
- Kelder, T., Müller, M., Slater, L.J. et al. Using UNSEEN trends to detect decadal changes in 100-year precipitation extremes. npj Clim Atmos Sci 3, 47 (2020). https://doi.org/10.1038/s41612-020-00149-4
- Thompson, V., Dunstone, N.J., Scaife, A.A. et al. High risk of unprecedented UK rainfall in the current climate. Nat Commun 8, 107 (2017). https://doi.org/10.1038/s41467-017-00275-3
A local Dask cluster can be setup as follows:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
You can set up a Dask cluster under the OpenOnDemand environment across multiple nodes using the cloud-based SLURM cluster.
from dask.distributed import Client,Scheduler
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=16, memory="31GB")
client = Client(cluster)
cluster.scale(cores=32)
You can also use climtas.nci.GadiClient()
following
these instructions.
/g/data/xv83/observations/
/g/data/zv2/agcd/v2
Code for model run: https://github.com/csiro-dcfp/cm-forecasts
The raw netcdf files produced by the model are sent to tape when the model is run. To access the 2020-11 forecasts (for instance) you need to be a member of the v14 project:
$ mdss -P v14 dmls -l f6/c5-d60-pX-f6-20201101
Code for post-processing (Zarr conversion): https://github.com/csiro-dcfp/post-processing
The post-processed data is stored at /g/data/xv83/dcfp/CAFE-f6
.
The control run that the simulation branched from (c5
), data assimilation method (d60
), perturbation method (pX
) are indicated. They are the defining features of the f5
dataset.
The f5
dataset has 10 ensemble members and was run over a longer period of time, whereas f6
has 96 members over a shorter period.
The CAFE monthly precipitation has listed units of kg/m2/s-1 but is actually the sum of the daily kg/m2/s-1 values.
In CAFE-60 (the reanalysis that is used to initialise the forecasts), a bunch of water was added to the ocean around 1990 (which has a trivial impact on most variables) and then the bias correction scheme changed for forecasts initialised from 1992 onwards (i.e. after satellite altimetry data came online). The scheme allowed an SST cold bias up until 1992 because there were so few observations to assimilate (and so the simulation would crash if you didn't allow a cold bias), but then didn't allow a bias to persist after 1992. This causes a sudden warming-related step change in most surface and atmospheric variables. For depth integrated ocean quantities warming continues for many years as the SST step change propagates into the deep ocean but then is arrested (and begins to cool for a few years) as Argo observations come online in the mid-2000s (i.e. the sub-surface ocean was basically unconstrained until then).
A different version of the model executable file was used in the f6 dataset for the May forecasts from 2005 onwards than for the rest of the forecasts. This creates a bit of a zig-zag behaviour in many variables.
TODO:
- I/O with netcdf
- Temporal aggregation (e.g. annual, seasonal or monthly) is the tricky part of the preprocessing,
because you need to decide how to handle/modify the lead_time coordinate
- Resample frequencies
- Y (annual, with date label being last day of year)
- M (monthly, with date label being last day of month)
- Q-NOV (DJF, MAM, JJA, SON, with date label being last day of season)
- A-NOV (annual mean for Dec-Nov, date label being last day of the year)
- YS (annual, with date label being first day of the year)
- MS (monthly, with date label being first day of month)
- QS-DEC (DJF, MAM, JJA, SON, with date label being first day of season)
- AS-DEC (annual mean for Dec-Nov, with date label being first day of year)
- Resample frequencies
- xarray pad for NaN filling
- ClimPred has functions for (and a good explanation of) verification alignment
Returns a mask (to remove gird points that fail the test) or a test statistic/s for each grid point corresponding to different alpha levels (e.g. 0.99, 0.95) that a mask could be created from.
Dougie's paper uses the Kolmogorov–Smirnov test statistic (the 1D function comes from SciPy, the 2D he wrote himself) to test the similarity between the observational and forecast distributions. Other papers apply the Multiple-moment test (1D only), or we could use scipy implementation of the Anderson-Darling test. It may be possible to use xclim for the KS test calculation once this issue is resolved, otherwise xks could be incorporated into this unseen package. For a 1D KS test there's a formula to determine the confidence interval. For the 2D test you need to use bootstrapping to find the confidence interval (i.e. you take many sub-samples of the forecast data - each sub-sample has the same size as the observations - and calculate a KS value for each sub-sample).
Need to calculate test scores for each lead time separately (because model can drift away from initial state).
Because all the ensemble members start from the same point, they are also correlated (i.e. not independent) for n lead times. For instance, lead times < 30 months might have a correlation higher than some threshold, so you'd would exclude all lead times < 30 months from your analysis.
An issue is data selection where you want the same number of samples for each lead time (e.g. for a dataset where 10-year forecasts are initiated every year from 2005-2020, you don't get full overlap until 2014). Dougie has written a function stack_super_ensemble
to subset the overlap data, but it's inefficient.
i.e. counting to determine the likelihood of a particular event.
Define functions that can be used to construct workflows in a Jupyter notebook
or linked together in the main()
function of Python scripts.
A common operation involves taking observational datasets and
converting the time axis to a dual initial date / lead time coordinate.
This basically involves duplicating the data via overlapping windows,
which is analogous to a process in image processing (to make images more blurry)
called patch extraction.
There are implementations available as part of
skimage (skimage.util.view_as_windows
),
numpy (numpy.lib.stride_tricks.sliding_window_view
),
dask (dask.array.overlap.sliding_window_view
)
and xarray (xarray.core.rolling
).
For many analyses we ultimately want to reduce all the non-spatial dimensions down to one single dimension. For example, this array,
ds_cafe
Dimensions: (ensemble: 96, init_date: 4, lat: 17, lead_time: 3650, lon: 17)
Coordinates:
* ensemble (ensemble) int64 1 2 3 4 5 6 7 8 9 ... 88 89 90 91 92 93 94 95 96
* init_date (init_date) datetime64[ns] 1990-11-01 1991-11-01 ... 1993-11-01
* lat (lat) float64 -43.48 -41.46 -39.44 ... -15.17 -13.15 -11.12
* lead_time (lead_time) int64 0 1 2 3 4 5 6 ... 3644 3645 3646 3647 3648 3649
* lon (lon) float64 113.8 116.2 118.8 121.2 ... 146.2 148.8 151.2 153.8
time (lead_time, init_date) datetime64[ns] dask.array<chunksize=(3650, 4), meta=np.ndarray>
Data variables:
precip (init_date, lead_time, ensemble, lat, lon) float64 dask.array<chunksize=(1, 50, 96, 17, 17), meta=np.ndarray>
would simply become (lat, lon, sample)
,
where sample
is a MultiIndex
with related ensemble
, init_date
and lead_time
coordinate values.
ds_stacked = ds_cafe.stack({'sample': ['ensemble', 'init_date', 'lead_time']})
ds_stacked
Dimensions: (lat: 17, lon: 17, sample: 1401600)
Coordinates:
* lat (lat) float64 -43.48 -41.46 -39.44 ... -15.17 -13.15 -11.12
* lon (lon) float64 113.8 116.2 118.8 121.2 ... 146.2 148.8 151.2 153.8
time (sample) datetime64[ns] dask.array<chunksize=(1401600,), meta=np.ndarray>
* sample (sample) MultiIndex
- ensemble (sample) int64 1 1 1 1 1 1 1 1 1 1 ... 96 96 96 96 96 96 96 96 96
- init_date (sample) datetime64[ns] 1990-11-01 1990-11-01 ... 1993-11-01
- lead_time (sample) int64 0 1 2 3 4 5 6 ... 3644 3645 3646 3647 3648 3649
Data variables:
precip (lat, lon, sample) float64 dask.array<chunksize=(17, 17, 14600), meta=np.ndarray>
We can index (i.e. select and slice) the data using references to ensemble
, init_date
, lead_time
, but not time
,
which is a bit of a limitation because the work around for time selection involves creating a boolean array
the same size as the data array.
I think this limitation will be overcome with the implementation of flexible indexes. A proposal has been developed for adding this feature and work has started as part of the xarray CZI grant.
Use apply_ufunc
. Here's some examples: