tutorial-aggregate.txt

Explanation of how to use the aggregate.ncl script

For Bill to generate T1; also send to Ping Yang

I wrote an ncl script to perform aggregation of NARCCAP data.  This
example shows how to use aggregate.ncl to condense 3-hourly data into
daily data.  This is the process I used to generate tasmin & tasmax
from tas for the RCM3 runs when we discovered that the values
generated by RCM3 itself were no good.  There are some fiddly details
relating to getting the aggregation period to match up with the
0600-0600 GMT "day" specified for NARCCAP Table 1 data, but the usage
to generate monthly & seasonal averages and climatologies is pretty
similar.


1) Concatenate input files together using the NCO command 'ncrcat'.

This is necessary because the file boundaries on the 5-year files may
not coincide exactly with the boundaries of the periods you're
aggregating over.  If you aren't careful with the boundaries, you can
end up with a period at the edge of the range where the value for a
large period is based on just one or two timesteps.  So you definitely
want to get this right.

For going from 3-hourly to daily, we also throw out the very first
timestep, which is at 0300 on the first day.  If it's not excluded, it
either results in an extra day at the beginning or in a day with 9
timesteps contributing intead of 8, and either way it messses things
up.

> ncrcat -d time,1, [input files] [output file]


2) Aggregate data using NCL script

> ncl -Q -n aggregate.ncl infile=\"tas.nc\" outfile=\"tasmax.nc\" interval=\"day\" varname=\"tas\" method=\"max\" check=True offset=-0.25 taint=True outtime=\"start\"

We pass command-line arguments to the NCL script using variable
definition statements on the command line.  For string-valued
variables, NCL needs the quote-marks, which means you need to escape
them with backslashes so the shell doesn't interpret them instead of
passing them on to NCL.  You could hardwire these values in the script
if you needed to.

In addition to the required command-line input to define the names of
the input file, output file, name of the variable, and period of
aggregation, there are a number of different options you can give
aggregate.ncl to control is behavior.  The options used here:

method: allowed values are "mean", "min", or "max".  Determines what
function is used to aggregate over the period.  Switch to "min" to
generate tasmin.

check: if True, prints a bunch of debugging information at the end so
you can double-check that the output really is what you think it is
and came where it was supposed to come from.  Good practice to use
this and look at it the output afterwards.  (I typically redirect it
to a file in a subdirectory named "check".)

offset: a shift to the time coordinate.  Used to adjust when the day
starts when doing daily aggregations.  Using -0.25 makes the day run
from 0600 GMT to 0600 GMT.

taint: if true, any missing_value timesteps in the input cause the
entire output to be missing also.

outtime: determines which point in the input interval should be used
as the time coordinate for the output.

There are also options to control where in the interval the output
time coordinate is set, making climatological averages across years,
and printing of progress indicators for large datasets that take a
long time to process.


3) Rename variables to reflect new contents

If we were averaging the variable, we'd probably want to leave it with
the same name, but since we're generating a maximum temperature
variable from an average temperature variable, we need to rename the
data variable accordingly.

> ncrename -v tas,tasmax tasmax.nc


4) Update metadata

The tas variable is in table 2, while tasmax is in table 1, so we need
to change the global attribute named "table".  We also need to update
the long_name attribute of tasmax to reflect the new variable.  And,
for a minimum or maximum value, we need to add an appropriate
cell_methods attribute.  All of these updates can be done with a
single use of ncatted.  Note that we use the -h flag to prevent
ncatted from adding a history entry for this operation because the
results of the action are plainly obvious in the metadata, and the
very long entries typical of editing metadata really clutter up the
history and make it hard to read.

> ncatted -h -a table_id,global,m,c,"Table 1" -a long_name,tasmax,m,c,"Maximum Daily Surface Air Temperature" -a cell_methods,tasmax,m,c,"time: maximum(interval: 1 days)" tasmax.nc


5) Split files back into 5-year chunks using ncks

For NARCCAP publication, we have everything split into 5-year chunks
to keep the file sizes below 2 GB.  If NCO has been installed with
udunits support, we can subset the data along the time dimension using
dates, which is a big plus for understanding what happened to the data
later on.  There's no good programmatic way to split the files
according to the NARCCAP spec, so we just specify all the start and
end dates by hand.  For Table 1 data, we can leave the time of day
unspecified.  This sets it to 00:00 hours, and since the coordinates
for daily values are at 06:00 hours, the bounds as specified below
will split things properly.  (The situation would be more complicated
for splitting 3-hourly data.)  Happily, going from Jan-01 to Jan-01
also lets you ignore differences in the calendar.

The little shell loop does this for both tasmax and tasmin, and
propagates whatever other filename components may be in place.


NCEP data:

foreach f (tasm*.nc)
  set g = `basename $f .nc`
  ncks -O -d time,"1979-01-01","1981-01-01" $f ${g}_1979010106.nc
  ncks -O -d time,"1981-01-01","1986-01-01" $f ${g}_1981010106.nc
  ncks -O -d time,"1986-01-01","1991-01-01" $f ${g}_1986010106.nc
  ncks -O -d time,"1991-01-01","1996-01-01" $f ${g}_1991010106.nc
  ncks -O -d time,"1996-01-01","2001-01-01" $f ${g}_1996010106.nc
  ncks -O -d time,"2001-01-01",             $f ${g}_2001010106.nc
end


Current-period data:
  
foreach f (tasm*.nc)
  set g = `basename $f .nc`
  ncks -O -d time,"1968-01-01","1971-01-01" $f ${g}_1968010106.nc
  ncks -O -d time,"1971-01-01","1976-01-01" $f ${g}_1971010106.nc
  ncks -O -d time,"1976-01-01","1981-01-01" $f ${g}_1976010106.nc
  ncks -O -d time,"1981-01-01","1986-01-01" $f ${g}_1981010106.nc
  ncks -O -d time,"1986-01-01","1991-01-01" $f ${g}_1986010106.nc
  ncks -O -d time,"1991-01-01","1996-01-01" $f ${g}_1991010106.nc
  ncks -O -d time,"1996-01-01",             $f ${g}_1996010106.nc
end


Future-period data:

foreach f (tasm*.nc)
  set g = `basename $f .nc`
  ncks -O -d time,"2038-01-01","2041-01-01" $f ${g}_2038010106.nc
  ncks -O -d time,"2041-01-01","2046-01-01" $f ${g}_2041010106.nc
  ncks -O -d time,"2046-01-01","2051-01-01" $f ${g}_2046010106.nc
  ncks -O -d time,"2051-01-01","2056-01-01" $f ${g}_2051010106.nc
  ncks -O -d time,"2056-01-01","2061-01-01" $f ${g}_2056010106.nc
  ncks -O -d time,"2061-01-01","2066-01-01" $f ${g}_2061010106.nc
  ncks -O -d time,"2066-01-01",             $f ${g}_2066010106.nc
end


6) Double-check results

Always check that the end result makes sense.  I wrote a little script
in NCL that uses the cd_calendar() function to print the date and time
of the first and last timestep in a file, and that plus the number of
timesteps in each file and the debugging output from the aggregate
script should indicate whether everything did what it was supposed to.