modify create_time_series to save RESTOM time series file #240

brianpm · 2023-04-13T18:05:24Z

Save RESTOM time series file whenever FLNT and FSNT are available in the hist_file_var_list in create_time_series.

This is a little brute force, but is relatively low overhead, and then allows RESTOM to be used like a regular variable throughout the rest of ADF.

justin-richling · 2023-10-04T15:42:01Z

Thanks for this @brianpm, sorry it took so long to get to.

I also changed the amwg_table.py to remove any RESTOM calculations now that RESTOM has its own time series file. I also added a section to move RESTOM to the top of the table (if applicable) as per Cecile's request. Would you mind updating your file and push to this branch?

import numpy as np
import xarray as xr
import sys
from pathlib import Path
import warnings  # use to warn user about missing files.

#Import "special" modules:
try:
    import scipy.stats as stats # for easy linear regression and testing
except ImportError:
    print("Scipy module does not exist in python path, but is needed for amwg_table.")
    print("Please install module, e.g. 'pip install scipy'.")
    sys.exit(1)
#End except

try:
    import pandas as pd
except ImportError:
    print("Pandas module does not exist in python path, but is needed for amwg_table.")
    print("Please install module, e.g. 'pip install pandas'.")
    sys.exit(1)
#End except

#Import ADF-specific modules:
import plotting_functions as pf

def amwg_table(adf):

    """
    Main function goes through series of steps:
    - load the variable data
    - Determine whether there are spatial dims; if yes, do global average (TODO: regional option)
    - Apply annual average (TODO: add seasonal here)
    - calculates the statistics
      + mean
      + sample size
      + standard deviation
      + standard error of the mean
      + 5/95% confidence interval of the mean
      + linear trend
      + p-value of linear trend
    - puts statistics into a CSV file
    - generates simple HTML that can display the data

    Description of needed inputs from ADF:

    case_names      -> Name(s) of CAM case provided by "cam_case_name"
    input_ts_locs   -> Location(s) of CAM time series files provided by "cam_ts_loc"
    output_loc      -> Location to write AMWG table files to, provided by "cam_diag_plot_loc"
    var_list        -> List of CAM output variables provided by "diag_var_list"
    var_defaults    -> Dict that has keys that are variable names and values that are plotting preferences/defaults.

    and if doing a CAM baseline comparison:

    baseline_name     -> Name of CAM baseline case provided by "cam_case_name"
    input_ts_baseline -> Location of CAM baseline time series files provied by "cam_ts_loc"

    """

    #Import necessary modules:
    from adf_base import AdfError

    #Additional information:
    #----------------------

    # GOAL: replace the "Tables" set in AMWG
    #       Set Description
    #   1 Tables of ANN, DJF, JJA, global and regional means and RMSE.
    #
    # STRATEGY:
    # I think the right solution is to generate one CSV (or other?) file that
    # contains all of the data.
    # So we need:
    # - a function that would produces the data, and
    # - then call a function that adds the data to a file
    # - another function(module?) that uses the file to produce a "web page"

    # IMPLEMENTATION:
    # - assume that we will have time series of global averages already ... that should be done ahead of time
    # - given a variable or file for a variable (equivalent), we will calculate the all-time, DJF, JJA, MAM, SON
    #   + mean
    #   + standard error of the mean
    #     -- 95% confidence interval for the mean, estimated by:
    #     ---- CI95 = mean + (SE * 1.96)
    #     ---- CI05 = mean - (SE * 1.96)
    #   + standard deviation
    # AMWG also includes the RMSE b/c it is comparing two things, but I will put that off for now.

    # DETAIL: we use python's type hinting as much as possible

    # in future, provide option to do multiple domains
    # They use 4 pre-defined domains:
    domains = {"global": (0, 360, -90, 90),
               "tropics": (0, 360, -20, 20),
               "southern": (0, 360, -90, -20),
               "northern": (0, 360, 20, 90)}

    # and then in time it is DJF JJA ANN

    # within each domain and season
    # the result is just a table of
    # VARIABLE-NAME, RUN VALUE, OBS VALUE, RUN-OBS, RMSE
    #----------------------

    #Notify user that script has started:
    print("\n  Calculating AMWG variable table...")


    #Extract needed quantities from ADF object:
    #-----------------------------------------
    var_list     = adf.diag_var_list
    var_defaults = adf.variable_defaults

    #Check if ocean or land fraction exist
    #in the variable list:
    for var in ["OCNFRAC", "LANDFRAC"]:
        if var in var_list:
            #If so, then move them to the front of variable list so
            #that they can be used to mask or vertically interpolate
            #other model variables if need be:
            var_idx = var_list.index(var)
            var_list.pop(var_idx)
            var_list.insert(0,var)
        #End if
    #End if

    #Special ADF variable which contains the output paths for
    #all generated plots and tables for each case:
    output_locs = adf.plot_location

    #CAM simulation variables (these quantities are always lists):
    case_names    = adf.get_cam_info("cam_case_name", required=True)
    input_ts_locs = adf.get_cam_info("cam_ts_loc", required=True)

    #Check if a baseline simulation is also being used:
    if not adf.get_basic_info("compare_obs"):
        #Extract CAM baseline variaables:
        baseline_name     = adf.get_baseline_info("cam_case_name", required=True)
        input_ts_baseline = adf.get_baseline_info("cam_ts_loc", required=True)

        if "CMIP" in baseline_name:
            print("CMIP files detected, skipping AMWG table (for now)...")

        else:
            #Append to case list:
            case_names.append(baseline_name)
            input_ts_locs.append(input_ts_baseline)

        #Save the baseline to the first case's plots directory:
        output_locs.append(output_locs[0])

    #-----------------------------------------

    #Loop over CAM cases:
    for case_idx, case_name in enumerate(case_names):

        #Convert output location string to a Path object:
        output_location = Path(output_locs[case_idx])

        #Generate input file path:
        input_location = Path(input_ts_locs[case_idx])

        #Check that time series input directory actually exists:
        if not input_location.is_dir():
            errmsg = f"Time series directory '{input_location}' not found.  Script is exiting."
            raise AdfError(errmsg)
        #Write to debug log if enabled:
        adf.debug_log(f"DEBUG: location of files is {str(input_location)}")
        #Check if analysis directory exists, and if not, then create it:
        if not output_location.is_dir():
            print(f"\t    {output_locs[case_idx]} not found, making new directory")
            output_location.mkdir(parents=True)

        #Create output file name:
        output_csv_file = output_location / f"amwg_table_{case_name}.csv"

        #Given that this is a final, user-facing analysis, go ahead and re-do it every time:
        if Path(output_csv_file).is_file():
            Path.unlink(output_csv_file)
        #End if

        #Create/reset new variable that potentially stores the re-gridded
        #ocean fraction xarray data-array:
        ocn_frc_da = None

        #Loop over CAM output variables:
        for var in var_list:

            #Notify users of variable being added to table:
            print(f"\t - Variable '{var}' being added to table")

            #Create list of time series files present for variable:
            ts_filenames = f'{case_name}.*.{var}.*nc'
            ts_files = sorted(input_location.glob(ts_filenames))

            # If no files exist, try to move to next variable. --> Means we can not proceed with this variable, and it'll be problematic later.
            if not ts_files:
                errmsg = f"Time series files for variable '{var}' not found.  Script will continue to next variable."
                warnings.warn(errmsg)
                continue
            #End if

            #TEMPORARY:  For now, make sure only one file exists:
            if len(ts_files) != 1:
                errmsg =  "Currently the AMWG table script can only handle one time series file per variable."
                errmsg += f" Multiple files were found for the variable '{var}'"
                raise AdfError(errmsg)
            #End if

            #Load model data from file:
            data = _load_data(ts_files[0], var)

            #Extract units string, if available:
            if hasattr(data, 'units'):
                unit_str = data.units
            else:
                unit_str = '--'

            #Check if variable has a vertical coordinate:
            if 'lev' in data.coords or 'ilev' in data.coords:
                print(f"\t   Variable '{var}' has a vertical dimension, "+\
                      "which is currently not supported for the AMWG Table. Skipping...")
                #Skip this variable and move to the next variable in var_list:
                continue
            #End if

            #Extract defaults for variable:
            var_default_dict = var_defaults.get(var, {})

            #Check if variable should be masked:
            if 'mask' in var_default_dict:
                if var_default_dict['mask'].lower() == 'ocean':
                    #Check if the ocean fraction has already been regridded
                    #and saved:
                    if ocn_frc_da is not None:
                        ofrac = ocn_frc_da
                        # set the bounds of regridded ocnfrac to 0 to 1
                        ofrac = xr.where(ofrac>1,1,ofrac)
                        ofrac = xr.where(ofrac<0,0,ofrac)

                        # apply ocean fraction mask to variable
                        data = pf.mask_land_or_ocean(data, ofrac, use_nan=True)
                        #data = var_tmp
                    else:
                        print(f"OCNFRAC not found, unable to apply mask to '{var}'")
                    #End if
                else:
                    #Currently only an ocean mask is supported, so print warning here:
                    wmsg = "Currently the only variable mask option is 'ocean',"
                    wmsg += f"not '{var_default_dict['mask'].lower()}'"
                    print(wmsg)
                #End if
            #End if

            #If the variable is ocean fraction, then save the dataset for use later:
            if var == 'OCNFRAC':
                ocn_frc_da = data
            #End if

            # we should check if we need to do area averaging:
            if len(data.dims) > 1:
                # flags that we have spatial dimensions
                # Note: that could be 'lev' which should trigger different behavior
                # Note: we should be able to handle (lat, lon) or (ncol,) cases, at least
                data = pf.spatial_average(data)  # changes data "in place"

            # In order to get correct statistics, average to annual or seasonal
            data = pf.annual_mean(data, whole_years=True, time_name='time')

            # create a dataframe:
            cols = ['variable', 'unit', 'mean', 'sample size', 'standard dev.',
                    'standard error', '95% CI', 'trend', 'trend p-value']

            # These get written to our output file:
            stats_list = _get_row_vals(data)
            row_values = [var, unit_str] + stats_list

            # Format entries:
            dfentries = {c:[row_values[i]] for i,c in enumerate(cols)}

            # Add entries to Pandas structure:
            df = pd.DataFrame(dfentries)

            # Check if the output CSV file exists,
            # if so, then append to it:
            if output_csv_file.is_file():
                df.to_csv(output_csv_file, mode='a', header=False, index=False)
            else:
                df.to_csv(output_csv_file, header=cols, index=False)

        #End of var_list loop
        #--------------------

        # Move RESTOM to top of table
        #----------------------------
        if 'RESTOM' in var_list:
            table_df = pd.read_csv(output_csv_file)
            idx = table_df.index[table_df['variable'] == 'RESTOM'].tolist()[0]
            table_df = pd.concat([table_df[table_df['variable'] == 'RESTOM'], table_df]).reset_index(drop = True)
            table_df = table_df.drop([idx+1]).reset_index(drop=True)
            table_df = table_df.drop_duplicates()
            table_df.to_csv(output_csv_file, header=cols, index=False)
        #End if

        # last step is to add table dataframe to website (if enabled):
        table_df = pd.read_csv(output_csv_file)
        adf.add_website_data(table_df, case_name, case_name, plot_type="Tables")

    #End of model case loop
    #----------------------

    #Notify user that script has ended:
    print("  ...AMWG variable table has been generated successfully.")

    #Check if observations are being comapred to, if so skip table comparison...
    if not adf.get_basic_info("compare_obs"):
        if "CMIP" in baseline_name:
            print("CMIP case detected, skipping comparison table...")
        else:
            #Create comparison table for both cases
            print("\n  Making comparison table...")
            _df_comp_table(adf, output_location, case_names)
            print("  ... Comparison table has been generated successfully")
        #End if
    else:
        print(" Comparison table currently doesn't work with obs, so skipping...")
    #End if


##################
# Helper functions
##################

def _load_data(dataloc, varname):
    import xarray as xr
    ds = xr.open_dataset(dataloc)
    return ds[varname]

#####

def _get_row_vals(data):
    # Now that data is (time,), we can do our simple stats:

    data_mean = data.data.mean()
    #Conditional Formatting depending on type of float
    if np.abs(data_mean) < 1:
        formatter = ".3g"
    else:
        formatter = ".3f"

    data_sample = len(data)
    data_std = data.std()
    data_sem = data_std / data_sample
    data_ci = data_sem * 1.96  # https://en.wikipedia.org/wiki/Standard_error
    data_trend = stats.linregress(data.year, data.values)

    stdev = f'{data_std.data.item() : {formatter}}'
    sem = f'{data_sem.data.item() : {formatter}}'
    ci = f'{data_ci.data.item() : {formatter}}'
    slope_int = f'{data_trend.intercept : {formatter}} + {data_trend.slope : {formatter}} t'
    pval = f'{data_trend.pvalue : {formatter}}'

    return [f'{data_mean:{formatter}}', data_sample, stdev, sem, ci, slope_int, pval]

#####

def _df_comp_table(adf, output_location, case_names):
    import pandas as pd

    output_csv_file_comp = output_location / "amwg_table_comp.csv"

    # * * * * * * * * * * * * * * * * * * * * * * * * * * * *
    #This will be for single-case for now (case_names[0]),
    #will need to change to loop as multi-case is introduced
    case = output_location/f"amwg_table_{case_names[0]}.csv"
    baseline = output_location/f"amwg_table_{case_names[-1]}.csv"

    #Read in test case and baseline dataframes:
    df_case = pd.read_csv(case)
    df_base = pd.read_csv(baseline)

    #Create a merged dataframe that contains only the variables
    #contained within both the test case and the baseline:
    df_merge = pd.merge(df_case, df_base, how='inner', on=['variable'])

    #Create the "comparison" dataframe:
    df_comp = pd.DataFrame(dtype=object)
    df_comp[['variable','unit','case']] = df_merge[['variable','unit_x','mean_x']]
    df_comp['baseline'] = df_merge[['mean_y']]

    diffs = df_comp['case'].values-df_comp['baseline'].values
    df_comp['diff'] = [f'{i:.3g}' if np.abs(i) < 1 else f'{i:.3f}' for i in diffs]

    #Write the comparison dataframe to a new CSV file:
    cols_comp = ['variable', 'unit', 'test', 'control', 'diff']
    df_comp.to_csv(output_csv_file_comp, header=cols_comp, index=False)

    #Add comparison table dataframe to website (if enabled):
    adf.add_website_data(df_comp, "Case Comparison", case_names[0], plot_type="Tables")

##############
#END OF SCRIPT

justin-richling · 2023-10-04T15:45:40Z

I forgot that the change in amwg_table.py assumes we add RESTOM to the diag_var_list in the config files

brianpm · 2023-10-04T16:37:53Z

@justin-richling --- I hope I did this correctly. I "synced" my branch with ADF main, and then I put your amwg_table.py into a commit and pushed back to my branch. Looks like one of the tests is failing, but I don't know why.

nusbaume · 2023-12-21T03:08:29Z

Just wanted to add a note here to see if we can add RESTOM to the time series files via the same method used to add PRECT in PR #260? If you need any help or guidance doing that just let me know (and once that is done I'll go ahead and do a formal review). Thanks!

brianpm · 2023-12-22T00:15:11Z

I have modified adf_diag.py to derive RESTOM following the same recipe/approach as #260. I tested that RESTOM time series files are being generated with real CAM data. I also added RESTOM to adf_variable_defaults.yaml because the code does not know it is a derivable quantity otherwise (Note: I did not check if the colors/contour levels are good, and that will probably need to be adjusted.)

A couple of things I noted:

In both cases (PRECT and RESTOM) this approach assumes that there will be one time series file per variable; the method may or may not succeed in creating a new file in other situations, but in all other possible situations, the result file will not be the correct/complete result.
The above problem also includes finding ".tmp" files from previous failed attempts.
There was no provision for what to do in cases where the derived variable file already exists, which would probably end up with an NCO prompt at the command line. I added a simple check and a kwarg to determine whether to overwrite or not. If overwrite is true and the file exists, it will be removed and re-made, otherwise it will use the existing file. Applied this to both PRECT and RESTOM.
this approach produces files that will have all three variables in them, replicating the time series information from the two input variables
this approach does not correct metadata for the derived variable, it is copied from one of the input variables. This may negatively impact automatic labeling of plots (especially long_name is incorrect).
The use of os.system has been superseded by the subprocess approach
The indentation of the code was weird, so I reformatted using the black formatter (I tried to do just the method at the end, but I think it hit the whole file).

nusbaume

Looks good! I did have a couple of suggested changes and questions, but nothing that should meaningfully hold up this PR, at least on my end.

Also, I tried to capture your concerns (along with at least one of my own) with the current derived variables method with ADF issue #273. Please feel free to add anything I missed!

scripts/analysis/amwg_table.py

lib/adf_diag.py

justin-richling

This looks good to me, my suggestions can be done here, or in a different PR if desired.
Thanks @brianpm

justin-richling · 2023-10-04T15:34:42Z

lib/adf_diag.py

+                tmp2 = xr.open_dataset(shortwavenet)
+                tmp3 = tmp2[swnet_name]-tmp1[lwnet_name]
+                tmp3.name = "RESTOM"
+                tmp3.attrs['long_name'] = "residual radiative flux at top of model"


I would just add the units for RESTOM here too

tmp3.attrs['units'] = "W/m2"

These lines are gone now because @nusbaume asked for me to use the implementation following PRECT.

justin-richling · 2023-12-29T22:20:27Z

lib/adf_variable_defaults.yaml

@@ -683,6 +683,19 @@ QRS:
 #+++++++++++++++++
 # Category: TOA energy flux
 #+++++++++++++++++
+RESTOM:
+  colormap: "RdBu_r"
+  contour_levels_range: [-10, 10, 0.5]


Not a big deal, but when plotting the contours are a different range. I'm not sure if the RESTOM global lat/lon plots will ever be looked at

but I just deleted the contour_levels_range variable

Yeah, totally a typo. I just made the range from -100 to 100 in the updated code.

scripts/analysis/amwg_table.py

brianpm · 2024-01-08T20:25:59Z

Okie dokie, I think I just addressed all the issues from the review.

justin-richling

Looks good @brianpm, thanks for getting this in!

Bring in changes from PR NCAR#240

modify create_time_series to save RESTOM time series file

c2b237f

justin-richling self-requested a review April 13, 2023 20:46

brianpm mentioned this pull request Sep 26, 2023

Adding mods to derive PRECT #260

Merged

brianpm and others added 2 commits October 4, 2023 10:25

Merge branch 'NCAR:main' into restom_as_timeseries

9c01872

bring in Justin's amwg_table.py version

3deb3e1

Update pylint call in Github Actions workflow.

fb6848d

nusbaume requested review from nusbaume and removed request for nusbaume December 21, 2023 02:55

brianpm and others added 5 commits December 21, 2023 16:29

Merge branch 'NCAR:main' into restom_as_timeseries

c5b8456

changed to match NCAR#260

21ecbfd

fixed typo in message string

03d25a9

added RESTOM to adf_variable_defaults.yaml

6b1dac4

Switch the variable being inserted (order was wrong)

3a6e4f4

nusbaume self-requested a review December 22, 2023 17:14

nusbaume added the enhancement New feature or request label Dec 22, 2023

nusbaume mentioned this pull request Dec 23, 2023

Future upgrades to derived variable calculations #273

Open

nusbaume approved these changes Dec 23, 2023

View reviewed changes

scripts/analysis/amwg_table.py Outdated Show resolved Hide resolved

lib/adf_diag.py Outdated Show resolved Hide resolved

lib/adf_diag.py Outdated Show resolved Hide resolved

lib/adf_diag.py Outdated Show resolved Hide resolved

justin-richling reviewed Dec 29, 2023

View reviewed changes

justin-richling approved these changes Dec 29, 2023

View reviewed changes

addressing code review comments

2fb7924

justin-richling approved these changes Jan 9, 2024

View reviewed changes

justin-richling merged commit f8eef00 into NCAR:main Jan 9, 2024
7 checks passed

justin-richling added a commit to justin-richling/ADF that referenced this pull request Jan 9, 2024

Update amwg_table.py

0bc3cdb

Bring in changes from PR NCAR#240

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modify create_time_series to save RESTOM time series file #240

modify create_time_series to save RESTOM time series file #240

brianpm commented Apr 13, 2023

justin-richling commented Oct 4, 2023

justin-richling commented Oct 4, 2023

brianpm commented Oct 4, 2023

nusbaume commented Dec 21, 2023

brianpm commented Dec 22, 2023

nusbaume left a comment

justin-richling left a comment

justin-richling Oct 4, 2023

brianpm Jan 8, 2024 •

edited

Loading

justin-richling Dec 29, 2023

brianpm Jan 8, 2024

brianpm commented Jan 8, 2024

justin-richling left a comment

modify create_time_series to save RESTOM time series file #240

modify create_time_series to save RESTOM time series file #240

Conversation

brianpm commented Apr 13, 2023

justin-richling commented Oct 4, 2023

justin-richling commented Oct 4, 2023

brianpm commented Oct 4, 2023

nusbaume commented Dec 21, 2023

brianpm commented Dec 22, 2023

nusbaume left a comment

Choose a reason for hiding this comment

justin-richling left a comment

Choose a reason for hiding this comment

justin-richling Oct 4, 2023

Choose a reason for hiding this comment

brianpm Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

justin-richling Dec 29, 2023

Choose a reason for hiding this comment

brianpm Jan 8, 2024

Choose a reason for hiding this comment

brianpm commented Jan 8, 2024

justin-richling left a comment

Choose a reason for hiding this comment

brianpm Jan 8, 2024 •

edited

Loading