Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify create_time_series to save RESTOM time series file #240

Merged
merged 10 commits into from
Jan 9, 2024

Conversation

brianpm
Copy link
Collaborator

@brianpm brianpm commented Apr 13, 2023

Save RESTOM time series file whenever FLNT and FSNT are available in the hist_file_var_list in create_time_series.

This is a little brute force, but is relatively low overhead, and then allows RESTOM to be used like a regular variable throughout the rest of ADF.

@justin-richling
Copy link
Collaborator

Thanks for this @brianpm, sorry it took so long to get to.

I also changed the amwg_table.py to remove any RESTOM calculations now that RESTOM has its own time series file. I also added a section to move RESTOM to the top of the table (if applicable) as per Cecile's request. Would you mind updating your file and push to this branch?

import numpy as np
import xarray as xr
import sys
from pathlib import Path
import warnings  # use to warn user about missing files.

#Import "special" modules:
try:
    import scipy.stats as stats # for easy linear regression and testing
except ImportError:
    print("Scipy module does not exist in python path, but is needed for amwg_table.")
    print("Please install module, e.g. 'pip install scipy'.")
    sys.exit(1)
#End except

try:
    import pandas as pd
except ImportError:
    print("Pandas module does not exist in python path, but is needed for amwg_table.")
    print("Please install module, e.g. 'pip install pandas'.")
    sys.exit(1)
#End except

#Import ADF-specific modules:
import plotting_functions as pf

def amwg_table(adf):

    """
    Main function goes through series of steps:
    - load the variable data
    - Determine whether there are spatial dims; if yes, do global average (TODO: regional option)
    - Apply annual average (TODO: add seasonal here)
    - calculates the statistics
      + mean
      + sample size
      + standard deviation
      + standard error of the mean
      + 5/95% confidence interval of the mean
      + linear trend
      + p-value of linear trend
    - puts statistics into a CSV file
    - generates simple HTML that can display the data

    Description of needed inputs from ADF:

    case_names      -> Name(s) of CAM case provided by "cam_case_name"
    input_ts_locs   -> Location(s) of CAM time series files provided by "cam_ts_loc"
    output_loc      -> Location to write AMWG table files to, provided by "cam_diag_plot_loc"
    var_list        -> List of CAM output variables provided by "diag_var_list"
    var_defaults    -> Dict that has keys that are variable names and values that are plotting preferences/defaults.

    and if doing a CAM baseline comparison:

    baseline_name     -> Name of CAM baseline case provided by "cam_case_name"
    input_ts_baseline -> Location of CAM baseline time series files provied by "cam_ts_loc"

    """

    #Import necessary modules:
    from adf_base import AdfError

    #Additional information:
    #----------------------

    # GOAL: replace the "Tables" set in AMWG
    #       Set Description
    #   1 Tables of ANN, DJF, JJA, global and regional means and RMSE.
    #
    # STRATEGY:
    # I think the right solution is to generate one CSV (or other?) file that
    # contains all of the data.
    # So we need:
    # - a function that would produces the data, and
    # - then call a function that adds the data to a file
    # - another function(module?) that uses the file to produce a "web page"

    # IMPLEMENTATION:
    # - assume that we will have time series of global averages already ... that should be done ahead of time
    # - given a variable or file for a variable (equivalent), we will calculate the all-time, DJF, JJA, MAM, SON
    #   + mean
    #   + standard error of the mean
    #     -- 95% confidence interval for the mean, estimated by:
    #     ---- CI95 = mean + (SE * 1.96)
    #     ---- CI05 = mean - (SE * 1.96)
    #   + standard deviation
    # AMWG also includes the RMSE b/c it is comparing two things, but I will put that off for now.

    # DETAIL: we use python's type hinting as much as possible

    # in future, provide option to do multiple domains
    # They use 4 pre-defined domains:
    domains = {"global": (0, 360, -90, 90),
               "tropics": (0, 360, -20, 20),
               "southern": (0, 360, -90, -20),
               "northern": (0, 360, 20, 90)}

    # and then in time it is DJF JJA ANN

    # within each domain and season
    # the result is just a table of
    # VARIABLE-NAME, RUN VALUE, OBS VALUE, RUN-OBS, RMSE
    #----------------------

    #Notify user that script has started:
    print("\n  Calculating AMWG variable table...")


    #Extract needed quantities from ADF object:
    #-----------------------------------------
    var_list     = adf.diag_var_list
    var_defaults = adf.variable_defaults

    #Check if ocean or land fraction exist
    #in the variable list:
    for var in ["OCNFRAC", "LANDFRAC"]:
        if var in var_list:
            #If so, then move them to the front of variable list so
            #that they can be used to mask or vertically interpolate
            #other model variables if need be:
            var_idx = var_list.index(var)
            var_list.pop(var_idx)
            var_list.insert(0,var)
        #End if
    #End if

    #Special ADF variable which contains the output paths for
    #all generated plots and tables for each case:
    output_locs = adf.plot_location

    #CAM simulation variables (these quantities are always lists):
    case_names    = adf.get_cam_info("cam_case_name", required=True)
    input_ts_locs = adf.get_cam_info("cam_ts_loc", required=True)

    #Check if a baseline simulation is also being used:
    if not adf.get_basic_info("compare_obs"):
        #Extract CAM baseline variaables:
        baseline_name     = adf.get_baseline_info("cam_case_name", required=True)
        input_ts_baseline = adf.get_baseline_info("cam_ts_loc", required=True)

        if "CMIP" in baseline_name:
            print("CMIP files detected, skipping AMWG table (for now)...")

        else:
            #Append to case list:
            case_names.append(baseline_name)
            input_ts_locs.append(input_ts_baseline)

        #Save the baseline to the first case's plots directory:
        output_locs.append(output_locs[0])

    #-----------------------------------------

    #Loop over CAM cases:
    for case_idx, case_name in enumerate(case_names):

        #Convert output location string to a Path object:
        output_location = Path(output_locs[case_idx])

        #Generate input file path:
        input_location = Path(input_ts_locs[case_idx])

        #Check that time series input directory actually exists:
        if not input_location.is_dir():
            errmsg = f"Time series directory '{input_location}' not found.  Script is exiting."
            raise AdfError(errmsg)
        #Write to debug log if enabled:
        adf.debug_log(f"DEBUG: location of files is {str(input_location)}")
        #Check if analysis directory exists, and if not, then create it:
        if not output_location.is_dir():
            print(f"\t    {output_locs[case_idx]} not found, making new directory")
            output_location.mkdir(parents=True)

        #Create output file name:
        output_csv_file = output_location / f"amwg_table_{case_name}.csv"

        #Given that this is a final, user-facing analysis, go ahead and re-do it every time:
        if Path(output_csv_file).is_file():
            Path.unlink(output_csv_file)
        #End if

        #Create/reset new variable that potentially stores the re-gridded
        #ocean fraction xarray data-array:
        ocn_frc_da = None

        #Loop over CAM output variables:
        for var in var_list:

            #Notify users of variable being added to table:
            print(f"\t - Variable '{var}' being added to table")

            #Create list of time series files present for variable:
            ts_filenames = f'{case_name}.*.{var}.*nc'
            ts_files = sorted(input_location.glob(ts_filenames))

            # If no files exist, try to move to next variable. --> Means we can not proceed with this variable, and it'll be problematic later.
            if not ts_files:
                errmsg = f"Time series files for variable '{var}' not found.  Script will continue to next variable."
                warnings.warn(errmsg)
                continue
            #End if

            #TEMPORARY:  For now, make sure only one file exists:
            if len(ts_files) != 1:
                errmsg =  "Currently the AMWG table script can only handle one time series file per variable."
                errmsg += f" Multiple files were found for the variable '{var}'"
                raise AdfError(errmsg)
            #End if

            #Load model data from file:
            data = _load_data(ts_files[0], var)

            #Extract units string, if available:
            if hasattr(data, 'units'):
                unit_str = data.units
            else:
                unit_str = '--'

            #Check if variable has a vertical coordinate:
            if 'lev' in data.coords or 'ilev' in data.coords:
                print(f"\t   Variable '{var}' has a vertical dimension, "+\
                      "which is currently not supported for the AMWG Table. Skipping...")
                #Skip this variable and move to the next variable in var_list:
                continue
            #End if

            #Extract defaults for variable:
            var_default_dict = var_defaults.get(var, {})

            #Check if variable should be masked:
            if 'mask' in var_default_dict:
                if var_default_dict['mask'].lower() == 'ocean':
                    #Check if the ocean fraction has already been regridded
                    #and saved:
                    if ocn_frc_da is not None:
                        ofrac = ocn_frc_da
                        # set the bounds of regridded ocnfrac to 0 to 1
                        ofrac = xr.where(ofrac>1,1,ofrac)
                        ofrac = xr.where(ofrac<0,0,ofrac)

                        # apply ocean fraction mask to variable
                        data = pf.mask_land_or_ocean(data, ofrac, use_nan=True)
                        #data = var_tmp
                    else:
                        print(f"OCNFRAC not found, unable to apply mask to '{var}'")
                    #End if
                else:
                    #Currently only an ocean mask is supported, so print warning here:
                    wmsg = "Currently the only variable mask option is 'ocean',"
                    wmsg += f"not '{var_default_dict['mask'].lower()}'"
                    print(wmsg)
                #End if
            #End if

            #If the variable is ocean fraction, then save the dataset for use later:
            if var == 'OCNFRAC':
                ocn_frc_da = data
            #End if

            # we should check if we need to do area averaging:
            if len(data.dims) > 1:
                # flags that we have spatial dimensions
                # Note: that could be 'lev' which should trigger different behavior
                # Note: we should be able to handle (lat, lon) or (ncol,) cases, at least
                data = pf.spatial_average(data)  # changes data "in place"

            # In order to get correct statistics, average to annual or seasonal
            data = pf.annual_mean(data, whole_years=True, time_name='time')

            # create a dataframe:
            cols = ['variable', 'unit', 'mean', 'sample size', 'standard dev.',
                    'standard error', '95% CI', 'trend', 'trend p-value']

            # These get written to our output file:
            stats_list = _get_row_vals(data)
            row_values = [var, unit_str] + stats_list

            # Format entries:
            dfentries = {c:[row_values[i]] for i,c in enumerate(cols)}

            # Add entries to Pandas structure:
            df = pd.DataFrame(dfentries)

            # Check if the output CSV file exists,
            # if so, then append to it:
            if output_csv_file.is_file():
                df.to_csv(output_csv_file, mode='a', header=False, index=False)
            else:
                df.to_csv(output_csv_file, header=cols, index=False)

        #End of var_list loop
        #--------------------

        # Move RESTOM to top of table
        #----------------------------
        if 'RESTOM' in var_list:
            table_df = pd.read_csv(output_csv_file)
            idx = table_df.index[table_df['variable'] == 'RESTOM'].tolist()[0]
            table_df = pd.concat([table_df[table_df['variable'] == 'RESTOM'], table_df]).reset_index(drop = True)
            table_df = table_df.drop([idx+1]).reset_index(drop=True)
            table_df = table_df.drop_duplicates()
            table_df.to_csv(output_csv_file, header=cols, index=False)
        #End if

        # last step is to add table dataframe to website (if enabled):
        table_df = pd.read_csv(output_csv_file)
        adf.add_website_data(table_df, case_name, case_name, plot_type="Tables")

    #End of model case loop
    #----------------------

    #Notify user that script has ended:
    print("  ...AMWG variable table has been generated successfully.")

    #Check if observations are being comapred to, if so skip table comparison...
    if not adf.get_basic_info("compare_obs"):
        if "CMIP" in baseline_name:
            print("CMIP case detected, skipping comparison table...")
        else:
            #Create comparison table for both cases
            print("\n  Making comparison table...")
            _df_comp_table(adf, output_location, case_names)
            print("  ... Comparison table has been generated successfully")
        #End if
    else:
        print(" Comparison table currently doesn't work with obs, so skipping...")
    #End if


##################
# Helper functions
##################

def _load_data(dataloc, varname):
    import xarray as xr
    ds = xr.open_dataset(dataloc)
    return ds[varname]

#####

def _get_row_vals(data):
    # Now that data is (time,), we can do our simple stats:

    data_mean = data.data.mean()
    #Conditional Formatting depending on type of float
    if np.abs(data_mean) < 1:
        formatter = ".3g"
    else:
        formatter = ".3f"

    data_sample = len(data)
    data_std = data.std()
    data_sem = data_std / data_sample
    data_ci = data_sem * 1.96  # https://en.wikipedia.org/wiki/Standard_error
    data_trend = stats.linregress(data.year, data.values)

    stdev = f'{data_std.data.item() : {formatter}}'
    sem = f'{data_sem.data.item() : {formatter}}'
    ci = f'{data_ci.data.item() : {formatter}}'
    slope_int = f'{data_trend.intercept : {formatter}} + {data_trend.slope : {formatter}} t'
    pval = f'{data_trend.pvalue : {formatter}}'

    return [f'{data_mean:{formatter}}', data_sample, stdev, sem, ci, slope_int, pval]

#####

def _df_comp_table(adf, output_location, case_names):
    import pandas as pd

    output_csv_file_comp = output_location / "amwg_table_comp.csv"

    # * * * * * * * * * * * * * * * * * * * * * * * * * * * *
    #This will be for single-case for now (case_names[0]),
    #will need to change to loop as multi-case is introduced
    case = output_location/f"amwg_table_{case_names[0]}.csv"
    baseline = output_location/f"amwg_table_{case_names[-1]}.csv"

    #Read in test case and baseline dataframes:
    df_case = pd.read_csv(case)
    df_base = pd.read_csv(baseline)

    #Create a merged dataframe that contains only the variables
    #contained within both the test case and the baseline:
    df_merge = pd.merge(df_case, df_base, how='inner', on=['variable'])

    #Create the "comparison" dataframe:
    df_comp = pd.DataFrame(dtype=object)
    df_comp[['variable','unit','case']] = df_merge[['variable','unit_x','mean_x']]
    df_comp['baseline'] = df_merge[['mean_y']]

    diffs = df_comp['case'].values-df_comp['baseline'].values
    df_comp['diff'] = [f'{i:.3g}' if np.abs(i) < 1 else f'{i:.3f}' for i in diffs]

    #Write the comparison dataframe to a new CSV file:
    cols_comp = ['variable', 'unit', 'test', 'control', 'diff']
    df_comp.to_csv(output_csv_file_comp, header=cols_comp, index=False)

    #Add comparison table dataframe to website (if enabled):
    adf.add_website_data(df_comp, "Case Comparison", case_names[0], plot_type="Tables")

##############
#END OF SCRIPT

@justin-richling
Copy link
Collaborator

I forgot that the change in amwg_table.py assumes we add RESTOM to the diag_var_list in the config files

@brianpm
Copy link
Collaborator Author

brianpm commented Oct 4, 2023

@justin-richling --- I hope I did this correctly. I "synced" my branch with ADF main, and then I put your amwg_table.py into a commit and pushed back to my branch. Looks like one of the tests is failing, but I don't know why.

@nusbaume nusbaume requested review from nusbaume and removed request for nusbaume December 21, 2023 02:55
@nusbaume
Copy link
Collaborator

Just wanted to add a note here to see if we can add RESTOM to the time series files via the same method used to add PRECT in PR #260? If you need any help or guidance doing that just let me know (and once that is done I'll go ahead and do a formal review). Thanks!

@brianpm
Copy link
Collaborator Author

brianpm commented Dec 22, 2023

I have modified adf_diag.py to derive RESTOM following the same recipe/approach as #260. I tested that RESTOM time series files are being generated with real CAM data. I also added RESTOM to adf_variable_defaults.yaml because the code does not know it is a derivable quantity otherwise (Note: I did not check if the colors/contour levels are good, and that will probably need to be adjusted.)

A couple of things I noted:

  • In both cases (PRECT and RESTOM) this approach assumes that there will be one time series file per variable; the method may or may not succeed in creating a new file in other situations, but in all other possible situations, the result file will not be the correct/complete result.
  • The above problem also includes finding ".tmp" files from previous failed attempts.
  • There was no provision for what to do in cases where the derived variable file already exists, which would probably end up with an NCO prompt at the command line. I added a simple check and a kwarg to determine whether to overwrite or not. If overwrite is true and the file exists, it will be removed and re-made, otherwise it will use the existing file. Applied this to both PRECT and RESTOM.
  • this approach produces files that will have all three variables in them, replicating the time series information from the two input variables
  • this approach does not correct metadata for the derived variable, it is copied from one of the input variables. This may negatively impact automatic labeling of plots (especially long_name is incorrect).
  • The use of os.system has been superseded by the subprocess approach
  • The indentation of the code was weird, so I reformatted using the black formatter (I tried to do just the method at the end, but I think it hit the whole file).

@nusbaume nusbaume self-requested a review December 22, 2023 17:14
@nusbaume nusbaume added the enhancement New feature or request label Dec 22, 2023
Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I did have a couple of suggested changes and questions, but nothing that should meaningfully hold up this PR, at least on my end.

Also, I tried to capture your concerns (along with at least one of my own) with the current derived variables method with ADF issue #273. Please feel free to add anything I missed!

scripts/analysis/amwg_table.py Outdated Show resolved Hide resolved
lib/adf_diag.py Outdated Show resolved Hide resolved
lib/adf_diag.py Outdated Show resolved Hide resolved
lib/adf_diag.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@justin-richling justin-richling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, my suggestions can be done here, or in a different PR if desired.
Thanks @brianpm

lib/adf_diag.py Outdated
tmp2 = xr.open_dataset(shortwavenet)
tmp3 = tmp2[swnet_name]-tmp1[lwnet_name]
tmp3.name = "RESTOM"
tmp3.attrs['long_name'] = "residual radiative flux at top of model"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add the units for RESTOM here too

tmp3.attrs['units'] = "W/m2"

Copy link
Collaborator Author

@brianpm brianpm Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are gone now because @nusbaume asked for me to use the implementation following PRECT.

@@ -683,6 +683,19 @@ QRS:
#+++++++++++++++++
# Category: TOA energy flux
#+++++++++++++++++
RESTOM:
colormap: "RdBu_r"
contour_levels_range: [-10, 10, 0.5]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but when plotting the contours are a different range. I'm not sure if the RESTOM global lat/lon plots will ever be looked at

RESTOM_ANN_LatLon_Mean

but I just deleted the contour_levels_range variable

RESTOM_ANN_LatLon_Mean (1)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, totally a typo. I just made the range from -100 to 100 in the updated code.

scripts/analysis/amwg_table.py Outdated Show resolved Hide resolved
@brianpm
Copy link
Collaborator Author

brianpm commented Jan 8, 2024

Okie dokie, I think I just addressed all the issues from the review.

Copy link
Collaborator

@justin-richling justin-richling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @brianpm, thanks for getting this in!

@justin-richling justin-richling merged commit f8eef00 into NCAR:main Jan 9, 2024
7 checks passed
justin-richling added a commit to justin-richling/ADF that referenced this pull request Jan 9, 2024
Bring in changes from PR NCAR#240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants