Skip to content

Documentation

Ian Fenty edited this page Jan 5, 2021 · 3 revisions

ECCO-ACCESS Preprocessing Pipeline Documentation

Overview

Preprocessing Pipeline

  • The general workflow for preprocessing a dataset to be added to the ECCO model is: [1] harvest the dataset, [2] run preprocessing on each file, and [3] aggregate data into yearly files. Currently the pipeline works on a local machine, although in the future it will be able to run on the cloud using AWS.

Logging/Tracking Metadata

  • The pipeline utilizes a Solr online database for logging. The Solr database is updated at each step of the pipeline with the following information: metadata about a dataset as a whole, metadata about specific files, and metadata about the pipeline process itself. This metadata is used to automatically control the flow of the pipeline. Entries in the database fall into one of seven “types”: Grid, Dataset, Field, Harvested, Transformation, Aggregation and Descendants.

Pipeline Structure

  • Dataset specific harvesting, transformation, and aggregation configuration files are used to provide the necessary information to run the pipeline from generalized code. The run_pipeline.py file provides the user with options for how to run the pipeline, what steps of the pipeline to run and what datasets to send through the pipeline.

Cloud Development

  • The current cloud computing functionality of this pipeline has not been tested or verified. As a result, the pipeline can currently only be run via the run_pipeline.py tool script on a local machine.

Solr

Solr is an indexing search platform, meaning it has fast lookup speed, and is used to track pipeline metadata. That metadata is also used to control which steps of the pipeline are to be applied to which pieces of data. There are seven types of entries within the Solr database.

Solr Structure

  • Types of Solr Entries:

  • Grid

    • 1 grid type entry per grid across all datasets (dataset independent)
    • Stored information
      • type, grid_type, grid_name, date_added, grid_path, grid_checksum, id, and _version_
  • Dataset

    • 1 dataset type entry per dataset
    • Stored information about the harvesting of that dataset:
      • harvest_status, last_checked, and last_download
    • Stored information specific to the dataset as a whole:
      • type, dataset, short_name, source, data_time_scale, date_format, last_checked, original_dataset_title, original_dataset_short_name, original_dataset_url, original_dataset_reference, original_dataset_doi, start_date, end_date, last_download, harvest_status, id, [grid_name]_[hemi]_factors_path, [grid_name]_[hemi]_factors_stored, [grid_name]_[hemi]_factors_version, transformation_status, aggregation_version, aggregation_status, and _version_
  • Field

    • 1 field type entry per field in a dataset
    • Stored information
      • type, dataset, name, long_name, standard_name, units, id, and _version_
  • Harvested

    • 1 harvested type entry per harvested granule
    • Stored information
      • type, dataset, date, source, modified_time, checksum, pre_transformation_file_path, harvest_success, filename, file_size, download_time, id, and _version_
  • Transformation

    • 1 transformation type entry per grid/field combination per harvested granule per dataset
    • Stored information
      • type, dataset, date, pre_transformation_file_path, origin_checksum, grid_name, field, transformation_in_progress, success, id, filename, transformation_file_path, transformation_completed, transformation_checksum, transformation_version, and _version_
  • Aggregation

    • 1 aggregation type entry per grid/field combination per year per dataset
    • Stored information
      • type, dataset, year, grid_name, field, aggregation_time, aggregation_success, aggregation_version, aggregation_daily_bin_path, aggregation_month_bin_path, aggregation_daily_netCDF_path, aggregation_monthly_netCDF_path, daily_aggregtated_uuid, monthly_aggregated_uuid, notes, id, and _version_
        • If monthly aggregation is not done, the monthly file paths will not be present
        • If data is in a monthly format, the daily file paths will not be present
  • Descendants

    • 1 descendants type entry per harvested granule
    • Provides information on how that granule has been used, where it has been used, and what it’s current status is in the preprocessing pipeline.
    • Stored information
      • type, dataset, date, filename, source, harvest_success, pre_transformation_file_path, all_transformations_success, [grid_name_field]_transformation_file_path (all transformation file paths), all_aggregation_success, [grid_name_field]_aggregated_[time_format_file_type]_path (all aggregated file paths that this file was included in), id, and _version_

Local Solr Implementation

  • Download and Installation:

  • Start Solr and create a collection named ecco_datasets:

    • Start Solr:
      • $ bin/solr start
    • Create a collection called ecco_datasets:
      • Command line:
        • $ bin/solr create -c ecco_datasets
  • Helpful tutorial: http://yonik.com/solr-tutorial/

General Solr Notes

  • Solr can be queried using REST APIs, curl, POSTMAN, and native clients for many languages. The pipeline uses Python’s requests package to perform REST API calls. The Solr Admin UI is useful for taking a look at the state of the database.

  • Solr Admin UI:

    • In a browser, navigate to http://localhost:8983/solr/#/

    • Select the core you created (ecco_datasets) from the Core Selector

    • Selecting a Core provides some database options, the most useful being “Documents” and “Query”.

      • The Query tab provides a query builder interface:

        • The q box value “*:*” denotes that the query should return all documents.
        • The fq box value “type_s:dataset” denotes that the query should be filtered for only documents that satisfy the condition that the document field “type_s” is equal to “dataset”
        • The rows entry is changed from the default of 10 to the max of 30000, to avoid handling pagination
        • More information about building queries can be found here: http://www.solrtutorial.com/solr-query-syntax.html
      • The Documents tab provides an update interface used for deleting or updating documents

        • The interface can accommodate multiple syntaxes. As an example to delete all documents from the AVHRR dataset using XML:

          • Set the Document Type to either XML or Solr Command
          • <delete><query>dataset_s:*AVHRR*</query></delete>

          and using JSON:

          • Set the Document Type to either JSON or Solr Command
          • {'delete': {'query': 'dataset_s:*AVHRR*'}}
        • To delete all documents (!!) XML:

          • Set the Document Type to either XML or Solr Command
          • <delete><query>*:*</query></delete>

          and using JSON:

          • Set the Document Type to either JSON or Solr Command
          • {'delete': {'query': '*:*'}}
  • Solr uses dynamic field typing. A field definition ends in a shorthand for the type of that field. For example field1_s is of type string, while field2_i is of type int. Some field types can contain multiple values, for example field3_ss would effectively hold a list of strings.

Solr Querying and Updating in the code

  • The pipeline utilizes two functions (solr_query, solr_update) to perform REST API calls.

  • solr_query:

  • Generalized query function is passed a filter query list which returns a list of all docs satisfied by that filter query.

    • Ex: fq = [type_s:dataset, dataset_s:sst_AVHRR]
  • solr_update:

    • Generalized update function is passed an update_body list of dictionaries where each dictionary is a document in Solr. If no id is given for the dictionary, Solr will create one, thus adding a new entry to Solr. If an existing document id is part of the dictionary, the document in Solr will be modified.

    • If optional argument r is set to True, then the function returns the request status of the post to Solr.

Grids

Grid Processing

  • Overview:

    • Grid processing is done in grids_to_solr.py.

    • Scans the ‘grids’ directory for .nc files, and only uses those in grids_config.yaml’s grids_to_use list, if it contains anything (otherwise it defaults to all grids in the directory). grids_to_use can be useful when developing, debugging or small batch jobs.

    • Checks Solr to see if the grid is already present and up to date by comparing the checksum of the grid file and what is stored on Solr for that grid entry.

      • If the grid file has a different checksum value, the new grid checksum is stored in the Solr entry and all transformation and aggregation entries with that grid are deleted from Solr, ensuring they are reperformed downstream.
    • New grids are added to Solr with additional identifying information provided in the grid file (‘name’ and ‘type’).

    • Then, if the verify_grids parameter is true, the function checks Solr for the grids provided in the grids_to_use parameter. If a grid is not present on Solr, its name is added to the return list.

  • Notes:

    • Transformation and Aggregation rely on grid entries in Solr. grids_to_solr.py MUST be run prior to transformation. Only required to be run once across the entire system. Can be run again when a new or updated grid is added to the grid directory.

    • Should grid names be provided to run_pipeline.py via the “--grids_to_use” argument, these names are passed to grids_to_solr.py. Within this file, each name is queried on Solr and their name is added to the return list if they are not present on Solr.

    • Every execution of run_pipeline.py calls the function check_solr_grids(). This function queries Solr for all grids, and runs grids_to_solr.py if none are found. This ensures that the pipeline is never run without any grids, unless the user has not included any within the grid_config.yaml configuration file.

Grid File Requirements

  • For the grids to be properly saved, added to Solr, and used in the transformation, each grid file requires the items listed below. These requirements are used either for identification and metadata, or for producing the mapping between the datasets and the grid.

  • Required Variables:

    • XC (‘x-coordinate’)

    • YC (‘y-coordinate’)

    • effective_grid_radius

      • Used during calculation of grid ‘factors’
  • Required Variable Attributes:

    • XC and YC:

      • _FillValue
      • standard_name
      • long_name
      • units
      • valid_range
      • axis
      • _ChunkSizes
    • effective_grid_radius:

      • _FillValue
      • _ChunkSizes
  • Required Global Attributes:

    • type (e.g. ‘llc’ or ‘latlon’)

    • name (e.g. ‘ECCO_llc90’)

  • Required Dimensions:

    • Depending on the type of grid, the following dimensions should be present in the grid file.

    • Grid type ‘llc’:

      • i (for ‘XC’)
      • j (for ‘YC’)
    • Grid type ‘latlon’:

      • lat (for ‘YC’)
      • lon (for ‘XC’)
  • If the grid does not have either of the above dimensions it needs the following:

    • XC (defined by ‘NY’ and ‘NX’)
    • YC (defined by ‘NY’ and ‘NX’)

Harvester (General)

  • This section outlines the general functionality of all harvesters, including what they create on Solr and their output.

  • The datasets currently supported by the pipeline come from three different hosts: PODAAC, NSIDC, and OSISAF. The files are hosted using different means, so there are separate harvesters for each host. The harvesters only differ in how they find and download the desired data granules. The metadata tracked in Solr is uniform across all harvesters.

Solr Entries

  • Harvesters create four types of entries in Solr:

    • Harvested

    • Descendants

    • Dataset

    • Field

  • Generally, a harvester has two major functions:

    • Download loop

      • The harvester loops through the data files within the date range given in a dataset’s harvester_config.yaml file. For each granule the harvester attempts to download the data file, and regardless of the success of the download, creates a harvested and descendants solr document.
    • Solr Metadata

      • Once the download loop is complete, the granule documents are all posted to Solr with a single REST API call.
      • The dataset entry is either updated with metadata from the current harvesting session, or created if it doesn’t exist
      • Field entries are created for each data field from the dataset
  • The “date_s” field in harvested and transformed granules are generated from the beginning of the data within the granule. Generally, granules contain the start time within the data filename. The pipeline ensures that the “date_s” field uses the start time for the datasets that do not use the start time within the data filename (RDEFT4 uses the end time in the filename). The metadata used in Solr is consistent across all datasets.

Output

  • Harvested data files are generally in the netCDF format. Of the currently supported datasets, only sss_AQUARIUS_L3_SSS_SMI_MONTHLY_V5 uses a different format, bz2. The transformation step of the pipeline accounts for bz2 compressed files when opening them.
  • The output files are downloaded and separated by year within the respective dataset’s harvested_granules folder in the output directory.

Harvester (PODAAC)

Configuration File

  • Notes:

    • The harvester for PODAAC hosted datasets and other datasets are very different. As a result,the configuration files require different information to access, download and update Solr.

    • Harvester code requires a configuration file with dataset specific information to set up the dataset entry and field entries on Solr.

      • A template is stored at the following location: ../ECCO-ACCESS/ecco-pipeline/src/harvesters/podaac_harvester/podaac_harvester_config.yaml
      • This file, should it be copied manually, must be renamed to harvester_config.yaml when placed in the dataset folder.
  • Necessary Information for PODAAC Harvesting:

    • start -- Date to harvest from (UTC)

    • end -- Date to harvest to (UTC)

    • user -- User to access PODAAC (default ‘anonymous’)

    • host -- URL of PODAAC search (general granule search URL - requires dataset ID)

    • date_regex -- UTC regex (matches date received from data server)

  • Dataset Specific Information:

    • ds_name -- Name of dataset. This name is only used for Solr querying and directory naming. Each dataset should have a unique ds_name. The current convention is {primary field}_{dataset short name} (Ex. sst_AVHRR_OI-NCEI-L4-GLOB-v2). However, if the short name is too long, it is shortened and made unique.

    • harvester_type -- “podaac”

    • podaac_id -- ID of dataset on PODAAC

    • aggregated -- Data is available already aggregated

    • data_time_scale -- Time scale of the data (‘daily’ or ‘monthly’)

    • date_format -- Format of date in data file name (‘yyyymmdd’, ‘yyyyddd’, ‘yyyymm’, and ‘yyyy_mm’)

    • fields -- List of dictionaries for each field to process in the data

      • name -- Name of data field from dataset
      • long_name -- Long name of data field
        • Will be used as netCDF title. Make sure it is general enough to be used for aggregated data (ie: no hemisphere specifics)
      • standard_name -- Standard name of data field (blank otherwise)
      • units -- Units of data field (blank otherwise)
    • original_dataset_title -- Title of dataset from provider

    • original_dataset_short_name -- Short name of dataset from provider

    • original_dataset_url -- URL of dataset from provider

    • original_dataset_reference -- Dataset reference from provider

    • original_dataset_doi -- DOI of dataset from provider

  • Solr Information:

    • solr_host_local -- Host for local Solr ('http://localhost:XXXX/solr/')

    • solr_host_aws -- Host for AWS Solr

    • solr_collection_name -- Name for collection (‘ecco_datasets’)

  • AWS Information:

    • target_bucket_name -- Name of s3 bucket

Harvester Code

  • Overview:

    • Harvesting of datafiles is done in podaac_harvester.py

    • podaac_harvester_local.py is a wrapper for running locally, while podaac_harvester_aws.py is a wrapper for running on AWS (although that functionality is not yet complete)

    • podaac_harvester.py is the code that collects datafiles from PODAAC, creates the dataset entry and field entries on Solr, and adds all harvested files to their own Solr entry.

    • Currently only the PODAAC harvester has support for pre-aggregated datasets.

      • The code finds and downloads the pre-aggregated set, filters out the dates being requested in the configuration file, and creates separate daily or monthly netCDF files. Only these non-aggregated files are posted to Solr.

Harvester (nsidc_FTP)

Configuration File

  • Notes:

    • This harvests seaice datasets from an NSIDC FTP server. Seaice datasets are stored by hemisphere and hosted on an FTP server with the information given in the configuration file.

    • Harvester code requires a configuration file with dataset specific information to set up the dataset entry and field entries on Solr.

      • A template is stored at the following location: ../ECCO-ACCESS/ecco-pipeline/src/harvesters/nsidc_harvester/nsidc_harvester_config.yaml
      • This file, should it be copied manually, must be renamed to harvester_config.yaml when placed in the dataset folder.
  • Necessary Information for FTP Harvesting:

    • start -- Date to harvest from (UTC)

    • end -- Date to harvest to (UTC)

    • user -- User to access the FTP server (default ‘anonymous’)

    • host -- URL of FTP server

    • regex -- Regex matching time for data files

    • date_regex -- UTC regex (matches date received from data server)

  • Dataset Specific Information:

    • ds_name -- Name of dataset. This name is only used for Solr querying and directory naming. Each dataset should have a unique ds_name. The current convention is {primary field}_{dataset short name} (Ex. sst_AVHRR_OI-NCEI-L4-GLOB-v2). However, if the short name is too long, it is shortened and made unique.

    • harvester_type -- “nsidc_ftp”

    • ddir -- Dataset directory in FTP server

    • aggregated -- Data is available already aggregated

    • data_time_scale -- Time scale of the data (‘daily’ or ‘monthly’)

    • date_format -- Format of date in data file name (‘yyyymmdd’, ‘yyyyddd’, ‘yyyymm’, and ‘yyyy_mm’)

    • regions -- List of regions data is split into (typically ‘north’ and ‘south’ hemispheres)

    • fields -- List of dictionaries for each field to process in the data

      • name -- Name of data field from dataset
      • long_name -- Long name of data field
        • Will be used as netCDF title. Make sure it is general enough to be used for aggregated data (ie: no hemisphere specifics)
      • standard_name -- Standard name of data field (blank otherwise)
      • units -- Units of data field (blank otherwise)
    • original_dataset_title -- Title of dataset from provider

    • original_dataset_short_name -- Short name of dataset from provider

    • original_dataset_url -- URL of dataset from provider

    • original_dataset_reference -- Dataset reference from provider

    • original_dataset_doi -- DOI of dataset from provider

  • Solr Information:

    • solr_host_local -- Host for local Solr (‘http://localhost:XXXX/solr/’)

    • solr_host_aws -- Host for AWS Solr

    • solr_collection_name -- Name for collection (‘ecco_datasets’)

  • AWS Information:

    • target_bucket_name -- Name of s3 bucket

Harvester Code

  • Overview:

    • Harvesting of datafiles is done in nsidc_ftp_harvester.py

    • nsidc_ftp_harvester_local.py is a wrapper for running locally.

    • nsidc_ftp_harvester.py is the code that collects datafiles from the FTP server, creates the dataset entry and field entries on Solr, and adds all harvested files to their own Solr entry.

  • Notes:

    • This harvester assumes that the files are stored separately by hemisphere. An example file path can be seen below.

    • Ex: pub/DATASETS/NOAA/G02202_V3/north/daily/2019/...

Harvester (osisaf_FTP)

Configuration File

  • Notes:

    • This harvests seaice datasets from an OSISAF FTP server. Seaice datasets are stored by hemisphere and hosted on an FTP server with the information given in the configuration file.

    • Harvester code requires a configuration file with dataset specific information to set up the dataset entry and field entries on Solr.

      • A template is stored at the following location: ../ECCO-ACCESS/ecco-pipeline/src/harvesters/osisaf_harvester/osisaf_harvester_config.yaml
      • This file, should it be copied manually, must be renamed to harvester_config.yaml when placed in the dataset folder.
  • Necessary Information for FTP Harvesting:

    • start -- Date to harvest from (UTC)

    • end -- Date to harvest to (UTC)

    • user -- User to access the FTP server (default ‘anonymous’)

    • host -- URL of FTP server

    • regex -- Regex matching time for data files

    • date_regex -- UTC regex (matches date received from data server)

  • Dataset Specific Information:

    • ds_name -- Name of dataset. This name is only used for Solr querying and directory naming. Each dataset should have a unique ds_name. The current convention is {primary field}_{dataset short name} (Ex. sst_AVHRR_OI-NCEI-L4-GLOB-v2). However, if the short name is too long, it is shortened and made unique.

    • harvester_type -- “osisaf_ftp”

    • ddir -- Dataset directory in FTP server

    • filename_filter -- String used to match only files that are wanted

    • aggregated -- Data is available already aggregated

    • data_time_scale -- Time scale of the data (‘daily’ or ‘monthly’)

    • date_format -- Format of date in data file name (‘yyyymmdd’, ‘yyyyddd’, ‘yyyymm’, and ‘yyyy_mm’)

    • regions -- List of regions data is split into (typically ‘north’ and ‘south’ hemispheres)

    • fields -- List of dictionaries for each field to process in the data

      • name -- Name of data field from dataset
      • long_name -- Long name of data field
        • Will be used as netCDF title. Make sure it is general enough to be used for aggregated data (ie: no hemisphere specifics)
      • standard_name -- Standard name of data field (blank otherwise)
      • units -- Units of data field (blank otherwise)
    • original_dataset_title -- Title of dataset from provider

    • original_dataset_short_name -- Short name of dataset from provider

    • original_dataset_url -- URL of dataset from provider

    • original_dataset_reference -- Dataset reference from provider

    • original_dataset_doi -- DOI of dataset from provider

  • Solr Information:

    • solr_host_local -- Host for local Solr (‘http://localhost:XXXX/solr/’)

    • solr_host_aws -- Host for AWS Solr

    • solr_collection_name -- Name for collection (‘ecco_datasets’)

  • AWS Information:

    • target_bucket_name -- Name of s3 bucket

Harvester Code

  • Overview:

    • Harvesting of datafiles is done in osisaf_ftp_harvester.py

    • osisaf_ftp_harvester_local.py is a wrapper for running locally

    • osisaf_ftp_harvester.py is the code that collects datafiles from the FTP server, creates the dataset entry and field entries on Solr, and adds all harvested files to their own Solr entry.

  • Notes:

    • This harvester assumes that both hemispherical datafiles are stored together in the same directory.

      • Ex: archive/ice/conc/1999/08/…
    • The filename_filter field is used to only download files whose name contains the filename_filter

    • The OSISAF server is located in Europe and as such downloading files can take much longer than with other, more local, servers.

Transformation

Configuration File

  • Notes:

    • Transformation code requires a configuration file (transformation_config.yaml) with dataset specific information to correctly transform the data to the various grids in Solr.

    • If any field within the configuration file is changed, make sure to update the version field in the same configuration file. This ensures that the transformations are re-run using the new changes.

    • Since this template supports all current datasets, it must be able to support hemispherical datasets. This means the configuration file has additional fields that require information specific to a hemisphere. They are also listed below with the rest of the fields.

  • Dataset Specific Information:

    • ds_name -- Name of dataset. Must match the name used in the harvester for the specific dataset.

    • version -- Version of transformation code. Must be updated whenever the yaml changes!

    • array_precision -- Precision to use (‘float32’)

    • time_zone_included_with_time -- Time in dataset has a time zone

    • extra_information -- Addition information about dataset

      • transpose -- Transpose data prior to transformation
      • time_bounds_var -- Data has a Time_bounds variable
      • time_var -- Data has a Time variable
      • no_time -- Data does not have a time value
      • no_time_dashes -- Data time does not have dashes
    • data_res -- Resolution of data (decimal or fractional)

    • data_max_lat -- Maximum latitude of data

      • data_max_lat_nh and data_max_lat_sh -- Hemisphere specific fields
    • area_extent -- Data extent ([‘lon min’, ‘lat min’, ‘lon max’, ‘lon min’])

      • area_extent_nh and area_extend_sh -- Hemisphere specific fields
    • dims -- Dimensions of data ([‘longitude steps’, ‘latitude steps’])

      • dims_nh and dims_sh -- Hemisphere specific fields
    • proj_info -- Projection dictionary for data set

      • area_id -- ID of area (‘longlat’)
      • area_name -- Name of projection (‘Plate Carree’)
      • proj_id -- ID of projection (‘EPSG:4326’)
      • proj4_args -- Arguments for projection (‘proj, ellps, datum, etc.)
      • There is also proj_info_nh and proj_info_sh for the projections for the northern and southern hemispheres (defaulted to EPSG values of 3411 and 3412 respectively)
    • pre_transformation_steps -- List of function names to run on the whole dataset prior to transformation. These functions should be located in specific_functions.py in ecco_cloud_utils. These functions take a Dataset argument and return a Dataset.

    • post_transformation_steps -- List of function names to run on each data array after transformation. These functions should be located in specific_functions.py in ecco_cloud_utils. These functions take a DataArray and field_name argument and return a DataArray.

  • Solr Information:

    • solr_host_local -- Host for local Solr (‘http://localhost:XXXX/solr/’)

    • solr_host_aws -- Host for AWS Solr

    • solr_collection_name -- Name for collection (‘ecco_datasets’)

  • AWS Information:

    • source_bucket -- Source s3 bucket for AWS

    • target_bucket -- Name of s3 bucket for outputs

    • aws_output_suffix -- Output suffix for AWS files (.nc)

Transformation Code

  • Overview:

    • Transforming of datafiles is done in grid_transformation.py

    • grid_transformation_local.py is a wrapper for running locally and grid_transformation_aws.py is a wrapper for running in the cloud . In accordance with cloud limitations, transformation is performed on one file at a time regardless of how the pipeline is run (either locally or on AWS). The AWS wrapper detects a new harvested granule in a bucket and passes it to grid_transformation.py. The local implementation must first determine which transformations must be performed on which harvested granules. grid_transformation_local.py contains a function which queries Solr for the missing transformation combinations of grids and fields and iterates through them, passing each to grid_transformation.py.

    • grid_transformation.py also clears outdated transformations from Solr and locally if the argument “--wipe_transformations” is provided to run_pipeline.py.

      • This is done by looking at the current configuration file’s version, and comparing it to the version on Solr for the dataset’s transformations. Any transformation with a different version is removed Solr and the local file is deleted.
    • grid_transformation.py is the code that does the transformation from the data to each grid in Solr.

      • For each grid (and hemisphere, if applicable), create the factors between the dataset and the grid, save them and refer to this in the future instead of making them again.
      • Do any pre-transformation steps specified in pre_transformation_steps in the configuration file.
      • Then for each field transform the data to the current grid.
      • Create a list of all transformed data fields.
      • Do any post-transformation steps specified in post_transformation_steps in the configuration file.
      • Convert the data array to a dataset.
        • Add all necessary metadata attributes.
        • Create the time_bnds coordinate with the time_start and time_end values.
      • Then save each transformed data field.
  • Product:

    • Following the transformation step, all transformed files are added to Solr and saved to the respective dataset’s transformed_products folder in the output directory. Each transformed granule is separated by grid, and by field.

    • A transformed file is saved for every field, grid and granule combination that was transformed.

      • Each file is given a name containing the field and grid name, along with the original granule’s filename.
    • The saved file contains the following metadata attributes:

      • The global metadata attributes are interpolated_grid, model_grid_type, original_dataset_title, original_dataset_short_name, original_dataset_url, original_dataset_reference, original_dataset_doi, interpolated_grid_id, and transformation_version.
      • The data variable specific attributes are long_name, standard_name, units, original_filename, original_field_name, interpolation_parameters, interpolation_code, interpolation_date, notes, original_time, original_time_start, original_time_end, valid_min, and valid_max.
    • A new coordinate, time_bnds, is also added at the end of the transformation step. time_bnds is an array containing the time_start and time_end of the current time step. It is set as the bound for the time variable.

Transformation Multiprocessing

  • Overview:

    • Through run_pipeline.py, the transformation portion of the pipeline can be run via multiprocessing. The argument “--multiprocesses” for run_pipeline.py allows the user to provide a number of CPUs that will be used during the transformation step. Then, during the transformations each available CPU is tasked with completing the transformations required for a single granule. This includes every grid and field combination for that granule.

    • This is done so as to speed up the transformation step, a necessary addition when transforming a large number of granules of all varying sizes.

  • Code:

    • Assuming the user has requested multiprocessing to be used and given a valid number of CPUs, the following code is run during the transformation step.

    • First, a single granule (or two if the data is hemispherical) is selected from Solr and transformed in order to ensure that the factors are created and up to date. This avoids multiple factors calculations due to race conditions.

    • Following the factors creation and updating, the remaining transformations are completed via multiprocessing.

Aggregation

Configuration File

  • Notes:

    • Aggregation code requires a configuration file (aggregation_config.yaml) with dataset specific information to correctly aggregate the transformed data into yearly data files.

    • Aggregations are directly tied to transformations. The transformation step maintains a Solr list of the years that have been transformed for a certain grid so that aggregation is only performed on the years that need it. The exception to this is if the version field in the aggregation configuration file changes. This ensures that all aggregations (not just for the years that have been transformed) are re-run using the new changes.

  • Dataset Specific Information:

    • ds_name -- Name of dataset. Must match the name used in the harvester for the specific dataset.

    • version -- Version of aggregation code. Must be updated whenever the yaml changes!

    • array_precision -- Precision to use (‘float32’)

    • remove_nan_days_from_data -- Removes empty days from the data

    • do_monthly_aggregation -- Aggregate data into monthly files

    • skipna_in_mean -- Controls 'skipna' option in np.mean() when taking monthly mean [True or False]

    • monthly_tolerance -- Defines the range of days to look for data around the start of the month (exclusive as opposed to inclusive). Only used for monthly datasets.

    • save_binary -- Save binary aggregated datafiles

    • save_netcdf -- Save netcdf aggregated datafiles

  • Solr Information:

    • solr_host_local -- Host for local Solr (‘http://localhost:XXXX/solr/’)

    • solr_host_aws -- Host for AWS Solr

    • solr_collection_name -- Name for collection (‘ecco_datasets’)

  • AWS Information:

    • source_bucket -- Name of s3 bucket for outputs

    • target_bucket -- Name of s3 bucket for outputs

    • s3_aggregated_path -- Output directory in AWS s3 target_bucket

Aggregation Code

  • Overview:

    • Aggregation of transformed data files is done in aggregation.py

    • aggregation_local.py is a wrapper for running locally and aggregation_aws.py is a wrapper for running on AWS.

    • aggregation.py is the code that does the aggregation of the transformed data files tracked in Solr.

      • For each grid, open the specified grid file to be used when making an empty record.
        • Then for each year create a list of all dates in the year with the format specified from the harvester.
          • Then for each field, collect all transformed data files from Solr day by day and if one does not exist, make an empty record, and append it to a list.
          • Merge the list of transformed data files.
          • Then aggregate the data monthly if specified, and save the resulting files.
          • JSON file consisting of Solr metadata relevant to that annual aggregation is saved in ../{output_directory}/{dataset_name}/transformed_products/{grid}/aggregated/{field}/
  • Notes:

    • When aggregating hemispherical datasets, the code opens both files for the same date, grid and field combination, and combines them into a single dataset. It then uses that for aggregations and progresses through the same code as all other datasets.
  • Product:

    • Following the aggregation step, all aggregated files are added to Solr and saved to the respective dataset’s transformed_products folder in the output directory. Each aggregated file is separated by grid, and by field.

    • An aggregated file is saved for every field, grid and year that was transformed.

      • Each file is given a name containing the field and grid name, along with the original granule’s filename, the year of the file and the time format of the file (DAILY or MONTHLY).
    • The saved file contains the following metadata attributes:

      • The global metadata attributes are interpolated_grid, model_grid_type, original_dataset_title, original_dataset_short_name, original_dataset_url, original_dataset_reference, original_dataset_doi, interpolated_grid_id, and transformation_version.
      • The data variable specific attributes are long_name, standard_name, units, original_filename, original_field_name, interpolation_parameters, interpolation_code, interpolation_date, notes, original_time, original_time_start, original_time_end, valid_min, and valid_max.

Tools

create_directories.py

  • This support script helps to create all the necessary directories and add the code files for a new dataset.

    • This requires input from the user for the ds_name ({main variable}_{long name}), and the type of harvester (PODAAC, NSIDC, or OSISAF).
  • The script automatically creates the dataset directory which contains the three configuration files for each step of the pipeline

  • This script does not create the harvested_granules and output directories since those are automatically created when running the pipeline.

run_pipeline.py

  • This script has support for many arguments that changes how the pipeline is run. See Pipeline Guide wiki for more information on the arguments and a detailed explanation of run_pipeline.py.

  • The user is prompted for additional input depending on the arguments provided.

  • Following that, or if no additional input is necessary, the user is prompted to select one of the following options:

      1. Run all
        • Runs the entire pipeline for all datasets present
      1. Harvesters only
        • Runs only the harvesters for all datasets present
      1. Up to aggregation
        • Runs the harvesters and transformations for all datasets present
      1. Dataset input
        • Prompts the user for a dataset name and pipeline steps to run for the provided dataset
      1. Y/N for datasets
        • For each dataset present, prompt the user to input Y/N to run the entire pipeline for the given dataset.
  • Once the execution of the option has been completed, a log is printed outlining each dataset and whether or not the step(s) run were successful, and execution of run_pipeline.py ends.

How to Setup the Pipeline

  • See the Pipeline Guide wiki for more detailed information about setting up and running the pipeline.

Steps

  • Install and setup Solr

    • See section “Local Solr Implementation” for information on how to do this.

    • If your Solr database has any entries already present, make sure to clear Solr using the delete query outlined in the “Local Solr Implementation” section by using the the query “:”.

  • Verify that the following packages are installed and updated:

pyresample (v1.15.0) numpy (v1.18.1) pyyaml (v5.3.1) netcdf4 (v1.5.3)
python-dateutil (v2.8.1) requests (v2.23.0) xarray (v0.16.2) python (v3.7.7)
  • Clone the ECCO-ACCESS repository.

  • Place grid files in ../ECCO-ACCESS/ecco-pipeline/grids_to_solr/grids/

    • Verify that the grids_to_use field in the configuration file ../ECCO-ACCESS/ecco-pipeline/grids_to_solr/grids_config.yaml contains the grids that you want to use for the transformations. Leave the list blank if you want to use all the grids present in the grids folder. There is also a command line argument that can be used to specify which grids to use (“--grids_to_use”, see Pipeline Guide wiki).
  • Go through the harvester configuration files and update the date ranges to whatever is wanted. These configuration files are located in their respective dataset’s directory: ../ECCO-ACCESS/ecco-pipeline/datasets/{dataset}/harvester_config.yaml

    • All of the harvesters currently have a small (a few days/months) date range already set.
  • Run ../ECCO-ACCESS/ecco-pipeline/src/tools/run_pipeline.py with any wanted arguments.

Adding Support for a New Dataset

  • The following outlines the steps for adding support for a new dataset

    • Add dataset automatically:

      • Can be done via run_pipeline.py or create_directories.py
    • Add dataset manually:

      • Create a new directory with ds_name in ../ECCO-ACCESS/ecco-pipeline/datasets/
      • Copy harvester, transformation, and aggregation config yamls from their respective folders in ../ECCO-ACCESS/ecco-pipeline/src/ to the newly created ds_name directory
    • Update fields within the three configuration files in ../ECCO-ACCESS/ecco-pipeline/datasets/{ds_name}/

      • Make sure the ds_name field is unified across all three configuration files

List of Current Grids

  • 2x2deg_demo

    • 2 degree x 2 degree lat-lon grid
  • ECCO_llc90_demo

    • 1 degree x 1 degree ECCO llc grid
  • ECCO_llc270_demo

    • ⅓ degree x ⅓ degree ECCO llc grid
  • TPOSE

    • Tropical pacific only grid
  • polar_stereo_n_25km_demo

    • 25 km x 25 km polar stereographic grid

List of Supported Datasets