Skip to content

Supporting various circulation models and Parcels' internal representation of data #2003

@VeckoTheGecko

Description

@VeckoTheGecko

This issue is a mix of discussion on internal representation of data, and tasks required for supporting reading data from various circulation models.

Internal representation of data

We've decided on working internally with Xarray datasets, now an important matter to discuss is what these datasets actually look like internally1 (i.e., dimension ordering, how is depth defined, ...). (1) Are we working with the datasets in the form close to the model output? (i.e., close to the raw NetCDF files?) Or (2) are we working with datasets that match a certain internal representation defined by us?

@erikvansebille I recall from our group meetings that you were leaning towards (1), saying something along the lines of "allowing users to write interpolator methods while using the knowledge about their model" (please, if I don't fully understand your viewpoint or am missing something, comment below).

I am personally leaning towards (2) for the following reasons:

  • Its safer
    • At the point of intialisation we can throw an informative error message if the dataset doesn't match our internal representation. This is easier to debug than runtime errors from failed index searching.
  • It makes our code simpler and more performant
    • All our indexing methods can make assumptions according to our internal model (e.g., that the depth array is increasing)
    • We need to do less checking before we do something
  • Makes testing easier as from FieldSet all the way through, we only need to test things in our internal data model (e.g., no need to test some edge case introduced only when model X is running pset.execute())
  • It's all lazy anyway
    • Any dataset transformation done to get to this "internal state" would be done lazily via Dask
  • I feel that we can define an internal representation within parcels that encompasses all models that we want to support. Admittedly, I'm not 100% sure on whether my feeling here is well informed - I'm not intimately familiar with the model outputs we work with.
  • We already have some assumptions about internal representation with assuming that the data is ordered [tdim, zdim, ydim, xdim]
  • Allows interpolation methods to be portable between Fields (and allows us to ship them in Parcels)

A downside of (2) is (as you point out) that those writing interpolation methods need to understand how data is handled internally in Parcels as opposed to their model - that is something we would need good documentation for to support those writing interpolation methods.

Tasks required for supporting reading data from various circulation models

Sub-issues:

  • Example datasets for different (ocean) circulation models #2004
  • (2) Decide on an internal representation of data within parcels (or decide on how we are going to define search methods in a flexible way to work with different datasets)
    • [DOC] Document this internal representation in the v4 docs (useful for those writing interpolators etc.)
    • [^2] Define helper functions (in module parcels._datasets.(un)structured.coerce) to transform datasets from the original representation (i.e., circulation_model.py) to the parcels internal representation. These coersian functions can either be used as documentation for users (to see which transformations they need to do to their data to bring in line with Parcels), or be used in Public convenience methods such as FieldSet.from_...() which handle this automatically.
  • (3) What do indices correspond to?
    • Is it the f-points, or the t-points?
    • how is VectorField indices done?
    • EDIT: Have since decided that its wrt. the f-points
  • (4) Define further helper functions
    • e.g., Field.from_cf_compliant(ds) which takes in a dataset with CF-compliant metadata and does all the transformations needed to bring it into parcels internal assumptions

Footnotes

  1. By here "internally" I mean at the point where the dataset is passed to the Field initialiser, and is stored on the data attribute (or similarly, passed to the Grid initialiser). This is the structure that the rest of Parcels can safely assume. From a user POV they don't necessarily need to do the data transformations themselves, this can be bridged using classmethods.

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussiontopic/input-dataIssues about hydrodynamical datatrackingIssues used as a way of tracking other issuesv4

    Type

    No type

    Projects

    Status

    Backlog

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions