A dataset document contains the metadata for an ODC geo-data resource.
A dataset document is a YAML or JSON document that conforms to the EO3 Dataset JSON Schema at:
https://github.com/opendatacube/eo3/blob/develop/eo3/schema/dataset.schema.yaml
The top level of an EO3 dataset document contains the following elements, discussed in detail below:
- $schema (required)
- id (required)
- label (not required, but strongly recommended)
- product (required)
- location (optional)
- locations (optional)
- crs (required)
- geometry (optional)
- grids (required)
- properties (required)
- measurements (required)
- accessories (optional)
- lineage (optional)
An EO3 dataset docment must contain a $schema
element, which is a string that must
be exactly https://schemas.opendatacube.org/dataset
.
An EO3 dataset must contain an id
element, which is a string containing a valid UUID in standard
hexidecimal fomat (e.g. 550e8400-e29b-41d4-a716-446655440000
). Id's should
be generated using an appropriate algorithm and should be globally unique.
Label is not currently required by the schema, but required by EO3 (see SPECIFICATION-odc-type.md).
The label is intended to contain a user-readable unique identifier for the dataset, however uniqueness is not enforced.
Label can only contain alphanumeric characters, and underscores and dashes.
The product section is required and identifies the ODC product document associated with the dataset.
The product section must contain a name entry, and may contain a href entry, as decribed below.
Name is a string that identifies the product associated with the dataset document (as recorded in the "name" field of the product document). Name can only contain alphanumeric characters, underscores and hyphens and is required in all dataset documents.
The product section may also contain a href entry, which is a valid URL pointing to a copy of the product document.
Href is optional, but is recommended, particularly when working with multiple ODC indexes that may contain slightly differing versions of a given product document.
Location and/or locations are intended to store the root URI of the measurement datafile(s) in cases where the root URI is not equal to the URI of the dataset metadata file (which is the default assumption).
In datacube-1.8.x:
The dataset schema (in eodatasets) allowed for either a single string in the location
field or an
array of strings in the locations
field. This avoids defining the the datatype for the location/loctaions
field(s) as a union, and theoretically allows the contents of an ODC database to be losslessly serialised to a
dataset metadata file. In the database, locations are stored in a separate table to datasets, and there
can be multiple locations per dataset. By default, the most recently added location is used, but it
is sometimes possible to specify a secondary location by giving the preferred URI schema,
e.g. s3:// vs https:// vs file:// (see datacube.storage._base.BaseInfo
).
Datacube-core itself however, ignored the locations
field and allowed the location
field to be either
a string or an array of strings. (If location
is an array of strings, only the first string in
the array is actually read, the remainder are ignored.) If set, the location field is stripped from
the metadata file before storing it in the ODC index - it is used only as part of the initial indexing
process.
From datacube-1.9.0:
The schema for the locations
field will be updated to match the behaviour described above (i.e. a
union datatype). The location
field is no longer supported and will raise an error.
The CRS for the georegistration of the spatial data is required. It maybe an EPSG code (e.g. "EPSG:4326") or may be expressed in WKT format.
Geometry contains a 2D geometry (of Polygon or MultiPolygon type) such that all valid data points in all the grids (described below) fall inside the geometry, and all data points in all the grids that fall outside the geometry are invalid.
The geometry is used when performing dataset searches. If there is valid data in the dataset outside of the specified geometry, the dataset may not be returned when explicitly searching for that data.
The geometry may be approximate. The geometry is optional. If omitted, the index driver assumes in searches that all data within the grids is valid.
The format expected is equivalent to a GeoJSON geometry primitive, e.g.:
geometry:
type: Polygon
coordinates: [
[
[35.0, 10.0], [45.0, 45.0], [15.0, 40.0], [10.0, 20.0], [35.0, 10.0]
]
]
Coordinates are always in xy (lon, lat) order and are assumed to be expressed in the crs specified above.
The grids section is required for EO3 datasets. It contains at least one grid definition named "default"
and may contain additional alternate grid definitions. Each grid definition is equivalent to an
odc-geo
GeoBox
for the entire dataset. Each measurement in the dataset must have a grid, but multiple
measurements can share the same grid.
A grid definition represents the native geobox for the whole dataset for at least one measurement belonging to the dataset.
Each grid definition must have a shape
and a transform
and may have a crs
. shape
is an array of two integers,
and represents the width and height of the grid in pixels. transform
is an array of either 6 or
9 floating point numbers and represents an affine transform for converting pixel coordinates to
coordinates in the specified crs
, or the dataset crs
described above if no grid-specific CRS is provided.
The CRS may be a valid EPSG code or WKT string.
If the 9-number form is used for the transform, the last three numbers must be [0, 0, 1].
Note: Grid-specific CRSes are new in ODC 1.9.x. In ODC 1.8.x, grids cannot define their own CRS and always use the dataset CRS. This extension has been added for STAC interoperability.
E.g.
grids:
default:
# "default" grid for most measurement bands is 7941x7901 pixels
shape: [7941, 7901]
# 9 number form - note that last three elements are [0, 0, 1]
transform: [30.0, 0.0, 557385.0, 0.0, -30.0, -4030485.0, 0.0, 0.0, 1.0]
panchromatic:
# "panchromatic" grid for the panchromatic measurement band
# This grid has higher resolution over the same area than default: 15881x15801 pixels
shape: [15881, 15801]
# 6 number form - final [0, 0, 1] elements are automatically appended.
transform: [15.0, 0.0, 557392.5, 0.0, -15.0, -4030492.5]
custom_crs:
# This grid uses a different CRS
crs: epsg:32756
shape: [2267, 1567]
transform: [50.0, 0.0, 257975.0, 0.0, -50.0, 6290325.0]
The properties section contains arbitrary user-specified metadata.
Previously this data could be nested arbitrarily, but in EO3 it is required to be flat with colon separated namespaces for virtual nesting, as described in the odc-type specification document.
Please refer to the default eo3 metadata type definition for common field locations. Compatibility with STAC metadata is recommended, and may be more strongly enforced in future.
In particular the acquisition time (or coverage time for derived products) should be stored
as either a range defined by dtr:start_datetime
and dtr:end_datetime
, or as single time value
at datetime
.
The measurements section describes the measurments (or bands) associated with the dataset.
The measurements section is required for EO3 compatibility. The measurements map canonical measurement names to measurement definitions. All measurements defined by the dataset's product must be included. Additional measurements not defined in the product have historically been supported, but this may be deprecated or removed in a future release.
Measurement names can contain alphanumeric characters and underscores only. Measurement definitions can contain the following elements:
Path is the only element of a measurement definition that is always required. It contains the path to the datafile, evaluated relative to the location of the dataset.
measurements:
red:
path: data/red.tif
green:
path: data/green.tif
Previously, if a NetCDF datafile contained multiple time-slices or measurements,
the part number can be specified as part of the path
. This is a zero-based index (in contrast to
the 1-based convention used by rasterio) E.g.:
measurements:
red:
path: data/file.nc#part=0
This usage is considered ambiguous and potentially confusing and will be deprecated and
removed in future releases. Instead, use the band
and layer
entries discussed below.
To specify multiple bands/time-slices in a single file, the optional band and layer entries can be used.
Band is a band or part number using a rasterio-style 1-based index.
Layer is a (string type) band or layer name.
Band and layer can be used together or separately. The normal use cases are a band number for a GeoTIFF and either or both for a NetCDF, depending on the structure of the file. E.g.
// Time slice from NetCDF file - first time-slice in file, number 1 and two named layers as measurments
measurements:
red:
path: data/file.nc
band: 1
layer: red
green:
path: data/file.nc
band: 1
layer: green
// Bands in a GeoTIFF file
measurements:
red:
path: data/file.tif
band: 1
green:
path: data/file.tif
band: 2
The grid
element identifies the grid (from the grids
section described above) to use
for this measurement band. It is optional and defaults to "default"
.
Accessories is an optional section for describing accessory and ancillary files packaged with the data and metadata (e.g. thumbnail images, checksums, metadata in alternate formats, etc).
Accessories is a object mapping accessory file names to a relative file path and an optional type. Accessory names consist of alphanumeric (plus underscores) characters, with colons to allow for a nested hierarchy of colon-separated namespaces.
path contains a path to the accessory file, relative to the location of the dataset. The interpretation of the optional type field is not specified. E.g.
accessories:
metadata:stac:
path: this-dataset.stac-info.json
type: STAC v1.0.0(proj,view)
checksum:sha1:
path: checksums/this-dataset.sha1
Lineage is an optional section for listing other datasets that were used in the calculation of this dataset. Older ODC metadata formats supported embedding of complete metadata documents of parent datasets, however this is now deprecated in favour of just including the ids of ancestor datasets.
The lineage section maps labels describing types of classes of source datasets to lists of source dataset ids.
The legacy postgres index driver rewrote lineage sections to flatten the dependency list prior to storage. E.g. a dataset with a lineage section specifying 4 source dataset ids of type 'ard' would be rewritten to have four source types ('ard1', 'ard2', 'ard3' and 'ard4'), each with a single source dataset id. This flattening is not performed by the new postgis index driver.
E.g.
lineage:
ard:
- c90f820b-7aa5-492d-a12b-ba8d47a16a90
- 90267ce3-41e0-480c-8cc1-4418a1ebc314
- 07c0a669-b2de-4437-aa90-43a86da9525e
- d5c99c8e-7ce1-4627-bb4d-4a1abbfebc1a
The new postgis index driver stores this lineage information exactly as provided. The legacy postgres index driver will rewrite this section for storage as follows:
lineage:
ard1:
- c90f820b-7aa5-492d-a12b-ba8d47a16a90
ard2:
- 90267ce3-41e0-480c-8cc1-4418a1ebc314
ard3:
- 07c0a669-b2de-4437-aa90-43a86da9525e
ard4:
- d5c99c8e-7ce1-4627-bb4d-4a1abbfebc1a
A future extension may allow specifying more than one level of lineage (e.g. grandparent datasets as well as parent datasets).
The following elements are traditionally not included in dataset metadata documents external
to the datacube (i.e. transmitted over a network protocol or stored on a network or local file
system.) Instead, they are generated by the Open Data Cube on indexing and injected into dataset
metadata documents for internal storage in the ODC index, and internal use within the ODC. They
are generated by the prep_eo3
method defined in datacube.index.eo3
.
These elements are not documented in the schema, and so will fail validation.
In the "postgres" index driver (the default index driver in datacube-1.8), the extent section was not expected to be in the source dataset metadata document, but was generated from the grids section at index time and injected into the dataset document before being stored in the database. It was never included in the dataset document schema in the eodatasets repository.
The extent section contains the maximum and minimum latitude and longitude values for the dataset in the EPSG:4326 CRS in the following format:
extent:
lat:
begin: -21.789474556891378
end: -20.788940834502526
lon:
begin: 133.0656386483482
end: 134.13328670106225
The extent section was historically used by the "postgres" driver to perform spatial search queries - leading to inefficient and/or broken search behaviour for datasets that lie around the poles and the anti-meridian, even if the dataset has a native CRS that performs well in those regions.
It is not used by the new "postgis" index driver (which uses instead the grids and geometry sections described above).
grid_spatial
is calculated from the "default" grid in the grids
section described above.
grid_spatial
contains a projection
section, which in turn contains the following elements:
spatial_reference
is a CRS expressed in a supported format (EPSG code or WKT) it is simply copied from
the crs
entry described above.
geo_ref_points
contains the coordinates of the four corners of the GeoBox defined by the default
grid spec. These are stored as "x" and "y" coordinate values for the upper-left (ul), lower-right (lr),
etc. points. Note that "x" and "y" are used even if the CRS defines alternative names for
its axes (e.g. does not become "latitude" and "longitude" when spatial_reference
is EPSG:4326).
valid data
is geoJSON geometry primitive (with coordinates in the CRS from spatial_reference
). If
the geometry
section (described above) is provided, it is a copy of that. If no geometry
section was
provided, a four sided polygon is generated from the geo_ref_points
.
E.g.
grid_spatial:
projection
spatial_reference: epsg:32753
geo_ref_points:
ll:
x: 300000.0
y: 7590220.0
lr:
x: 409800.0
y: 7590220.0
ul:
x: 300000.0
y: 7700020.0
ur:
x: 409800.0
y: 7700020.0
valid_data:
type: Polygon
coordinates: [
[
[300000.0, 7700020.0],
[409800.0, 7700020.0],
[409800.0, 7590220.0],
[300000.0, 7590220.0],
[300000.0, 7700020.0]
]
]