-
Notifications
You must be signed in to change notification settings - Fork 179
ODC EP 008
The existing data model and API for lineage/source data is constrained by decisions made a long time ago to meet requirements that no longer exist. The current API is a significant barrier to efficient reimplementation of key operational bottlenecks in the datacube index layer.
Several elements of this EP have been flagged previously in:
- ODC-EP03 Replace the ODC Index and Internal Database API
- ODC-EP06 Extract Geometry utilities into a Separate Package
- Overhaul of index driver layer
- ODCv2 Road Map
This Enhancement Proposal outlines a new data model and API for dataset lineage and a migration path within the context of the ODCv2 road map.
An implementation of this proposal can be found in PR#1401 - refer to it for further detail.
Paul Haesler (@SpacemanPaul)
- Under Discussion
- In Progress
- Completed
- Rejected
- Deferred
This proposal has been implemented and merged into the develop-1.9
branch. It will be released with v1.9.0.
Issues with the current implementation of lineage/sources include:
- A lineage relationship between two datasets can only be recorded in an index if both datasets already exist in the index, this unnecessarily complicates indexing, and prevents the recording of derivation from datasets stored in another index (external lineage).
- A unique index on (source_dataset, classifier), requiring arbitrary multiplication of classifiers (e.g. ard1, ard2, etc for geomedian) - (in general requiring a rewriting of the source eo3 document!)
- The "source_field" search API greatly complicates the search API and is rarely used.
- Lineage trees are only handled in the API with fully populated trees of Dataset objects which presents which is not compatible with external lineage.
- Index drivers can declare whether they support the old lineage API or the new "external lineage" API, with the old API being deprecated, then dropped in v1.9 and v2.0 respectively.
- Decouple source and destination id columns from dataset table - allow lineage of external datasets to be tracked by id.
- Ability to optionally associate external ids with a named external index.
- New index resource API for saving, updating, removing and retrieving lineage trees (dataset its only).
- Internal API to convert between lineage trees and a flattened, indexed representation suitable for database representation and enforcing lineage consistency across the database.
- Updates to Dataset model to support lineage trees.
- simple API that works in both the sourcewards and derivedwards directions.
- Drop support for old lineage-related API and CLI methods/options.
- Backwards-compatible extensions to
Doc2Dataset
interface to support the new lineage model.
- Add new "supports" flag to
AbstractIndexDriver
:supports_external_lineage
. Defaults to False. -
supports_external_lineage=False
means the index driver is fully compatible with existing APIs, and does not support new API features proposed herein. -
supports_external_lineage=True
means the index driver supports the new API features proposed herein, and is therefore not fully compatible with legacy API.
v1.9.0 will introduce the new Lineage API. Index drivers may optionally support external lineage. Drivers that don't support external lineage will support the new API with additional error conditions for unknown lineage as much as possible.
From v1.9.x (exact point to be determined) the old API will become deprecated.
In v2.0.x all index drivers must support external lineage, and the legacy API and data model will no longer be supported.
Lineage relations as nodes in a many-to-many network database, similar to how lineage is tracked now, except the source and derived is not enforced to exist in the dataset table.
Other simpler representations are possible (e.g. just storing LineageTree
s as JSON blobs - the LineageTree
class is documented below), which allows reads and writes to be much
faster. This approach might be fine for the in-memory index driver, but I think the postgis driver needs to be able to enforce a lineage consistency across a whole index. Users who don't care about lineage consistency probably don't care about lineage at all.
Column | Description | Type | Null? | Unique Indexes and other comments |
---|---|---|---|---|
derived_id | Derived Dataset ID | UUID | N | no enforced referential integrity to dataset table; unique with source_id |
source_id | Source Dataset ID | UUID | N | no enforced referential integrity to dataset table; unique with derived_id |
classifier | Lineage Type | String | N | no unique indexes |
Column | Description | Type | Null? | Unique Indexes and other comments |
---|---|---|---|---|
dataset_id | Dataset ID | UUID | N | no enforced referential integrity to dataset table; Primary Key |
home | An ODC index | Char/Text | N |
This table records an optional text value that may be associated with particular datasets referenced by the lineage table that may be external. The home field is provided to record an identifier for the database/index that the dataset is known to reside in. It is not interpreted by the API, but could be used to contain e.g. an index name from a shared config file, a database connection string, or a uri.
It is not required that an external database be registered with a home in this table, and datasets that do exist in the index may also be registered in this table. The value and significance of home is entirely user-defined.
In the current/legacy API, lineage information is always represented as a nesting of complete datasets under the sources
property of the root Dataset object.
In the proposed API, lineage information is represented by a LineageTree:
datacube.model.LineageTree
:
from dataclasses import dataclass
from enum import Enum
from uuid import UUID
from typing import Mapping, Optional, Sequence
class LineageDirection(Enum):
SOURCES = 1 # Tree shows all source datasetss of the root node.
DERIVED = 2 # Tree shows all derived datasets of the root node.
@dataclass
class LineageTree:
direction: LineageDirection # Whether this is a node in a source tree or a derived tree
dataset_id: UUID # The dataset id associated with this node
children: Optional[Mapping[str, Sequence["LineageTree"]]] = None
# An optional sequence of lineage nodes of the same direction as this node. The keys of the mapping
# are classifier strings. children=None means that there may be children in the database. children={}
# means there are no children in the database.
# children represent source datasets or derived datasets depending on the direction.
# child nodes must have the same `direction` as the parent node.
home: Optional[str] = None # The home index associated with this node's dataset
A LineageTree may represent the sources of the root dataset (and the sources' sources, and the sources' sources' sources, etc.) OR datasets derived from the root dataset (and datasets derived from datasets derived from the root dataset, etc.), but not both at once.
The LineageTree class will have a class method for reading an EO3 format source mapping into a (shallow) LineageTree.
The LineageTree class will have a sub-tree search method for finding subtrees.
The LineageTree class will have methods for serialising and deserialising LineageTrees to a YAML/JSON compatible mapping.
Diamond relations are supported as follows: Only one node in a given LineageTree with a particular dataset ID can have it's children
field populated - all others should have children
set to None
or {}
.
Optional new properties to be added to the Dataset
model and it's constructor:
source_tree: Optional[LineageTree]=None # Assumed to be of "source" direction
derived_tree: Optional[LineageTree]=None # Assumed to be of "derived" direction
v.1.9.0: The existing optional sources
property not populated by an index driver that supports_external_lineage
, and source_tree
and derived_tree
will not populated by an index driver that does not.
v1.9.x: The sources property becomes deprecated.
v2.0.0: The sources property will be removed.
The datacube.models.lineage.LineageRelations
class will be provided support converting back and forth and validating consistency between flattened dataset relations as are stored in the index under the proposed database representation above, and LineageTree
s as presented to end-users in the public API.
class InconsistentLineageException(Exception):
"""
Exception to raise on detecting inconsistent lineage.
"""
@dataclass
class LineageRelation:
classifier: str
source_id: UUID
derived_id: UUID
class LineageRelations:
"""
An indexed collection of LineageRelations.
For converting between iterables of LineageRelations and LineageTrees.
Enforces all lineage chains are acyclic.
"""
def __init__(self,
tree: Optional[LineageTree] = None,
max_depth: int = 0,
relations: Optional[Iterable[LineageRelation]] = None,
homes: Optional[Mapping[UUID, str]] = None,
clone: Optional["LineageRelations"] = None) -> None:
"""
All arguments are optional. Default gives an empty LineageRelations, and:
rels = LineageRelations(tree, max_depth=max_depth, relations=lrels, clone=clone)
is equivalent to:
rels = LineageRelations()
rels.merge_tree(tree, max_depth=max_depth)
rels.merge(clone)
for rel in lrels:
rels.merge_new_lineage_relation(rel)
:param tree: Initially merge a LineageTree
:param max_depth: The maximum depth to read the LineageTree.
Default/0: no limit. Not used if tree is None.
:param clone: Initially clone this other LineageRelations object
"""
def merge_new_home(self, id_: UUID, home: str) -> None:
"""
Merge a new home relation
Raises InconsistentLineageException if we already have this id with a different home
:param id_: The dataet id
:param home: The home string
"""
def _merge_new_relation(self, ids: Tuple[UUID, UUID], classifier: str) -> None:
"""
Internal convenience wrapper to merge_new_lineage_relation
"""
def merge_new_lineage_relation(self, rel: LineageRelation) -> None:
"""
Merge a new LineageRelation object
Raises InconsistentLineageException if we already have this relation with a different classifier, or
this relation would result in a cyclic relation.
"""
def merge(self, pool: "LineageRelations") -> None:
"""
Merge in another LineageRelations collection, ensuring it is consistent with this one.
:param pool: The other LineageRelations object
"""
def merge_tree(self, tree: LineageTree,
nodes: Optional[Mapping[UUID, LineageTree]] = None,
max_depth: int = 0) -> None:
"""
Merge in a LineageTree, ensuring it is consistent with the collection so far.
Raises InconsistentLineageException if tree contains cyclic depenedencies or inconsistent direction
:param tree: The LineageTree to merge
:param parent_node: The parent node (used to mark recursive traversal - should be None on first call)
:param max_depth: The depth to traverse the tree to. default/zero = unlimited
"""
def relations_diff(self,
existing_relations: Optional["LineageRelations"] = None,
allow_updates: bool = False) -> Tuple[Mapping[LineageIDPair, str],
Mapping[LineageIDPair, str],
Mapping[UUID, str],
Mapping[UUID, str]]:
"""
Compare to another LineageRelations object, returning records to be added to or updated in
the other LinearRelations collection to consistently merge this collection into it.
Intended to be used by index drivers when adding lineage data to an index.
Raises InconsistentLineageException if updates are required and allow_updates is False, or if
merging the two LineageRelations would result in cyclic depenedencies.
:param existing_relations: The relations currently in an index.
:param allow_updates: Whether updates to existing records are allowed.
:return: Tuple containing:
Relations that need to be added to existing_relations to merge with this collection.
Relations that need to be updated in existing_relations to merge with this collection.
Homes that need to be added to existing_relations to merge with this collection.
Homes that need to up updated in existing_relations to merge with this collection.
"""
def extract_tree(self,
root: UUID,
direction: LineageDirection = LineageDirection.SOURCES,
parents: Optional[Set[UUID]] = None,
so_far: Optional[Set[UUID]] = None,
) -> LineageTree:
"""
Extract a LineageTree from this LineageRelations collection.
Used to detect cyclic dependencies.
:param root: The dataset id at the root of the extracted LineageTree
:param direction: The direction of the extracted tree
:param parents: Used to detect cyclic dependencies in recursive mode
- should be None on initial call.
:param so_far: Used to detect duplication from diamond dependencies in recursive mode
- should be None on initial call.
:return: the extracted LineageTree.
"""
Constructor argument | Current (v1.8.x) | Proposed (v2.0.x) |
---|---|---|
index |
The ODC Index that newly constructed Dataset models are intended to be saved into. | No change. |
products |
List of product names (existing in index) to consider for matching. Default None meaning consider all products in index. | No change. |
exclude_products |
List of product names (existing in index) to exclude from matching. Default None meaning no explicit exclusions. | No change. |
fail_on_missing_lineage |
Fail if any datasets referenced in lineage do not exist in index. Default False. | Only False supported. |
verify_lineage |
Check that nested lineage documents match versions already in database, and fail if they don't. Default True. Ignore for eo3 documents | Ignored (as all documents EO3) |
skip_lineage |
Strip out and ignore all lineage information. Overrides fail_on_missing_lineage and verify_lineage if set. Default False. |
No change |
eo3 |
Pre-process EO3 documents: auto/True/False. Default auto. | All documents are EO3, so False not supported and auto==True |
home_index |
proposed new argument | Optional string. If provided and implementation supports the foreign dataset home table, all lineage dataset ids will be recorded as belonging to this home index. |
The callable signature of a Doc2Dataset object will have a new source_tree: Optional[LineageTree] = None
keyword argument added. If passed (and the Doc2Dataset object was created with an index that supports external lineage, and skip_lineage
was not set), then the passed in source_tree is used as the source_tree of the Dataset object and the lineage recorded in metadata and home_index (if there is any) are ignored.
The result of calling a Doc2Dataset object is DatasetOrError
which is defined as:
DatasetOrError = Union[
Tuple[Dataset, None],
Tuple[None, Union[str, Exception]]
]
Currently, lineage information is packed into the Dataset object as nested Dataset objects in the source
property.
For index drivers supporting the new data model described above, and exclusively in ODCv2, lineage information will instead by packed into the Dataset object as a source LineageTree
in the source_tree property, as discussed above.
Retrieve a Dataset from the index. If include_sources
is True then the full source lineage information is packaged in the returned Dataset.
Current/Legacy behaviour: Source lineage information returned as nested Dataset objects in the sources
field of the root Dataset, always fully recursive.
Proposed/v2 behaviour: Source lineage information returned as a LineageTree
object in the source_tree
field of the root (and only) Dataset.
Add new parameter include_deriveds=False
- if true, also return derived lineage information as a LineageTree
object in the derived_tree
field of the Dataset.
Add new parameter max_depth=0
- limits the depth of source and/or derived lineage tree returned. (0/default = no limit)
Current/Legacy behaviour:
# :param with_lineage:
# - ``True (default)`` attempt adding lineage datasets if missing
# - ``False`` record lineage relations, but do not attempt
# adding lineage datasets to the db
(where lineage data is assumed to be stored in sources
field of ds
.)
Proposed/v2 behaviour:
-
Lineage data is assumed to be stored in
source_tree
andderived_tree
fields ofds
. -
with_lineage
argument is ignored (and is dropped all together in v2). Always record lineage relations only (both sourcewards and derivedwards, as provided in the Dataset object).
Legacy API allows searching by metadata on source dataset.
Propose v2 drop support for source_fields
argument.
Currently: Return a list of datasets that are derived from the named dataset id.
Proposed:
- Raise
NotImplementedError
(and deprecation warning in legacy driver in 1.9)
Add a new Index Resource lineage
. (i.e. dc.index.lineage
like existing dc.index.products
and dc.index.datasets
.)
Abstract class definition with docstrings of API:
class AbstractLineageResource(ABC):
"""
Abstract base class for the Lineage portion of an index api.
All LineageResource implementations should inherit from this base class.
Note that this is a "new" resource only supported by new index drivers with `supports_external_lineage`
set to True. If a driver does NOT support external lineage, it can use LegacyLineageResource below,
which is a minimal implementation of this resource that raises a NotImplementedError for all methods.
"""
def __init__(self, index) -> None:
self._index = index
# THis is explicitly for indexes that do not support the External Lineage API.
assert self._index.supports_external_lineage
@abstractmethod
def get_derived_tree(self, id: DSID, max_depth: int = 0) -> LineageTree:
"""
Extract a LineageTree from the index, with:
- "id" at the root of the tree.
- "derived" direction (i.e. datasets derived from id, datasets derived from
datasets derived from id, etc.)
- maximum depth as requested (default 0 = unlimited depth)
Tree may be empty (i.e. just the root node) if no lineage for id is stored.
:param id: the id of the dataset at the root of the returned tree
:param max_depth: Maximum recursion depth. Default/Zero = unlimited depth
:return: A derived-direction Lineage tree with id at the root.
"""
@abstractmethod
def get_source_tree(self, id: DSID, max_depth: int = 0) -> LineageTree:
"""
Extract a LineageTree from the index, with:
- "id" at the root of the tree.
- "source" direction (i.e. datasets id was derived from, the dataset ids THEY were derived from, etc.)
- maximum depth as requested (default 0 = unlimited depth)
Tree may be empty (i.e. just the root node) if no lineage for id is stored.
:param id: the id of the dataset at the root of the returned tree
:param max_depth: Maximum recursion depth. Default/Zero = unlimited depth
:return: A source-direction Lineage tree with id at the root.
"""
@abstractmethod
def merge(self, rels: LineageRelations, allow_updates: bool = False, validate_only: bool = False) -> None:
"""
Merge an entire LineageRelations collection into the databse.
:param rels: The LineageRelations collection to merge.
:param allow_updates: If False and the merging rels would require index updates,
then raise an InconsistentLineageException.
:param validate_only: If True, do not actually merge the LineageRelations, just check for inconsistency.
allow_updates and validate_only cannot both be True
"""
@abstractmethod
def add(self, tree: LineageTree, max_depth: int = 0, allow_updates: bool = False) -> None:
"""
Add or update a LineageTree into the Index.
If the provided tree is inconsistent with lineage data already
recorded in the database, by default a ValueError is raised,
If replace is True, the provided tree is treated as authoritative
and the database is updated to match.
:param tree: The LineageTree to add to the index
:param max_depth: Maximum recursion depth. Default/Zero = unlimited depth
:param allow_updates: If False and the tree would require index updates to fully
add, then raise an InconsistentLineageException.
"""
@abstractmethod
def remove(self, id_: DSID, direction: LineageDirection, max_depth: int = 0) -> None:
"""
Remove lineage information from the Index.
Removes lineage relation data only. Home values not affected.
:param id_: The Dataset ID to start removing lineage from.
:param direction: The direction in which to remove lineage (from id_)
:param max_depth: The maximum depth to which to remove lineage (0/default = no limit)
"""
@abstractmethod
def set_home(self, home: str, *args: DSID, allow_updates: bool = False) -> int:
"""
Set the home for one or more dataset ids.
:param home: The home string
:param args: One or more dataset ids
:param allow_updates: Allow datasets with existing homes to be updated.
:returns: The number of records affected. Between zero and len(args).
"""
@abstractmethod
def clear_home(self, *args: DSID, home: Optional[str] = None) -> int:
"""
Clear the home for one or more dataset ids, or all dataset ids that currently have
a particular home value.
:param args: One or more dataset ids
:param home: The home string. Supply home or args - not both.
:returns: The number of home records deleted. Usually len(args).
"""
@abstractmethod
def get_homes(self, *args: DSID) -> Mapping[UUID, str]:
"""
Obtain a dictionary mapping UUIDs to home strings for the passed in DSIDs.
If a passed in DSID does not have a home set in the database, it will not
be included in the returned mapping. i.e. a database index with no homes
recorded will always return an empty mapping.
:param args: One or more dataset ids
:return: Mapping of dataset ids to home strings.
"""
Current bulk read/write methods are flagged as being unstable (i.e. subject to further change).
I propose adding new bulk read/write methods for lineage data. These would operate with flat records (not full LineageTrees) and would be similar in format to the existing (unstable) bulk read/write methods. Details of these new bulk methods and stabilisation of the existing read/write and cloning API methods is a deferred to a future EP.
Some existing CLI commands/options will have to be updated to reflect the API changes above, and new CLI commands for handling lineage will need to be added. In particular, a CLI command to index lineage information ONLY (i.e. don't index the dataset, just extract and save the lineage info.) would be desirable.
The detailed specifications for these are deferred to a future EP.
The above design can easily detect/fix:
- a dataset ID with a different
home
to what is recorded in the index. - a lineage reationship between two dataset ids with a different classifier to what is recorded in the index.
The above design as it stands cannot detect all cases of circular dependency when saving lineage information, although it provides plenty of tools to minimise the likelihood of them occurring.
Edit and add your comments here
Current EO3 metadata standard (i.e. what is recorded in the json stac or yaml dataset metadata document, as read at indexing time) supports:
- Source lineage only, only one level-deep.
- DOES support multiple source IDs for a single classifier, but this is currently overwritten and flattened by the ODC at index time.
Although the API enhancements in this EP can proceed with the current EO3 metadata format, this EP may be an appropriate place to consider adding (optional) extensions to the EO3 format to support:
- nested lineage; and
- derived as well as source lineage
If any extensions to the EO3 format are to be made, they should be made as part of a larger effort to draft a more formal definition the EO3 format.
- Paul Haesler (@SpacemanPaul)
Welcome to the Open Data Cube