Skip to content

Detailed metadata for formalizing the information model of geospatial EO machine learning training data.

Notifications You must be signed in to change notification settings

openrsgis/trainingdml-ai-extension

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TrainingDML-AI Extension Specification

This document explains the fields of The Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Extension to the SpatioTemporal Asset Catalog (STAC) specification. Training data plays a fundamental role in Earth Observation (EO) Artificial Intelligence Machine Learning (AI/ML), especially Deep Learning (DL). The TrainingDML-AI Extension provides detailed metadata for formalizing the information model of geospatial machine learning training data. This includes but is not limited to the following aspects:

· How the training data is prepared, such as provenance or quality;

· How to specify different metadata used for different ML tasks such as scene/object/pixel levels;

· How to differentiate the high-level training data information model and extended information models specific to various ML applications;

· How to introduce external classification schemes and flexible means for representing ground truth labeling.

Collection Fields

The fields in the table below can be used in these parts of STAC documents:

  • Catalogs
  • Collections
  • Item Properties (incl. Summaries in Collections)
  • Assets (for both Collections and Items, incl. Item Asset Definitions in Collections)
  • Links
Field Name Type Description
tdml:amount_of_training_data number Required, Total number of training samples in the AI training dataset.
tdml:classification_schema string Classification schema for classes used in the AI training dataset.
tdml:metrics_in_LIT [MetricsInLIT Object] Results of performance metrics achieved by AI/ML algorithms in the peer-reviewed literature.
tdml:image_sizes [number] Size of the images used in the EO training dataset.
tdml:scope Scope Object Description of the scope of the training dataset.
tdml:quality Quality Object Quality description of training datasets.
tdml:provenance provenance Object Provenance information of the training data and training dataset.
tdml:data_sources [string] Citation of data sources.

In addition, fields from the following extensions must be imported in the item:

Item Fields

The fields in the table below can be used in these parts of STAC documents:

  • Catalogs
  • Collections
  • Item Properties (incl. Summaries in Collections)
  • Assets (for both Collections and Items, incl. Item Asset Definitions in Collections)
  • Links
Field Name Type Description
tdml:quality Quality Object Quality description of training datasets.
tdml:provenance Provenance Object Provenance information of the training data and training dataset.
tdml:data_sources [string] Citation of data sources.

In addition, fields from the following extensions must be imported in the item:

Additional Field Information

tdml:amount_of_training_data

Total number of training samples in the AI training dataset.

tdml:classification_schema

Time when the AI training dataset was created.

tdml:metrics_in_LIT

Results of performance metrics achieved by AI/ML algorithms in the peer-reviewed literature.

tdml:image_sizes

Size of the images used in the EO training dataset. The imageSize is recommended to be expressed in the form of "width*height". If the imageSize of the training data in dataset is not the same, you can use the imageSize of Smallest size image and the imageSize of largest image to express, such as "minWidth*minHeight~maxWidth*maxHeight".

tdml:scope

Description of the scope of the training dataset.

tdml:quality

Quality description of training datasets. Quality will be aligned with the DQ_DataQuality class in the ISO 19157:2013 spatial data quality model, and the quality assessment metrics for the sample dataset are described using the quality metric classes defined in ISO 19157:2013.

tdml:provenance

provenance includes the labeler and the labeling procedure, which can be mapped to the agent and activity respectively in W3C PROV model. The labeler identifies the agent that creates the training dataset or individual samples, and the labeling procedure represents the process for data generation.

tdml:data_sources

Citation of data sources.

MetricsInLIT Object

This is the introduction for the purpose and the content of the metricsInLIT Object used in field: tdml:metricsInLIT.

Field Name Type Description
doi string REQUIRED. Digital object identifier of the peer-reviewed literature.
algorithm string AI/ML algorithms used in the peer-reviewed literature.
metrics object REQUIRED. Metrics and results of AI/ML algorithms in the peer-reviewed literature.

An example of yolov5's MetricsInLIT on the DOTA-v1.5 dataset:

{
    "doi":"10.5281/zenodo.3983579",
    "algotithm": "YOLOV5",
    "metrics":[
        {
            "name": "AP50",
            "value": "66.1"
        },
        {
            "name": "AP50:95",
            "value": "41.5"
        },
        {
            "name": "AR1",
            "value": "39.4"
        },
        {
            "name": "AR10",
            "value": "54.9"
        },
        {
            "name": "AR100",
            "value": "58.4"
        }
	]
}

Quality Object

This is the introduction for the purpose and the content of the Quality Object used in filed: tdml:quality.

Field Name Type Description
scope [Scope Object] REQUIRED. the scope of quality information is specified.
report [QualityElement Object] Quality reports about the training dataset.

Provenance Object

This is the introduction for the purpose and the content of the Provenance Object used in filed: tdml:provenance.

Field Name Type Description
scope [Scope Object] REQUIRED. the scope of labeling information is specified.
labeling_methods [string] Methods used in the labeling procedure.
labeling_tools [string] Tools or software used in the labeling procedure.
labeler_names [string] Name of the labeler.

Scope Object

Field Name Type Description
level string REQUIRED. The applicable level of data.
level_description object REQUIRED. A more detailed description of the level to better understand the scope of application of the data.

QualityElement Object

This is the introduction for the purpose and the content of the qualityElement. Elements related to quality, or more specifically, bias that can be used to reduce the errors when using AI/ML. For example, any knowledge of the TD imbalance and mislabeling can be stored in TD quality.

Field Name Type Description
type string REQUIRED. Type of evaluation quality.
measure string REQUIRED. Reference to measure used.
evaluation_method string REQUIRED. Evaluation information.
result string Value obtained from applying a data quality measure..

Relation types

It is highly recommended to use the version-history as a rel type in the Link Object to record the changed training samples between two versions at the collection level as a changeset. The changeset is used to track updates made to a specific version of a sample dataset, identified by its "datasetId" and "version".

There are three types of updates for sample data units: "add" for adding new sample data units, "modify" for modifying existing sample data units, and "delete" for removing sample data units. "Modify" includes changes to metadata of sample data, changes to original data used in sample data, and additions, modifications, and deletions of all labeled objects in the sample data.

Best Practices

Core and other extensions fields

It is higly recommended to use the following fields to describe the training dataset:

Field name TrainingDML-AI usage
providers People or organizations who provide the AI training dataset.
label: overviews Statistics results of training samples in each class.
label: classes REQUIRED. Classes used in the AI training dataset.
label: tasks REQUIRED. Type description of the EO task.
label: methods Methods used in the labeling procedure.
ml-aoi: split Training type of the individual AI.
eo: bands Description of the image bands used in the EO training dataset.
sci:doi Digital object identifier of the AI training dataset.

Contributing

All contributions are subject to the STAC Specification Code of Conduct. For contributions, please follow the STAC specification contributing guide Instructions for running tests are copied here for convenience.

Running tests

The same checks that run as checks on PR's are part of the repository and can be run locally to verify that changes are valid. To run tests locally, you'll need npm, which is a standard part of any node.js installation.

First you'll need to install everything with npm once. Just navigate to the root of this repository and on your command line run:

npm install

Then to check markdown formatting and test the examples against the JSON schema, you can run:

npm test

This will spit out the same texts that you see online, and you can then go and fix your markdown or examples.

If the tests reveal formatting problems with the examples, you can fix them with:

npm run format-examples

About

Detailed metadata for formalizing the information model of geospatial EO machine learning training data.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published