Skip to content

TG1 Framework Cheat Sheet

Paul J. Morris edited this page Sep 6, 2024 · 3 revisions

Fitness For Use Framework Summary of Mathematical Formalization

This is a Summary of pp.89-108 in: Veiga, A.K. 2016. A conceptual framework on biodiversity data quality. Tese (Doutorado) [Doctoral Thesis] Escola Politécnica da Universidade de São Paulo. Departamento de Engenharia de Computação e Sistemas Digitais.156p. With changes (as discussed in the TDWG data quality interest group: dcmitype:Dataset replaced with Multi-record. Improvement Method changed to Enhancement Method, Improvement Policy changed to Enhancement Policy,Data Quality Improvement changed to Data QualityAmendment.

For an updated version with concept names aligned to the bdqffdq vocabulary, see the bdqffdq summary in the draft BDQ Core standard.

Fundamental Concepts

  • U = Use Case
  • D = Dimension (e.g. precision)
  • IE = Information Element (e.g. coordinates)
  • M = Mechanism
  • C = Criterion (e.g. “in controlled vocabulary”)
  • E = Enhancement (description of a means by which data could be improved e.g. recommend replacement value from a controlled vocabulary).
  • S = Specification (specification of how a criterion is to be evaluated e.g. “Iterate records and calculate the proportion of records with scientific name different from null”)

Properties

  • US = Usages
  • ID = Persistent GUID
  • RT = Resource Type ~ dc:type { Single Record, Multi-Record}
  • sr = instance of Single Records
  • ds = instance of Dataset.
  • V = Data Resource Value
  • R = Assertion (result from a mechanism, of Validation, Measurement, Improvement on Resource)

Notation

  • X: Domain
  • x: instance
  • { } set
  • < > tuple
  • ⋃ union
  • ⋀ and (logical conjunction)
  • ∈ is a member of

Derived Concepts

General

Dimension in Context

CD = { cd | cd =< ie, d, rt >, ie ∈ IE, d ∈ D ⋀ rt ∈ RT }

cd1 = < ie1, d1, rt1 >
  • “coordinate precision of single records”

Criterion in Context

 CC = { cc | cc = < ie, c, rt >, ie ∈ IE, c ∈ C ⋀ rt ∈ RT }   

 cc1 = < ie1, c1, rt1 >
  • “The value of Basis of Records of single records must be in the controlled vocabulary”

Enhancement in Context

CE = { ce | ce = < ie, e, rt >, ie ∈ IE, e ∈ E ⋀ rt ∈ RT }

ce1 = { < ie1, e1, rt1 >}

*“Recommend valid value for taxon name in single record”

Data Quality Needs

Use Case Coverage

 UC(u) = { us | u ∈ U ⋀ us ⊂ US}

 uc(u1) = {us1, us2}
  • “A Use Case for Niche Modeling covers MAXENT and GARP modeling”

Valuable Information Elements

 VIE(u) = {ie | ie ⊂ I E ⋀ u ∈ U }
  • For a Use Case, what information elements are valuable.

Acceptable Data Quality Measure

 AM(cd) = {cc | cd ∈ C D ⋀ cc ⊂ C C}

 am(cd1) = {cc1, cc2}
  • For the dimension in context coordinate completeness in a dataset, acceptable quality is met by all records having coordinates complete.

Enhancement Target

IT (ce) = {cd ⋃ cc | cd ∈ CD, cc ∈ CC ⋀ ce ∈ CE}

it(ce1) = {cd1, cc2}
  • Recommending coordinates based on textual locality improves the coordinate completeness of single records and may result in compliance with the criterion data set must have all records with coordinates.

Measurement Policy

MP (u) = {cd | cd ⊂ CD ⋀ u ∈ U }

mp(u1) = {cd1, cd2, cd3, cd4}
mp(u1) = {< ie1, d1, rt2 >, < ie1, d1, rt1 >, < ie2, d1, rt1 >, < ie2, d2, rt2 >}

Validation Policy

VP (u) = {cc | cc ⊂ CC ⋀ u ∈ U }

vp(u1) = {cc1, cc2}
vp(u1) = {< ie1, c1, rt1>, < ie2, c2, rt2> }

Enhancement Policy

 IP (u) = {ce | ce ⊂ CE ⋀ u ∈ U }

 ip(u1) = {ce1, ce2}

Data Quality Profile

  DQP (u) = {dqp | dqp = mp(u) ⋃ vp(u) ⋃ ip(u), mp ∈ MP , vp ∈ VP , ip ∈ IP ⋀ u ∈ U }

  dqp(u1) = {mp(u1), vp(u1), ip(u1)}

Data Quality Solutions

Measurement Method

MM(cd) = {s | s ⊂ S ⋀ cd ∈ CD}

Validation Method

VM(cc) = {s | s ⊂ S ⋀ cc ∈ CC}

Enhancement Method

IM(ce) = {s | s ⊂ S ⋀ ce ∈ CE}

Implementation

 I (s) = {m | m ⊂ M ⋀ s ∈ S}

 i(s1) = {m1, m2}

Mechanism Coverage

MC(m) = {s | s ⊂ S ⋀ m ∈ M }

mc(m1) = {s1, s2}

Data Quality Reports

Data Resource

DR = { dr | dr = < id, rt, v >, id ∈ I D, rt ∈ RT , (rt = sr ⋁ rt = ds) ⋀ v ∈ V }

dr1 =< id1, rt1, v1 >
  • “dr1 is a Data Resource which represents the Dataset "3cc6171e-8c52-4f65-ad7a-32c74e395f29" which contains 251,744 records” Data resources are defined as having persistent GUIDs

Data Quality Measure

 DQM(dr) = {dqm | dqm =< cd, s, m, r >, cd ∈ CD, s ∈ S, m ∈ M , r ∈ R ⋀ dr ∈ DR}
 
 dqm(dr1) = {< cd1, s1, m1, r1 >}
  • Coordinate numerical precision of the dataset 3cc6171e-8c52-4f65-ad7a-32c74e395f29 is 6.16 and this value was assigned by the software DwC-A Validator 2.0 which calculated the value by the average of significant digits of each record of the dataset.

Data Quality Validation

 DQV (dr) = {dqv | dqv = < cc, s, m, r >, cc ∈ CC, s ∈ S, m ∈ M , r ∈ R ⋀ dr ∈ DR}

 dqv(dr1) = {< cc1s1, m1, r1 >}
  • A DQ Validation asserts that the Contextualized Criterion “Geodetic Datum must be supplied” is COMPLIANT for a specific species occurrence and this validation was performed by the software Darwin Test by checking if the field Geodetic Datum of the record was not empty.

Data Quality Amendment

 DQI(dr) = {dqi | dqi = < ce, s, m, r >, ce ∈ CE, s ∈ S, m ∈ M , r ∈ R ⋀ dr ∈ DR}

 dqi(dr1) = {< ce1, s1, m1, r1 >}
  • An amendment is proposed to replace the current value of the scientific name by the value “Apis” because Apis is the most similar valid name based on the Levenshtein distance in the Catalog of Life database using the software DwC-A Validator 2.0.

Data Quality Assessment

 A(dr) = {dqm(dr) ⋃ dqv(dr) ⋃ dqi(dr) | dqm ∈ DQM, dqv ∈ DQV , dqi ∈ DQI ⋀ dr ∈ DR}

 a(dr1) = {dqm1, dqm2, dqm3, dqv1, dqi1}

Quality Control

 QC(dr) = {dqv(dr) ⋃ dqi(dr) | dqv ∈ DQV , dqi ∈ DQI ⋀ dr ∈ DR}

 qc(dr1) = {dqv1, dqi1}

Quality Assurance

 QA(dr) = {dqv(dr) | dqv ∈ DQV ⋀ dr ∈ DR}

 qa(dr1) = {dqv1, dqv2}
Clone this wiki locally