forked from ESIPFed/bds-primer-best-practices
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathINTEGRATE_DATA.qmd
116 lines (60 loc) · 8.87 KB
/
INTEGRATE_DATA.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# INTEGRATE YOUR DATA WITH OTHER DATA
- Why?
- Key Information
- Top 5 References
### **Topic:** Climate and Forecast
##### What is it:
The Climate and Forecast (CF) metadata conventions are designed to promote the processing and sharing of files created with the NetCDF (Network Common Data Form) [API (Application Programming Interface)](https://en.wikipedia.org/wiki/API). The conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. Said more plainly, the conventions explain the extra information (metadata) that clearly describes what each piece of data means and when and where it was collected.
##### Why?
This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities. The CF convention includes a standard name table, which defines strings that identify physical quantities. CF is well-established, although not perfect for biology. Still, biological standards, and other standards, should consider terms from CF that can be used, before reinventing the wheel.
##### Key Information (How / What)
- [Version 1.0](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.0/build/cf-conventions.html) was released in October 2003, we are now on version 1.12
- CF is a convention built on top of the netCDF standard, and it generalizes and extends the netCDF [COARDS conventions](https://ferret.pmel.noaa.gov/noaa_coop/coop_cdf_profile.html).
- Like DwC requires knowledge of the EML standard, CF requires knowledge of others standards. Because CF is a netCDF convention, it assumes the netCDF standard is being followed. And it relies on the UDUNITS system of specifying units (see [Units in CF (UDUNITS)](https://cfconventions.org/faq.html#units-in-cf-udunits) below).
- a CF principle is to be self-contained. So for example the CF Standard Names attempt to be as general and well-defined as possible, so the reader does not have to access outside sources to understand the terms.
- To find standard names that describe your data, open up the latest [Standard Name table](https://cfconventions.org/vocabularies.html) (as HTML or XML) and search through it for words typically used for your data
- The NERC Vocabulary Server hosts CF, and maintains mappings from CF to other vocabularies. They decompose them into more useful SKOS level semantics. For example: <https://vocab.nerc.ac.uk/search_nvs/map/?vocab=P07>
- The NERC Vocabulary server is a platform for searching and browsing vocabularies relating to a variety of domains. NERC hosts a collection of vocabularies, including the Climate and Forecast (CF) vocabulary, and provides an interface for searching for terms across all the hosted vocabularies.
##### Top 5 References
1. <https://cfconventions.org/>
2. [CF GitHub Discussions](https://github.com/orgs/cf-convention/discussions): announcements, forum for community discussion, questions and answers
3. Current proposals for changing CF (CF GitHub issues): [vocabulary](https://github.com/cf-convention/vocabularies/issues) (including standard names), [conventions](https://github.com/cf-convention/cf-conventions/issues), this [website](https://github.com/cf-convention/cf-convention.github.io/issues) (including governance)
4. [CF GitHub organisation](https://github.com/cf-convention)
5. [CF FAQ](https://cfconventions.org/faq.html)
6. [List of software for working with CF](https://cfconventions.org/software.html)
7. [List of Projects and Activities that Use the CF Metadata Conventions](https://cfconventions.org/projects-activities.html)
8. [Paper](https://doi.org/10.5194/gmd-10-4619-2017) describing the CF data model and reference software
9. Overview of CF basics as a [presentation](https://cfconventions.org/Data/cf-documents/overview/viewgraphs.pdf) and [paper](https://cfconventions.org/Data/cf-documents/overview/article.pdf)
10. <https://www.ogc.org/standard/netcdf/>;
11. <https://gdal.org/en/latest/drivers/vector/netcdf.html>
##### Repositories that use this data standard
1. NCEI
2. NASA EarthData
### **Topic:** Darwin Core
##### What is it:
Dublin Core \[Link to Dublin Core section in the primer guidelines.\] is a set of metadata terms used by libraries to describe physical and digital resources.
Darwin Core (DwC) is a glossary of terms…
The Darwin Core archive is a set of interlinked tables…
Darwin Core is a data standard that offers a stable, straightforward and flexible framework for compiling biodiversity data from varied and variable sources. Darwin Core is an extension of Dublin Core for biodiversity informatics. The standard was originally developed by the Biodiversity Information Standards (TDWG, formerly known as the Taxonomic Database Working Group) community, and it is currently maintained by the Darwin Core Maintenance Interest Group; feedback and participation in development is open to the public.. It includes a glossary of terms intended to facilitate the sharing of information about biological diversity by providing identifiers, labels and definitions, improving data reuse in a variety of contexts. In short, it maps information from multiple sources/institutions in a cohesive way for the broader community. Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information. Taxonomic occurrence data can be standardized to Darwin Core irrespective of the observing method by which the data were collected (e.g., observational data, genomics, imaging, animal tracking). Through the use of common terminology and controlled vocabularies, downstream users can more easily discover, search, evaluate, integrate and compare datasets\
\
- Darwin Core Archive
- A DwC-Archive includes three things - the data itself, an EML file and a meta.xml file
- Some repositories only ask for EML (e.g. EDI)
##### Why?
Biodiversity data, be it in museum collections, environmental monitoring programs, research programs or civic science projects (e.g. iNaturalist), is collected and managed in many different systems and environments. Additionally, the data is often very heterogeneous within and across these systems, depending on research objectives. Formatting data to the Darwin Core standard can play a fundamental role in facilitating open-access biodiversity data sharing, use and reuse.
Why use DwC in place of OGC, or other standards, like CF? People collect data in all sorts of ways (we list them later in that paragraph), but there are consistent aspects of those: the what, where, and when, and then details of those. It would be great to be able to put all of these disparate collection methods together, and that's what aligning your data allows. The data you collect have details with names you've given them, but there are equivalents with specific names that you can give them to align them.
##### Key Information (How / What)
In practice, adopting the Darwin Core standard revolves around a standard file format, the Darwin Core Archive (DwC-A). This is the biodiversity informatics data standard that uses the Darwin Core terms to produce a data record for biodiversity data. Essentially, DwC-A is a compressed (.zip) file that contains interconnected text (e.g. csv or tsv) files that enables data publishers to share data using common terminology. These data files are logically arranged in a star-like manner, and typically consist of a core file with one or more extension files, connected through the use of primary and foreign keys.
Aside from text data tables, a DwC-A contains .xml files that facilitate human- and machine-interpretation of the data. A descriptor file (meta.xml) describes the contents of the compressed file, as well as the relationships between the core and any extensions. An eml.xml file describes the datasets contained in the DwC-A.
##### Advanced Darwin Core
Measurement Or Fact Section? The eMOf can help pull in other vocabularies or standards and cross-link between them.
##### Top 5 References
- [Darwin Core Quick Reference](https://dwc.tdwg.org/terms/)
- Wieczorek et al., (2012) - Darwin Core: An Evolving Community-Developed Biodiversity Data Standard: <https://doi.org/10.1371/journal.pone.0029715>
- Ecological Metadata Language (\*maybe we can point to EML section in Guidelines?)
##### Repositories that use this data standard
- [Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/)
- [Ocean Biodiversity Information System (OBIS)](https://obis.org/)
- [The Atlas of Living Australia (ALA)](https://www.ala.org.au/)
And many more, see this list of [Key projects using Darwin Core](https://en.wikipedia.org/wiki/Darwin_Core)
##### Is there anything missing?