forked from ESIPFed/bds-primer-best-practices
-
Notifications
You must be signed in to change notification settings - Fork 0
/
SOFTWARE_READY.qmd
395 lines (295 loc) · 16.6 KB
/
SOFTWARE_READY.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
# Make Your Data Software Ready
## Use non-proprietary formats
### Why?
- Allows data to be useful in perpetuity by ensuring data readability and reusability across multiple platforms.
- To align better with the FAIR principles (findability, accessibility, interoperability, reusability)
- Makes data more socially equitable, supporting open science. Proprietary formats can depend on software that require licenses, which not everyone can afford/has access to.
### Key Information
- *Non-proprietary formats* are supported by more than one developer and can be accessed with different software systems. For example, comma separated values (CSV) format is becoming an increasingly popular non-proprietary format.
- A [*proprietary file format*](https://en.wikipedia.org/wiki/Proprietary_file_format) is a file format of a company, organization, or individual that contains data that is ordered and stored according to a particular encoding-scheme, designed by the company or organization to be secret or with restricted access, such that the decoding and interpretation of this stored data is easily accomplished only with particular software or hardware that the company itself has developed. There may also be costs associated with it and access may be limited. Examples include `Microsoft Excel (xlsx)` and `ESRI shapefiles (shp)`.
- Many applications (e.g. Microsoft Office) allow exporting in multiple formats.
### Top References
- Table of commonly used formats for common data types\
<https://guides.osu.edu/c.php?g=707751&p=5027409>
- A more detailed table that is specific to US Federal records management\
<https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html>
## Structure tabular data in tidy/long format
### Why?
*This is specifically intended for tabular data*
- There is a clear and easy to understand structure that can make your data more machine readable and easier to analyze/visualize
- Clear structure: one observation per row
- Data are as atomic as possible (e.g., don't mix types in field)
- In the biological data community, tidy formats are more likely to work with commonly-used software
- Easier to aggregate data across multiple files
### Key Information
Example of Wide Format
```{r}
#| label: tbl-tidyformat wide
library(dplyr)
library(flextable)
#library(gt)
example_df <- data.frame(species = c("Tilia americana",
"Pinus strobus"),
site_01 = sample(0:5, size = 2),
site_02 = sample(0:5, size = 2),
site_03 = sample(0:5, size = 2))
long_example <- tidyr::pivot_longer(data = example_df, cols = site_01:site_03, names_to = "site", values_to = "count")
# flextable is awesome, but captions aren't working yet: <https://github.com/quarto-dev/quarto-cli/issues/1556
# for now, just add print lines and maybe switch back in the future
flextable(example_df) %>%
set_caption(caption = "Wide Format") %>%
fontsize(size = 9) %>%
autofit() %>%
theme_zebra()
```
Example of Long Format
```{r}
#| label: tbl-tidyformat long
library(dplyr)
library(flextable)
#library(gt)
example_df <- data.frame(species = c("Tilia americana",
"Pinus strobus"),
site_01 = sample(0:5, size = 2),
site_02 = sample(0:5, size = 2),
site_03 = sample(0:5, size = 2))
long_example <- tidyr::pivot_longer(data = example_df, cols = site_01:site_03, names_to = "site", values_to = "count")
# flextable is awesome, but captions aren't working yet: <https://github.com/quarto-dev/quarto-cli/issues/1556
# for now, just add print lines and maybe switch back in the future
flextable(long_example) %>%
set_caption(caption = "Long Format") %>%
fontsize(size = 9) %>%
autofit() %>% theme_zebra()
# example_df %>%
# gt(caption = "Wide Format")
#
# long_example %>%
# gt(caption = "Long Format")
```
- Can be tricky working with multiple column datatypes
- Don't use colors or text formatting in tabular data, and only include column names as metadata. All other notes, definitions, etc. should be in an external metadata file (e.g. data dictionary)
### Top References
- Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1--23.\
<https://doi.org/10.18637/jss.v059.i10>
- Data Sharing and Management Snafu in 3 Short Acts (video)\
<https://www.youtube.com/watch?v=N2zK3s=Atr-4&t=7s>
- Tips for working with data in BASH\
<https://www.datafix.com.au/BASHing/2022-01-12.html>
- Data Organization in Spreadsheets for Ecologists\
<https://datacarpentry.org/spreadsheet-ecology-lesson/>
- Cleaning Data and Quality Control\
<https://edirepository.org/resources/cleaning-data-and-quality-control#data-table-structure>
## Follow ISO 8601 for dates
```{r xkcd1, echo = FALSE, out.width="40%", fig.align="center", fig.cap="https://imgs.xkcd.com/comics/iso_8601.png"}
knitr::include_graphics('https://imgs.xkcd.com/comics/iso_8601.png')
```
### Why?
- Internationally accepted format used across multiple schemas (e.g. `Darwin Core`, `EML`, `ISO 19115`)
- Removes ambiguity related to timezone, daylight savings time changes, and time of day
- Better software integration of time date/time elements
### Key Information
- UTC (AKA `Zulu` or `GMT`): Coordinated Universal Time (UTC) is the primary time standard by which the world regulates clocks and time. It is time relative to `0°` longitude and is not adjusted for daylight saving time. ([from Wikipedia](https://en.wikipedia.org/wiki/ISO_8601)).
- Conversion to UTC, or between time zones, may depend on daylight savings
*Examples: April 3, 2023 standardized to ISO 8601*
```{r}
#| label: tbl-datetime1
#| echo: FALSE
#| message: FALSE
#| warnings: FALSE
#| results: 'asis'}
tab1 <-
tibble(
Description = c(
"Date",
"Date and Time with timezone offset",
"Date and Time in UTC",
"Time Interval in UTC (April 3 - 5, 2023)"
),
"Written in ISO 8601" = c(
"2023-04-03",
"2023-04-03T18:29:38+00:00",
"2023-04-03T18:29:38Z",
"2023-04-03T18:29:38Z/2023-04-05T00:29:38Z"
)
)
flextable(tab1) %>%
#set_caption(caption = "") %>%
autofit() %>%
fontsize(size = 9) %>%
theme_zebra()
```
*Examples: different styles of timezone annotation*
```{r}
#| label: tbl-datetime2
#| echo: FALSE
#| message: FALSE
#| warnings: FALSE
#| results: 'asis'}
tab2 <-
tibble(
"Description" = c("Date and Time with timezone offset", "Date and Time in UTC"),
"Example" = c("2023-04-03T18:29:38+00:00", "2023-04-03T18:29:38Z"),
"Annotation" = c(
"YYYY-MM-DD[the letter capital T]HH:MM:SS+[Timezone offset]",
"YYYY-MM-DD[capital T]HH:MM:SS+[capital Z to indicate an offset of zero]"
)
)
flextable(tab1) %>%
#set_caption(caption = "") %>%
autofit() %>%
fontsize(size = 9) %>%
theme_zebra()
```
### Top References
- ISO 8601 wiki: <https://en.wikipedia.org/wiki/ISO_8601>
- R package lubridate, OlsonNames()
- Python go-to package, datetime <https://docs.python.org/3/library/datetime.html>
- Article on datetime uncertainty: <https://www.datafix.com.au/BASHing/2020-02-12.html>
- Map of offset from UTC: <https://www.timeanddate.com/time/map/>
- Nice time converter: <https://coastwatch.pfeg.noaa.gov/erddap/convert/time.html>
## Match scientific names to a taxonomic authority
### Why?
- To integrate or aggregate datasets, we need a common frame of reference for taxonomic name
- Provides an anchor for the taxonomy as scientific understanding evolves.
### Key Information
- Definition: As used here, a taxonomic authority is an online resource that maintains up-to-date species-level classification information and provides persistent identifiers for taxonomic classifications. Example: For the species *Balaenoptera borealis* (Lesson, 1828), the WoRMS taxonomic authority ID link is <https://www.marinespecies.org/aphia.php?p=taxdetails&id=137088> and the LSID is `urn:lsid:marinespecies.org:taxname:137088`.
- Use an existing taxonomic authority (e.g. [World Register of Marine Species](https://www.marinespecies.org) , [Integrated Taxonomic Information System](https://itis.gov/) , [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy)) and include the authority who manages said information in your metadata
- List of many authorities can be found here: <https://resolver.globalnames.org/data_sources>
- Make yourself aware of the structure, limits, and history of the authority you are using.
- Adopt standard binomial nomenclature, when possible
- When possible, reference the unique identifier in addition to the nomenclature.
- Always save and document the originally recorded name.
- Put notes about identification uncertainty in a separate column.
- Many authorities have APIs through which you can match names to identifiers.
### Top References
- R packages
- taxize is a taxonomic toolbelt for R. taxize wraps APIs for a large suite of taxonomic databases available on the web\
<https://cran.r-project.org/web/packages/taxize/index.html>
- worrms is an API client for [World Register of Marine Species](http://www.marinespecies.org/)\
<http://cran.nexr.com/web/packages/worrms/vignettes/worrms_vignette.html>
- worms: another API client for WoRMS\
<https://cran.r-project.org/web/packages/worms/index.html>
- Ritis: API client for ITIS \<https://cran.r-project.org/web/packages/ritis/\>
- Python packages
- WoRMS API client\
<https://github.com/iobis/pyworms>
- Global Names Resolver to compare taxonomic concepts across authorities\
<https://resolver.globalnames.org/>
- Article: Recommendations for the Standardisation of Open Taxonomic Nomenclature for Image-Based Identifications\
<https://doi.org/10.3389/fmars.2021.620702>
- TDWG 2022 Keynote: Richard Pyle, "An Introduction to the Scientific Names of Organisms and the Taxon Concepts they Represent"\
<https://www.youtube.com/watch?v=rmTvUUjBxrI>
## Record latitude and longitude in decimal degrees in WGS84
```{r xkcd2, echo = FALSE, out.width="40%", fig.align="center", fig.cap="https://imgs.xkcd.com/comics/coordinate_precision.png"}
knitr::include_graphics('https://imgs.xkcd.com/comics/coordinate_precision.png')
```
### Why?
- Users have to know where you collected this data, which requires a latitude, longitude, reference system and uncertainty.
- Decimal-degrees avoids special symbols (`°` or `‘`) which is preferable for machine readable formats
- `WGS84` is a reference coordinate system that is widely used and incorporated in many GPS units and tools, and recognized as a standard by many government agencies.
### Key Information
- If possible, encourage data providers to confirm, and record, the WGS84 datum prior to data collection.
- Understand and report the device/instrument uncertainty associated with your coordinates because it affects the usability of your data.
- Consider including the vertical component (altitude, depth, height off bottom, elevation, etc)
- Generally speaking, `degrees-minutes-seconds (DMS)` can be converted to `decimal-degrees (DD)` by:
- `DD = d + (min/60) + (sec/3600)`
- Watch out for mixed formats, like degrees, `decimal-minutes (DDM)`.
- Degrees West and South become negative in DD.
- Values for longitude range from `-180` to `180`, inclusive.
- Values for latitude range from `-90` to `90`, inclusive.
*Example Coordinates*
```{r}
#| label: tbl-example_coordinates
#| echo: FALSE
ds <- tibble(
Format = c(
"Decimal Degrees (DD)",
"Degrees Minutes Seconds (DMS)",
"Degrees Decimal Minutes (DM or DDM)"
),
Example = c(
30.50833333,
paste("30\u00B0", "15'", '10 N'),
paste("30\u00B0", "15.1667 N")
)
)
flextable(ds) %>%
#set_caption(caption = "") %>%
autofit() %>%
fontsize(size = 9) %>%
theme_zebra()
```
### Top References
- Existing R/python/ESRI packages/functions
- R - measurements <https://cran.r-project.org/web/packages/measurements/measurements.pdf>
- EML <https://eml.ecoinformatics.org/schema/index.html> (find "bounding Coordinates)
- CF <https://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#latitude-coordinate>
- Getting lat/lon to decimal degrees\
<https://ioos.github.io/bio_mobilization_workshop/03-data-cleaning/index.html#getting-latlon-to-decimal-degrees>
- Some background on precision
- <https://www.trekview.org/blog/2021/reading-decimal-gps-coordinates-like-a-computer/#a-note-on-accuracy>
- <https://gis.stackexchange.com/questions/8650/measuring-accuracy-of-latitude-and-longitude>
- DMS to DD calculator\
<https://www.fcc.gov/media/radio/dms-decimal> -- The three most commonly used datums are WGS84, NAD83, and NAD27. A more complete list can be found here: <https://wiki.gis.com/wiki/index.php/Datum_(geodesy)#List_of_Datums)>
## Use persistent unique identifiers
### Why?
- It can be useful to have unique identifiers to unambiguously identify granules of information, e.g. dataset, collection, database, taxonomic concept, etc. This will allow users to precisely refer to the data and allow your data to remain identifiable when aggregated with other datasets.
- To be able to uniquely identify a record in your data system or across data systems. Useful to create relational databases or merge records.
- Although it increases workload, it safeguards against confusion and inefficiency in the future.
### Key Information
- There are good reasons to keep an identifier opaque, i.e. it does not indicate anything about the content of information it points to. However, there are also transparent, or semi-opaque identifiers in use that take advantage of semantics to guide humans as well as machines.
- One way to create a unique identifier is concatenation of sampling event, location, time, enumeration of unique observation or event. (e.g. `Station_95_Date_09JAN1997:14:35:00.000`)
- Some prefer using opaque identifiers. (e.g. `10FC9784-B30F-48ED-8DB5-FF65A2A9934E`)
- If there is an existing persistent unique identifier, it's usually a good idea to use it (i.e. when using a taxonomic authority like WoRMS and applying their LSID).
- It is important to manage any identifiers you create, if they are not managed by an authority (e.g. DOIs).
- Important that it be persistent (consider samples possibly moving between institutions)
*Examples of PIDs*
```{r}
#| label: tbl-PIDExamples
#| echo: FALSE
# Autowrap isn't working in pdfs for flextable when overridden by font size
# Hard returns seem to be the answer
ds <- tibble(
"Type of PID" = c(
"Digital Object Identifier (DOI)",
"International Geo Sample Number (IGSN)",
"Life Science Identifier (LSID)",
"Open Researcher and Contributor ID (ORCID)"
),
"Use Case" = c(
"Actionable persistent link for papers, data, and other digital objects",
"Persistent identifier for physical samples",
"Persistent structured method for biologically significant data",
"Persistent actionable link for individuals"
),
"Example" = c(
"https://doi.org/10.6084/m9.figshare.16806712.v2",
"http://igsn.org/AU1243>",
"urn:lsid:marinespecies.org:taxname:218214",
"https://orcid.org/0000-0002-4391-107X"
)
)
flextable(ds) %>%
#set_caption(caption = "") %>%
#set_table_properties(layout = "autofit") %>%
fontsize(size = 9) %>%
width(j = 1,
width = 1.5) %>%
width(j = 2,
width = 1.5) %>%
width(j = 3,
width = 3) %>%
theme_zebra()
```
### Top References
- Software and Packages to generate uuids:
- R - uuid <https://cran.r-project.org/web/packages/uuid/index.html>
- python - uuid <https://docs.python.org/3/library/uuid.html>
- <http://guid.one/>
- <https://guidgenerator.com/>
- Guidance on how to use GUIDs (Globally Unique Identifiers) to meet specific requirements of the biodiversity information community\
<http://bioimages.vanderbilt.edu/pages/guid-applicability-final-2011-01.pdf>
- Use of globally unique identifiers (GUIDs) to link herbarium specimen records to physical specimens\
<https://bsapubs.onlinelibrary.wiley.com/doi/full/10.1002/aps3.1027>
- A Beginner's Guide to Persistent Identifiers\
<http://links.gbif.org/persistent_identifiers_guide_en_v1.pdf>