src/uvabits_to_movebank.Rmd

---
title: "Prepare Uva-BiTS (meta)data for Movebank upload"
author: "Peter Desmet"
date: "`r Sys.Date()`"
output:
  html_document
---

This script prepares [UvA-BiTS](http://www.uva-bits.nl/) bird tracking data for upload to [Movebank](https://www.movebank.org/). For a project, it downloads **reference, GPS and acceleration data** from UvA-BiTS and transforms these to the [Movebank data format](https://www.movebank.org/node/2381) using [SQL](sql). See the steps at the end of this document to check for outliers and to concatenate files before upload.

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

Load libraries:

```{r}
library(tidyverse)
library(DBI)
library(RPostgres)
library(config)
library(here)
library(glue)
# For RPostgres, see https://github.com/r-dbi/RPostgres/issues/181#issuecomment-438700465
```

## Connect to UvA-BiTS database

Get connection settings from `config.yml` (not included in this repo) and connect to database:

```{r connect_to_uvabits}
uvabits <- config::get("uvabits")
con = DBI::dbConnect(
  drv = RPostgres::Postgres(),
  dbname = uvabits$database,
  host = uvabits$server,
  port = uvabits$port,
  user = uvabits$username,
  password = uvabits$password
)

# The alternative is getting the connection defined in odbc.ini:
# con <- DBI::dbConnect(odbc::odbc(), "uvabits")
```

## Set user-defined values

Some values cannot be automatically assumed for the Movebank format and have to be defined by the user. See [here](https://www.movebank.org/node/2381#metadata) for information on all reference data terms.

Set values:

```{r}
project_id <- "O_WESTERSCHELDE"
shared <- FALSE
overwrite_files <- FALSE
animal_life_stage <- "adult"
manipulation_type <- "none"
bird_remarks_is_nickname <- TRUE
start_date <- "2019-01-01 00:00:00"
end_date <- "2019-12-31 24:00:00"
```

## Get reference data

For a project, get data on animals, tags and deployments in the [Movebank reference data format](https://www.movebank.org/node/27).

Query data:

```{r get_ref_data}
movebank_ref_sql <- glue::glue_sql(readr::read_file(here::here("sql", "uvabits_to_movebank_ref.sql")), .con = con)
movebank_ref_data <- DBI::dbGetQuery(con, movebank_ref_sql)
```

Save to CSV:

```{r}
readr::write_csv(movebank_ref_data, file = here::here("data", "processed", project_id, "movebank_ref_data.csv"), na = "")
```

### Note on track session remarks

Text in `track_session.remarks` is parsed to different Movebank fields using regular expressions and should follow a certain format:

```
Waterland-Oudeman | wing_length: 385 | tarsus_length: 71 | Tracker reused from L143467, assumed dead.

# Release location only
Waterland-Oudeman

# Measurements only
wing_length: 385 | tarsus_length: 71

# Remarks only
 | Tracker reused from L143467, assumed dead.
```

- `study-site` = `Waterland-Oudeman` (the release location). Write at start of `remarks`. Start with capital letter, don't use spaces.
- `deploy-on-measurements` = `wing_length: 385 | tarsus_length: 71`. Write as `key_name: value` pairs, where `key_name` is lower snake_case and pairs are separated by ` | `.
- `deployment-comments` = `Tracker reused from L143467, assumed dead.`. Write at end of `remarks` and end with `.`. Don't use `:`.
- `deployment-end-type` = `dead`. The words `dead`, `defect`, `malfunction` in `remarks` are used to map to specific types.

## Get ring numbers

Get individuals with track sessions starting before the `end_date` (avoids unnecessary downloads for individuals with later tracking sessions) by ring number (stored as `animal-id`). Note that ring number is a better identifier than `tag-id` as it is unique to an individual (and cannot be reused).

```{r}
ring_numbers <- movebank_ref_data %>%
  dplyr::filter(`alt-project-id` == project_id) %>%
  dplyr::filter(`deploy-on-timestamp` <= end_date) %>%
  dplyr::distinct(`animal-id`, .keep_all = TRUE) %>%
  dplyr::arrange(`animal-id`) %>%
  dplyr::pull(`animal-id`)
```

Load custom function to download data (will iterate over ring numbers):

```{r}
source(here::here("src", "functions", "download_data.R"))
```

## Get GPS data

Query and save data (one csv file per ringnumber):

```{r}
download_data(
  sql_file = here::here("sql", "uvabits_to_movebank_gps.sql"),
  download_directory = here::here("data", "processed", project_id, "gps"),
  download_filename = "movebank_gps",
  ring_numbers = ring_numbers,
  shared = shared,
  con = con,
  overwrite = overwrite_files
)
```

## Get acceleration data

Query and save data (one csv file per ringnumber):

```{r}
download_data(
  sql_file = here::here("sql", "uvabits_to_movebank_acc.sql"),
  download_directory = here::here("data", "processed", project_id, "acc"),
  download_filename = "movebank_acc",
  ring_numbers = ring_numbers,
  shared = shared,
  con = con,
  overwrite = overwrite_files
)
```

## Outlier detection

After download, run `outliers.Rmd` to mark outliers based on speed.

## Upload to Movebank

For smaller projects (e.g. MH_WATERLAND), files can be uploaded to Movebank as they are queried here: per individual (encompassing all years). To update the dataset on Movebank, just query all years again and replace the file on Movebank.

For larger projects (e.g. HG_OOSTENDE), it is better to upload per year, as data won't change for previous years. This can be done by using the `start_date`/`end_date` filters above and concatenating the resulting individual files. To update the dataset on Movebank, start from the last uploaded year (as it might contain new records), replace that one on Movebank, and add the new years as new files. The maximum upload limit per file is around 1.2 GB, so large files might need to be broken up (especially acceleration data).

To concatenate files, use:

```
awk 'FNR > 1' gps_outliers/movebank_gps_*.csv > movebank_gps_yyyy_nh.csv
```

After which the header needs to be added with:

```
cat movebank_gps_outliers_headers.csv movebank_gps_yyyy_nh.csv > movebank_gps_yyyy.csv
```