Skip to content

Commit

Permalink
Finish 1st vignette version
Browse files Browse the repository at this point in the history
  • Loading branch information
katrinleinweber committed Mar 20, 2018
1 parent c198e1b commit c7d90e9
Showing 1 changed file with 140 additions and 1 deletion.
141 changes: 140 additions & 1 deletion vignettes/BacDiveR.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "How to become a BacDiveR"
title: "BacDive-ing in"
author: "Katrin Leinweber"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
Expand All @@ -17,3 +17,142 @@ knitr::opts_chunk$set(
comment = "#>"
)
```

# BacDiveR

This R package provides a programmatic interface for the [Bacterial Diversity
Metadatabase](https://bacdive.dsmz.de/) of the [DSMZ (German Collection of
Microorganisms and Cell Cultures)](https://www.dsmz.de/about-us.html). It helps
you download full datasets or just their IDs based on reproducible searches
against the BacDive Web Service.

## Reference

> Carola Söhngen, Adam Podstawka, Boyke Bunk, Dorothea Gleim, Anna Vetcininova,
> Lorenz Christian Reimer, Christian Ebeling, Cezar Pendarovski, Jörg Overmann;
> BacDive – The Bacterial Diversity Metadatabase in 2016,
> [Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016, Pages D581–D585](https://academic.oup.com/nar/article/44/D1/D581/2503137),
> [doi:10.1093/nar/gkv983](https://doi.org/10.1093/nar/gkv983)

## Installation

1. Because the [BacDive API requires registration](https://bacdive.dsmz.de/api/bacdive/registration/register/) please do that first and wait for your access to be granted.

2. Once you have your login credentials, install BacDiveR from GitHub with:

```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("katrinleinweber/BacDiveR")
```

3. After installing, run the following commands to save your login credentials locally:

```{r edit, eval = FALSE}
file.edit(BacDiveR:::get_Renviron_path())
```

4. In that file, add your email and password directly after the `=` signs, save it, then restart R(Studio) or run:

```{r read}
readRenviron(BacDiveR:::get_Renviron_path())
```

In the following examples, the data retrieval will only work if your login credentials are correct in themselves (no typos) and were correctly saved. Console output like `"{\"detail\": \"Invalid username/password\"}"`, or `Error: $ operator is invalid for atomic vectors` mean that either the login credentials or the `.Renviron` file are incorrect. Please repeat steps 2 to 4.


## Example

The BacDive website lets you easily search for all their strains within a
given taxonomic unit. [BacDive.DSMZ.de/index.php?search=Bacillus](https://bacdive.dsmz.de/index.php?search=Bacillus) for example a
paginated list of strains that you can then access, download and analyse
further. All manual, though. BacDiveR automates this workflow:

```{r taxon_Bac}
library(BacDiveR)
taxon_1 <- "Bacillus halodurans"
Bac_IDs <- retrieve_data(searchTerm = taxon_1)
head(Bac_IDs)
```

Calling `retrieve_data()` with just a `searchTerm` results in a vector of
numeric BacDive IDs. You can use such ID downloads for meta-analyses of different
bacterial taxons such as comparisons of taxon sizes, as they are represented in
the DSMZ's collection.

### Downloading datasets

In order to analyse the actual datasets, we now need to download them. Suppose
we want to compare the optimal growth temperatures of strains from the taxon
*`r taxon_1`* with another one. You can obtain that data of course by feeding
the ID vector obtained above into self-made loops that calls `retrieve_data(…, searchType = "bacdive_id")`.

However, you can save yourself some time and effort by activating the parameter
`force_taxon_download`. This will get you all taxon data in a single (albeit
large) list of dataframes. Feel free to take a break while the computers
do some work for you:

```{r taxon_At}
taxon_2 <- "Aneurinibacillus thermoaerophilus"
Bac_data <- retrieve_data(taxon_1, force_taxon_download = TRUE)
At_data <- retrieve_data(taxon_2, force_taxon_download = TRUE)
```


## Extracting data fields

We wanted the growth temperatures, right? As with any other database field, you
now need to determine its path within the list data structure that BacDiveR
returned to you. Use either

a) RStudio's `Environment > Data` viever, or
a) `str(Bac_data)`, or
a) your web browser's JSON viewer on the dataset's URL: [BacDive.DSMZ.de/api/bacdive/taxon/`r gsub(" ", "/", taxon_1)`](https://bacdive.dsmz.de/api/bacdive/taxon/`r gsub(" ", "/", taxon_1)`),

to find the `$`-marked path to the field of your interest. In our example, it's `$culture_growth_condition$culture_temp$temp`, which we'll now use to extract that field from all entries in our downloaded datasets.

Multiple steps are necessary here, which could easily result in hardly readable code if we used the regular assignment operator `<-`, intermediate variables and nested function calls. We will [avoid this with the pipe operator `%>%`](https://cran.r-project.org/package=magrittr). It indicates that

a) an object is passed into a function as its first argument, and that
a) the function's output is "piped" into the next function.

Note the ` ~ .x` prepended to the path `$culture_growth_condition$culture_temp$temp`! This is `map()`'s way of indicating that each element in the piped-in `dataset` will be accessed at that path.

```{r extract}
library(magrittr)
extract_temps <- function(dataset, taxon_name) {
dataset %>%
purrr::map(~.x$culture_growth_condition$culture_temp$temp) %>%
unlist() %>%
as.numeric() %>%
data.frame(temp_C = ., taxon = rep(taxon_name, length(.))) %>%
return()
}
temperature_Bac <- extract_temps(Bac_data, taxon_1)
temperature_At <- extract_temps(At_data, taxon_2)
```

Before visualising the data, we need to create a dataframe of the two datasets.


```{r ggplot}
library("ggplot2")
rbind(temperature_Bac, temperature_At) %>%
ggplot(aes(x = taxon, y = temp_C)) +
geom_boxplot(notch = TRUE, varwidth = TRUE) +
geom_jitter(height = 0.05, alpha = 0.5) +
theme(legend.position = "none")
```

And thus we find, that *`r taxon_2`* contains strains with different growth optima (note the groups of data _points_), even up to the 50-something-°C-range as the `thermo`-part in its name suggest). On the other hand, all *`r taxon_1`* strains known to BacDive were found to grow best at the lower temperature of `r mean(temperature_Bac)`°C. Thanks to the notch in *`r taxon_2`*'s box, we can also say that there is a significant difference between the temperature ranges of these two taxons, even before digging into the numbers:

```{r}
summary(temperature_At$temp_C)
```

## Summary

BacDiveR helps you download BacDive data for investigating it offline. Use `?retrieve_data` to learn more about its options.

0 comments on commit c7d90e9

Please sign in to comment.