Finish 1st vignette version

TIBHannover · Mar 20, 2018 · c7d90e9 · c7d90e9
1 parent c198e1b
commit c7d90e9
Showing 1 changed file with 140 additions and 1 deletion.
diff --git a/vignettes/BacDiveR.Rmd b/vignettes/BacDiveR.Rmd
@@ -1,5 +1,5 @@
 ---
-title: "How to become a BacDiveR"
+title: "BacDive-ing in"
 author: "Katrin Leinweber"
 date: "`r Sys.Date()`"
 output: rmarkdown::html_vignette
@@ -17,3 +17,142 @@ knitr::opts_chunk$set(
   comment = "#>"
 )
 ```
+
+# BacDiveR
+
+This R package provides a programmatic interface for the [Bacterial Diversity
+Metadatabase](https://bacdive.dsmz.de/) of the [DSMZ (German Collection of 
+Microorganisms and Cell Cultures)](https://www.dsmz.de/about-us.html). It helps 
+you download full datasets or just their IDs based on reproducible searches 
+against the BacDive Web Service.
+
+## Reference
+
+> Carola Söhngen, Adam Podstawka, Boyke Bunk, Dorothea Gleim, Anna Vetcininova,
+> Lorenz Christian Reimer, Christian Ebeling, Cezar Pendarovski, Jörg Overmann;
+> BacDive – The Bacterial Diversity Metadatabase in 2016, 
+> [Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016, Pages D581–D585](https://academic.oup.com/nar/article/44/D1/D581/2503137), 
+> [doi:10.1093/nar/gkv983](https://doi.org/10.1093/nar/gkv983)
+
+
+## Installation
+
+1. Because the [BacDive API requires registration](https://bacdive.dsmz.de/api/bacdive/registration/register/) please do that first and wait for your access to be granted.
+
+2. Once you have your login credentials, install BacDiveR from GitHub with:
+
+```{r gh-installation, eval = FALSE}
+# install.packages("devtools")
+devtools::install_github("katrinleinweber/BacDiveR")
+```
+
+3. After installing, run the following commands to save your login credentials locally:
+
+```{r edit, eval = FALSE}
+file.edit(BacDiveR:::get_Renviron_path())
+```
+
+4. In that file, add your email and password directly after the `=` signs, save it, then restart R(Studio) or run:
+
+```{r read}
+readRenviron(BacDiveR:::get_Renviron_path())
+```
+
+In the following examples, the data retrieval will only work if your login credentials are correct in themselves (no typos) and were correctly saved. Console output like `"{\"detail\": \"Invalid username/password\"}"`, or `Error: $ operator is invalid for atomic vectors` mean that either the login credentials or the `.Renviron` file are incorrect. Please repeat steps 2 to 4.
+
+
+## Example
+
+The BacDive website lets you easily search for all their strains within a 
+given taxonomic unit. [BacDive.DSMZ.de/index.php?search=Bacillus](https://bacdive.dsmz.de/index.php?search=Bacillus) for example a 
+paginated list of strains that you can then access, download and analyse 
+further. All manual, though. BacDiveR automates this workflow:
+
+```{r taxon_Bac}
+library(BacDiveR)
+taxon_1 <- "Bacillus halodurans"
+Bac_IDs <- retrieve_data(searchTerm = taxon_1) 
+head(Bac_IDs) 
+```
+
+Calling `retrieve_data()` with just a `searchTerm` results in a vector of 
+numeric BacDive IDs. You can use such ID downloads for meta-analyses of different 
+bacterial taxons such as comparisons of taxon sizes, as they are represented in 
+the DSMZ's collection.
+
+### Downloading datasets
+
+In order to analyse the actual datasets, we now need to download them. Suppose  
+we want to compare the optimal growth temperatures of strains from the taxon
+*`r taxon_1`* with another one. You can obtain that data of course by feeding 
+the ID vector obtained above into self-made loops that calls `retrieve_data(…, searchType = "bacdive_id")`.
+
+However, you can save yourself some time and effort by activating the parameter
+`force_taxon_download`. This will get you all taxon data in a single (albeit
+large) list of dataframes. Feel free to take a break while the computers
+do some work for you:
+
+```{r taxon_At}
+taxon_2 <- "Aneurinibacillus thermoaerophilus"
+Bac_data <- retrieve_data(taxon_1, force_taxon_download = TRUE)
+At_data <- retrieve_data(taxon_2, force_taxon_download = TRUE)
+```
+
+
+## Extracting data fields
+
+We wanted the growth temperatures, right? As with any other database field, you 
+now need to determine its path within the list data structure that BacDiveR 
+returned to you. Use either
+
+a) RStudio's `Environment > Data` viever, or
+a) `str(Bac_data)`, or
+a) your web browser's JSON viewer on the dataset's URL: [BacDive.DSMZ.de/api/bacdive/taxon/`r gsub(" ", "/", taxon_1)`](https://bacdive.dsmz.de/api/bacdive/taxon/`r gsub(" ", "/", taxon_1)`),
+
+to find the `$`-marked path to the field of your interest. In our example, it's `$culture_growth_condition$culture_temp$temp`, which we'll now use to extract that field from all entries in our downloaded datasets.
+
+Multiple steps are necessary here, which could easily result in hardly readable code if we used the regular assignment operator `<-`, intermediate variables and nested function calls. We will [avoid this with the pipe operator `%>%`](https://cran.r-project.org/package=magrittr). It indicates that 
+
+a) an object is passed into a function as its first argument, and that
+a) the function's output is "piped" into the next function.
+
+Note the ` ~ .x` prepended to the path `$culture_growth_condition$culture_temp$temp`! This is `map()`'s way of indicating that each element in the piped-in `dataset` will be accessed at that path.
+
+```{r extract}
+library(magrittr) 
+ 
+extract_temps <- function(dataset, taxon_name) {
+  dataset %>%
+  purrr::map(~.x$culture_growth_condition$culture_temp$temp) %>%
+  unlist() %>%
+  as.numeric() %>%
+  data.frame(temp_C = ., taxon = rep(taxon_name, length(.))) %>%
+  return()
+}
+
+temperature_Bac <- extract_temps(Bac_data, taxon_1) 
+temperature_At <- extract_temps(At_data, taxon_2) 
+``` 
+
+Before visualising the data, we need to create a dataframe of the two datasets.
+
+
+```{r ggplot}
+library("ggplot2")
+
+rbind(temperature_Bac, temperature_At) %>% 
+  ggplot(aes(x = taxon, y = temp_C)) +
+  geom_boxplot(notch = TRUE, varwidth = TRUE) +
+  geom_jitter(height = 0.05, alpha = 0.5) +
+  theme(legend.position = "none")
+```
+
+And thus we find, that *`r taxon_2`* contains strains with different growth optima (note the groups of data _points_), even up to the 50-something-°C-range as the `thermo`-part in its name suggest). On the other hand, all *`r taxon_1`* strains known to BacDive were found to grow best at the lower temperature of `r mean(temperature_Bac)`°C. Thanks to the notch in *`r taxon_2`*'s box, we can also say that there is a significant difference between the temperature ranges of these two taxons, even before digging into the numbers:
+
+```{r}
+summary(temperature_At$temp_C)
+```
+
+## Summary
+
+BacDiveR helps you download BacDive data for investigating it offline. Use `?retrieve_data` to learn more about its options.