New mashup vignette for #23

Idea is just to show a simple workflow, and some of the steps usually needed to join up tips (here as a `phylo4d` object). Need to work on the text still, but think this will be the basis of the vignette
ropensci · Jul 9, 2015 · 023a4ce · 023a4ce
1 parent 35da0f8
commit 023a4ce
Show file tree

Hide file tree

Showing 3 changed files with 150 additions and 118 deletions.
diff --git a/NAMESPACE b/NAMESPACE
@@ -14,7 +14,9 @@ S3method(ott_id,match_names)
 S3method(ott_id,taxon_lica)
 S3method(ott_taxon_name,taxon_info)
 S3method(ott_taxon_name,taxon_lica)
+S3method(print,found_studies)
 S3method(print,gol)
+S3method(print,study_meta)
 S3method(print,tnrs_contexts)
 S3method(print,tol_summary)
 S3method(study_list,tol_summary)

diff --git a/vignettes/data_mashups.Rmd b/vignettes/data_mashups.Rmd
@@ -0,0 +1,148 @@
+---
+title: "Connecting data to Open Tree trees"
+author: "David Winter"
+date: "`r Sys.Date()`"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Vignette Title}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+##Combining data from OToL and other sources. 
+
+One of the major goals of `rotl` is to help users combine data from other
+sources with the phylogenetic trees in the Open Tree database. This examples
+demonstrates how a user might connect data they have collected to trees from
+Open Tree. 
+
+##Get Open Tree ids to match your data. 
+
+Let's say you have a dataset where each row represents a measurement taken from
+one species, and your goal is to put these measurements in some phylogenetic
+context. . Here's a small example, the best available estimates of the
+mutation rate for a set of single-celled Eukaryotes:
+
+
+```{r, data}
+csv_path <- system.file("extdata", "protist_mutation_rates.csv", package = "rotl")
+mu <- read.csv(csv_path, stringsAsFactors=FALSE)
+mu
+```
+
+To get started, we want to know that each of these species are known to the Open
+Tree project and what their unique ID is. We can use the Taxonomic Name
+resolution Service (`tnrs`) functions to do this. It can be useful to provide a
+taxonomic context for searches using `tnrs` function, so let's see if any of
+them apply to this group:
+
+```{r, context}
+tnrs_contexts()
+```
+
+
+No then. If all of our species fell into one of those groups we could specify
+that group to avoid conflicts between different taxonomic codes. As it is we can 
+search using the `All life` context and the function `tnrs_match_names`:
+
+```{r, match}
+taxon_search <- tnrs_match_names(mu$species, context_name="All life")
+knitr::kable(taxon_search)
+```
+
+So, all the species are known to Open Tree. Note, though, that one of the names
+is a synonym. _Saccharomyces pombe_ is older name for what is now called 
+_Schizosaccharomyces pombe_. As the name suggests, the Taxonnomic Name
+Resolution Service is designed to deal with these problems (and similar ones
+like misspellings), but it is always a good idea to check the results of
+`tnrs_match_names` closely to ensure the results are what you expect.
+
+Let's keep out original data in line with the Open Tree names and IDS by adding them to
+the `data.frame`
+
+```{r, munge}
+mu$ott_name <- taxon_search$unique_name
+mu$ott_id <- taxon_search$ott_id
+```
+
+##Find a tree with your taxa
+
+Now let's find a tree. There are two possible options here: we can search for
+published studies that include our taxa or we can use the synthetic tree from
+Open Tree. Let's start by searching for studies. Before we can do that, we can
+find the names of the various properties of studies or trees that can be used
+for searching:
+
+###Published trees
+
+```{r, properties}
+studies_properties()
+```
+
+We have `ottIds` for our taxa, so let's use those IDs and `studies_find_studies`
+to work out many trees each of our taxa are represented in. Starting with our
+first species _Tetrahymena thermophila_:
+
+```{r taxon_count}
+studies_find_trees(property="ot:ottId", value="180195")
+```
+Well... that's not very promising. We can repeat that process for all of the IDs
+to see if the other species are better represented.
+
+
+
+```{r, all_taxa_count}
+hits <- sapply(mu$ott_id, studies_find_trees, property="ot:ottId")
+sapply(hits, length)
+```
+
+###A part of the synthesis tree
+
+Most of our species are not in any of the published trees available from Open
+Tree. Thankfully, we can still use the complete Tree of Life made from the
+combined results of all of those published trees. The function
+`tol_induced_subtree` will fetch a tree relating a set of IDs:
+
+
+```{r subtree,  fig.width=7, fig.height=4}
+tr <- tol_induced_subtree(ott_ids=mu$ott_id)
+plot(tr)
+```
+
+###Connect your data to the tips of your tree
+
+The package `phylobase` provides classes for storing phylogenetic trees and 
+data together. In order to align the tips of our tree with the rows in our
+`data.frame` we need to make sure the names match exactly. They don't quite do
+that now:
+
+```{r, match_names}
+mu$ott_name
+tr$tip.label
+```
+
+Let's use `sub` to changes the remove the underscores and `ottId` from the tree
+(check out `?regex` to see how these patterns work):
+
+```{r, sub}
+tr$tip.label <- sub("_ott\\d+", "", tr$tip.label)
+tr$tip.label <- sub("_", " ", tr$tip.label)
+tr$tip.label
+```
+
+Ok, now the tips are together we can make a new dataset and 
+
+
+
+```{r phylobase}
+library(phylobase)
+rownames(mu) <- mu$ott_name
+tree_data <- phylo4d(tr, mu[,2:4])
+```
+Now we can plot the data
+
+
+```{r,  fig.width=7, fig.height=4}
+plot(tree_data)
+```
+
diff --git a/vignettes/how-to-use-rotl.Rmd b/vignettes/how-to-use-rotl.Rmd
@@ -210,122 +210,4 @@ Using `get_study("pg_2550")` would returns a `multiPhylo` object (default) with
 all the trees associated with this particular study, while
 `get_study_tree("pg_2550", "tree5513")` would return one of these trees.
 
-##Combining data from OToL and other sources. 
 
-One of the major goals of `rotl` is to help users combine data from other
-sources with the phylogenetic trees in OToL. As an example, let's see if we can
-combine data from [`fishbase`](http://www.fishbase.org/), using the ROpenSci
-package [`rfishbase`](https://github.com/ropensci/rfishbase). 
-
-
-### Find a tree for your focal taxon
-
-Since we are using fishbase in our example, let's focus on one of the most
-diverse fish clades, the cyprinids (carps, true minnows and their kin). We could 
-use `tol_induced_subtree` to get the synthetic tree for this group, or
-`studies_find_studies` to see if there are any published papers available for
-the group. Let's start with the second approach:
-
-```{r, find_study}
-cyp_studies <- studies_find_studies("ot:focalCladeOTTTaxonName", value="Cyprinidae")
-cyp_studies
-```
-
-So there is at least one study on the cyprinids, we can get some more
-information on it using `get_study_meta`:
-
-```{r, meta}
-meta <- get_study_meta("pg_1909")
-meta
-```
-
-`rotl` provides some helper functions to extract elements from this metadata, so
-we can the title and DOI for this study:
-
-```{r, pub}
-get_publication(meta)
-```
-
-OK, this look like a good study. Let's get the trees from this study and check
-them out:
-
-```{r, get_study_tr}
-get_tree_ids(meta)
-tr <- get_study_tree(study="pg_1909", tree_id=get_tree_ids(meta))
-tr
-```
-### Now find some data to attach to the tips in the tree
-
-The `rfishbase` package provides us with a local version of the fishbase
-database, and some tools to extract information from it. Let's load the data,
-and check it out
-
-```{r, fishbase_intro}
-library(rfishbase)
-data(fishbase)
-typeof(fish.data)
-length(fish.data)
-```
-So, `fish.data` is a big list. Each element of the list has data from one
-species, with each of the following sub-elements:
-
-```{r, data}
-names(fish.data[[1]])
-```
-
-We want to match up data in this list to the tips in our tree, unfortunately the
-names are formatted slightly differently:
-
-```{r, compare_labels}
-tr$tip.label[14]
-fish.data[[23210]]$ScientificName
-```
-
-So before we can match the tip labels to the data in the list we have to
-substitute the underscores in the tip labels for spaces. We can then extract the
-`SceintificName` from each `fishbase` entry to see if they match something in
-our tree:
-
-```{r, fishy_data}
-tr$tip.label <- sub("_", " ", tr$tip.label)
-fishbase_in_tree <- fish.data[sapply(fish.data, "[[", "ScientificName") %in% tr$tip.label]
-length(fishbase_in_tree)
-```
-
-###Red Fish Blue Fish
-
-OK, so we have a tree and we have data for 142 of the species in that tree.
-Let's drop the tips representing fish we have no data on:
-
-```{r, drop}
-to_drop <- tr$tip.label[!tr$tip.label %in% sapply(fishbase_in_tree, "[[", "ScientificName")]
-subtree <- drop.tip(tr, to_drop, rooted=TRUE)
-subtree
-```
-We now have a tree with 142 tips, and a datset with 142 entries, but each one of
-those datasets is in a different order. Let's line them up by extracting the
-names from the fishbase data and matching them against the tree:
-
-
-```{r, fbsorted}
-fb_names <- sapply(fishbase_in_tree, "[[", "ScientificName")
-fishbase_in_tree_sorted <- fishbase_in_tree[match(fb_names , subtree$tip.label)]
-```
-
-fascinating theory should we test? Let's follow up [Suess (1960)](https://en.wikipedia.org/wiki/One_Fish_Two_Fish_Red_Fish_Blue_Fish) and explore the distribution of coloration in 
-these fish. Specifically, we can use the `rfishbase` function `which_fish` to
-find red fish and blue fish in our phylogeny:
-
-```{r, redfishbluefish}
-bluefish<- which_fish("grey|green|blue", using="diagnostic", fish.data=fishbase_in_tree_sorted)
-redfish <- which_fish("red|orange", using="diagnostic", fish.data=fishbase_in_tree_sorted)
-```
-
-
-```{r, plot,  fig.width=7, fig.height=4}
-cols <- rep("#FFFFFF00", length(subtree$tip.label))
-cols[redfish] <- "red"
-cols[bluefish] <- "blue"
-plot(subtree, show.tip.label=F)
-tiplabels(pch=16, col=cols, adj=2)
-```