Skip to content

Commit

Permalink
New mashup vignette for #23
Browse files Browse the repository at this point in the history
Idea is just to show a simple workflow, and some of the
steps usually needed to join up tips (here as a `phylo4d`
object).

Need to work on the text still, but think this will be the basis of the
vignette
  • Loading branch information
dwinter committed Jul 9, 2015
1 parent 35da0f8 commit 023a4ce
Show file tree
Hide file tree
Showing 3 changed files with 150 additions and 118 deletions.
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ S3method(ott_id,match_names)
S3method(ott_id,taxon_lica)
S3method(ott_taxon_name,taxon_info)
S3method(ott_taxon_name,taxon_lica)
S3method(print,found_studies)
S3method(print,gol)
S3method(print,study_meta)
S3method(print,tnrs_contexts)
S3method(print,tol_summary)
S3method(study_list,tol_summary)
Expand Down
148 changes: 148 additions & 0 deletions vignettes/data_mashups.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
---
title: "Connecting data to Open Tree trees"
author: "David Winter"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

##Combining data from OToL and other sources.

One of the major goals of `rotl` is to help users combine data from other
sources with the phylogenetic trees in the Open Tree database. This examples
demonstrates how a user might connect data they have collected to trees from
Open Tree.

##Get Open Tree ids to match your data.

Let's say you have a dataset where each row represents a measurement taken from
one species, and your goal is to put these measurements in some phylogenetic
context. . Here's a small example, the best available estimates of the
mutation rate for a set of single-celled Eukaryotes:


```{r, data}
csv_path <- system.file("extdata", "protist_mutation_rates.csv", package = "rotl")
mu <- read.csv(csv_path, stringsAsFactors=FALSE)
mu
```

To get started, we want to know that each of these species are known to the Open
Tree project and what their unique ID is. We can use the Taxonomic Name
resolution Service (`tnrs`) functions to do this. It can be useful to provide a
taxonomic context for searches using `tnrs` function, so let's see if any of
them apply to this group:

```{r, context}
tnrs_contexts()
```


No then. If all of our species fell into one of those groups we could specify
that group to avoid conflicts between different taxonomic codes. As it is we can
search using the `All life` context and the function `tnrs_match_names`:

```{r, match}
taxon_search <- tnrs_match_names(mu$species, context_name="All life")
knitr::kable(taxon_search)
```

So, all the species are known to Open Tree. Note, though, that one of the names
is a synonym. _Saccharomyces pombe_ is older name for what is now called
_Schizosaccharomyces pombe_. As the name suggests, the Taxonnomic Name
Resolution Service is designed to deal with these problems (and similar ones
like misspellings), but it is always a good idea to check the results of
`tnrs_match_names` closely to ensure the results are what you expect.

Let's keep out original data in line with the Open Tree names and IDS by adding them to
the `data.frame`

```{r, munge}
mu$ott_name <- taxon_search$unique_name
mu$ott_id <- taxon_search$ott_id
```

##Find a tree with your taxa

Now let's find a tree. There are two possible options here: we can search for
published studies that include our taxa or we can use the synthetic tree from
Open Tree. Let's start by searching for studies. Before we can do that, we can
find the names of the various properties of studies or trees that can be used
for searching:

###Published trees

```{r, properties}
studies_properties()
```

We have `ottIds` for our taxa, so let's use those IDs and `studies_find_studies`
to work out many trees each of our taxa are represented in. Starting with our
first species _Tetrahymena thermophila_:

```{r taxon_count}
studies_find_trees(property="ot:ottId", value="180195")
```
Well... that's not very promising. We can repeat that process for all of the IDs
to see if the other species are better represented.



```{r, all_taxa_count}
hits <- sapply(mu$ott_id, studies_find_trees, property="ot:ottId")
sapply(hits, length)
```

###A part of the synthesis tree

Most of our species are not in any of the published trees available from Open
Tree. Thankfully, we can still use the complete Tree of Life made from the
combined results of all of those published trees. The function
`tol_induced_subtree` will fetch a tree relating a set of IDs:


```{r subtree, fig.width=7, fig.height=4}
tr <- tol_induced_subtree(ott_ids=mu$ott_id)
plot(tr)
```

###Connect your data to the tips of your tree

The package `phylobase` provides classes for storing phylogenetic trees and
data together. In order to align the tips of our tree with the rows in our
`data.frame` we need to make sure the names match exactly. They don't quite do
that now:

```{r, match_names}
mu$ott_name
tr$tip.label
```

Let's use `sub` to changes the remove the underscores and `ottId` from the tree
(check out `?regex` to see how these patterns work):

```{r, sub}
tr$tip.label <- sub("_ott\\d+", "", tr$tip.label)
tr$tip.label <- sub("_", " ", tr$tip.label)
tr$tip.label
```

Ok, now the tips are together we can make a new dataset and



```{r phylobase}
library(phylobase)
rownames(mu) <- mu$ott_name
tree_data <- phylo4d(tr, mu[,2:4])
```
Now we can plot the data


```{r, fig.width=7, fig.height=4}
plot(tree_data)
```

118 changes: 0 additions & 118 deletions vignettes/how-to-use-rotl.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -210,122 +210,4 @@ Using `get_study("pg_2550")` would returns a `multiPhylo` object (default) with
all the trees associated with this particular study, while
`get_study_tree("pg_2550", "tree5513")` would return one of these trees.

##Combining data from OToL and other sources.

One of the major goals of `rotl` is to help users combine data from other
sources with the phylogenetic trees in OToL. As an example, let's see if we can
combine data from [`fishbase`](http://www.fishbase.org/), using the ROpenSci
package [`rfishbase`](https://github.com/ropensci/rfishbase).


### Find a tree for your focal taxon

Since we are using fishbase in our example, let's focus on one of the most
diverse fish clades, the cyprinids (carps, true minnows and their kin). We could
use `tol_induced_subtree` to get the synthetic tree for this group, or
`studies_find_studies` to see if there are any published papers available for
the group. Let's start with the second approach:

```{r, find_study}
cyp_studies <- studies_find_studies("ot:focalCladeOTTTaxonName", value="Cyprinidae")
cyp_studies
```

So there is at least one study on the cyprinids, we can get some more
information on it using `get_study_meta`:

```{r, meta}
meta <- get_study_meta("pg_1909")
meta
```

`rotl` provides some helper functions to extract elements from this metadata, so
we can the title and DOI for this study:

```{r, pub}
get_publication(meta)
```

OK, this look like a good study. Let's get the trees from this study and check
them out:

```{r, get_study_tr}
get_tree_ids(meta)
tr <- get_study_tree(study="pg_1909", tree_id=get_tree_ids(meta))
tr
```
### Now find some data to attach to the tips in the tree

The `rfishbase` package provides us with a local version of the fishbase
database, and some tools to extract information from it. Let's load the data,
and check it out

```{r, fishbase_intro}
library(rfishbase)
data(fishbase)
typeof(fish.data)
length(fish.data)
```
So, `fish.data` is a big list. Each element of the list has data from one
species, with each of the following sub-elements:

```{r, data}
names(fish.data[[1]])
```

We want to match up data in this list to the tips in our tree, unfortunately the
names are formatted slightly differently:

```{r, compare_labels}
tr$tip.label[14]
fish.data[[23210]]$ScientificName
```

So before we can match the tip labels to the data in the list we have to
substitute the underscores in the tip labels for spaces. We can then extract the
`SceintificName` from each `fishbase` entry to see if they match something in
our tree:

```{r, fishy_data}
tr$tip.label <- sub("_", " ", tr$tip.label)
fishbase_in_tree <- fish.data[sapply(fish.data, "[[", "ScientificName") %in% tr$tip.label]
length(fishbase_in_tree)
```

###Red Fish Blue Fish

OK, so we have a tree and we have data for 142 of the species in that tree.
Let's drop the tips representing fish we have no data on:

```{r, drop}
to_drop <- tr$tip.label[!tr$tip.label %in% sapply(fishbase_in_tree, "[[", "ScientificName")]
subtree <- drop.tip(tr, to_drop, rooted=TRUE)
subtree
```
We now have a tree with 142 tips, and a datset with 142 entries, but each one of
those datasets is in a different order. Let's line them up by extracting the
names from the fishbase data and matching them against the tree:


```{r, fbsorted}
fb_names <- sapply(fishbase_in_tree, "[[", "ScientificName")
fishbase_in_tree_sorted <- fishbase_in_tree[match(fb_names , subtree$tip.label)]
```

fascinating theory should we test? Let's follow up [Suess (1960)](https://en.wikipedia.org/wiki/One_Fish_Two_Fish_Red_Fish_Blue_Fish) and explore the distribution of coloration in
these fish. Specifically, we can use the `rfishbase` function `which_fish` to
find red fish and blue fish in our phylogeny:

```{r, redfishbluefish}
bluefish<- which_fish("grey|green|blue", using="diagnostic", fish.data=fishbase_in_tree_sorted)
redfish <- which_fish("red|orange", using="diagnostic", fish.data=fishbase_in_tree_sorted)
```


```{r, plot, fig.width=7, fig.height=4}
cols <- rep("#FFFFFF00", length(subtree$tip.label))
cols[redfish] <- "red"
cols[bluefish] <- "blue"
plot(subtree, show.tip.label=F)
tiplabels(pch=16, col=cols, adj=2)
```

0 comments on commit 023a4ce

Please sign in to comment.