-
Notifications
You must be signed in to change notification settings - Fork 6
/
README.Rmd
247 lines (181 loc) · 7.04 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
hook_output <- knitr::knit_hooks$get("output")
knitr::knit_hooks$set(output = function(x, options) {
lines <- options$output.lines
if (is.null(lines)) {
return(hook_output(x, options))
}
x <- unlist(strsplit(x, "\n"))
more <- "..."
if (length(lines) == 1) {
if (length(x) > lines) {
x <- c(head(x, lines), more)
}
} else {
x <- c(if (abs(lines[1]) > 1 | lines[1] < 0) more else NULL,
x[lines],
if (length(x) > lines[abs(length(lines))]) more else NULL
)
}
x <- paste(c(x, ""), collapse = "\n")
hook_output(x, options)
})
options(reutils.api.key = NULL)
options(reutils.rcurl.connecttimeout = 25)
library(reutils)
```
# reutils
[![Travis-CI Build Status](https://travis-ci.org/gschofl/reutils.svg?branch=master)](https://travis-ci.org/gschofl/reutils)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/gschofl/reutils?branch=master&svg=true)](https://ci.appveyor.com/project/gschofl/reutils)
[![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/reutils)](http://cran.r-project.org/web/packages/reutils/index.html)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/reutils)](http://cran.r-project.org/package=reutils)
`reutils` is an R package for interfacing with NCBI databases such as PubMed,
Genbank, or GEO via the Entrez Programming Utilities
([EUtils](https://www.ncbi.nlm.nih.gov/books/NBK25501/)). It provides access to the
nine basic *eutils*: `einfo`, `esearch`, `esummary`, `epost`, `efetch`, `elink`,
`egquery`, `espell`, and `ecitmatch`.
Please check the relevant
[usage guidelines](https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requiremen)
when using these services. Note that Entrez server requests are subject to frequency limits.
Consider obtaining an NCBI API key if are a heavy user of E-utilities.
## Installation
You can install the released version of reutils from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("reutils")
```
Install the development version from `github` using the `devtools` package.
```r
require("devtools")
install_github("gschofl/reutils")
```
Please post feature or support requests and bugs at the [issues tracker for the reutils package](https://github.com/gschofl/reutils/issues) on GitHub.
## Important functions ##
With nine E-Utilities, NCBI provides a programmatical interface to the Entrez query and database system for searching and retrieving requested data
Each of these tools corresponds to an `R` function in the reutils package described below.
#### `esearch` ####
`esearch`: search and retrieve a list of primary UIDs or the NCBI History
Server information (queryKey and webEnv). The objects returned by `esearch`
can be passed on directly to `epost`, `esummary`, `elink`, or `efetch`.
#### `efetch` ####
`efetch`: retrieve data records from NCBI in a specified retrieval type
and retrieval mode as given in this
[table](https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly). Data are returned as XML or text documents.
#### `esummary` ####
`esummary`: retrieve Entrez database summaries (DocSums) from a list of primary UIDs (Provided as a character vector or as an `esearch` object)
#### `elink` ####
`elink`: retrieve a list of UIDs (and relevancy scores) from a target database
that are related to a set of UIDs provided by the user. The objects returned by
`elink` can be passed on directly to `epost`, `esummary`, or `efetch`.
#### `einfo` ####
`einfo`: provide field names, term counts, last update, and available updates
for each database.
#### `epost` ####
`epost`: upload primary UIDs to the users's Web Environment on the Entrez
history server for subsequent use with `esummary`, `elink`, or `efetch`.
## Examples
### `esearch`: Searching the Entrez databases ###
Let's search PubMed for articles with Chlamydia psittaci in the title that have been published in 2013 and retrieve a list of PubMed IDs (PMIDs).
```{r, eval=TRUE, echo=TRUE}
pmid <- esearch("Chlamydia psittaci[titl] and 2013[pdat]", "pubmed")
pmid
```
Alternatively we can collect the PMIDs on the history server.
```{r, eval=TRUE, echo=TRUE}
pmid2 <- esearch("Chlamydia psittaci[titl] and 2013[pdat]", "pubmed", usehistory = TRUE)
pmid2
```
We can also use `esearch` to search GenBank. Here we do a search for polymorphic membrane
proteins (PMPs) in Chlamydiaceae.
```{r, eval=TRUE, echo=TRUE}
cpaf <- esearch("Chlamydiaceae[orgn] and PMP[gene]", "nucleotide")
cpaf
```
Some accessors for `esearch` objects
```{r, eval=TRUE, echo=TRUE}
getUrl(cpaf)
```
```{r, eval=TRUE, echo=TRUE}
getError(cpaf)
```
```{r, eval=TRUE, echo=TRUE}
database(cpaf)
```
Extract a vector of GIs:
```{r, eval=TRUE, echo=TRUE}
uid(cpaf)
```
Get query key and web environment:
```{r, eval=TRUE, echo=TRUE}
querykey(pmid2)
```
```{r, eval=TRUE, echo=TRUE}
webenv(pmid2)
```
Extract the content of an EUtil request as XML.
```{r, eval=TRUE, echo=TRUE}
content(cpaf, "xml")
```
Or extract parts of the XML data using the reference class method `#xmlValue()` and
an XPath expression:
```{r, eval=TRUE, echo=TRUE}
cpaf$xmlValue("//Id")
```
### `esummary`: Retrieving summaries from primary IDs ###
`esummary` retrieves document summaries (*docsum*s) from a list of primary IDs.
Let's find out what the first entry for PMP is about:
```{r, eval=TRUE, echo=TRUE, output.lines=24}
esum <- esummary(cpaf[1])
esum
```
We can also parse *docsum*s into a `tibble`
```{r, eval=TRUE, echo=TRUE}
esum <- esummary(cpaf[1:4])
content(esum, "parsed")
```
### `efetch`: Downloading full records from Entrez ###
First we search the protein database for sequences of the **c**hlamydial **p**rotease
**a**ctivity **f**actor, [CPAF](http://dx.doi.org/10.1016/j.tim.2009.07.007)
```{r, eval=TRUE, echo=TRUE}
cpaf <- esearch("Chlamydia[orgn] and CPAF", "protein")
cpaf
```
Let's fetch the FASTA record for the first protein. To do that, we have to
set `rettype = "fasta"` and `retmode = "text"`:
```{r, eval=TRUE, echo=TRUE}
cpaff <- efetch(cpaf[1], rettype = "fasta", retmode = "text")
cpaff
```
Now we can write the sequence to a fasta file by first extracting the data from the
`efetch` object using `content()`:
```{r, eval=TRUE, echo=TRUE}
write(content(cpaff), file = "~/cpaf.fna")
```
```{r, eval=TRUE, echo=TRUE, output.lines=24}
cpafx <- efetch(cpaf, rettype = "fasta", retmode = "xml")
cpafx
```
```{r, eval=TRUE, echo=TRUE, output.lines=24}
aa <- cpafx$xmlValue("//TSeq_sequence")
aa
defline <- cpafx$xmlValue("//TSeq_defline")
defline
```
### `einfo`: Information about the Entrez databases ###
You can use `einfo` to obtain a list of all database names accessible through the Entrez utilities:
```{r, eval=TRUE, echo=TRUE}
einfo()
```
For each of these databases, we can use `einfo` again to obtain more information:
```{r, eval=TRUE, echo=TRUE}
einfo("taxonomy")
```