Skip to content

Commit

Permalink
Update vignettes
Browse files Browse the repository at this point in the history
Add section about acc_filter to tips and tricks
  • Loading branch information
joelnitta committed Aug 31, 2022
1 parent f1024fa commit 1e30498
Show file tree
Hide file tree
Showing 8 changed files with 183 additions and 138 deletions.
36 changes: 8 additions & 28 deletions vignettes/1_rodents.Rmd
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
---
title: "1. Build a database of all rodents"
date: "2022-07-29"
date: "2022-08-31"
output: rmarkdown::html_vignette
---



In this first tutorial we are going to build a database for all rodents. The rodents are a good test case for playing with `restez` as they are a relatively small domain in GenBank but still have charismatic organisms that people are familiar enough with to understand. To keep things extra fast, we will also limit the number of sequences in the database by limiting the sequence sizes between 100 and 1000.
In this first tutorial we are going to build a database for all rodents. The rodents are a good test case for playing with `restez` as they are a relatively small domain in GenBank but still have charismatic organisms that people are familiar enough with to understand. We will also limit the number of sequences in the database by limiting the sequence sizes between 100 and 1000.

The database you build here will be used again in later tutorials and you may wish to experiment with it yourself. Therefore it is best to locate a suitable place in your harddrive where you would like to store it for later reference. In this tutorial and in others, we will always refer to the rodents' `restez` path with the variable `rodents_path`.

Setting up the rodents database will likely take a long time. The exact time will depend on your internet speeds and machine specs. For reference, this vigenette was written on a MacBook Air (2013) via WiFi with a download speed of 13 MBPS. With this setup, downloading the database took 26 minutes and creating the database took 59 minutes.
Setting up the rodents database will likely take a long time. The exact time will depend on your internet speeds and machine specs. For reference, this vigenette was written on an iMac (2019) with a download speed of 56 MBPS. With this setup, downloading the database took 2.9 hr and creating the database took 5.3 hr.

Note GenBank domains (i.e., major parts of GenBank split by major taxonomic group) are specified by number. As of writing, rodents are number 10, but the numbering scheme may change between releases, so don't assume this will necessarily be the case in future releases. You can check the current number scheme by running `db_download()` without any input to `preselection`.

## Download

Expand All @@ -20,8 +22,8 @@ Setting up the rodents database will likely take a long time. The exact time wil
library(restez)
# set the restez path to a memorable location
restez_path_set(rodents_path)
# download for domain 15
db_download(preselection = '15')
# download for domain 10 (rodents as of GenBank release 251)
db_download(preselection = '10')
```

## Build
Expand All @@ -35,29 +37,7 @@ db_create(min_length = 100, max_length = 1000)
```r
restez_status()
#> Checking setup status at ...
#> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Restez path ...
#> ... Path '[RODENTS PATH]/restez'
#> ... Does path exist? 'Yes'
#> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Download ...
#> ... Path '[RODENTS PATH]/restez/downloads'
#> ... Does path exist? 'Yes'
#> ... N. files 58
#> ... N. GBs 6.77
#> ... GenBank division selections 'Primate'
#> ... GenBank Release 250
#> ... Last updated '2022-07-28 21:51:13'
#> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Database ...
#> ... Path '[RODENTS PATH]/restez/sql_db'
#> ... Does path exist? 'Yes'
#> ... N. GBs 0
#> ... Does the database have data? 'Yes'
#> ... Number of sequences 631991
#> ... Min. sequence length 100
#> ... Max. sequence length 1000
#> ... Last_updated '2022-07-29 00:50:00'
#> Error in restez_path_check(): Restez path not set. Use restez_path_set().FALSE
```

## Next up
Expand Down
16 changes: 12 additions & 4 deletions vignettes/1_rodents.Rmd.orig
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,21 @@ status_class2 <- function() {
res
}
assignInNamespace(x = 'status_class', value = status_class2, ns = 'restez')

# Get download and build times
# rodent_build_times.RDS generated by rodent_db.R
rodent_build_times <- readRDS(here::here("other/rodent_build_times.RDS"))
dl_time_hr <- round(rodent_build_times$dl_time[["elapsed"]] / 60 / 60, 1)
db_time_hr <- round(rodent_build_times$db_time[["elapsed"]] / 60 / 60, 1)
```

In this first tutorial we are going to build a database for all rodents. The rodents are a good test case for playing with `restez` as they are a relatively small domain in GenBank but still have charismatic organisms that people are familiar enough with to understand. To keep things extra fast, we will also limit the number of sequences in the database by limiting the sequence sizes between 100 and 1000.
In this first tutorial we are going to build a database for all rodents. The rodents are a good test case for playing with `restez` as they are a relatively small domain in GenBank but still have charismatic organisms that people are familiar enough with to understand. We will also limit the number of sequences in the database by limiting the sequence sizes between 100 and 1000.

The database you build here will be used again in later tutorials and you may wish to experiment with it yourself. Therefore it is best to locate a suitable place in your harddrive where you would like to store it for later reference. In this tutorial and in others, we will always refer to the rodents' `restez` path with the variable `rodents_path`.

Setting up the rodents database will likely take a long time. The exact time will depend on your internet speeds and machine specs. For reference, this vigenette was written on a MacBook Air (2013) via WiFi with a download speed of 13 MBPS. With this setup, downloading the database took 26 minutes and creating the database took 59 minutes.
Setting up the rodents database will likely take a long time. The exact time will depend on your internet speeds and machine specs. For reference, this vigenette was written on an iMac (2019) with a download speed of 56 MBPS. With this setup, downloading the database took `r dl_time_hr` hr and creating the database took `r db_time_hr` hr.

Note GenBank domains (i.e., major parts of GenBank split by major taxonomic group) are specified by number. As of writing, rodents are number 10, but the numbering scheme may change between releases, so don't assume this will necessarily be the case in future releases. You can check the current number scheme by running `db_download()` without any input to `preselection`.

## Download
```{r setpath, include=FALSE, eval=TRUE}
Expand All @@ -41,8 +49,8 @@ if (!dir.exists(rodents_path)) {
library(restez)
# set the restez path to a memorable location
restez_path_set(rodents_path)
# download for domain 15
db_download(preselection = '15')
# download for domain 10 (rodents as of GenBank release 251)
db_download(preselection = '10')
```

## Build
Expand Down
95 changes: 7 additions & 88 deletions vignettes/2_search_and_fetch.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "2. How to search for and fetch sequences"
date: "2022-08-29"
date: "2022-08-31"
output: rmarkdown::html_vignette
---

Expand Down Expand Up @@ -38,10 +38,10 @@ To fetch the sequences from the rodents database, we can use the `gb_fasta_get`
restez_path_set(rodents_path)
coi_sequences <- gb_fasta_get(id = accessions)
str(coi_sequences[[1]])
#> NULL
#> chr ">FJ808614.1 Glis glis isolate A1g02_E11-F cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\n"| __truncated__
# Are all accessions in results?
all(accessions %in% names(coi_sequences))
#> [1] FALSE
#> [1] TRUE
```

## Comparing to Entrez
Expand All @@ -55,7 +55,7 @@ system.time(expr = {
coi_sequences <- gb_fasta_get(id = accessions)
})
#> user system elapsed
#> 0.142 0.293 0.187
#> 0.202 0.099 0.255
# time via Entrez
system.time(expr = {
coi_sequences_p1 <- rentrez::entrez_fetch(db = 'nucleotide',
Expand All @@ -75,7 +75,7 @@ system.time(expr = {
rettype = 'fasta')
})
#> user system elapsed
#> 0.047 0.008 4.618
#> 0.049 0.008 5.588
```
<!-- Below is no longer relevant now that the size of sequences in the db has been limited.
## Missing
Expand All @@ -86,7 +86,7 @@ A user should know that if an ID cannot be found in the local database no error
```r
# Are all accessions in results?
all(accessions %in% names(coi_sequences))
#> [1] FALSE
#> [1] TRUE
# .... no
```
Expand All @@ -97,88 +97,7 @@ We can look up alternative IDs and test whether they are in our `accesssions` ve
```r
(accessions[!accessions %in% names(coi_sequences)])
#> [1] "MZ661159" "MZ364430" "HQ966965" "GU670702" "GU670701" "GU670700"
#> [7] "GU670699" "GU670462" "GU670451" "GU670450" "GU670447" "GU670446"
#> [13] "GU670445" "GU670444" "GU670442" "KY033226" "HQ980030" "MW478021"
#> [19] "MK256849" "MK256848" "MK256847" "MK256846" "MK256845" "MK256844"
#> [25] "MK256843" "MK256842" "MK256841" "MK256840" "MK256839" "MK256838"
#> [31] "MK256837" "MN326076" "MN326074" "MN326066" "MN326062" "MN326059"
#> [37] "MN326045" "HM380207" "KX859270" "KX859268" "KX859266" "KX859265"
#> [43] "KX859257" "KX859256" "KX859255" "KX859254" "JF457164" "JF457162"
#> [49] "JF457161" "JF457159" "JF457156" "JF457155" "JF457154" "JF457153"
#> [55] "JF457152" "JF457151" "JF457147" "JF457143" "JF457142" "JF457140"
#> [61] "JF457139" "JF457134" "JF457121" "JF457099" "JF456717" "JF456716"
#> [67] "JF456715" "JF456610" "JF456609" "JF456608" "JF456607" "JF456606"
#> [73] "JF456605" "JF456604" "JF456603" "JF456602" "JF456599" "JF499380"
#> [79] "KY754550" "KY754547" "KY754517" "KY754509" "KY605374" "KY605373"
#> [85] "KY605302" "KY605301" "KY605300" "KY315490" "LT630616" "LT630615"
#> [91] "LT630614" "LT630613" "LT630612" "LT630611" "LT630610" "LT630609"
#> [97] "LT630608" "LT630607" "JF457163" "JF457160" "JF457158" "JF457157"
#> [103] "JF456718" "JF456601" "JF456600" "JF456597" "JF343472" "JF499379"
#> [109] "JF499341" "JF499335" "JF499319" "JF499307" "JF444472" "JF444471"
#> [115] "JF444470" "JF444445" "JF444444" "JF444306" "GU931776" "GU931775"
#> [121] "HM102291" "FJ808617" "FJ808616" "FJ808615" "FJ808614" "KU527909"
#> [127] "KU527908" "KU527907" "KU527906" "KU527905" "KU527904" "LN899446"
#> [133] "LN899445" "LN899444" "LN899443" "LN899442" "LN899441" "LN899440"
#> [139] "LN899439" "LN899438" "LN899437" "LN899436" "LN899435" "LN899434"
#> [145] "LN899433" "LN899432" "LN899431" "LN899430" "LN899429" "LN899428"
#> [151] "LN899427" "LN899426" "LN899425" "KF152967" "KJ192997" "KJ192996"
#> [157] "KJ192995" "KJ192994" "KJ192831" "KJ192830" "KM537985" "KM537984"
#> [163] "KM537983" "KM537982" "KM537981" "KM537980" "KM537979" "KM537978"
#> [169] "KM537977" "KM537976" "KM537975" "KM537974" "KM537973" "KM537972"
#> [175] "KM537971" "KM537970" "KM537969" "KM537968" "KM537967" "KM537966"
#> [181] "KM537965" "KM537964" "KM537963" "KM537962" "KM537961" "KM537960"
#> [187] "KM537959" "KM537958" "KM537957" "KM537956" "KM537955" "KM537954"
#> [193] "KM537953" "KM537952" "KM537951" "KM537950" "KM537949" "KM537948"
#> [199] "KM537947" "KM537946" "KM537945" "KM537944" "KM537943" "KM537942"
#> [205] "KM537941" "KM537940" "KM537939" "KM537938" "KM537937" "KM537936"
#> [211] "KM537935" "KM537934" "KM537933" "KM537932" "KM537931" "KM537930"
#> [217] "KM537929" "KM537928" "KM537927" "KM537926" "KM537925" "KM537924"
#> [223] "KM537923" "KM537922" "KM537921" "KM537920" "KM537919" "KM537918"
#> [229] "KM537917" "KM537916" "KM537915" "KM537914" "KM537913" "KM537912"
#> [235] "KM537911" "KM537910" "KM537909" "KM537908" "KM537907" "KM537906"
#> [241] "KM537905" "KM537904" "KM537903" "KM537902" "KM537901" "KM537900"
#> [247] "KM537899" "KM537898" "KM537897" "KM537896" "KM537895" "KM537894"
#> [253] "KM537893" "KM537892" "KM537891" "KM537890" "KM537889" "KM537888"
#> [259] "KM537887" "KM537886" "KM537885" "KF856233" "KF856232" "KF856231"
#> [265] "KF856230" "KF856229" "KF856228" "KF856227" "KF856226" "KF856225"
#> [271] "KF856224" "KF856223" "KF856222" "KF856221" "KF856220" "KF856219"
#> [277] "KF856218" "KF856217" "KF856216" "KF856215" "KF856214" "KF856213"
#> [283] "KF856212" "KF856211" "KJ466862" "KJ466738" "KJ466737" "KJ466736"
#> [289] "KJ466735" "KJ466734" "KJ466733" "KJ466732" "KJ466731" "KJ466730"
#> [295] "KJ466729" "KJ466728" "KJ466727" "KJ466726" "KJ466725" "KJ466724"
#> [301] "KJ466723" "KJ466722" "KJ466721" "KJ466720" "KJ466719" "KJ466718"
#> [307] "KJ466717" "KJ466716" "KJ466715" "KJ466714" "KJ466713" "KJ466712"
#> [313] "KJ466711" "KJ466710" "KJ466709" "KJ466708" "KJ466707" "KJ466706"
#> [319] "KJ466705" "KJ466704" "KJ466703" "KJ466702" "KJ466701" "KJ466700"
#> [325] "KJ466699" "KJ466698" "KJ466697" "KJ466696" "KJ466695" "KJ466694"
#> [331] "KJ466693" "KJ466692" "KJ466691" "KJ466690" "KJ466689" "KJ466688"
#> [337] "KJ466687" "KJ466686" "KJ466685" "KJ466684" "KJ466683" "KJ466682"
#> [343] "KJ466681" "KJ466680" "KJ466679" "KJ466678" "KJ466677" "KJ466676"
#> [349] "KJ466675" "KJ466674" "KJ466673" "KJ466672" "KJ466671" "KJ466670"
#> [355] "KC861443" "JX962054" "JX962053" "KC193094" "KC193093" "KC193092"
#> [361] "KC193091" "KC193090" "KC193089" "KC193088" "KC193087" "KC193086"
#> [367] "KC193085" "KC193084" "KC193083" "KC189865" "HM031935" "HM031934"
#> [373] "HM031933" "HM031932" "JQ601651" "JQ601447" "JQ601445" "JQ601436"
#> [379] "JQ599849" "JQ599848" "JQ599847" "JQ599843" "JQ599841" "JQ599837"
#> [385] "JQ599827" "JQ599826" "JQ599825" "JQ599824" "JQ599822" "JQ599821"
#> [391] "JQ599820" "JQ599819" "JQ599818" "JQ599817" "JQ043551" "JQ043550"
#> [397] "JQ043549" "JQ043548" "JQ043547" "JQ350517" "JQ350516" "JQ350515"
#> [403] "JQ350514" "JQ350513" "JQ350512" "JQ350511" "JQ350510" "JQ350509"
#> [409] "JQ350508" "JQ350507" "JQ350506" "JQ350505" "JQ350504" "JQ350503"
#> [415] "JQ350502" "JQ350501" "JF457150" "JF457149" "JF457148" "JF457146"
#> [421] "JF457145" "JF457138" "JF457137" "JF457136" "JF457135" "JF457133"
#> [427] "JF457120" "JF457119" "JF457118" "JF457117" "JF457116" "JF457115"
#> [433] "JF457114" "JF457113" "JF457112" "JF444286" "JF457171" "JF457170"
#> [439] "JF457169" "JF457168" "JF457167" "JF457166" "JF457165" "JF457144"
#> [445] "JF457141" "JF457124" "JF457123" "JF457122" "JF457111" "JF457110"
#> [451] "JF457109" "JF457108" "JF457107" "JF457106" "JF457105" "JF457104"
#> [457] "JF457103" "JF457102" "JF457101" "JF457100" "JF456598" "JF456155"
#> [463] "JF456154" "JF459873" "JF459872" "JF459626" "JF459625" "JF459623"
#> [469] "JF459280" "JF459279" "JF459278" "JF459277" "JF459276" "JF459479"
#> [475] "JF444390" "JF444298" "JF444288" "JF444287" "JF443948" "JF443947"
#> [481] "JF443946" "JF443945" "JF443944" "JF443943" "JF443942" "JF443941"
#> [487] "JF444258" "JF444257" "JF444256" "JF444139" "JF444138" "GU377127"
#> character(0)
# NC* refers to RefSeq sequences and are not available through restez
# The sequence exists in GB under a different id which we can find like so
smmry <- rentrez::entrez_summary(db = 'nucleotide', id = 'NC_027278')
Expand Down
120 changes: 109 additions & 11 deletions vignettes/3_parsing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -172,36 +172,134 @@ You can try out the above functions yourself on any sequence record by downloadi
library(restez)
restez_path_set(rodents_path)
(rand_id <- sample(suppressWarnings(list_db_ids()), 1))
#> [1] "AB005816"
#> [1] "AB000469"
record <- gb_record_get(rand_id)
(gb_extract(record = record, what = 'features'))
#> [[1]]
#> [[1]]$type
#> [1] "source"
#>
#> [[1]]$location
#> [1] "1..494"
#> [1] "1..510"
#>
#> [[1]]$organism
#> [1] "Homo sapiens"
#> [1] "Mus musculus"
#>
#> [[1]]$mol_type
#> [1] "genomic DNA"
#>
#> [[1]]$strain
#> [1] "C57BL6"
#>
#> [[1]]$db_xref
#> [1] "taxon:9606"
#> [1] "taxon:10090"
#>
#> [[1]]$clone
#> [1] "Ayu-1"
#>
#> [[1]]$chromosome
#> [1] "11"
#> [[1]]$cell_line
#> [1] "MS-1"
#>
#> [[1]]$map
#> [1] "11q13gene <1..>494"
#> [[1]]$cell_type
#> [1] "embryonic stem cells"
#>
#> [[1]]$gene
#> [1] "PPP1CA"
#> [[1]]$dev_stage
#> [1] "embryo"
#>
#> [[1]]$note
#> [1] "DNA fragment containing a part of CpG island of thehuman PPP1CA gene;Fragment RP8-11A"
#> [1] "vector-derived rabbit beta-globin sequence"
#>
#>
#> [[2]]
#> [[2]]$type
#> [1] "gene"
#>
#> [[2]]$location
#> [1] "1..491"
#>
#> [[2]]$gene
#> [1] "Ayu-1"
#>
#>
#> [[3]]
#> [[3]]$type
#> [1] "regulatory"
#>
#> [[3]]$location
#> [1] "1..491"
#>
#> [[3]]$regulatory_class
#> [1] "promoter"
#>
#> [[3]]$gene
#> [1] "Ayu-1"
#>
#>
#> [[4]]
#> [[4]]$type
#> [1] "misc_feature"
#>
#> [[4]]$location
#> [1] "85..92"
#>
#> [[4]]$gene
#> [1] "Ayu-1"
#>
#> [[4]]$note
#> [1] "octamer binding motif"
#>
#>
#> [[5]]
#> [[5]]$type
#> [1] "protein_bind"
#>
#> [[5]]$location
#> [1] "220..229"
#>
#> [[5]]$gene
#> [1] "Ayu-1"
#>
#> [[5]]$bound_moiety
#> [1] "En-type Oct-3"
#>
#>
#> [[6]]
#> [[6]]$type
#> [1] "misc_feature"
#>
#> [[6]]$location
#> [1] "269..276"
#>
#> [[6]]$gene
#> [1] "Ayu-1"
#>
#> [[6]]$note
#> [1] "octamer binding motif"
#>
#>
#> [[7]]
#> [[7]]$type
#> [1] "regulatory"
#>
#> [[7]]$location
#> [1] "397..402"
#>
#> [[7]]$regulatory_class
#> [1] "TATA_box"
#>
#> [[7]]$gene
#> [1] "Ayu-1"
#>
#>
#> [[8]]
#> [[8]]$type
#> [1] "mobile_element"
#>
#> [[8]]$location
#> [1] "492..510"
#>
#> [[8]]$mobile_element_type
#> [1] "insertion sequence"
```

## Next up
Expand Down
Loading

0 comments on commit 1e30498

Please sign in to comment.