PDB download and database functions (#483)

BioJulia · Aug 16, 2017 · 1c89218 · 1c89218
1 parent e08f2a3
commit 1c89218
Show file tree

Hide file tree

Showing 3 changed files with 649 additions and 29 deletions.
diff --git a/docs/src/man/structure.md b/docs/src/man/structure.md
@@ -15,18 +15,19 @@ end
 The `Bio.Structure` module provides functionality to manipulate macromolecular structures, and in particular to read and write [Protein Data Bank](http://www.rcsb.org/pdb/home/home.do) (PDB) files. It is designed to be used for standard structural analysis tasks, as well as acting as a platform on which others can build to create more specific tools. It compares favourably in terms of performance to other PDB parsers - see some [benchmarks](https://github.com/jgreener64/pdb-benchmarks).
 
 
-## Parsing PDB files
+## Basics
 
 To download a PDB file:
 
 ```julia
+# Stored in the current working directory by default
 downloadpdb("1EN2")
 ```
 
 To parse a PDB file into a Structure-Model-Chain-Residue-Atom framework:
 
 ```julia
-julia> struc = read(filepath_1EN2, PDB)
+julia> struc = read("/path/to/pdb/file.pdb", PDB)
 Bio.Structure.ProteinStructure
 Name                        -  1EN2.pdb
 Number of models            -  1
@@ -40,6 +41,8 @@ Number of hydrogens         -  0
 Number of disordered atoms  -  27
 ```
 
+**Note** : Refer to [Downloading PDB files](#downloading-pdb-files) and [Reading PDB files](#reading-pdb-files) sections for more options.
+
 The elements of `struc` can be accessed as follows:
 
 | Command                     | Returns                                                                         | Return type       |
@@ -194,21 +197,6 @@ RCGSQGGGSTCPGLRCCSIWGWCGDSEPYCGRTCENKCWSGERSDHRCGAAVGNPPCGQDRCCSVHGWCGGGNDYCSGGN
 ```
 
 
-## Writing PDB files
-
-PDB format files can be written:
-
-```julia
-writepdb("1EN2_out.pdb", struc)
-```
-
-Any element type can be given as input to `writepdb`. Atom selectors can also be given as additional arguments:
-
-```julia
-writepdb("1EN2_out.pdb", struc, backboneselector)
-```
-
-
 ## Spatial calculations
 
 Various functions are provided to calculate spatial quantities for proteins:
@@ -244,6 +232,178 @@ julia> rad2deg(psiangle(struc['A'][50], struc['A'][51]))
 ```
 
 
+## Downloading PDB files
+
+To download a PDB file to a specify directory:
+
+```julia
+downloadpdb("1EN2", pdb_dir="path/to/pdb/directory/")
+```
+
+To download multiple PDB files to a specify directory:
+
+```julia
+downloadpdb(["1EN2","1ALW","1AKE"], pdb_dir="path/to/pdb/directory/")
+```
+
+To download a PDB file in PDB, XML, MMCIF or MMTF format:
+
+```julia
+# PDB file format
+downloadpdb("1ALW", pdb_dir="path/to/pdb/directory/", file_format=PDB)
+# XML file format
+downloadpdb("1ALW", pdb_dir="path/to/pdb/directory/", file_format=PDBXML)
+# MMCIF file format
+downloadpdb("1ALW", pdb_dir="path/to/pdb/directory/", file_format=MMCIF)
+# MMTF file format
+downloadpdb("1ALW", pdb_dir="path/to/pdb/directory/", file_format=MMTF)
+```
+
+Various options can be set through optional keyword arguments when downloading PDB files as follows:
+
+| Keyword Argument               | Description                                                                                                           |
+| :----------------------------- | :-------------------------------------------------------------------------------------------------------------------- |
+| `pdb_dir::AbstractString=pwd()`| The directory to which the PDB file is downloaded                                                                     |
+| `file_format::Type=PDB`        | The format of the PDB file. Options are PDB, PDBXML, MMCIF or MMTF                                                    |
+| `obsolete::Bool=false`         | If set `true`, the PDB file is downloaded into the auto-generated "obsolete" directory inside the specified `pdb_dir` |
+| `overwrite::Bool=false`        | If set `true`, overwrites the PDB file if exists in `pdb_dir`; by default skips downloading the PDB file              |
+| `ba_number::Integer=0`         | If set > 0, downloads the respective biological assembly; by default downloads the PDB file                           |
+
+
+## Reading PDB files
+
+- To parse a existing PDB file into a Structure-Model-Chain-Residue-Atom framework:
+
+```julia
+julia> struc = read("/path/to/pdb/file.pdb", PDB)
+Bio.Structure.ProteinStructure
+Name                        -  1EN2.pdb
+Number of models            -  1
+Chain(s)                    -  A
+Number of residues          -  85
+Number of point mutations   -  5
+Number of other molecules   -  5
+Number of water molecules   -  76
+Number of atoms             -  614
+Number of hydrogens         -  0
+Number of disordered atoms  -  27
+```
+
+Various options can be set through optional keyword arguments when parsing a PDB file as follows:
+
+| Keyword Argument                             | Description                                                                     |
+| :------------------------------------------- | :------------------------------------------------------------------------------ |
+| `structure_name::AbstractString="$pdbid.pdb"`| The name of the PDB Structure read. Defaults to "< PDBID >.pdb"                 |
+| `remove_disorder::Bool=false`                | If set true, then disordered atoms wont be parsed                               |
+| `read_std_atoms::Bool=true`                  | If set false, then standard ATOM records wont be parsed                         |
+| `read_het_atoms::Bool=true`                  | If set false, then HETATOM records wont be parsed                               |
+
+- To parse a PDB file by specifying the PDB ID and PDB directory into a Structure-Model-Chain-Residue-Atom framework (file name must be in upper case, e.g. "1EN2.pdb")
+
+The function `readpdb` provides an uniform way to download and read PDB files. For example:
+
+```julia
+struc = readpdb("1EN2", pdb_dir="/path/to/pdb/directory")
+```
+
+The same keyword arguments are taken as `read` above, plus `pdb_dir` and `ba_number`.
+
+- To download and parse a PDB file into a Structure-Model-Chain-Residue-Atom framework in a single line:
+
+```julia
+julia> struc = retrievepdb("1ALW", pdb_dir="path/to/pdb/directory")
+INFO: Downloading PDB : 1ALW
+INFO: Parsing the PDB file...
+Bio.Structure.ProteinStructure
+Name                        -  1ALW.pdb
+Number of models            -  1
+Chain(s)                    -  AB
+Number of residues          -  346
+Number of point mutations   -  0
+Number of other molecules   -  10
+Number of water molecules   -  104
+Number of atoms             -  2790
+Number of hydrogens         -  0
+Number of disordered atoms  -  0
+```
+
+Various options can be set through optional keyword arguments when downloading and parsing a PDB file as follows:
+
+| Keyword Argument                             | Description                                                                                                      |
+| :--------------------------------------------| :--------------------------------------------------------------------------------------------------------------- |
+| `pdb_dir::AbstractString=pwd()`              | The directory from which the PDB file is read                                                                    |
+| `obsolete::Bool=false`                       | If set `true`, PDB file is downloaded into the auto-generated "obsolete" directory inside the specified `pdb_dir`|
+| `overwrite::Bool=false`                      | if set `true`, overwrites the PDB file if exists in `pdb_dir`; by default skips downloading PDB file if exists   |
+| `ba_number::Integer=0`                       | If set > 0 reads the respective biological assembly; by default reads PDB file                                   |
+| `structure_name::AbstractString="$pdbid.pdb"`| The name of the PDB Structure read. Defaults to "< PDBID >.pdb"                                                  |
+| `remove_disorder::Bool=false`                | If set true, then disordered atoms wont be parsed                                                                |
+| `read_std_atoms::Bool=true`                  | If set false, then standard ATOM records wont be parsed                                                          |
+| `read_het_atoms::Bool=true`                  | If set false, then HETATOM records wont be parsed                                                                |
+
+
+## Writing PDB files
+
+PDB format files can be written:
+
+```julia
+writepdb("1EN2_out.pdb", struc)
+```
+
+Any element type can be given as input to `writepdb`. Atom selectors can also be given as additional arguments:
+
+```julia
+writepdb("1EN2_out.pdb", struc, backboneselector)
+```
+
+
+## RCSB PDB Utility Functions
+
+- To download the entire RCSB PDB database in your preferred file format:
+
+```julia
+downloadentirepdb(pdb_dir="path/to/pdb/directory/", file_format=MMTF, overwrite=false)
+```
+
+The keyword arguments are described below:
+
+| Keyword Argument                 | Description                                                                                              |
+| :------------------------------- | :------------------------------------------------------------------------------------------------------- |
+| `pdb_dir::AbstractString=pwd()`  | The directory to which the PDB files are downloaded                                                      |
+| `file_format::Type=PDB`          | The format of the PDB file. Options are PDB, PDBXML, MMCIF or MMTF                                       |
+| `overwrite::Bool=false`          | If set `true`, overwrites the PDB file if exists in `pdb_dir`; by default skips downloading the PDB file |
+
+- To update your local PDB directory based on the weekly status list of new, modified and obsolete PDB files from RCSB Server:
+
+```julia
+updatelocalpdb(pdb_dir="path/to/pdb/directory/", file_format=MMTF)
+```
+
+The `file_format` specifies the format of the PDB files present in the local PDB directory. Obsolete PDB files are stored in the autogenerated `obsolete` directory inside the specified local PDB directory.
+
+- To download all obsolete PDB files from RCSB Server:
+
+```julia
+downloadallobsoletepdb(;obsolete_dir="/path/to/obsolete/directory/", file_format=MMCIF, overwrite=false)
+```
+
+The `file_format` specfies the format in which the PDB files are downloaded; Options are PDB, PDBXML, MMCIF or MMTF.
+
+If `overwrite=true`, the existing PDB files in obsolete directory will be overwritten by the newly downloaded ones.
+
+- To maintain a local copy of the entire RCSB PDB Database
+
+Run the `downloadentirepdb` function once to download all PDB files and setup a CRON job or similar to run `updatelocalpdb` function once in every week to keep the local PDB directory up to date with the RCSB Server.
+
+There are a few more functions that may help.
+
+| Function           | Returns                                                                         | Return type                                              |
+| :----------------- | :------------------------------------------------------------------------------ | :------------------------------------------------------- |
+| `pdbentrylist`     | List of all PDB entries from RCSB Server                                        | `Array{String,1}`                                        |
+| `pdbstatuslist`    | List of PDB entries from specified RCSB weekly status list URL                  | `Array{String,1}`                                        |
+| `pdbrecentchanges` | Added, modified and obsolete PDB lists from the recent RCSB weekly status files | `Tuple{Array{String,1},Array{String,1},Array{String,1}}` |
+| `pdbobsoletelist`  | List of all obsolete PDB entries in the RCSB server                             | `Array{String,1}`                                        |
+
+
 ## Examples
 
 A few further examples of `Bio.Structure` usage are given below.