Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

[WIP] PDB file handling enhancements #483

Merged
merged 36 commits into from
Aug 16, 2017

Conversation

joelselvaraj
Copy link
Member

@joelselvaraj joelselvaraj commented Aug 3, 2017

I have been working on few enhancements over handling PDB files based on BioPython library to be true.
Such as:

  • Downloading PDB file to a specific directory instead of only current directory
  • Skip or Overwrite if the PDB file already exists in the directory
  • Downloading Entire PDB database
  • Downloading multiple PDB files by passing a list of PDB IDs
  • Downloading all obsolete PDB entries from RCSB Server
  • Updating a PDB directory based on weekly status file from RCSB Server
  • Retrieve a PDB file (i.e. it will download the PDB file if it does not exists and read them)
  • Obsolete PDB file handling (A directory named "obsolete" is automatically created inside the specified "pdb_dir" to store and handle them)

@joelselvaraj joelselvaraj changed the title PDB file handling enhancements [WIP] PDB file handling enhancements Aug 3, 2017
@jgreener64
Copy link
Member

Great, thanks for making this PR. I will have a look at this tomorrow.

@joelselvaraj
Copy link
Member Author

joelselvaraj commented Aug 3, 2017

We have few things to discuss.

1. Whether the out_filepath argument in downloadpdb() is required?
Because removing it will give an uniformity in the code and will be easy for users. It would be better to keep the PDB ID as file name so that it will be easy to handle it. The user can specify a different structure_name when reading the file to differentiate the file as he wishes.

2. Should we write test cases?
The code has grown little complex than before. Especially there are lots of options and different combinations. This is may lead to unexpected bugs.

3. Should getstatuslist() function be exported in the module?
As of now its just a helper function for getrecentchanges() function.

function downloadpdb(pdbid::AbstractString,
out_filepath::AbstractString="$pdbid.pdb";
ba_number::Integer=0)
function getallpdbentries()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we avoid get in function names. The name should describe what is returned, perhaps pdbentrylist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fine. I will change it accordingly.

"""
Returns a list of pdb codes in the weekly pdb status file from the given URL.
"""
function getstatuslist(url::AbstractString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name should reference the PDB, perhaps pdbstatuslist?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What URL is the user supposed to use here? If there is a certain one perhaps put it in the docstring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually see comment below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL is supposed be weekly status file from RCSB Server. I will add it in the docstrings.

"""
Returns three lists of the newest weekly files (added,modified,obsolete) from RCSB PDB Server
"""
function getrecentchanges()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps pdbrecentchanges?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. K

"""
Returns a list of all obsolete entries ever in the RCSB PDB server
"""
function getallobsolete()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps pdbobsoletelist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. K

if the keyword argument `obsolete` is set `true`, the PDB files are downloaded into the obsolete directory inside `pdb_dir`;
if the keyword argument `overwrite` is set `true`, then it will overwrite the PDB file if it exists in the `pdb_dir`;
"""
function downloadmultiplepdb(pdbidlist::AbstractArray{String,1}; pdb_dir::AbstractString=pwd(), obsolete::Bool=false, overwrite::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name seems a bit long. We could even make this another method of downloadpdb, so that function can take either a string or a list.

I think this should be pdbidlist::Array{String,1} for the first argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I will override the downloadpdb for downloading multiple PDB files. Regarding pdbidlist::Array{String,1} I was getting an error, defining it as AbstractArray was only working. I will look into it and update accordingly.

@@ -107,6 +321,17 @@ function Base.read(filepath::AbstractString,
end
end

# Read PDB file based on PDB ID and pdb_dir.
function Base.read(pdbid::AbstractString,
Copy link
Member

@jgreener64 jgreener64 Aug 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method clashes with the above read method. I wonder whether if it is necessary, but if it goes in then the directory should be a full argument (so args are pdbid, directory, PDB).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just added it because that will be in a uniform format as the rest. If users wants to download first and then read it later, they will have uniform way of calling the functions. We may change the arguments as you mentioned.



"""
Download a PDB file or biological assembly from the RCSB PDB server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use line breaks in the doc strings to keep line length to 80.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K. I had this doubt. I will keep the line length to 80 in docstrings.

pdbidlist = Array{String,1}()
info("Fetching list of all PDB Entries from RCSB PDB Server...")
download("ftp://ftp.wwpdb.org/pub/pdb/derived_data/index/entries.idx","entries.idx")
open("entries.idx") do input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we make then delete the file here, I wonder if we should use a temporary filepath? If the user has entries.idx in their directory here it is getting overwritten without warning. You can use tempname() to save a name to a variable, write to this then delete it (I think there are examples of this in the PDB tests).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I didn't know about it. thank you.



"""
Returns a list of pdb codes in the weekly pdb status file from the given URL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantry - PDB should be capital in docstrings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. K.

if the keyword argument `obsolete` is set `true`, the PDB file is downloaded into the obsolete directory inside `pdb_dir`;
if the keyword argument `overwrite` is set `true`, then it will overwrite the PDB file if it exists in the `pdb_dir`;
"""
function retrievepdb(pdbid::AbstractString;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice function to have, difficult to find a descriptive name for it though. A descriptive one would be downloadpdbandread but that is way too long. Maybe this name is okay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya. I was so confused. Then finally decided keep it as in BioPython. So if BioPython users are using this, they may find it easy. As mentioned downloadpdbandread will be too long. We will keep it as retrievepdb as of now. Any other good function name is welcomed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility is readfrompdb.

@jgreener64
Copy link
Member

So overall looks good, thanks for adding this useful feature. There are a couple more things but the above is most of my issues. For reference the Biopython implementation is here: https://github.com/biopython/biopython/blob/master/Bio/PDB/PDBList.py

In answer to your specific questions:

@jgreener64
Copy link
Member

  1. out_filepath in my initial implementation was generally meant for the user to specify a directory rather than filename. Since you are adding the extra directory argument you can remove out_filepath for uniformity of filenames.

  2. Yes, there should be some tests. These will rely on an internet connection but I think that is okay. Obviously we cannot download the whole PDB in the tests but certainly on the other functions.

  3. Maybe don't export it, if we do then there should be information on its use in the docstring.

@jgreener64
Copy link
Member

There is one more thing that could be added now you are looking at this code, namely download of mmCIF and MMTF files from the PDB. Since mmCIF is now the standard, this is an important feature (I am writing an mmCIF parser for Bio.Structure now too). Would you be okay to implement this as a file_format (or similar) keyword argument to downloadpdb?

@joelselvaraj
Copy link
Member Author

@jgreener64 Thank you so much for your review. I will update the code accordingly in the future commits. Adding option file_format in downloadpdb will be useful. Nice that you are working on mmCIF parser. We can discuss further changes as the code grows.

Returns a list of pdb codes in the weekly pdb status file from the given URL.
"""
function getstatuslist(url::AbstractString)
statuslist = Array{String,1}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think statuslist = String[] is better, it seems to allocate less (this applies to all similar ones below too).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K. I didnt know that

push!(failedlist,pdbid)
end
end
warn(length(failedlist)," PDB file failed to download : ", failedlist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only do this if length(failedlist) > 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I missed it

@joelselvaraj
Copy link
Member Author

  • PDB, XML, mmCIF files are now downloaded in compressed format to reduce internet usage.
  • PDB files can now be downloaded and updated in MMTF format also.

Copy link
Member

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great changes @joels94 ! In my view this is near completion. I have made some more comments. A few more things before merge:

Download a Protein Data Bank (PDB) file or biological assembly from the RCSB
PDB. By default downloads the PDB file; if `ba_number` is set the biological
assembly with that number will be downloaded.
Returns a list of all PDB entries from RCSB PDB server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantry - full stop at the end (and in some other docstrings).

end
linecount +=1
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do need to explicitly remove the temp file I think - tempname() gets you an available name then you download to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies elsewhere you have used temp files too.

from RCSB PDB Server
"""
function pdbrecentchanges()
addedlist = String[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are not required as pdbstatuslist returns the array.

if the keyword argument `obsolete` is set `true`, the PDB file is downloaded
into the obsolete directory inside `pdb_dir`;
if the keyword argument `overwrite` is set `true`, then it will overwrite the
PDB file if it exists in the `pdb_dir`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End with full stop.

"""
function downloadpdb(pdbid::AbstractString; pdb_dir::AbstractString=pwd(), file_format::Type=PDB, obsolete::Bool=false, overwrite::Bool=false, ba_number::Integer=0)
# A Dict mapping the type to their file extensions
pdbextension = Dict{Type,String}( PDB => ".pdb", PDBXML => ".xml", mmCIF => ".cif", MMTF => ".mmtf")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Dict is defined in a couple of places so can be taken out of the functions and defined as const Dict near the top.

# check if PDB file is downloaded and extracted properly
if !ispath(pdbpath) || filesize(pdbpath)==0
# If the file size is 0, its deleted. force=true ensures error is not thrown when file does not exists
rm(pdbpath, force=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid forcing deletion and check again if the file exists here.

if the keyword argument `overwrite` is set `true`, then it will overwrite the
PDB file if it exists in the `pdb_dir`;
"""
function downloadpdb(pdbidlist::AbstractArray{String,1}; pdb_dir::AbstractString=pwd(), file_format::Type=PDB, obsolete::Bool=false, overwrite::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be pdbidlist::Array{String,1}, did you say this caused a problem?

pdb_dir::AbstractString=pwd(),
ba_number::Integer=0,
structure_name::AbstractString="$pdbid.pdb",
remove_disorder::Bool=false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 arguments can be replaced by kwargs... and then pass that to the inner read function (this means the defaults for remove_disorder etc. are defined in one place only).

read_het_atoms::Bool=true)
filepath = joinpath(pdb_dir,"$pdbid.pdb")
pdbpath = ba_number == 0 ? filepath : filepath*"$ba_number"
open(pdbpath, "r") do input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This open is not actually required as calling read with the filepath is already defined below.

filepath = joinpath(pdb_dir,"$pdbid.pdb")
end
pdbpath = ba_number == 0 ? filepath : filepath*"$ba_number"
open(pdbpath, "r") do input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This open is not actually required as calling read with the filepath is already defined below.

@codecov-io
Copy link

codecov-io commented Aug 7, 2017

Codecov Report

Merging #483 into master will increase coverage by 0.6%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #483     +/-   ##
=========================================
+ Coverage   70.34%   70.94%   +0.6%     
=========================================
  Files          34       34             
  Lines        2421     2537    +116     
=========================================
+ Hits         1703     1800     +97     
- Misses        718      737     +19
Impacted Files Coverage Δ
src/structure/pdb.jl 89.22% <80%> (-5.61%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e08f2a3...22231b3. Read the comment docs.

Copy link
Member

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried out some of the functions and they appear okay. I have left a few more small comments. Thanks for addressing everything. I'll look again when tests and docs are up 👍

if the keyword argument `overwrite` is set `true`, then it will overwrite the
PDB file if it exists in the `pdb_dir`.
"""
function downloadpdb(pdbidlist::Array{String,1}; pdb_dir::AbstractString=pwd(), file_format::Type=PDB, obsolete::Bool=false, overwrite::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last 4 arguments here could just be kwargs... so their defaults are defined in one place above.

if ispath(archivefilepath) && filesize(archivefilepath) > 0 && file_format != MMTF
input = open(archivefilepath) |> ZlibInflateInputStream
open(pdbpath,"w") do output
for line in eachline(input)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prints each newline character twice so the files alternate line/blank line. Could use print rather than println.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also a system specific issue? Because when using println() the file is generated properly and I can read them. But if I use print(), all lines are getting concatenated and I cannot read the file. I have attached the two files for your reference. (Renamed the extensions as .txt as I was not able upload in .pdb format). Let me know what you find.

1ENT_print().txt
1ENT_println().txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. It seems to be a Julia 0.5/0.6 issue, see the first breaking change for Julia v0.6.0 here: https://github.com/JuliaLang/julia/blob/master/NEWS.md .

In 0.5 the line break is not removed by eachline, in 0.6 it is. In order to be compatible with both I think we can use eachline(..., chomp=false) and use print inside.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Guess, I should keep an eye on the release notes as Julia changes a lot in each version. Thank you. I will update accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just tested that on 0.5 and eachline can't take the chomp argument. It should be

for line in eachline(input)
    println(output, chomp(line))
end

Sorry about that.

throw(ErrorException("Error downloading PDB : $pdbid"))
end
end
rm(archivefilepath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the condition is satisfied on line 180 then this will error as the file is not created. Either check the file exists here or move it up in the logic to where you know it exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The archivefilepath exists independent of the if condition in line 180. Its because, archivefilepath=tempname() in line 172 actually creates an empty temporary file instead of just the path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, it doesn't do that on my machine. I wonder if that is system specific. For example if I attempt to overwrite when overwrite is false I get

INFO: PDB Exists : 1AKE
ERROR: unlink: no such file or directory (ENOENT)
 in unlink(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in #rm#7(::Bool, ::Bool, ::Function, ::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in #downloadpdb#1(::String, ::Type{T}, ::Bool, ::Bool, ::Int64, ::Function, ::String) at ./REPL[1]:61
 in (::#kw##downloadpdb)(::Array{Any,1}, ::#downloadpdb, ::String) at ./<missing>:0

Copy link
Member Author

@joelselvaraj joelselvaraj Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Can you check by trying ispath(tempname())? For me i m getting true in Windows 10

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Julia 0.6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get false in Mac OSX, Julia 0.6. Searching this on Julia turns up JuliaLang/julia#9053 which addresses this.

I guess check it exists and if so remove, which will work either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Ok.

else
# Will download error page if ba_number is too high
download("http://www.rcsb.org/pdb/files/$pdbid.pdb$ba_number", out_filepath)
pdbpath = joinpath(pdb_dir,"$pdbid"*pdbextension[file_format]*"$ba_number")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more about it, the biological number should not change the extension of the file. Ideally it would change the filename, e.g. 1ABC_ba1.pdb or something like that rather than 1ABC.pdb1.

# The first 4 characters in the line is the PDB ID
pdbid = uppercase(line[1:4])
# Check PDB ID is 4 characters long and only consits of alphanumeric characters
if length(pdbid) != 4 || ismatch(r"[^a-zA-Z0-9]", pdbid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r"^[a-zA-Z0-9]{4}$" will check the length as well.

linecount +=1
end
end
rm(tempfilepath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm(tempfilepath, force=true) would be better because it works even when the file does not exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to make sure that the temporary file will be deleted, you need to use the try-catch-finally statement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't try-finally be sufficient? That way we may be able to know what error occurs. Which might be helpful in the initial stages to debug the code.

# MMTF is downloaded in uncompressed form, thus directly stored in pdbpath
download("http://mmtf.rcsb.org/v1.0/full/$pdbid", pdbpath)
else
warn("Invalid PDB file format!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should throw an ArgumentError exception.

elseif file_format == mmCIF
download("http://files.rcsb.org/download/$pdbid-assembly$ba_number"*pdbextension[file_format]*".gz", archivefilepath)
else
warn("Biological Assembly is available only in PDB and mmCIF formats!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArgumentError.



"""
Download a PDB file or biological assembly from the RCSB PDB server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line should be the signature of the function (see: https://docs.julialang.org/en/stable/manual/documentation).

end
end
# Verify if the compressed PDB file is downloaded properly and extract it. For MMTF no extraction is needed
if ispath(archivefilepath) && filesize(archivefilepath) > 0 && file_format != MMTF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use isfile, not ispath.

function downloadentirepdb(;pdb_dir::AbstractString=pwd(), file_format::Type=PDB, overwrite::Bool=false)
# Get the list of all pdb entries from RCSB PDB Server using getallpdbentries() and downloads them
pdblist = pdbentrylist()
info("About to download "*string(length(pdblist))*" PDB files. Make sure to have enough disk space and time!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use string interpolation? "... download $(length(pdblist)) PDB files..."

@joelselvaraj
Copy link
Member Author

joelselvaraj commented Aug 12, 2017

I have written the test cases. I have also made few changes to the code. Kindly take a look at it and let me know if any changes are required.

@joelselvaraj
Copy link
Member Author

joelselvaraj commented Aug 13, 2017

TO DO

  • Update docstrings

  • Write test cases

  • Add documentation

Finally, completed writing the documentation. Let me if any changes are required before merging the code.

Copy link
Member

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am happy with the code now, thanks for making those changes. The tests look good too.

I think the doc changes need a bit of re-ordering though. The content is good but I think that the first few things the user reads should be how to download a PDB file and read it in the manner consistent with the other BioJulia read interfaces, i.e. read("path.pdb", PDB).

So I suggest keeping the start of the docs the same, apart from maybe changing "To download a PDB file" to "To download a PDB file - see below for more options".

Then after the struc = read(filepath_1EN2, PDB) box have a line or box talking about retrievepdb as a shortcut. The other options for downloadpdb, maintaining a local PDB copy etc. can go in the "RCSB PDB Metadata" section at the bottom, which could be renamed to "RCSB PDB utility functions".

info("Downloading PDB : $pdbid")
if ba_number == 0
if file_format == PDB || file_format == PDBXML || file_format == mmCIF
download("http://files.rcsb.org/download/$pdbid"*pdbextension[file_format]*".gz", archivefilepath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well use string interpolation as "http://files.rcsb.org/download/$(pdbid)$(pdbextension[file_format]).gz" here (and throughout).

| Keyword Argument | Description |
| :----------------------------- | :-------------------------------------------------------------------------------------------------------------------- |
| `pdb_dir::AbstractString=pwd()`| The directory to which the PDB file is downloaded |
| `file_format::Type=PDB` | The format of the PDB file. Options <PDB, PDBXML, mmCIF, MMTF> |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Options are PDB, PDBXML, mmCIF or MMTF" is more readable.

If `overwrite=true`, the existing PDB files in obsolete directory will be overwritten by the newly downloaded ones.


## Maintaining a Local Copy of the entire RCSB PDB Database
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantry - no capitals on local copy.

| Keyword Argument | Description |
| :----------------------------- | :------------------------------------------------------------------------------------------------------- |
| `pdb_dir::AbstractString=pwd()`| The directory to which the PDB files are downloaded |
| `file_format::Type=PDB` | The format of the PDB file. Options <PDB, PDBXML, mmCIF, MMTF> |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -244,6 +366,18 @@ julia> rad2deg(psiangle(struc['A'][50], struc['A'][51]))
```


## RCSB PDB Metadata

Few functions that may help fetching information about the RCSB PDB Database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantry - "There are a few functions that may help" etc.

@jgreener64
Copy link
Member

On further thought, should the mmCIF type be called MMCIF? I think the Julia convention of capital types might supersede the technical name, and also this would bring it in line with MMTF where the MMs mean the same thing.

@joelselvaraj
Copy link
Member Author

joelselvaraj commented Aug 15, 2017

Updated docs and code as discussed. Let me if further changes are required.

Copy link
Member

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes. I am happy with the code and tests and would suggest a few more small changes to the docs before merge. Hopefully they won't take long.

Sorry to spend a while on the docs but I think it's very important to give a concise and useful overview of the module.

struc = readpdb("1EN2", pdb_dir="/path/to/pdb/directory")
```

**Note:** This requires the PDB file name to be uppercase PDB ID. Example : "1EN2.pdb"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove this line and change the above line to 'To parse a PDB file by specifying the PDB ID and PDB directory into a Structure-Model-Chain-Residue-Atom framework (file name must be in upper case, e.g. "1EN2.pdb")'.

Number of disordered atoms - 27
```

Various options can be set through optional keyword arguments when parsing a PDB file as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indented - to be in line with the rest of the file I don't think any indentation is needed, even for code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to other lines too.

@@ -40,6 +41,8 @@ Number of hydrogens - 0
Number of disordered atoms - 27
```

**Note** : Refer [Downloading PDB files](#downloading-pdb-files) and [Reading PDB files](#reading-pdb-files) sections for more options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Refer to..."


**Note:** This requires the PDB file name to be uppercase PDB ID. Example : "1EN2.pdb"

The function `readpdb` provides an uniform way to download and read PDB files. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure lines 309-337 are required. Perhaps we could just say "The function readpdb provides an uniform way to download and read PDB files, for example readpdb("1EN2",pdb_dir="/path/to/pdb/directory"). The same keyword arguments are taken as read above, plus pdb_dir and ba_number."

| `pdbobsoletelist` | List of all obsolete PDB entries in the RCSB server | `Array{String,1}` |


## Maintaining a local copy of the entire RCSB PDB Database
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section title is quite long, and the section mainly refers to the above section. Perhaps remove this section and put the sentence "Run the downloadentirepdb function..." into the above section.

@@ -20,13 +20,14 @@ The `Bio.Structure` module provides functionality to manipulate macromolecular s
To download a PDB file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps change this section title from "Parsing PDB files" to "Basics" as this makes more sense in the new arrangement.

downloadpdb("1EN2")
```

To parse a PDB file into a Structure-Model-Chain-Residue-Atom framework:

```julia
julia> struc = read(filepath_1EN2, PDB)
julia> struc = read("/path/to/pdb/file.pdb", PDB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this was initially set as filepath_1EN2 to pass the doctests but I don't think they are being tested here so this can stay as "/path/to/pdb/file.pdb".

@joelselvaraj
Copy link
Member Author

I have updated the docs. Let me know if any changes are required.

@jgreener64
Copy link
Member

Great, I'm happy with this to go in. Thanks for all the work. I will wait until tomorrow in case anyone else has any comments, then I'll merge.

@jgreener64 jgreener64 merged commit 1c89218 into BioJulia:master Aug 16, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants