-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BinaryCIF parser #4707
Add BinaryCIF parser #4707
Conversation
.github/workflows/ci.yml
Outdated
@@ -105,7 +105,7 @@ jobs: | |||
|
|||
- name: Install Python packaging tools | |||
run: | | |||
python -m pip install --upgrade pip setuptools wheel | |||
python -m pip install --upgrade pip setuptools wheel numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy should be installed later as a declared dependency - you shouldn't have needed to change this here (same comment below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without NumPy, the python setup.py sdist --formats=gztar,zip
command fails because I use NumPy in the setup.py file to get the location of the C API header files (np.get_include()
). This location is numpy/core/include/numpy
. Maybe we could hardcode this, but I think NumPy would still need to be installed in order to compile the NumPy extension module(s).
@mdehoon could you look at this too please, especially from a compiled C code and numpy API point of view? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks very nice, thank you @Will-Tyler for another great contribution! The testing you made makes me confident that this works correctly. I made only a few minor comments.
# This resets the source if source is a file handle. | ||
source.seek(0) | ||
|
||
with ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd separate the choice of the open func from the with
statement. Makes it a little clearer to understand what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to do this:
if source.endswith(".gz"):
open_func = gzip.open
else:
open_func = open
with open_func(source, mode="rb") as file:
result = msgpack.unpack(file, use_list=True)
But mypy is complaining:
mypy.....................................................................Failed
- hook id: mypy
- exit code: 1
Bio/PDB/binary_cif.py: note: In member "get_structure" of class "BinaryCIFParser":
Bio/PDB/binary_cif.py:258:25: error: Incompatible types in assignment
(expression has type overloaded function, variable has type overloaded function)
[assignment]
open_func = open
^~~~
Found 1 error in 1 file (checked 1 source file)
I will leave as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue raised on https://stackoverflow.com/questions/16813267/python-gzip-refuses-to-read-uncompressed-file with a pointer to python/mypy#1026 but that and the linked issue seemed to have gone in the direction of Python 2/3 workarounds being a legacy corner case (irrelevant here), and judicous use of # type: ignore
being the only practical suggestion :(
Bio/PDB/binary_cif.py
Outdated
for index in range(len(serial_numbers)) | ||
] | ||
|
||
def get_structure(self, source: str) -> Structure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As much as I don't like it, I'd keep the signature similar to the other parsers, e.g. add the id argument:
def get_structure(self, id: str, source: str) -> Structure:
....
Doc/Tutorial/chapter_pdb.rst
Outdated
|
||
.. code:: pycon | ||
|
||
>>> parser.get_structure("1gbt.bcif.gz") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the signature don't forget to update the example here.
I redid my testing as described in the PR description and everything looks good. I am ready for a new review. |
Why did your write this decoders in an extension module? I may have misunderstood the goal of the C code, or it maybe a performance bottleneck - but looking at |
I implemented the integer packing decoder in C. (Here is the integer packing description.) Ideally, I think we want the integer packing decoder to be able to work with NumPy arrays because most of the other decoders can be implemented efficiently and simply with the NumPy API. I couldn't find a way to implement the integer packing decoder using NumPy's Python API. Hence, I started thinking about writing custom C code. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have nothing further to add, @mdehoon ?
It's been a few weeks since the last response... Can we merge this unless there is anything else to address? |
Is a squash-and-merge OK with you Will? |
Yes, squash-and-merge is fine with me |
Merged, thank you Will 👍 |
Thanks all for reviewing! 🙏 |
Acknowledgements
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runpre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
Description
This pull request closes #4705. In this pull request, I add a BinaryCIF parser to the PDB package. The parser uses NumPy for a faster implementation, and I write one of the decoders in a NumPy extension module. The rest of the decoders use straightforward NumPy APIs.
I add NumPy as a packaging tool so that essentially NumPy is required to build Biopython—more under Discussion.
Testing
Code
I tested the BinaryCIF parser by parsing 983 PDB structures with the BinaryCIF parser and comparing them to the structures returned by the mmCIF parser using the
strictly_equals
method. Here is a sample testing script:I also added some unit tests, which are passing locally for me.
Documentation
To check the documentation changes, I built the documentation locally and manually inspected the changes to confirm that they were as expected.
Speed
I find that, on average, the BCIF parser is around 5 times faster than the mmCIF parser. The data below describe the time taken to parse the mmCIF file divided by the time taken to parse the BinaryCIF file for the PDB codes used in testing. Note that the BinaryCIF files were GZIP-compressed whereas the mmCIF files were not. The BinaryCIF parser likely would have been even faster if decompressed files were used.
Discussion
NumPy extension modules
To make a faster implementation, I add a NumPy extension module, called
_bcifhelper
, written using the C-API. To use Numpy's C-API, the build system needs to know where the NumPy header files are, which is given bynumpy.get_include()
. Thus, NumPy is required to build the extension module.In this pull request, I add NumPy to the packing tools so that NumPy is present for the build system while building Biopython. I believe that other parts of Biopython may benefit from NumPy extension modules. For example, the CE Align code creates a list of lists to represent the atomic coordinates of the PDB structure. It would be better to create a 2-dimensional NumPy array, which the C code portion of CE Align could accept and work with using the NumPy header definitions.
Further optimizations
I have a few ideas to further improve the speed of the BinaryCIF parser:
Lazy MessagePack Unpacking
The
msgpack
module decodes the entire file. Instead, it might be more efficient to decode one "layer" at a time. For example, if the object is a dictionary, decode the keys first, then only decode the values when the user requests the value associated with a key.String Array Decoder
The part of the string array decoder that takes the string data and the offsets array and produces the list of strings is currently implemented in Python. I didn't figure out a way to implement this entirely using the standard NumPy interface. A NumPy extension might be used to speed this up.
Controlled Garbage Collection
I found that disabling garbage collection and only collecting garbage after specific operations complete can increase the speed of the parser by roughly 20%. This is trivial to do in Python using the
gc
module.References