Add BinaryCIF parser #4707

Will-Tyler · 2024-04-23T22:28:59Z

Acknowledgements

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run pre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

Description

This pull request closes #4705. In this pull request, I add a BinaryCIF parser to the PDB package. The parser uses NumPy for a faster implementation, and I write one of the decoders in a NumPy extension module. The rest of the decoders use straightforward NumPy APIs.

I add NumPy as a packaging tool so that essentially NumPy is required to build Biopython—more under Discussion.

Testing

Code

I tested the BinaryCIF parser by parsing 983 PDB structures with the BinaryCIF parser and comparing them to the structures returned by the mmCIF parser using the strictly_equals method. Here is a sample testing script:

comparison_pdb_codes = ['2VFV', '8E55', '2L2P', '4HOV', '8IIT', '3G84', '5FHH', '7Y8W', '7B3T', '3VVF', '7Z1N', '5MA2', '3BL7', '4U5M', '2ND3', '3DPI', '5SJ1', '2XG6', '4XZ1', '7C7M', '7UTV', '3IO3', '5Y2J', '6C4P', '1OA6', '3WRB', '5GKO', '4O4J', '6HAX', '2QOC', '2P0I', '2O44', '4LQT', '1T9B', '2NYK', '3VL1', '4JJR', '5USS', '4BIF', '5U2L', '7W5X', '4PB9', '6VKD', '7CLZ', '5EGB', '7FHK', '2W9N', '1V4S', '6IVA', '1AF2', '5D58', '6ULA', '3VAE', '6CMG', '2DPM', '2K1Y', '1S2K', '4GSX', '6DOQ', '4YLC', '7MY3', '6WZL', '5IAE', '4YYW', '4XQM', '5ISJ', '1PUX', '3RJY', '2MN5', '4HUZ', '1FJO', '4WVX', '6QUS', '2UZ9', '3TZA', '3OLI', '4APN', '5Z6W', '6ETC', '2FB9', '5INI', '2D61', '5UDP', '7EO6', '4N4O', '3VGE', '2RGP', '4N6L', '5FEX', '8F6O', '1YQ5', '5DVZ', '2MJX', '2NNT', '7U53', '1DOG', '3HO4', '6DPZ', '1I1D', '5UEF', '7QUL', '7U37', '5Y8Q', '3BTG', '4V1I', '7LW1', '3B81', '5O09', '8CMJ', '6HVS', '3QWV', '2O81', '3IQD', '4GR1', '1KGT', '8G3P', '4W5N', '6IYN', '6NYG', '3NG9', '4J79', '6VOR', '6GCQ', '5T0S', '1B6Y', '3VX8', '1CW8', '4J7V', '4FMN', '6R43', '5EXP', '6GAD', '3RRF', '2ZPI', '7RDX', '4DAJ', '2VMQ', '5SDN', '1XB3', '5LLX', '2LBC', '7WOB', '5MZ7', '7RAX', '6NIM', '5ULC', '4D6B', '5NVD', '2FMS', '4ADU', '7K6S', '8CYL', '7E2E', '7NTK', '8EV4', '6NBW', '4QW3', '8KBW', '1QQ3', '8GCL', '3PCU', '3PGV', '5Y5N', '5CMI', '2Y7J', '4EB0', '2GG3', '5VC2', '3LWJ', '5FUV', '7XWV', '5AZJ', '5N1N', '4CH6', '6EOM', '4E5J', '1W3V', '6RPY', '6VXZ', '2IYL', '1PF9', '5B7Y', '5SKO', '4ZYS', '7CJP', '2A38', '7LL7', '5P12', '6TZC', '3DPS', '3JBK', '3P16', '1C1L', '1NAY', '1BGG', '3IP9', '3DQG', '6KJK', '4ADQ', '4NDW', '6WMM', '7ZX8', '4N9C', '7YP9', '6HJS', '3G7P', '3QBR', '7Q8M', '2AFA', '2HRI', '3WQS', '5AUP', '3K4G', '3AM6', '6B0U', '2IZH', '5US7', '3WOK', '2CLC', '5AQV', '4FWF', '5FFF', '6PZN', '1O8A', '4JYH', '4UTZ', '3TNU', '3HHQ', '5IZW', '1GTS', '8DFI', '3PPX', '7QWV', '4BAA', '3U5L', '2R44', '2NLN', '4D0Z', '3J78', '1USI', '7ZAG', '5DIF', '6HDZ', '2V1Y', '5OWT', '6W2H', '3EQ2', '3UA3', '4BD8', '1CLM', '6TDS', '1G6O', '7KAA', '7ETI', '4XKR', '3DSJ', '2JQH', '2DSO', '5ISN', '3WO4', '7NAU', '6D0C', '6A5T', '6WEL', '2Q9E', '3CLU', '2MA6', '1JLR', '5FQN', '1MQ2', '8DKO', '2O4L', '1E62', '7SGI', '2Q80', '3L3Q', '6IB7', '3MK1', '1HCT', '4UP5', '3LN9', '2NTM', '3EVD', '2OQX', '4RMD', '8HW0', '5NTN', '7FA3', '8AY4', '1HIJ', '2ER8', '2FJ6', '7ERN', '5LLO', '7JLL', '2VFL', '4V5V', '6K7Z', '2MAW', '7MR1', '5DGU', '5UG6', '6W3C', '3C1Q', '6KXD', '3T2B', '4ELG', '2BPR', '7RGP', '1FMD', '5IV8', '2CKM', '1R3J', '7QLV', '4MBG', '7LXU', '7DUQ', '6WY8', '5LN0', '2WOU', '3SLS', '2QC9', '5H3Q', '4RD1', '7PE6', '7WD6', '7E3J', '3NKE', '6BGF', '1IKP', '3WL5', '3HEQ', '5ERU', '8EQD', '7ORP', '1HWQ', '1YNC', '4HHP', '2GF5', '3BLR', '2FGF', '1NTA', '5T9R', '2MYC', '5KZS', '1TBD', '7CDO', '3N6W', '17GS', '7O3T', '1HJN', '7WSF', '7U88', '1DQB', '3I52', '7OEP', '1AKM', '7O0U', '3LT1', '7KV7', '6L0X', '5LRF', '5OPI', '5NKU', '4JE7', '1ZLG', '1OOT', '5OOE', '8UYE', '2NLQ', '7GHM', '1E02', '4QNY', '4P4R', '7ZG7', '3RG8', '2P2O', '3GPV', '5EX6', '7OGK', '7OIE', '6O5T', '6WG6', '7MC3', '1Q2P', '3WNT', '5AB9', '5VVJ', '5Q14', '2E7G', '1VPN', '2LDI', '3PRP', '8QEL', '5GQT', '2MV0', '3QCK', '2Y0G', '6MSO', '4U1N', '6NHI', '7RU8', '6KUW', '6L5Z', '4LQV', '2LGI', '2CPG', '3AT7', '6G0S', '4WJY', '4QN5', '8APW', '8G4U', '2XDM', '3NUC', '1S89', '5EHA', '3WB2', '6PJ9', '1IYU', '5GLQ', '6FQ8', '1VTT', '5R2A', '7DXU', '7VER', '7F9O', '4MH8', '8G31', '8ASE', '3OZS', '1P2G', '1YXM', '6BDV', '6TG5', '3ISF', '8CK1', '3KVZ', '1XXH', '7S14', '3JV2', '4P6D', '1QFC', '4NCH', '2I1Q', '1WRQ', '1UNW', '2VI6', '6G1V', '6BYZ', '8EF8', '4YLQ', '8SNX', '7E61', '1GNB', '1OBF', '2EQI', '2XLI', '6QDU', '6LCI', '3QS0', '6KQQ', '7C3H', '6V5D', '7KCB', '5PUH', '8FWB', '6LGA', '1IEF', '5PJN', '8FVZ', '6TXW', '5OS2', '3I4F', '7TKE', '5J8T', '1V33', '6C00', '4R03', '2IX0', '3AQJ', '5D8S', '7SEF', '1KHL', '7LIY', '4CEV', '8BHW', '4KQK', '3NMN', '4EC4', '6QFI', '1V6I', '6HVR', '3IIW', '2QWR', '4XFS', '7OXF', '3FV9', '4V36', '5IRC', '3SEK', '6MI3', '6NKT', '3OAW', '1NFG', '1TEL', '3OO1', '6S05', '8AIS', '8DPR', '1EGG', '6VGE', '3B0W', '5VNZ', '5E2Q', '4ZWG', '7KAU', '5X0E', '7L9L', '8OGC', '3C0C', '5MA3', '7Y5H', '4H4Z', '5D2U', '5QC5', '2ZT5', '2FH8', '5FAV', '6WJS', '5PGO', '4ACY', '6ENF', '4FLW', '7ZZB', '8C3W', '6RXH', '1KCN', '5JIX', '3H0V', '6Z6W', '2ZEO', '4RZD', '1EUP', '7OKQ', '5UH9', '6RPJ', '6O3M', '3JW6', '6WXJ', '1TIL', '7Q7A', '4QMK', '2QJP', '6WAF', '2K8U', '5I63', '8BYU', '7VY4', '5GN5', '7NSY', '7C6P', '6P47', '2IMH', '1AP1', '2QVI', '1CG4', '4X0N', '2KJP', '5S82', '3RG0', '1SDU', '5G0N', '2MMH', '2YJH', '2CAZ', '7KZN', '1R2G', '7U0E', '5XLJ', '1T7D', '7RAO', '4GIF', '4PWO', '6Q4V', '5DW0', '6DN1', '6NS2', '4UI5', '1WFY', '6UJ5', '4O0W', '3DXT', '5Z4U', '8T12', '6SJP', '6GSA', '8C6O', '8AH4', '1Z3S', '6GB8', '6GXU', '5DIO', '7K5E', '2BYX', '7FD7', '5F82', '5N5H', '5NB9', '2O4U', '8JCV', '1LXG', '4XZB', '2GBT', '3PH0', '7TIP', '5T12', '6V06', '3LKF', '1W7D', '2PW5', '7CCM', '5MBB', '8WAZ', '6YWK', '6RGT', '5TLV', '4YM3', '2NU5', '3APM', '5MZQ', '3IW7', '5OF3', '7GDX', '6TYF', '2EG5', '5PN0', '1UHM', '8STZ', '5U3J', '6DA1', '8B34', '5CLF', '3N76', '1JTH', '2JGX', '6JDO', '4PPH', '5DKI', '1JTI', '2KP4', '1ZLP', '5SPC', '5E5X', '3TOG', '1JJ2', '7N32', '6GSZ', '6V0M', '1YYN', '4KB4', '4ANG', '1H3N', '4BEW', '7AOY', '7RXW', '4IMT', '3DJA', '2CVV', '2XUJ', '2KMQ', '4LXB', '3L6P', '3H94', '6VNL', '2UYR', '8AV0', '7DIX', '3KG0', '2A8Q', '8FRC', '5YLS', '5P44', '4YS4', '6E7H', '5R9I', '3AOH', '4RP8', '6TNS', '7XJX', '3D7R', '6NV9', '2L39', '3RMN', '4DCX', '4V8Q', '2RGF', '4OHE', '6RK5', '4V10', '4G1M', '5CGK', '4CSU', '8EKC', '7A0S', '3RJV', '1IFI', '1S0U', '7D6F', '5ZYI', '3KPX', '7FIP', '2P6I', '6QHC', '2JO4', '6GUS', '1NCO', '3MBC', '2XZ9', '1AAF', '5QXP', '7E69', '8ESU', '4G7F', '2DKB', '2H5F', '4IB0', '7VIR', '6B6E', '6EVL', '5HLW', '8FUA', '7REY', '7R1C', '3VX1', '1BFB', '1TGG', '7CEH', '6OKX', '4CAA', '8D6B', '5U7M', '7FNV', '8J79', '6AVT', '6WFJ', '8D7N', '7NX5', '7W7A', '1GR7', '5LT0', '2VMJ', '1BZ9', '6EQJ', '5GNK', '5YAK', '5PVW', '5MIY', '6SII', '6AFE', '2ZZX', '4IUE', '5X7Z', '7LBU', '4PAO', '3RUZ', '6N9O', '1QZA', '7ZY1', '1DL7', '7KXG', '1AFA', '1FUT', '6F75', '2ORB', '3US2', '1YQH', '2EYL', '1BEV', '1X98', '6JGZ', '7AV1', '4UGL', '3IBV', '8A94', '2JR0', '3M7W', '1QHT', '4UTU', '1HUK', '4O5K', '1BVL', '1EF4', '4C7Y', '4KRZ', '6HHO', '6LPK', '4LPH', '8OUW', '3RF7', '2EOO', '8J5D', '1V66', '6M5N', '5E1L', '3CTH', '3M3J', '3JC1', '1B2L', '4JO2', '3UVP', '3CHW', '4W6C', '6PJR', '4KF6', '7OXD', '7C7L', '6N0J', '5VBK', '4ZO4', '4K3G', '1E9T', '4BJ5', '7DFC', '6NLO', '5SFH', '2CZ5', '3ZWI', '2QMH', '7QBW', '7JS5', '6EL2', '5QUS', '6EGO', '7OLV', '1REY', '7ALX', '4DOB', '3OCT', '8PQJ', '5LS3', '1UWC', '8E3F', '8AM2', '5CLW', '6CKA', '3Q6O', '2H9U', '5GMQ', '5ADY', '6FDZ', '7VML', '7TJ3', '1SK4', '4K7J', '7DU7', '6MKW', '164L', '7YF4', '8AH7', '1HDZ', '2RDW', '3SK2', '1QFF', '1GJY', '5A4T', '5BMF', '5DPS', '3TW9', '6K8A', '3HU6', '5US4', '3B1Y', '8B2R', '4R67', '5YDM', '3TJE', '1MLK', '4OA4', '5KBG', '6DVV', '6ZN4', '6N1X', '2F00', '7YSI', '5S9E', '3PFE', '2DNJ', '1TO2', '2E33', '8PDV', '5R19', '3EJL', '1GMY', '3I2T', '3NOW', '1RSZ', '7LY4', '1O6V', '7ATV', '8PFX', '3J9J', '6VFU', '6WR3', '1UNA', '6IEI', '4QSM', '1WV0', '2AMD', '1CI5', '2IJI', '1B5B', '2N4G', '2YJC', '2ANA', '5RCJ', '2Q0R', '6M2O', '6GKI', '5IPH', '2EDV', '4G6K', '5LYU', '8B8S', '8BZO', '6QIX', '3V74', '6DFH', '5M4A', '6WM8', '5RP6', '8J6G', '7OKZ', '1IDT', '4CKX', '6JXH', '1RG5', '3GLL', '7F27', '5OQG', '6I11', '8V15', '6ED2', '7PI3', '7PMB', '3FVH', '1CZF', '4M14', '5XAI', '4KH3', '1IQE', '6XIV', '8GYG', '1CQK', '1Q5Q', '2QTU', '2HQE', '4IIN', '2BWK', '7W9W', '3OM8', '1GBT', '6WT2', '6D2P', '5W3U', '6RKW', '4DV0', '6W5V', '2M2M', '3ZZ4', '4B6U', '2LW1', '1ZUG', '4XJ2', '6G2K', '5P06', '5YXH', '1E5T', '7YTV', '2Y3N', '2VS1', '7YDT', '1CY9']

from Bio.PDB import MMCIFParser, BinaryCIFParser

mmcif_parser = MMCIFParser(auth_chains=False)
bcif_parser = BinaryCIFParser()

def compare_structures(pdb_code: str):
    mmcif_path = f"data/mmcif/{pdb_code}.cif"
    bcif_path = f"data/bcif/{pdb_code}.bcif.gz"

    mmcif_structure = mmcif_parser.get_structure(pdb_code, mmcif_path)
    bcif_structure = bcif_parser.get_structure(bcif_path)

    if mmcif_structure.strictly_equals(bcif_structure, compare_coordinates=True):
        print(f"Compared structures for {pdb_code}...")
    else:
        print(f"Comparison failed for {pdb_code}...")
        assert False


for pdb_code in comparison_pdb_codes:
    compare_structures(pdb_code)

I also added some unit tests, which are passing locally for me.

Documentation

To check the documentation changes, I built the documentation locally and manually inspected the changes to confirm that they were as expected.

Speed

I find that, on average, the BCIF parser is around 5 times faster than the mmCIF parser. The data below describe the time taken to parse the mmCIF file divided by the time taken to parse the BinaryCIF file for the PDB codes used in testing. Note that the BinaryCIF files were GZIP-compressed whereas the mmCIF files were not. The BinaryCIF parser likely would have been even faster if decompressed files were used.

count    983.000000
mean       4.758603
std        3.478190
min        0.108241
25%        2.892743
50%        4.396097
75%        6.147312
max       60.518840
dtype: float64

Discussion

NumPy extension modules

To make a faster implementation, I add a NumPy extension module, called _bcifhelper, written using the C-API. To use Numpy's C-API, the build system needs to know where the NumPy header files are, which is given by numpy.get_include(). Thus, NumPy is required to build the extension module.

In this pull request, I add NumPy to the packing tools so that NumPy is present for the build system while building Biopython. I believe that other parts of Biopython may benefit from NumPy extension modules. For example, the CE Align code creates a list of lists to represent the atomic coordinates of the PDB structure. It would be better to create a 2-dimensional NumPy array, which the C code portion of CE Align could accept and work with using the NumPy header definitions.

Further optimizations

I have a few ideas to further improve the speed of the BinaryCIF parser:

lazy MessagePack unpacking,
NumPy extension for the string array decoder,
controlled garbage collection.

Lazy MessagePack Unpacking

The msgpack module decodes the entire file. Instead, it might be more efficient to decode one "layer" at a time. For example, if the object is a dictionary, decode the keys first, then only decode the values when the user requests the value associated with a key.

String Array Decoder

The part of the string array decoder that takes the string data and the offsets array and produces the list of strings is currently implemented in Python. I didn't figure out a way to implement this entirely using the standard NumPy interface. A NumPy extension might be used to speed this up.

Controlled Garbage Collection

I found that disabling garbage collection and only collecting garbage after specific operations complete can increase the speed of the parser by roughly 20%. This is trivial to do in Python using the gc module.

References

BinaryCIF Specification
Python BinaryCIF parser
- This is an example BinaryCIF parser implemented in Python.
PDBML Parser pull request
- I adopted a similar testing strategy as this pull request.

Bio/PDB/__init__.py

Bio/PDB/binary_cif.py

Tests/test_PDB_binary_cif.py

setup.py

peterjc · 2024-04-24T14:10:13Z

.github/workflows/ci.yml

@@ -105,7 +105,7 @@ jobs:

    - name: Install Python packaging tools
      run: |
-        python -m pip install --upgrade pip setuptools wheel
+        python -m pip install --upgrade pip setuptools wheel numpy


NumPy should be installed later as a declared dependency - you shouldn't have needed to change this here (same comment below).

Without NumPy, the python setup.py sdist --formats=gztar,zip command fails because I use NumPy in the setup.py file to get the location of the C API header files (np.get_include()). This location is numpy/core/include/numpy. Maybe we could hardcode this, but I think NumPy would still need to be installed in order to compile the NumPy extension module(s).

Bio/PDB/Atom.py

Doc/Tutorial/chapter_pdb.rst

peterjc · 2024-04-26T11:14:29Z

@mdehoon could you look at this too please, especially from a compiled C code and numpy API point of view?

JoaoRodrigues

Overall, looks very nice, thank you @Will-Tyler for another great contribution! The testing you made makes me confident that this works correctly. I made only a few minor comments.

Bio/PDB/Atom.py

Bio/PDB/binary_cif.py

JoaoRodrigues · 2024-04-28T19:59:44Z

Bio/PDB/binary_cif.py

+            # This resets the source if source is a file handle.
+            source.seek(0)
+
+        with (


I'd separate the choice of the open func from the with statement. Makes it a little clearer to understand what's going on.

I am trying to do this:

if source.endswith(".gz"): open_func = gzip.open else: open_func = open with open_func(source, mode="rb") as file: result = msgpack.unpack(file, use_list=True)

But mypy is complaining:

mypy.....................................................................Failed - hook id: mypy - exit code: 1 Bio/PDB/binary_cif.py: note: In member "get_structure" of class "BinaryCIFParser": Bio/PDB/binary_cif.py:258:25: error: Incompatible types in assignment (expression has type overloaded function, variable has type overloaded function) [assignment] open_func = open ^~~~ Found 1 error in 1 file (checked 1 source file)

I will leave as is for now.

Same issue raised on https://stackoverflow.com/questions/16813267/python-gzip-refuses-to-read-uncompressed-file with a pointer to python/mypy#1026 but that and the linked issue seemed to have gone in the direction of Python 2/3 workarounds being a legacy corner case (irrelevant here), and judicous use of # type: ignore being the only practical suggestion :(

JoaoRodrigues · 2024-04-28T20:01:32Z

Bio/PDB/binary_cif.py

+            for index in range(len(serial_numbers))
+        ]
+
+    def get_structure(self, source: str) -> Structure:


As much as I don't like it, I'd keep the signature similar to the other parsers, e.g. add the id argument:

def get_structure(self, id: str, source: str) -> Structure: ....

JoaoRodrigues · 2024-04-28T20:01:56Z

Doc/Tutorial/chapter_pdb.rst

+
+.. code:: pycon
+
+   >>> parser.get_structure("1gbt.bcif.gz")


If you change the signature don't forget to update the example here.

NEWS.rst

Will-Tyler · 2024-05-06T23:04:15Z

I redid my testing as described in the PR description and everything looks good. I am ready for a new review.

Bio/PDB/__init__.py

peterjc · 2024-05-07T10:31:23Z

Why did your write this decoders in an extension module? I may have misunderstood the goal of the C code, or it maybe a performance bottleneck - but looking at Bio/PDB/bcifhelpermodule.c it seems to be "just" moving data about, something which might be possible in pure Python with the struct module (see e.g. https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py for an example).

Will-Tyler · 2024-05-07T14:00:53Z

I implemented the integer packing decoder in C. (Here is the integer packing description.) Ideally, I think we want the integer packing decoder to be able to work with NumPy arrays because most of the other decoders can be implemented efficiently and simply with the NumPy API. I couldn't find a way to implement the integer packing decoder using NumPy's Python API. Hence, I started thinking about writing custom C code.

The struct module appears to return Python types. Iterating in Python and using Python types would slow the parser down a lot I think. I originally implemented the parser in pure Python following the examples I was looking at. I then found that I was able to speed it up a lot by using NumPy. (I should have recorded the performance improvements more carefully, but if I remember correctly, the NumPy implementation is 50 to 100 percent faster.)

peterjc

I have nothing further to add, @mdehoon ?

Will-Tyler · 2024-06-09T00:04:28Z

It's been a few weeks since the last response... Can we merge this unless there is anything else to address?

peterjc · 2024-06-09T02:31:32Z

Is a squash-and-merge OK with you Will?

Will-Tyler · 2024-06-11T00:25:33Z

Yes, squash-and-merge is fine with me

peterjc · 2024-06-11T03:36:10Z

Merged, thank you Will 👍

Will-Tyler · 2024-06-11T13:45:56Z

Thanks all for reviewing! 🙏

Will-Tyler added 7 commits April 19, 2024 02:17

Fix structure comparison bug

80c066d

Add BinaryCIF parser

ab2d297

Update tutorial

7c8fd14

Add NEWS entry

9f0cdba

Import BinaryCIF parser into PDB module

a499fcb

Fix pre-commit issues

9cb7699

Fix msgpack import error

a485649

Will-Tyler marked this pull request as ready for review April 24, 2024 01:53

Will-Tyler requested a review from JoaoRodrigues as a code owner April 24, 2024 01:53