Skip to content

Conversation

@marinegor
Copy link
Contributor

@marinegor marinegor commented Nov 6, 2025

Fixes #5141 and perhaps fixes #5089

Changes made in this Pull Request:
Adds option to explicitly read pdb directly with gemmi, with format='pdb_gemmi' upon Universe creation:

import MDAnalysis as mda
u = mda.Universe(mmcif_filename, format="pdb_gemmi")

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added?
  • Documentation updated/added?
  • package/CHANGELOG file updated?
  • Is your name in package/AUTHORS? (If it is not, add it!)

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--5142.org.readthedocs.build/en/5142/

@marinegor marinegor linked an issue Nov 6, 2025 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.75%. Comparing base (5c7c480) to head (9c3884e).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #5142      +/-   ##
===========================================
+ Coverage    92.68%   92.75%   +0.07%     
===========================================
  Files          180      182       +2     
  Lines        22452    22520      +68     
  Branches      3186     3190       +4     
===========================================
+ Hits         20809    20888      +79     
- Misses        1169     1176       +7     
+ Partials       474      456      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@marinegor
Copy link
Contributor Author

converted to draft since #4712 is not yet merged

@marinegor marinegor mentioned this pull request Nov 13, 2025
4 tasks
@marinegor
Copy link
Contributor Author

An update after amazing discussion with gemmi developers: seems that gemmi reader can actually be faster than legacy one, if we use FlatStructure representation!

Benhcmarks on 1FFK (>64k atoms), pdb format:

# BEFORE flatstructure
Measured what='coordinates' reader='pdb' gzipped=True n_cycles=20
    mean=272 ms	std= 9
Measured what='coordinates' reader='cif' gzipped=True n_cycles=20
    mean=63 ms	std= 6
Measured what='topology' reader='pdb' gzipped=True n_cycles=20
    mean=165 ms	std= 3
Measured what='topology' reader='cif' gzipped=True n_cycles=20
    mean=206 ms	std= 2
Measured what='universe' reader='pdb' gzipped=True n_cycles=20
    mean=274 ms	std= 9
Measured what='universe' reader='cif' gzipped=True n_cycles=20
    mean=273 ms	std= 8

Measured what='coordinates' reader='pdb' gzipped=False n_cycles=20
    mean=182 ms	std= 8
Measured what='coordinates' reader='cif' gzipped=False n_cycles=20
    mean=44 ms	std= 6
Measured what='topology' reader='pdb' gzipped=False n_cycles=20
    mean=121 ms	std= 2
Measured what='topology' reader='cif' gzipped=False n_cycles=20
    mean=186 ms	std= 2
Measured what='universe' reader='pdb' gzipped=False n_cycles=20
    mean=187 ms	std=10
Measured what='universe' reader='cif' gzipped=False n_cycles=20
    mean=235 ms	std= 7


# AFTER flatstructure
Measured what='coordinates' reader='pdb' gzipped=True n_cycles=20
    mean=274 ms	std= 9
Measured what='coordinates' reader='cif' gzipped=True n_cycles=20
    mean=29 ms	std= 0
Measured what='topology' reader='pdb' gzipped=True n_cycles=20
    mean=165 ms	std= 3
Measured what='topology' reader='cif' gzipped=True n_cycles=20
    mean=145 ms	std= 1
Measured what='universe' reader='pdb' gzipped=True n_cycles=20
    mean=276 ms	std= 9
Measured what='universe' reader='cif' gzipped=True n_cycles=20
    mean=173 ms	std= 1

Measured what='coordinates' reader='pdb' gzipped=False n_cycles=20
    mean=180 ms	std= 8
Measured what='coordinates' reader='cif' gzipped=False n_cycles=20
    mean=11 ms	std= 0
Measured what='topology' reader='pdb' gzipped=False n_cycles=20
    mean=120 ms	std= 2
Measured what='topology' reader='cif' gzipped=False n_cycles=20
    mean=125 ms	std= 2
Measured what='universe' reader='pdb' gzipped=False n_cycles=20
    mean=189 ms	std=11
Measured what='universe' reader='cif' gzipped=False n_cycles=20
    mean=137 ms	std= 1

(same script as above)

I think we can safely say that it's not slower (or actually, almost 40% faster), and it's possible to default to gemmi reader when respective package is installed?

@MDAnalysis/coredevs and @ljwoods2 do you have any opinion about it?

@p-j-smith
Copy link
Member

Nice find @marinegor!

and it's possible to default to gemmi reader when respective package is installed?

If the behaviour is the same, then yeah definitely (i.e. if the universe created is identical). What tests fail if you make gemmi the default and run the testsuite?

@marinegor
Copy link
Contributor Author

Nice find @marinegor!

If the behaviour is the same, then yeah definitely (i.e. if the universe created is identical). What tests fail if you make gemmi the default and run the testsuite?

If I just replace default format (leave ent for PDB/PDBParser and add PDB to MMCIF/MMCIFParser):

======================================================================================= short test summary info ========================================================================================
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBReader::test_coordinates - AssertionError:
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBReader::test_distances - IndexError: index -1 is out of bounds for axis 0 with size 0
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBReader::test_uses_PDBReader - AssertionError: failed to choose PDBReader
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBMetadata::test_HEADER - AttributeError: 'MMCIFReader' object has no attribute 'header'
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBMetadata::test_TITLE - AssertionError: Reader does not have a 'title' attribute.
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBMetadata::test_COMPND - AssertionError: Reader does not have a 'compound' attribute.
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBMetadata::test_REMARK - AssertionError: Reader does not have a 'remarks' attribute.
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBWriter::test_writer_no_altlocs - AssertionError:
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBWriter::test_write_nodims[universe_and_expected_dims0] - IndexError: invalid index to scalar variable.
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBWriter::test_write_nodims[universe_and_expected_dims1] - AssertionError:
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBWriter::test_chainid_validated[@] - IndexError: invalid index to scalar variable.
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBWriter::test_chainid_validated[] - IndexError: invalid index to scalar variable.
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBWriter::test_chainid_validated[AA] - IndexError: invalid index to scalar variable.
FAILED testsuite/MDAnalysisTests/coordinates/test_pdb.py::TestPDBWriter::test_hetatm_written - IndexError: invalid index to scalar variable.
---------
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::TestPDBParser::test_guessed_attributes - AssertionError
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::TestPDBParserSegids::test_guessed_attributes - AssertionError
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::test_PDB_no_resid - assert np.int64(-2147483648) == 1
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::test_PDB_metals - assert np.float64(55.845) == 55.847 ± 5.6e-05
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::test_missing_elements_noattribute - Failed: DID NOT WARN. No warnings of type (<class 'UserWarning'>,) were emitted.
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::test_wrong_elements_warnings - Failed: DID NOT WARN. No warnings of type (<class 'UserWarning'>,) matching the regex were emitted.
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::test_guessed_masses_and_types_values - AssertionError:
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::test_PDB_bad_charges[REMARK Invalid charge assignment - no sign for MG2+\nHETATM    1 CU    CU A   1      03.000  00.000  00.000  1.00 00.00          CU2+\nHETATM    2 FE    FE A   2      00.000  03.000  00.000  1.00 00.00          Fe2+\nHETATM    3 Mg    Mg A   3      03.000  03.000  03.000  1.00 00.00          MG2\nEND\n-2] - Failed: DID NOT WARN. No warnings of type (<class 'UserWarning'>,) matching the regex were emitted.
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::test_PDB_bad_charges[REMARK Invalid charge format for MG2+\nHETATM    1 CU    CU A   1      03.000  00.000  00.000  1.00 00.00          CU2+\nHETATM    2 FE    FE A   2      00.000  03.000  00.000  1.00 00.00          Fe2+\nHETATM    3 Mg    Mg A   3      03.000  03.000  03.000  1.00 00.00          MG+2\nEND\n-\\+2] - Failed: DID NOT WARN. No warnings of type (<class 'UserWarning'>,) matching the regex were emitted.
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::test_force_chainids_to_segids - assert 1 == 4

Most of which, I believe, fall into guessed attributes category, or parsing PDB metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add option for gemmi backend for PDB reading Parse hexadecimal resid from OpenMM in PDBParser

3 participants