PDBx/mmCIF Reader/Topology Reader #2367

JoaoRodrigues · 2019-10-15T18:21:58Z

Hi all,

I've been using OpenMM to run simulations for the past couple of years. Unlike other tools, OpenMM does not write a 'topology' file (e.g. .gro, .prmtop) but instead creates a Topology object on the fly when loading structures. Because simulation systems tend to be quite large, I've take to use PDBx/mmCIF files as my default file format for writing structures that I use as topologies.

I'd like to start using MDAnalysis but right now this involves jumping through a bunch of hoops to get my topologies in a format that is parseable. It'd be much easier if I could just load a PDBx/mmCIF file as a topology, specially since it's now the default file format for structures in the PDB.

To this end, I've started working on writing a simple PDBxParser class. I wouldn't mind extending it to a PDBxReader/Writer class but that is not necessarily a priority (specially the writer). Would this be an interesting feature to add in your opinion? See the code here, I modeled the class after PDBParser.

Thanks for the great work with the library so far!

EDIT: Since I cannot label issues, I'm just editing the title of the issue for now to make it clear!

The text was updated successfully, but these errors were encountered:

orbeckst · 2019-10-17T16:21:55Z

Hi @JoaoRodrigues , new formats are a great addition. Can you just create a PR and then we can comment directly on the code, run tests, etc?

JoaoRodrigues · 2019-10-17T16:30:04Z

Sounds good. Will do. Thanks!

orbeckst · 2020-07-30T00:03:32Z

Pure python implementation: https://github.com/Electrostatics/mmcif_pdbx

orbeckst · 2020-07-30T22:34:03Z

Also, chemfiles reads mmCIF and can be used inside MDAnalysis, see Reading trajectories with chemfiles.

orbeckst · 2021-07-08T02:15:17Z

Note on trying to use chemfiles:

With the 2.0.0b,

u = mda.Universe("1ake.cif.gz", topology_format="chemfiles")

fails with

'CHEMFILES' isn't a valid topology format, nor a coordinate format

I can't just read it with

u = mda.Universe("1ake.cif.gz", format="chemfiles")

either as it gives the same ValueError.

Just trying something stupid with the topology as a PDB also fails

u = mda.Universe("1ake.pdb.gz", "1ake.cif.gz", format="chemfiles")

with

ChemfilesError: can not read file '1ake.cif.gz' at step 0, it does not contain any step.

Conclusion: I couldn't get it to work with chemfiles. Maybe @Luthaf has some ideas but a key problem seems to be that our (EDIT) chemfiles converter do not work for topologies —— ??

Luthaf · 2021-07-08T08:25:54Z

I think that topology reading is not registered with MDA, since the chemfiles adapter is implemented as a coordinate reader/writer. Although the conversion from chemfiles to MDA topologies is already implemented, so it should be mostly a question of adding a new ChemfilesParser class. I'll have a look at this next week!

ChemfilesError: can not read file '1ake.cif.gz' at step 0, it does not contain any step.

This is a bit strange and probably a bug, is this 1ake from the wwwPDB?

Luthaf · 2021-07-08T10:25:03Z

This is a bit strange and probably a bug, is this 1ake from the wwwPDB?

Ok, I understand this part. The core of the issue is that to relatively different formats want to use the same extension: mmCIF and crystallography CIF. While they both use the same STAR format, they specify data in different ways. Unfortunately, chemfiles is associating the .cif extension with crystallography CIF files, and uses .mmcif for mmCIF.

There is a simple workaround though, since you can specify the format to use manually, with something like

u = mda.Universe("1ake.pdb.gz", "1ake.cif.gz", format="chemfiles", chemfiles_format="mmCIF / GZ")

Unfortunately, this still fails in the case of 1AKE (but should work for other files). I'll fix the 1AKE issue, it should be working in the next patch release.

I would also like to introduce a better format guessing functionality, to decide between mmCIF and crystallography CIF on the fly instead of having the use specify it manually.

orbeckst · 2021-07-08T15:27:27Z

Yes, I downloaded the file directly from the PDB: sorry I didn’t try any others.

…

Am 7/8/21 um 01:26 schrieb Guillaume Fraux ***@***.***>: This is a bit strange and probably a bug, is this 1ake from the wwwPDB?

flautodipan · 2021-07-16T11:04:42Z

Hi everyone,
I am trying to construct a MDAnalysis Universe for PDBx/mmCIF files, since for several structures I am studying there is no available .pdb file.

Under read suggestions, I've successfully created a Universe with the following lines, in the conda env 'mda-workshop2021' with 2.0.0b version.

(https://urldefense.com/v3/__https://github.com/MDAnalysis/WorkshopPrace2021/blob/main/INSTALL.md__;!!JFdNOqOXpB6UZW0!-TNyVmB_fl6DRCq3XPyNYXwWr45Lg3p4afVdE_ZAclXGf3FhXNXP0diPIN-M7KP6ugyYIRzpPA$)

from simtk.openmm.app import pdbxfile
from MDAnalysis.converters.OpenMMParser import OpenMMTopologyParser

pdbx = pdbxfile.PDBxFile('<path/to/stored/.cif/file'>)

openmmtopol =OpenMMTopologyParser(pdbx.topology)

mdatopol = ommtopol._mda_topology_from_omm_topology(pdbx.topology)

U = mda.Universe(mdatopol)

With these lines I can see atoms and chains and other attributes. But I've just realised that the attribute dimensions had not been created.

When I try to run the command

U.select_atoms('around 2.2 resname ')

I'm getting this AttributeError:
'AtomGroup has no attribute dimensions'

Glad if you can help!

orbeckst · 2021-07-16T21:37:27Z

@flautodipan does your mmCIF file contain a box? If I do the following with 1ake.cif

import MDAnalysis as mda
from simtk.openmm.app import pdbxfile
structure = pdbxfile.PDBxFile("1ake.cif")
u = mda.Universe(structure)

(EDIT: Note that you normally do not need to explicitly import converters/parsers, i.e., you don't need from MDAnalysis.converters.OpenMMParser import OpenMMTopologyParser: just throw the resulting OpenMM object into a Universe and let MDAnalysis figure out the rest.)

I can do a selection (note that I added a name for resname)

>>> u.select_atoms("around 2.2 resname ARG")
<AtomGroup with 52 atoms>

without issues. Check that you have the dimensions with

>>> u.dimensions
array([73.200005, 79.8     , 85.      , 90.      , 90.      , 90.      ],      dtype=float32)

and verify that you have the corresponding elements in the mmCIF file

_cell.length_a           73.200
_cell.length_b           79.800
_cell.length_c           85.000
_cell.angle_alpha        90.00
_cell.angle_beta         90.00
_cell.angle_gamma        90.00

flautodipan · 2021-07-19T09:17:31Z

Ok now effectively it is working and it is much simpler.
I don't know why I get a little bit confused, thanks for helping.

So to resume:

using the OpenMM topology generated by the PDBxfile module in simtk.openmm.pdbxfile do create a MDA universe from mmCIF files.

orbeckst · 2021-07-19T17:55:20Z

We have to improve our documentation to make clearer that users almost never need to do anything directly with the Readers and Converters.

@lilyminium @IAlibay @fiona-naughton this might be something to keep in mind going forward. Maybe start with a Interoperability in practice blog post (collecting some of @jbarnoud 's examples from the workshop, too) and then see how we can make a User Guide entry from that?

joaomcteixeira · 2021-11-22T20:47:33Z

The solution proposed by @orbeckst using pdbxfile does work. However, I would like to see @JoaoRodrigues initiative pushed forward. I think his implementation is fully compatible with version 2. I am talking with Joao to have it Pull Requested.

orbeckst · 2022-05-17T07:25:31Z

The mmCIF/PDBx format would also be needed for alphafold #3377 .

razvanmarinescu · 2022-08-16T05:56:06Z

does anyone have a solution for writing an MDAnalysis universe as PDBx/mmCIF?

joaomcteixeira · 2022-08-16T08:27:34Z

you could use pdb_tocif tool from pdb-tools applied to all PDB files from a trajectory and then pdb_mkensemble to merge them.

razvanmarinescu · 2022-08-16T16:48:25Z

I just tried pdb_tocif, but it screws up the segids. It should have segid of B0, or B1, ... It only keeps the chain (B). See the question marks below:

ATOM   374778 H  HA     ?     B   ? 116   ?    352.061     37.526    138.818      1.000      0.000 ? 116       B HA   1
ATOM   374779 C  CB     ?     B   ? 116   ?    353.637     37.061    140.192      1.000      0.000 ? 116       B CB   1
ATOM   374780 H  HB2    ?     B   ? 116   ?    353.967     36.114    140.621      1.000      0.000 ? 116       B HB2  1
ATOM   374781 H  HB3    ?     B   ? 116   ?    354.280     37.265    139.336      1.000      0.000 ? 116       B HB3  1
ATOM   374782 C  CG     ?     B   ? 116   ?    353.874     38.157    141.231      1.000      0.000 ? 116       B CG   1

JoaoRodrigues · 2022-08-16T18:37:28Z

Sorry to cross-post on a different project, but could you share an input file?pdb_tocif is not designed to handle multi-character chain identifiers. Feel free to open an issue in the pdb-tools repo.

marinegor · 2024-05-15T10:51:22Z

I wonder if it'd be ok to use gemmi library (link) as a dependency to read mmcif and other cif-like formats consistently?

IMO it's one of the best supported crystallography-related libraries, is maintained by ccp4 and globalphasing, and allows very detailed cif parsing and writing.

For example, reading all atom ID and coordinates could be done as simple as:

# read only ATOM groups of chain 

model = gemmi.read_structure('6rz6.cif')[0]
residues = [res for res in model['A'] if res.het_flag == 'A'] # reading non-heteroatom atoms for simplicity
arr = np.array([[at.serial, *at.pos.tolist()] for res in residues for at in res])

ids, xyzs = arr[:, 0].astype(int), arr[:, 1:]

the whole discussion with devs here

hmacdope · 2024-05-17T09:59:41Z

@marinegor I think @richardjgowers has some ideas in this area

marinegor · 2024-05-20T15:44:27Z

@richardjgowers could you share them (here, if it's appropriate place, or somewhere on discord)?

richardjgowers · 2024-05-23T07:47:46Z

@marinegor I'd done this at a hackathon: #4303

The problem with this approach is that it doesn't do the "table join" on conect records that mmcif relies on, so you won't get some data (bonds).

I've since done this: https://github.com/OpenFreeEnergy/pdbinf which does do the "table join" to get bonds. It's into rdkit format, but conversion is trivial...

JoaoRodrigues changed the title ~~PDBx/mmCIF Reader/Topology Reader~~ [Feature][Improvement] PDBx/mmCIF Reader/Topology Reader Oct 15, 2019

JoaoRodrigues changed the title ~~[Feature][Improvement] PDBx/mmCIF Reader/Topology Reader~~ [Feature][Contrib] PDBx/mmCIF Reader/Topology Reader Oct 15, 2019

orbeckst added Component-Readers Component-Topology labels Oct 17, 2019

orbeckst added enhancement Format-PDB labels Oct 17, 2019

JoaoRodrigues changed the title ~~[Feature][Contrib] PDBx/mmCIF Reader/Topology Reader~~ PDBx/mmCIF Reader/Topology Reader Oct 17, 2019

ahy3nz mentioned this issue Aug 25, 2020

Converter for OpenMM #2917

Merged

4 tasks

Luthaf mentioned this issue Jul 9, 2021

CIF/mmCIF improvements chemfiles/chemfiles#407

Merged

orbeckst mentioned this issue Nov 4, 2021

PDB writer does not reorder atoms as required by PDB standard #3452

Open

orbeckst mentioned this issue Nov 22, 2021

conda-forge package Electrostatics/mmcif_pdbx#60

Open

joaomcteixeira mentioned this issue Nov 23, 2021

Add mmcif compatibility to mdanalysis library joaomcteixeira/taurenmd#57

Merged

IAlibay mentioned this issue Sep 25, 2023

[MDAKit / Hackathon] New file formats MDAnalysis/UGM2023#27

Open

Michael-C-Strobel mentioned this issue Apr 6, 2024

CIF format tiwarylab/GrASP#5

Closed

marinegor linked a pull request Sep 20, 2024 that will close this issue

Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDBx/mmCIF Reader/Topology Reader #2367

PDBx/mmCIF Reader/Topology Reader #2367

JoaoRodrigues commented Oct 15, 2019 •

edited

Loading

orbeckst commented Oct 17, 2019

JoaoRodrigues commented Oct 17, 2019

orbeckst commented Jul 30, 2020

orbeckst commented Jul 30, 2020

orbeckst commented Jul 8, 2021 •

edited

Loading

Luthaf commented Jul 8, 2021

Luthaf commented Jul 8, 2021

orbeckst commented Jul 8, 2021 via email

flautodipan commented Jul 16, 2021

orbeckst commented Jul 16, 2021 •

edited

Loading

flautodipan commented Jul 19, 2021

orbeckst commented Jul 19, 2021

joaomcteixeira commented Nov 22, 2021

orbeckst commented May 17, 2022

razvanmarinescu commented Aug 16, 2022

joaomcteixeira commented Aug 16, 2022

razvanmarinescu commented Aug 16, 2022 •

edited

Loading

JoaoRodrigues commented Aug 16, 2022

marinegor commented May 15, 2024 •

edited

Loading

hmacdope commented May 17, 2024

marinegor commented May 20, 2024

richardjgowers commented May 23, 2024

PDBx/mmCIF Reader/Topology Reader #2367

PDBx/mmCIF Reader/Topology Reader #2367

Comments

JoaoRodrigues commented Oct 15, 2019 • edited Loading

orbeckst commented Oct 17, 2019

JoaoRodrigues commented Oct 17, 2019

orbeckst commented Jul 30, 2020

orbeckst commented Jul 30, 2020

orbeckst commented Jul 8, 2021 • edited Loading

Luthaf commented Jul 8, 2021

Luthaf commented Jul 8, 2021

orbeckst commented Jul 8, 2021 via email

flautodipan commented Jul 16, 2021

orbeckst commented Jul 16, 2021 • edited Loading

flautodipan commented Jul 19, 2021

orbeckst commented Jul 19, 2021

joaomcteixeira commented Nov 22, 2021

orbeckst commented May 17, 2022

razvanmarinescu commented Aug 16, 2022

joaomcteixeira commented Aug 16, 2022

razvanmarinescu commented Aug 16, 2022 • edited Loading

JoaoRodrigues commented Aug 16, 2022

marinegor commented May 15, 2024 • edited Loading

hmacdope commented May 17, 2024

marinegor commented May 20, 2024

richardjgowers commented May 23, 2024

JoaoRodrigues commented Oct 15, 2019 •

edited

Loading

orbeckst commented Jul 8, 2021 •

edited

Loading

orbeckst commented Jul 16, 2021 •

edited

Loading

razvanmarinescu commented Aug 16, 2022 •

edited

Loading

marinegor commented May 15, 2024 •

edited

Loading