Error fix: Specify "standard_names=False" when loading pdb into mdtraj #552

jaketanderson · 2024-04-10T06:41:27Z

Description

This change fixes an error where, for certain host PDBs, pAPRika can lose CONECT bond information during the release_apply_parameters windows of APR calculations. The error is given here (some line numbers will differ):

Traceback (most recent call last):
  File [...]/openff-evaluator/openff/evaluator/workflow/protocols.py", line 1208, in _execute_protocol
    protocol.execute(directory, available_resources)
  File "[...]/openff-evaluator/openff/evaluator/workflow/protocols.py", line 695, in execute
    self._execute(directory, available_resources)
  File "[...]/openff-evaluator/openff/evaluator/protocols/paprika/forcefield.py", line 93, in _execute
    build_protocol.execute(directory, available_resources)
  File "[...]/openff-evaluator/openff/evaluator/workflow/protocols.py", line 695, in execute
    self._execute(directory, available_resources)
  File "[...]/openff-evaluator/openff/evaluator/protocols/forcefield.py", line 683, in _execute
    topology = Topology.from_openmm(
               ^^^^^^^^^^^^^^^^^^^^^
  File "[...]/openff/utilities/utilities.py", line 80, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/openff-toolkit/openff/toolkit/topology/topology.py", line 1495, in from_openmm
    raise ValueError(msg)
ValueError: No match found for molecule C. This would be a very unusual molecule to try and parameterize, and it is likely that the data source it was read from does not contain connectivity information. If this molecule is coming from PDB, please ensure that the file contains CONECT records. The PDB format documentation (https://www.wwpdb.org/documentation/file-format-content/format33/sect10.html) states "CONECT records are mandatory for HET groups (excluding water) and for other bonds not specified in the standard residue connectivity table."

Todos

Speciy that MDTraj should not change naming when loading PDB files in protocols/paprika/coordinates.py, to prevent an error.

Note

The file protocols/paprika/coordinates.py contains one other instance where an MDTraj trajectory is created using mdtraj.load_pdb(), located inside _atom_indices_by_role(), but I haven't encountered any errors related to that instance.

for more information, see https://pre-commit.ci

codecov · 2024-04-10T06:46:16Z

Codecov Report

Merging #552 (343f3be) into main (c4fbf16) will decrease coverage by 2.60%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files

mattwthompson · 2024-04-10T15:35:29Z

pAPRika can lose CONECT bond information during the release_apply_parameters windows of APR calculations

Could you elaborate on this a little on what's going on here (what triggers it, how often it happens, etc.)? I'm a little apprehensive about changing the default behavior of one of the main PDB loading pathways

jaketanderson · 2024-04-11T03:56:48Z

@mattwthompson I can appreciate your apprehension. I haven't encountered any unintended consequences of making the change this PR suggests, but that doesn't mean there are provably none. This issue has only started to occur for me with PDBs describing host-guest systems unique to the "steroids" branch of the TAPROOM dataset (meaning not found in the main branch). @jeff231li mentioned the error given seemed familiar, so he may have encountered it before, but it seems to be an overall rare issue. Here is a little script showing two example PDBs, one that does not have this issue and one that does. The PDB that does have this issue, "hse_x02", can be corrected when standard_names=False is specified in mdtraj.load_pdb() (line 18).

import mdtraj
import requests

urls = [
    "https://raw.githubusercontent.com/slochower/host-guest-benchmarks/steroids/taproom/systems/bcd/hex/b-hex-p.pdb",
    "https://raw.githubusercontent.com/slochower/host-guest-benchmarks/steroids/taproom/systems/hse/x02/b-x02-p.pdb",
]

system_names = ["bcd_hex", "hse_x02"]


def has_CONECT_issue(url, system_name):

    raw_pdb = requests.get(url)
    with open(f"{system_name}.pdb", "wb") as f:
        f.write(raw_pdb.content)

    mdtraj_system = mdtraj.load_pdb(f"{system_name}.pdb")  # Problem line

    # Here the mdtraj trajectories are sliced to only contain the atoms of the host molecule, identified by a list of indices.
    # The list of indices determined by pAPRika is 0..146 for bcd and 0..209 for hse
    if system_name == "bcd_hex":
        mdtraj_system_sliced = mdtraj_system.atom_slice(range(0, 147))
    elif system_name == "hse_x02":
        mdtraj_system_sliced = mdtraj_system.atom_slice(range(0, 210))

    mdtraj_system_sliced.save(f"{system_name}-mdtraj.pdb")

    with open(f"{system_name}-mdtraj.pdb", "r") as f:
        for line in f.readlines():
            if "CONECT" in line:
                return False
        return True


for i, url in enumerate(urls):
    print(
        f"System {system_names[i]} has the CONECT issue: {has_CONECT_issue(url, system_names[i])}\n"
    )

I ran this script with mdtraj 1.9.9 (py311h90fe790_1). Looking at the PDB files produced by running the script will allow you to directly see the difference between raw and mdtraj-processed CONECT records. I think the script highlights the issue sufficiently, but if it's unclear please let me know.

jaketanderson · 2024-04-11T04:15:25Z

Here is a list of the major differences between the bcd_hex PDB file and the hse_x02 PDB file:

hse_x02 has no CRYST line at the top
hse_x02 has no TER line after the host
hse_x02 has hydrogens after heavy atoms, instead of in-between heavy atoms
hse_x02 has 1 residue for the entire complex, with two chains (one for host, one for guest)
bcd_hex has no specified residues, with 1 chain per host monomer and 1 chain for the guest (7+1=8 total)

My guess to why the default mdtraj loading works for bcd systems and not hse systems is that it's due to the 4th and 5th difference, but I haven't checked this rigorously. The thing that makes me think this should be less of a "fix your inputs PDBs" solution and more of a "fix the way PDBs are handled" solution is that, to my understanding, the hse PDB is valid and therefore ought to be accepted.

mattwthompson · 2024-04-26T19:20:46Z

Sorry, I'm still a little confused - MDTraj can load those files in just fine, but gets confused when writing the CONECT records of the HSE host?

Any of the issues you see in the HSE file seem like data problems to me - why, for example, should one have a TER separating the host and guest and the others not?

Does pAPRika do alchemical transformations that muck with the bond graph?

I'm iffy on this as-is since these seem like input data issues, but if there was a try-except (try loading it as-is, and if that fails fall back to using this flag) I'd be more amenable to it.

jaketanderson and others added 2 commits April 9, 2024 23:29

Specify "standard_names=False" when loading pdb into mdtraj

7f6e654

[pre-commit.ci] auto fixes from pre-commit.com hooks

343f3be

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error fix: Specify "standard_names=False" when loading pdb into mdtraj #552

Error fix: Specify "standard_names=False" when loading pdb into mdtraj #552

jaketanderson commented Apr 10, 2024

codecov bot commented Apr 10, 2024

mattwthompson commented Apr 10, 2024

jaketanderson commented Apr 11, 2024 •

edited

Loading

jaketanderson commented Apr 11, 2024

mattwthompson commented Apr 26, 2024

Error fix: Specify "standard_names=False" when loading pdb into mdtraj #552

Are you sure you want to change the base?

Error fix: Specify "standard_names=False" when loading pdb into mdtraj #552

Conversation

jaketanderson commented Apr 10, 2024

Description

Todos

Note

codecov bot commented Apr 10, 2024

Codecov Report

mattwthompson commented Apr 10, 2024

jaketanderson commented Apr 11, 2024 • edited Loading

jaketanderson commented Apr 11, 2024

mattwthompson commented Apr 26, 2024

jaketanderson commented Apr 11, 2024 •

edited

Loading