Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error fix: Specify "standard_names=False" when loading pdb into mdtraj #552

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jaketanderson
Copy link
Contributor

Description

This change fixes an error where, for certain host PDBs, pAPRika can lose CONECT bond information during the release_apply_parameters windows of APR calculations. The error is given here (some line numbers will differ):

Traceback (most recent call last):
  File [...]/openff-evaluator/openff/evaluator/workflow/protocols.py", line 1208, in _execute_protocol
    protocol.execute(directory, available_resources)
  File "[...]/openff-evaluator/openff/evaluator/workflow/protocols.py", line 695, in execute
    self._execute(directory, available_resources)
  File "[...]/openff-evaluator/openff/evaluator/protocols/paprika/forcefield.py", line 93, in _execute
    build_protocol.execute(directory, available_resources)
  File "[...]/openff-evaluator/openff/evaluator/workflow/protocols.py", line 695, in execute
    self._execute(directory, available_resources)
  File "[...]/openff-evaluator/openff/evaluator/protocols/forcefield.py", line 683, in _execute
    topology = Topology.from_openmm(
               ^^^^^^^^^^^^^^^^^^^^^
  File "[...]/openff/utilities/utilities.py", line 80, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/openff-toolkit/openff/toolkit/topology/topology.py", line 1495, in from_openmm
    raise ValueError(msg)
ValueError: No match found for molecule C. This would be a very unusual molecule to try and parameterize, and it is likely that the data source it was read from does not contain connectivity information. If this molecule is coming from PDB, please ensure that the file contains CONECT records. The PDB format documentation (https://www.wwpdb.org/documentation/file-format-content/format33/sect10.html) states "CONECT records are mandatory for HET groups (excluding water) and for other bonds not specified in the standard residue connectivity table."

Todos

  • Speciy that MDTraj should not change naming when loading PDB files in protocols/paprika/coordinates.py, to prevent an error.

Note

The file protocols/paprika/coordinates.py contains one other instance where an MDTraj trajectory is created using mdtraj.load_pdb(), located inside _atom_indices_by_role(), but I haven't encountered any errors related to that instance.

Copy link

codecov bot commented Apr 10, 2024

Codecov Report

Merging #552 (343f3be) into main (c4fbf16) will decrease coverage by 2.60%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files

@mattwthompson
Copy link
Member

pAPRika can lose CONECT bond information during the release_apply_parameters windows of APR calculations

Could you elaborate on this a little on what's going on here (what triggers it, how often it happens, etc.)? I'm a little apprehensive about changing the default behavior of one of the main PDB loading pathways

@jaketanderson
Copy link
Contributor Author

jaketanderson commented Apr 11, 2024

@mattwthompson I can appreciate your apprehension. I haven't encountered any unintended consequences of making the change this PR suggests, but that doesn't mean there are provably none. This issue has only started to occur for me with PDBs describing host-guest systems unique to the "steroids" branch of the TAPROOM dataset (meaning not found in the main branch). @jeff231li mentioned the error given seemed familiar, so he may have encountered it before, but it seems to be an overall rare issue. Here is a little script showing two example PDBs, one that does not have this issue and one that does. The PDB that does have this issue, "hse_x02", can be corrected when standard_names=False is specified in mdtraj.load_pdb() (line 18).

import mdtraj
import requests

urls = [
    "https://raw.githubusercontent.com/slochower/host-guest-benchmarks/steroids/taproom/systems/bcd/hex/b-hex-p.pdb",
    "https://raw.githubusercontent.com/slochower/host-guest-benchmarks/steroids/taproom/systems/hse/x02/b-x02-p.pdb",
]

system_names = ["bcd_hex", "hse_x02"]


def has_CONECT_issue(url, system_name):

    raw_pdb = requests.get(url)
    with open(f"{system_name}.pdb", "wb") as f:
        f.write(raw_pdb.content)

    mdtraj_system = mdtraj.load_pdb(f"{system_name}.pdb")  # Problem line

    # Here the mdtraj trajectories are sliced to only contain the atoms of the host molecule, identified by a list of indices.
    # The list of indices determined by pAPRika is 0..146 for bcd and 0..209 for hse
    if system_name == "bcd_hex":
        mdtraj_system_sliced = mdtraj_system.atom_slice(range(0, 147))
    elif system_name == "hse_x02":
        mdtraj_system_sliced = mdtraj_system.atom_slice(range(0, 210))

    mdtraj_system_sliced.save(f"{system_name}-mdtraj.pdb")

    with open(f"{system_name}-mdtraj.pdb", "r") as f:
        for line in f.readlines():
            if "CONECT" in line:
                return False
        return True


for i, url in enumerate(urls):
    print(
        f"System {system_names[i]} has the CONECT issue: {has_CONECT_issue(url, system_names[i])}\n"
    )

I ran this script with mdtraj 1.9.9 (py311h90fe790_1). Looking at the PDB files produced by running the script will allow you to directly see the difference between raw and mdtraj-processed CONECT records. I think the script highlights the issue sufficiently, but if it's unclear please let me know.

@jaketanderson
Copy link
Contributor Author

Here is a list of the major differences between the bcd_hex PDB file and the hse_x02 PDB file:

  1. hse_x02 has no CRYST line at the top
  2. hse_x02 has no TER line after the host
  3. hse_x02 has hydrogens after heavy atoms, instead of in-between heavy atoms
  4. hse_x02 has 1 residue for the entire complex, with two chains (one for host, one for guest)
  5. bcd_hex has no specified residues, with 1 chain per host monomer and 1 chain for the guest (7+1=8 total)

My guess to why the default mdtraj loading works for bcd systems and not hse systems is that it's due to the 4th and 5th difference, but I haven't checked this rigorously. The thing that makes me think this should be less of a "fix your inputs PDBs" solution and more of a "fix the way PDBs are handled" solution is that, to my understanding, the hse PDB is valid and therefore ought to be accepted.

@mattwthompson
Copy link
Member

Sorry, I'm still a little confused - MDTraj can load those files in just fine, but gets confused when writing the CONECT records of the HSE host?

Any of the issues you see in the HSE file seem like data problems to me - why, for example, should one have a TER separating the host and guest and the others not?

Does pAPRika do alchemical transformations that muck with the bond graph?

I'm iffy on this as-is since these seem like input data issues, but if there was a try-except (try loading it as-is, and if that fails fall back to using this flag) I'd be more amenable to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants