Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify that altloc labels (in PDB) are supported across XCA, loader, & F/E #1512

Open
4 tasks
mwinokan opened this issue Sep 12, 2024 · 22 comments
Open
4 tasks
Assignees
Labels
2024-06-14 mint Data dissemination 2 2024-09-17 olive data curation big items XChemAlign

Comments

@mwinokan
Copy link
Collaborator

E.g. A71EV2A crystal 0515 has several ligands modelled at residue index 201. In the crystallographic PDB files they have the alternative site codes A and B to differentiate between the two models.

  • 1. @ConorFWild Are both instances of these ligands aligned separately into separate observations?
  • 2. @ConorFWild does the alternative site code make it's way into the XCA output YAML files?
  • 3. @kaliif do you capture the alt site codes in the loader?
  • 4. @mwinokan to test the F/E once the above has been verified
@mwinokan mwinokan added XChemAlign 2024-06-14 mint Data dissemination 2 labels Sep 12, 2024
@mwinokan mwinokan moved this from Backlog to XChemAlign in Fragalysis Sep 12, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Fragalysis Sep 12, 2024
@kaliif
Copy link
Collaborator

kaliif commented Sep 12, 2024

@mwinokan do you have the tarball?

@mwinokan
Copy link
Collaborator Author

mwinokan commented Sep 19, 2024

@ConorFWild says that the alt conf is currently not identified as a separate ligand instance.

The bottom line is that there are ligand models/observations are missing from the f/e, the question is whether XCA should split the two models or if we suggest a different SoP for modelling.

Conor is anxious that this will affect all ligand instances to need to consider the alternative site/conformations. Conor says it is trivial to split the ligand models, but difficult to account for it across the whole pipeline.

Currently, Jasmin is concerned that 2343 for CHIKV_Mac shows the less desirable ligand model. Conor says that the first conformation is used and letter is ignored, and that only the residue numbers are used to split the ligands and not the name/altconf.

One workaround is to model it as a partial occupancy separate residue, a script could be made for this that runs before XCA, or this could be a functionality for the collator

@tdudgeon says that if the ligand has different names it will require #1476, but we can assume here that all ligands are named LIG

@tdudgeon to talk to @mwinokan to iron out details

@mwinokan
Copy link
Collaborator Author

@tdudgeon has implemented the validation in the collator that raises an error if there are multiple models for the same ligand (same residue number)

@mwinokan mwinokan moved this from In Progress (DEV) to Dev Done - Do review (DEV) in Fragalysis Sep 24, 2024
@mwinokan
Copy link
Collaborator Author

mwinokan commented Oct 1, 2024

@tdudgeon and @mwinokan to discuss again how the collator can add validation logic to throw an error if:

Any ATOM/HETATM lines with the same chain letter AND residue number have multiple different alt site letters

@mwinokan
Copy link
Collaborator Author

mwinokan commented Oct 1, 2024

@tdudgeon this python snippet should be all you need for the collator:

from rich import print

lookup_dict = {}

EXPECTED_LIG_NAMES = ["LIG"]
FILE = "A0515a.pdb"

with open(FILE) as f:

    for line in f:

        if not (line.startswith("ATOM") or line.startswith("HETATM")):
            continue

        residue_number = int(line[22:26].strip())
        residue_name = line[17:21].strip()

        if residue_name not in EXPECTED_LIG_NAMES:
            continue

        chain = line[21:22].strip()
        alt_code = line[16:17].strip() or None

        key = (chain, residue_number)

        alt_codes = lookup_dict.setdefault(key, set())
        alt_codes.add(alt_code)

for key, alt_codes in lookup_dict.items():
    if len(alt_codes) > 1:
        print(
            f"[red bold] Error: Ligand in (chain, residue_number)={key} has multiple models"
        )

@phraenquex phraenquex changed the title Verify that alternative ligand sites are supported across XCA, loader, & F/E Verify that altloc labels (in PDB) are supported across XCA, loader, & F/E Oct 10, 2024
@mwinokan
Copy link
Collaborator Author

@tdudgeon has previously said that the combi-soak data was a prerequite as the ligand names would be unknown otherwise (in the metadata).

@tdudgeon now thinks that there is no way for alternative conformations to be valid as there should never be two ligand models within the same residue number. @phraenquex says this may be necessary to correctly tell crystallographic software that there can be multiple residue models in the same place

@mwinokan
Copy link
Collaborator Author

@phraenquex says that the altlocs and residue numbers needs to be carried through to all the aligned files as they are in the crystallographic files, but observations must be split by the collator.

@phraenquex says to aggregate all residues with no alt site and "A" together as one PDB and the same for "B", etc. if present in the crystallographic file

This may need some brainstorming between @tdudgeon, @ConorFWild, and @mwinokan to find the most elegant way to do this.

@tdudgeon
Copy link
Collaborator

I'm not sure that XCA has to deal with altlocs for ligands as it seems that NGLViewer can already handle this.
In 0515 example for A71EV2A that is mentioned above this is how that ligand appears in NGL by default:
Image
Clearly 2 different locations for the same ligand.

Using NGL's selection language the A or the B conformation can be displayed. Here it is using 201:A%A:
Image
Clearly just the A conformation. The same can be done to display the B.

The way XCA handles this at present is to generate an aligned PDB file for that observation that contains the ligand like this

HETATM 1098  N  ALIG A 201       5.450  10.239  26.602  0.36 23.03           N  
HETATM 1099  C  ALIG A 201       9.993  11.982  23.113  0.36 28.58           C  
HETATM 1100  O  ALIG A 201       9.625  11.851  24.507  0.36 27.19           O  
HETATM 1101  C1 ALIG A 201       8.596  12.588  24.916  0.36 26.12           C  
HETATM 1102  C2 ALIG A 201       7.491  11.761  25.467  0.36 24.46           C  
HETATM 1103  C3 ALIG A 201       6.577  12.343  26.343  0.36 23.76           C  
HETATM 1104  C4 ALIG A 201       5.591  11.538  26.881  0.36 23.60           C  
HETATM 1105  C5 ALIG A 201       6.318   9.695  25.748  0.36 23.24           C  
HETATM 1106  C6 ALIG A 201       7.355  10.387  25.158  0.36 23.73           C  
HETATM 1107  N1 ALIG A 201       8.228   9.709  24.321  0.36 23.02           N  
HETATM 1108  O1 ALIG A 201       8.578  13.793  24.886  0.36 26.27           O  
HETATM 1109  N  BLIG A 201       6.128  12.869  26.362  0.35 19.90           N  
HETATM 1110  C  BLIG A 201       9.663   8.717  22.997  0.35 16.64           C  
HETATM 1111  O  BLIG A 201       8.816   9.439  23.923  0.35 17.57           O  
HETATM 1112  C1 BLIG A 201       8.968  10.756  23.924  0.35 18.94           C  
HETATM 1113  C2 BLIG A 201       7.997  11.466  24.809  0.35 19.60           C  
HETATM 1114  C3 BLIG A 201       8.022  12.860  24.881  0.35 20.24           C  
HETATM 1115  C4 BLIG A 201       7.072  13.500  25.657  0.35 19.98           C  
HETATM 1116  C5 BLIG A 201       6.112  11.531  26.315  0.35 20.73           C  
HETATM 1117  C6 BLIG A 201       7.019  10.773  25.555  0.35 20.41           C  
HETATM 1118  N1 BLIG A 201       6.924   9.401  25.568  0.35 21.16           N  
HETATM 1119  O1 BLIG A 201       9.807  11.311  23.272  0.35 20.73           O  
TER    1120      LIG A 201    

e.g. both conformations are carried through (also note what appears to be the incorrect use of the TER record which has been noted before). When displayed in NGL that aligned PDB file also shows both conformations.

The problem for Fragalysis is that it is the extracted Molfile that is used to display ligand, and that extracted Molfile only contains a single representation (strangely the B one) as Molfile format does not support multiple conformations.

The solution might be either:

  1. use the extracted PDB format for the displaying the ligand, which does include the altlocs (though that could result in problems with bond orders as CONECT records are not present)
  2. Extract both conformations out to a SDfile (each to a separate record) and send that to the Fragalysis FE which is configured to display all molecules in the SDF, not just the first.

The better, but more complex, solution would be to have Fragalysis display the complete PDB file, not the APO one and use NGL selection mechanisms to show/hide the ligand (this would also properly address the covalent ligand problem).

@mwinokan
Copy link
Collaborator Author

Thanks for your investigation efforts @tdudgeon. At the end of the day we want each conformation of the ligand to be a separate site-observation in the database, and separate observation in the f/e.

Hence it should be split by the collator and aligned separately, right?

@tdudgeon
Copy link
Collaborator

@mwinokan

At the end of the day we want each conformation of the ligand to be a separate site-observation in the database, and separate observation in the f/e.

If that is indeed the case then there is a lot of work that's needed in XCA to handle this.
But what I was getting at is isn't this the tail wagging the dog? The original supposition was that NGL could not handle altconfs, but we now know it can.
So rather than re-engineer the process to handle each alt conf as its own observation have it handle as one, but say that there are more than one obvious interpretations of the same observation.

2 molecules, same site => 2 observations
1 molecule, 2 sites => 2 observations
1 molecule (with altconf), one site => 1 observation, 2 interpretations (actually, 3, either both are valid, or the density isn't good enough to decide)

@mwinokan
Copy link
Collaborator Author

@phraenquex confirms that each alt conf of a ligand should be its own observation.

@tdudgeon says this will require significant changes to the collator but not aligner.

It is still unclear what changes will be needed for the loader, but there will likely be a need to carry through more metadata to trace the root ligand in the crystallographic data.

Additionally, Daren pointed out that there might not be categorical correlation between altloc letters across the structure, but since we are working locally @phraenquex says that we should assume that all the protein A locs go with the ligand A, for example.

@phraenquex says to ideally aggregate all nearby protein alt locs near the ligand with the same letter for alignment, but practically speaking it may not be necessary as LNA only looks at the nearby

@mwinokan mwinokan assigned tdudgeon and unassigned kaliif and ConorFWild Oct 17, 2024
@mwinokan
Copy link
Collaborator Author

mwinokan commented Oct 22, 2024

@tdudgeon has progressed this and says that there may be no changes needed to the collator/aligner.

  • We can run the collator and aligner as normal, except that the last step which extract components from the different PDB files.

  • The aligner now generates a PDB with all the altlocs included, a new final step is needed to split this single file into one for each altloc.

But @phraenquex says that his may affect the alignment as in extreme cases the protein may have shifted dramatically and it is indeed the collator that will need to split the PDBs into separate files to be aligned by the aligner.

There may be a need to review whether the aligner can accept these split files with @ConorFWild

@tdudgeon asks if we should correlate all the altlocs across the whole structure together, or if the splitting should occur within some radius of the ligand. @phraenquex says that it is easier to treat them all together, and that should be implemented this way for now

Altloc codes that are not present for the ligand, e.g. if the ligand only has A,B but the protein has A,B,C,D:

  • Split the ligand and protein by the altlocs A,B into two separate PDB's as if consistently correlated across the whole structure
  • Each of those two PDB files should also contain all the protein models C,D

@tdudgeon
Copy link
Collaborator

I have initial code that is able to split out the altlocs into separate PDB files, but doing so hit a conceptual problem.
Currently aligner handles the identification of the ligands, but that behaviour will be slightly different now. A case to consider is the x0515 structure mentioned above. The ligands are defined like this:

HETATM 1654  N   LIG A 147      20.444  -5.815  13.647  0.62 27.90           N
HETATM 1655  C   LIG A 147      18.902  -4.139  19.134  0.62 27.44           C
HETATM 1656  O   LIG A 147      18.989  -5.360  18.360  0.62 27.86           O
HETATM 1657  C1  LIG A 147      18.208  -5.432  17.288  0.62 27.88           C
HETATM 1658  C2  LIG A 147      18.982  -5.565  16.023  0.62 28.50           C
HETATM 1659  C3  LIG A 147      20.273  -5.045  15.923  0.62 26.68           C
HETATM 1660  C4  LIG A 147      20.946  -5.194  14.724  0.62 27.72           C
HETATM 1661  C5  LIG A 147      19.206  -6.314  13.740  0.62 27.32           C
HETATM 1662  C6  LIG A 147      18.427  -6.213  14.893  0.62 28.11           C
HETATM 1663  N1  LIG A 147      17.146  -6.728  14.883  0.62 28.61           N
HETATM 1664  O1  LIG A 147      17.006  -5.398  17.343  0.62 31.32           O
HETATM 1676  N  ALIG A 201       6.458  11.643  26.136  0.36 23.03           N
HETATM 1677  C  ALIG A 201      10.968  13.087  22.471  0.36 28.58           C
HETATM 1678  O  ALIG A 201      10.628  13.033  23.878  0.36 27.19           O
HETATM 1679  C1 ALIG A 201       9.629  13.818  24.273  0.36 26.12           C
HETATM 1680  C2 ALIG A 201       8.514  13.050  24.886  0.36 24.46           C
HETATM 1681  C3 ALIG A 201       7.637  13.699  25.752  0.36 23.76           C
HETATM 1682  C4 ALIG A 201       6.641  12.949  26.349  0.36 23.60           C
HETATM 1683  C5 ALIG A 201       7.291  11.034  25.291  0.36 23.24           C
HETATM 1684  C6 ALIG A 201       8.333  11.668  24.646  0.36 23.73           C
HETATM 1685  N1 ALIG A 201       9.168  10.926  23.824  0.36 23.02           N
HETATM 1686  O1 ALIG A 201       9.644  15.020  24.185  0.36 26.27           O
HETATM 1687  N  BLIG A 201       7.203  14.238  25.755  0.35 19.90           N
HETATM 1688  C  BLIG A 201      10.545   9.831  22.519  0.35 16.64           C
HETATM 1689  O  BLIG A 201       9.739  10.620  23.427  0.35 17.57           O
HETATM 1690  C1 BLIG A 201       9.928  11.931  23.362  0.35 18.94           C
HETATM 1691  C2 BLIG A 201       8.997  12.710  24.232  0.35 19.60           C
HETATM 1692  C3 BLIG A 201       9.062  14.104  24.237  0.35 20.24           C
HETATM 1693  C4 BLIG A 201       8.148  14.808  25.001  0.35 19.98           C
HETATM 1694  C5 BLIG A 201       7.149  12.900  25.773  0.35 20.73           C
HETATM 1695  C6 BLIG A 201       8.017  12.081  25.031  0.35 20.41           C
HETATM 1696  N1 BLIG A 201       7.885  10.715  25.112  0.35 21.16           N
HETATM 1697  O1 BLIG A 201      10.767  12.430  22.666  0.35 20.73           O

Here residues 147 and 201 are the same molecule at 2 different locations, one of which (201) has an altloc.
Using the simplistic approach defined above we end up with 2 generated PDB files, one for the A altloc, the other for the B altloc.
But both of these also contain the 147 ligand as it does not have an altloc, so that would result in this observation being present twice. It should be fairly simple to have that ligand treated as a pseudo-altloc so that 3 PDB files were generated (for LIG, ALIG and BLIG), but my concern is that aligner is already doing some of this analysis and so it might be better to handle this in aligner.
@ConorFWild do you think it's the right approach for collator to split out the altlocs into separate PDBs so that aligner handles each one separately?

@mwinokan
Copy link
Collaborator Author

mwinokan commented Oct 24, 2024

@phraenquex agrees that this is more elegantly handled in LNA. @tdudgeon please see if you can make sense of how this is handled in LNA and see what kind of changes would be necessary, for either yourself or Conor to implement

@ConorFWild says there are multiple places/ways to implement this

  • The XCA mechanism that identifies which ligands are present: separate LigandBindingEvent objects can be created for each altconf. N.B. this will need new metadata to differentiate the two events as the longcode generation would be identical for both events as currently implemented. I.e. various duplicate keys across the code. Conor also adds that the startup energy to familiarise themselves with this code base is large.
  • Alternatively, to save Conor time: Conor could reconfigure LNA to handle altconfs from the input ligand keys. Tim would then handle the XCA changes necessary

Conor says that dealing with protein altconfs will need much more serious work in both LNA and XCA.

actions:

@tdudgeon please prepare a minimal dataset for Conor and begin modifying LigandBindingEvent to include altconfs.
@ConorFWild estimates 20 hours of LNA work minimum
@tdudgeon also please revisit creation of an SDF with all altconf molecules (new ticket #1556)

@tdudgeon
Copy link
Collaborator

The molecule extraction has been modified to handle altlocs for ligands. The following are now generated for each observation:

  1. PDB file with just the corresponding ligand (including altlocs)
  2. molfile with all the alternative structures
  3. sdfile with a record for each alternative structures (this file is new)

As the FE already displays the molfile, it should now display multiple altlocs without the need for any changes.
The only further work needed would be if we needed any specific handling of the new sdfile.

This change is rolled out the the XCA staging environment.

@Waztom
Copy link
Collaborator

Waztom commented Oct 29, 2024

@mwinokan to help with datasets to test this. Also need to confirm if the .sdf file generated is the download source vs. the mol file that is converted into the .sdfs files downloaded.

@mwinokan
Copy link
Collaborator Author

@tdudgeon's changes are in the XCA staging deployment

@kaliif: Small change needed in the target loader to use the SDF from XCA instead of generating an SDF from the .mol file

Eventually there may need to be f/e features to select which altlocs to display.

@phraenquex
Copy link
Collaborator

phraenquex commented Nov 5, 2024

@tdudgeon has observed some issues in staging - something about how altloc is displayed (f/e), and something being switched. Could be an API issue.

@mwinokan will spec out for the F/E.

(Stays in "in progress")

@Waztom
Copy link
Collaborator

Waztom commented Nov 5, 2024

Noticed centroid residue information is missing from 2A example. @kaliif says the centroid info is in the API endpoint. @matej-vavrek can you please have a look at this and confirm what is happening with the centroid API call?

@tdudgeon
Copy link
Collaborator

tdudgeon commented Nov 5, 2024

I looked at the x0515 data again and it does seem to be displayed correctly in fragalysis, except that the multiple ligand conformations are not shown, which is probably expected.
@boriskovar-m2ms The frontend needs to do one of these two things:

  1. use the molfile for display of the ligand (the molfile will contain all the alternate conformations)
  2. use the SDF, but configure NGL to display all the records in the file (it probably defaults to displaying just the first).

The second is probably better as it leaves the possibility of showing/hiding individual conformations in the future.

@kaliif may still need to make a small change to the backend to ensure that the SDF in the upload is served up rather than one he generates from the molfile (which no longer needs to be done).

@mwinokan
Copy link
Collaborator Author

mwinokan commented Nov 7, 2024

This will need work from @boriskovar-m2ms as he is familiar with the NGL implementation, after he is done with #1483

@tdudgeon please share in this ticket a screenshot (annotated?) of the NGL UI to help Boris find exactly which settings need changing

@tdudgeon
Copy link
Collaborator

tdudgeon commented Nov 7, 2024

SDF (2 records) with default settings showing only the first record
Image

Setting changed to display all records:
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024-06-14 mint Data dissemination 2 2024-09-17 olive data curation big items XChemAlign
Projects
Status: In Progress (DEV)
Development

No branches or pull requests

6 participants