Refactor matching code #14

Yoshanuikabundi · 2025-01-23T06:43:16Z

Should do this post-MVP. It'll make adding features and maintenance easier.

The basic abstract process of matching is:

Loop over all residues and find name-based matches from database
Loop over name-based matches and filter for connectivity
Check any unmatched residues against unknown-molecules

At the moment, this is done like so:

get_all_matches
    loop over residue_indices
        Find name-based matches
    loop over name-based matches
        Find connectivity matches
    loop over connectivity matches
        find and yield crosslink matches
 
topology_from_pdb
    <parse PDB file>
    loop over get_all_matches
        check unmatched residues against unknown-molecules
        raise if residue is unmatched
        <build up topology with _add_to_molecule>
    <finalize and return>

This makes get_all_matches really big, and means the code jumps around a lot. Instead, we should do something like

get_name_based_matches
    loop over residue_indices
        loop over residue_definitions
            yield subset_matches_residue

filter_on_connectivity
    loop over input matches with neighbours
        yield connectivity matches

filter_on_crosslinks
    loop over input matches (or CONECT records?)
        yield crosslink matches

match_residues
    get_name_based_matches
    filter_on_connectivity
    filter_on_crosslinks
    If no matches remain:
        match_unknown_molecules
    <Other fallbacks to be implemented in future>


build_up_topology
    loop over input matched residues
        <build up topology with _add_to_molecule>

topology_from_pdb
    <parse PDB file>
    loop over match_residues
        collect match failures or degeneracies
    raise if any residues were unmatched or multiply matched
    [we now have a complete description of the PDB]
    build_up_topology
    <finalize and return>

That refactor will make maintaining, testing, expanding etc the code much easier. To support it, we could define a protocol/ABC ("trait" in RustSpeak) for residue matches. The existing ResidueMatch class would implement this protocol, and we would add a new class:

MoleculeMatch for unique_molecules
Further classes could be added in future for additional substructures, bond inference, etc. _add_to_molecule would take the base class as input.

The text was updated successfully, but these errors were encountered:

Yoshanuikabundi self-assigned this Jan 28, 2025

Yoshanuikabundi added Pablo Stage 2 and removed Pablo Stage 2 labels Jan 28, 2025

Yoshanuikabundi mentioned this issue Feb 12, 2025

Support original additional_substructures argument (requiring conectivity rather than atom names) #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor matching code #14

Refactor matching code #14

Yoshanuikabundi commented Jan 23, 2025

Refactor matching code #14

Refactor matching code #14

Comments

Yoshanuikabundi commented Jan 23, 2025