Skip to content

Conversation

@Nora-Khalil
Copy link
Contributor

@Nora-Khalil Nora-Khalil commented Nov 13, 2025

Motivation or Problem

Noticed that the nodes of some families were not being regularized. For example, in groups.py of the Retroene family, the following entry (below) specifies that the *5 atom can be a fluorine, but the entire training set of the family includes zero fluorinated reactions.

entry(
    index = 93,
    label = "Root_N-4R!H->C",
    group = 
"""
1 *3 R!H                       u0 r0 {2,S} {3,S}
2 *4 R!H                       u0 r0 {1,S} {4,S}
3 *2 C                         u0 {1,S} {5,[D,T,B]}
4 *5 [I,P,Br,Cl,N,Si,S,F,Li,O] u0 r0 {2,S} {6,S}
5 *1 R!H                       u0 {3,[D,T,B]}
6 *6 [H,Li]                    u0 r0 {4,S}
""",
    kinetics = None,
)

This indicates that node regularization was not carried out on this node. My understanding of node regularization is that it is performed to limit how "general" a node is and ensures that the node fits it's reaction data "tightly." Successful regularization on this node would mean that this node should not explicitly specify atoms like I, P, Br, Cl, Si, S, F, or Li if they are not included in the training reaction set. There are numerous other nodes in this family that are not regularized, and this is also occurring in other families. This is a problem because, if you were descending a fluorinated reaction down this tree, it would fall to a deeper depth down the tree to a node with fewer training reactions compared to the Root, but those fewer training reactions are not at all representative of the fluorinated reaction that we want to estimate a rate for. I could see this being an issue for other chemistries involving Br, Cl, Si, S, Li, etc.

After some digging, I think regularization is being skipped because:

  1. compliment nodes are missing the regularization dimensions that provide information on how the node should be regularized (this information is only passed to the extension node and not its compliment). I fix this by passing the info to the compliment. However, even when I pass the compliment the regularization info, if that compliment is further split via an atomExt extension as the tree is further extended, you run into issue 2.
  2. when a parent node is split via an atomExt extension (which splits the parent node into an extension node and compliment node), the regularization information of the parent node is overwritten with the regularization info of the new extension node (which resets the regularization information to an empty list).

Description of Changes

Different approach to passing the regularization information to each node. After an extension is selected, if that extension is of the type atomExt (i.e. changing an atom's atomtype R -> C) that splits the training reactions of the parent node into an extension node and complement node, then regularization info is passed to the extension node AND the compliment node. Regularization info of compliment is found by analyzing all of the reactions that fit the complimentary node and determining which atomtypes are in those reactions. After this information is found, regularization info is also passed to the parent node (and this information is no longer an empty list).

Testing

Simplest testing is with Retroene family.

  1. load instance of database
  2. clean Retroene tree
  3. generate tree
  4. Identify nodes that are not regularized (will have atomtype [Si,Li,S,N,P,F,Br,I,Cl,C])
  5. regularize family
  6. re-check identified nodes from step 4. If things work, these nodes should have more specific atomtypes than before.
  7. check entire tree for isomorphism.

@Nora-Khalil
Copy link
Contributor Author

Use testing_regularization.ipynb for testing.

@Nora-Khalil Nora-Khalil marked this pull request as ready for review November 14, 2025 14:36
@Nora-Khalil
Copy link
Contributor Author

Nora-Khalil commented Nov 14, 2025

As a sanity check to see if this would work on a family other than Retroene, I ran this fix on 1+2_Cycloaddition family. Tree is generated and regularized without errors and passes final isomorphism check. Here's a file that shares the nodes that were unregularized, and the resulting group after regularization with my fix: 1+2_Cycloaddition.txt

…tead of using their atom symbol to make an atomtype
@mjohnson541
Copy link
Contributor

New groups and their complements are regularized at the same time in the same way:

grpc.atoms[indcr[0]].reg_dim_atm = list(reg_val)
. So it would be strange if the complement is missing regularization information, but the new group has the information.

I was able to identify an issue with leaf nodes though (and Root_N-4R!H->C is a leaf node). Since regularization dimensions are computed when we extend a group, leaf nodes where we didn't need to extend seem to only have the copied regularization dimensions of their parent. I added a step in PySIDT regularization where it runs extension expansion (without recursive extension expansion for the split exiting before collecting groups to be expanded, which is quite quick) on each leaf node before regularizing it and that seemed to fix things.

@Nora-Khalil
Copy link
Contributor Author

Nora-Khalil commented Nov 17, 2025

Thanks for the input! I'll try that.

During my debugging, I've been regenerating the Retroene family. For some reason, my regenerated tree doesn't split the root node to Root_N-4R!H->C (even before any of my changes). Instead, the first split is below:

Root_4R!H->O (new group)
1 *3 R!H    u0 {2,S} {3,[S,D]}
2 *4 R!H    u0 {1,S} {4,[S,D]}
3 *2 R!H    u0 {1,[S,D]} {5,[D,T,B]}
4 *5 O      u0 {2,[S,D]} {6,S}
5 *1 R!H    u0 {3,[D,T,B]}
6 *6 [H,Li] u0 {4,S}

Root_N-4R!H->O (compliment) 
1 *3 R!H                       u0 {2,S} {3,[S,D]}
2 *4 R!H                       u0 {1,S} {4,[S,D]}
3 *2 R!H                       u0 {1,[S,D]} {5,[D,T,B]}
4 *5 [Si,Li,S,N,P,F,Br,I,Cl,C] u0 {2,[S,D]} {6,S}
5 *1 R!H                       u0 {3,[D,T,B]}
6 *6 [H,Li]                    u0 {4,S}

where the compliment is unregularized. So I've been focusing on this compliment and why it wasn't regularized. I don't think this is a leaf node (I'm assuming a leaf node means there's no extensions beyond this node?). Looks like the Root node had 68 reactions and this compliment node has 67 reactions. The templateRxnMap shows many different nodes that start with the similar node label, which I think means that there were extensions branching from this node (please correct me if I am wrong).

Screenshot 2025-11-17 at 8 51 08 AM

I think I'm seeing this issue in non-leaf nodes then.

After running family.regularize with this proposed fix, I can get the compliment to regularize to:

Root_N-4R!H->O
1 *3 R!H    u0 {2,S} {3,[S,D]}
2 *4 R!H    u0 {1,S} {4,[S,D]}
3 *2 C      u0 {1,[S,D]} {5,[D,T,B]}
4 *5 [N,C]  u0 {2,[S,D]} {6,S}
5 *1 R!H    u0 {3,[D,T,B]}
6 *6 [H,Li] u0 {4,S}

whereas, before, running regularize won't change the node at all.

From debugging, it looks like the regularization info in reg_val (the second list in the tuple) is an empty list in these cases where regularization is missing in the compliment. Apologies for my wording in this PR, I believe the compliment is getting passed the regularization dictionary, as you showed in line 3005, but it looks like in the cases where the compliment isn't regularized, this regularization dictionary just had an empty list (and essentially no regularization info). Was that intended? I think this is specifically an issue with nodes that are both 1) a compliment and 2) then further extended via an atomExt since the reg_val then always has an empty list for the regularization info.

I'm curious to know what you think about this new Root_N-4R!H->O node. In addition, I'm also encountering the issue you mentioned with leaf nodes, so thanks for suggesting a fix for that. Thanks so much for the input so far!

@Nora-Khalil
Copy link
Contributor Author

Nora-Khalil commented Nov 18, 2025

Another comment, mostly to keep track of my thinking. There is definitely a more sophisticated way to fix this issue that doesn't involve reading in atomtypes directly from the training reactions of that node. It's probably better to keep with the architecture that Matt already has.

Similar to the fix with the leaf nodes Matt mentioned, I see now that the problem should be solved if the problematic node then acts as a parent node and generates a further extension that is non-splitting on the atom of interest (even if that isn't the extension that is ultimately selected to extend the tree). This would update the reg_dict to include regularization information on the atom of interest in the parent (our problematic node).

i.e. our problematic node

Root_N-4R!H->O 
1 *3 R!H                       u0 {2,S} {3,[S,D]}
2 *4 R!H                       u0 {1,S} {4,[S,D]}
3 *2 R!H                       u0 {1,[S,D]} {5,[D,T,B]}
4 *5 [Si,Li,S,N,P,F,Br,I,Cl,C] u0 {2,[S,D]} {6,S}
5 *1 R!H                       u0 {3,[D,T,B]}
6 *6 [H,Li]                    u0 {4,S}

can be passed regularization info for *5 if a non-splitting extension could be found on that atom. However, some debugging is showing me that attempts to extend this node doesn't propose any non-splitting extensions on *5 (only on *2 and *6). Hence why there's no regularization information for that atom in the group, and why it isn't regularized. I'm still figuring out how exactly get_extension_edge works, so there may be something I'm still missing...

Looks like all the training reactions at this node have their *5 atom as N or C, so I would think there should be a proposed extension on *5.

Going to focus now on figuring out 1) whether or not there should be a non-splitting extension on *5 and 2) if there shouldn't be, how can we still pass regularization information to *5?

@Nora-Khalil
Copy link
Contributor Author

*5 does have a splitting extension that looks like this, where *5 is either N or C.
Screenshot 2025-11-19 at 9 49 46 AM

But I believe our problem would be solved if there was a non-splitting extension proposed that looked like this:

0  *3 R!H
1  *4 R!H
2  *2 R!H
3  *5 N, C
4  *1 R!H
5  *6 H, Li 

where *5 would be extended from atomtypes [Si,Li,S,N,P,F,Br,I,Cl,C] to [N,C]

Nora Khalil added 3 commits November 19, 2025 12:33
…ows problematic nodes to generate atomExt extensions that aren't node splitting if the optimization dimension of the regularization dictionary is more specific than the atomtype at the atom of interest being extended. For example, if the atomtype of an atom labeled *5 is [Si, F, Li, N, C, P, S] and the regulatization dictionary has an optimization dimension that narrows down these atomtypes (i.e. reg_dim_atm[0] = <N,C>), then we can allow for atomExt extensions that change *5's atomtype to be [N,C] (rather than just [N] or just [C]). This way, we have an extension that narrows down *5 to <N,C> from [Si, F, Li, N, C, P, S] but also matches all of the training reactions at the node, so the regularization information (reg_dim_atm{1]) is passed to the group.
… regularization info is passed (but not actually extending them).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants