-
Notifications
You must be signed in to change notification settings - Fork 249
Modify node regularization in kinetic families. #2856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Modify node regularization in kinetic families. #2856
Conversation
|
Use testing_regularization.ipynb for testing. |
|
As a sanity check to see if this would work on a family other than Retroene, I ran this fix on 1+2_Cycloaddition family. Tree is generated and regularized without errors and passes final isomorphism check. Here's a file that shares the nodes that were unregularized, and the resulting group after regularization with my fix: 1+2_Cycloaddition.txt |
…tead of using their atom symbol to make an atomtype
|
New groups and their complements are regularized at the same time in the same way: RMG-Py/rmgpy/data/kinetics/family.py Line 3005 in d65c5f3
I was able to identify an issue with leaf nodes though (and Root_N-4R!H->C is a leaf node). Since regularization dimensions are computed when we extend a group, leaf nodes where we didn't need to extend seem to only have the copied regularization dimensions of their parent. I added a step in PySIDT regularization where it runs extension expansion (without recursive extension expansion for the split exiting before collecting groups to be expanded, which is quite quick) on each leaf node before regularizing it and that seemed to fix things. |
|
Another comment, mostly to keep track of my thinking. There is definitely a more sophisticated way to fix this issue that doesn't involve reading in atomtypes directly from the training reactions of that node. It's probably better to keep with the architecture that Matt already has. Similar to the fix with the leaf nodes Matt mentioned, I see now that the problem should be solved if the problematic node then acts as a parent node and generates a further extension that is non-splitting on the atom of interest (even if that isn't the extension that is ultimately selected to extend the tree). This would update the reg_dict to include regularization information on the atom of interest in the parent (our problematic node). i.e. our problematic node can be passed regularization info for *5 if a non-splitting extension could be found on that atom. However, some debugging is showing me that attempts to extend this node doesn't propose any non-splitting extensions on *5 (only on *2 and *6). Hence why there's no regularization information for that atom in the group, and why it isn't regularized. I'm still figuring out how exactly Looks like all the training reactions at this node have their *5 atom as N or C, so I would think there should be a proposed extension on *5. Going to focus now on figuring out 1) whether or not there should be a non-splitting extension on *5 and 2) if there shouldn't be, how can we still pass regularization information to *5? |
…ows problematic nodes to generate atomExt extensions that aren't node splitting if the optimization dimension of the regularization dictionary is more specific than the atomtype at the atom of interest being extended. For example, if the atomtype of an atom labeled *5 is [Si, F, Li, N, C, P, S] and the regulatization dictionary has an optimization dimension that narrows down these atomtypes (i.e. reg_dim_atm[0] = <N,C>), then we can allow for atomExt extensions that change *5's atomtype to be [N,C] (rather than just [N] or just [C]). This way, we have an extension that narrows down *5 to <N,C> from [Si, F, Li, N, C, P, S] but also matches all of the training reactions at the node, so the regularization information (reg_dim_atm{1]) is passed to the group.
… regularization info is passed (but not actually extending them).


Motivation or Problem
Noticed that the nodes of some families were not being regularized. For example, in groups.py of the Retroene family, the following entry (below) specifies that the *5 atom can be a fluorine, but the entire training set of the family includes zero fluorinated reactions.
This indicates that node regularization was not carried out on this node. My understanding of node regularization is that it is performed to limit how "general" a node is and ensures that the node fits it's reaction data "tightly." Successful regularization on this node would mean that this node should not explicitly specify atoms like I, P, Br, Cl, Si, S, F, or Li if they are not included in the training reaction set. There are numerous other nodes in this family that are not regularized, and this is also occurring in other families. This is a problem because, if you were descending a fluorinated reaction down this tree, it would fall to a deeper depth down the tree to a node with fewer training reactions compared to the Root, but those fewer training reactions are not at all representative of the fluorinated reaction that we want to estimate a rate for. I could see this being an issue for other chemistries involving Br, Cl, Si, S, Li, etc.
After some digging, I think regularization is being skipped because:
Description of Changes
Different approach to passing the regularization information to each node. After an extension is selected, if that extension is of the type
atomExt(i.e. changing an atom's atomtype R -> C) that splits the training reactions of the parent node into an extension node and complement node, then regularization info is passed to the extension node AND the compliment node. Regularization info of compliment is found by analyzing all of the reactions that fit the complimentary node and determining which atomtypes are in those reactions. After this information is found, regularization info is also passed to the parent node (and this information is no longer an empty list).Testing
Simplest testing is with Retroene family.
[Si,Li,S,N,P,F,Br,I,Cl,C])