You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes a structure transformation in the evo-search algorithm will fail repeatedly and cause a worker to get stuck for at least several minutes. We should alter how transformations are monitored to avoid clogging up workers.
Additional context
Currently for each transformation the evo-search will select parent structure(s) then try to perform the transformation up to 10,000 times. If it fails after this amount of time, then the transformation is removed from the search. In my testing, this issue has come up consistently, at least with smaller structures, and takes a good chunk of time. @jacksund suggested two fixes in #715:
The quick and fallback fix is to put a timer in the transformation function and to check it after each attempt -- exiting when it takes too long. While the code won't be pretty, it's probably important to have this feature exist somewhere. The timer limit should probably be proportional to the complexity/size of the system
The more robust fix is to find the cases where it does get stuck and either catch it or adjust the number of transformation attempts ahead of time. It'd be nice to have a reproducible test case to work with -- you can try adding a bunch of logs to a search where it happens often, and when a structure gets stuck, you go in an pull out those parent structures + the transformation it was using. This is preferred but might take more time to track down.
I'm more inclined to add the timer for now and investigate the root cause later, especially if we plan to rework structure generation and transformation in the future. In terms of a reproducible test case, this issue happens every time I run a search on the Na-Cl system. I always see it when a search is running on Na, Cl, and NaCl and I think some of the larger systems as well (e.g. Na2Cl, Na2Cl2, etc.). The issue is more obvious when running with one worker and a sqlite backend like I do when running quick tests. In this case, since there's only one worker, the whole search shuts down until the transformation attempt finishes. With more workers what will likely happen is that each submission of the transformation will clog up a worker. If one does succeed it'll be because it selected a structure that is easier to transform. That's no different than selecting a different structure each attempt and will bias the results.
So for now, I vote that we add a timer to get things ready for benchmarking as quickly as possible. We can troubleshoot more extensively later.
To-do items
Switch to a timer system rather than set numbers of while loops
Troubleshoot situations where this problem occurs using logs
Add a more robust fix addressing the root of the problem (whatever it may be)
The text was updated successfully, but these errors were encountered:
@jacksund I have to give group meeting this week so I might not be able to work on this until Thursday afternoon. I'll open a PR once I start working on it to give you a heads up.
Also, I talked to Scott more about the benchmarks and for now we are planning to use fewer resources (50 workers, 4 cores each) so that I can start soon despite the heavy activity on WarWulf. I'm hoping that we can add the quick fix for this issue by the end of this week, and I'll start the benchmarks as soon as it's merged.
Sorry for the late response, I've been sick (still am really) and had some family stuff come up. Seems like an easy enough solution. I should be able to get a PR with something similar up today or tomorrow.
Describe the desired feature
Sometimes a structure transformation in the evo-search algorithm will fail repeatedly and cause a worker to get stuck for at least several minutes. We should alter how transformations are monitored to avoid clogging up workers.
Additional context
Currently for each transformation the evo-search will select parent structure(s) then try to perform the transformation up to 10,000 times. If it fails after this amount of time, then the transformation is removed from the search. In my testing, this issue has come up consistently, at least with smaller structures, and takes a good chunk of time. @jacksund suggested two fixes in #715:
I'm more inclined to add the timer for now and investigate the root cause later, especially if we plan to rework structure generation and transformation in the future. In terms of a reproducible test case, this issue happens every time I run a search on the Na-Cl system. I always see it when a search is running on Na, Cl, and NaCl and I think some of the larger systems as well (e.g. Na2Cl, Na2Cl2, etc.). The issue is more obvious when running with one worker and a sqlite backend like I do when running quick tests. In this case, since there's only one worker, the whole search shuts down until the transformation attempt finishes. With more workers what will likely happen is that each submission of the transformation will clog up a worker. If one does succeed it'll be because it selected a structure that is easier to transform. That's no different than selecting a different structure each attempt and will bias the results.
So for now, I vote that we add a timer to get things ready for benchmarking as quickly as possible. We can troubleshoot more extensively later.
To-do items
The text was updated successfully, but these errors were encountered: