Fix Stuck Evolutionary Search Transformations #716

SWeav02 · 2025-02-04T17:19:46Z

Describe the desired feature

Sometimes a structure transformation in the evo-search algorithm will fail repeatedly and cause a worker to get stuck for at least several minutes. We should alter how transformations are monitored to avoid clogging up workers.

Additional context

Currently for each transformation the evo-search will select parent structure(s) then try to perform the transformation up to 10,000 times. If it fails after this amount of time, then the transformation is removed from the search. In my testing, this issue has come up consistently, at least with smaller structures, and takes a good chunk of time. @jacksund suggested two fixes in #715:

The quick and fallback fix is to put a timer in the transformation function and to check it after each attempt -- exiting when it takes too long. While the code won't be pretty, it's probably important to have this feature exist somewhere. The timer limit should probably be proportional to the complexity/size of the system

The more robust fix is to find the cases where it does get stuck and either catch it or adjust the number of transformation attempts ahead of time. It'd be nice to have a reproducible test case to work with -- you can try adding a bunch of logs to a search where it happens often, and when a structure gets stuck, you go in an pull out those parent structures + the transformation it was using. This is preferred but might take more time to track down.

I'm more inclined to add the timer for now and investigate the root cause later, especially if we plan to rework structure generation and transformation in the future. In terms of a reproducible test case, this issue happens every time I run a search on the Na-Cl system. I always see it when a search is running on Na, Cl, and NaCl and I think some of the larger systems as well (e.g. Na2Cl, Na2Cl2, etc.). The issue is more obvious when running with one worker and a sqlite backend like I do when running quick tests. In this case, since there's only one worker, the whole search shuts down until the transformation attempt finishes. With more workers what will likely happen is that each submission of the transformation will clog up a worker. If one does succeed it'll be because it selected a structure that is easier to transform. That's no different than selecting a different structure each attempt and will bias the results.

So for now, I vote that we add a timer to get things ready for benchmarking as quickly as possible. We can troubleshoot more extensively later.

To-do items

Switch to a timer system rather than set numbers of while loops
Troubleshoot situations where this problem occurs using logs
Add a more robust fix addressing the root of the problem (whatever it may be)

SWeav02 · 2025-02-04T17:24:57Z

@jacksund I have to give group meeting this week so I might not be able to work on this until Thursday afternoon. I'll open a PR once I start working on it to give you a heads up.

Also, I talked to Scott more about the benchmarks and for now we are planning to use fewer resources (50 workers, 4 cores each) so that I can start soon despite the heavy activity on WarWulf. I'm hoping that we can add the quick fix for this issue by the end of this week, and I'll start the benchmarks as soon as it's merged.

jacksund · 2025-02-06T18:27:23Z

I wouldn't overcomplicate the timeout with threading or anything like that. I would just check a simple timer after each attempt. Similar to what's suggested here:
https://stackoverflow.com/questions/13293269/how-would-i-stop-a-while-loop-after-n-amount-of-time

SWeav02 · 2025-02-10T18:03:50Z

I wouldn't overcomplicate the timeout with threading or anything like that. I would just check a simple timer after each attempt. Similar to what's suggested here: https://stackoverflow.com/questions/13293269/how-would-i-stop-a-while-loop-after-n-amount-of-time

Sorry for the late response, I've been sick (still am really) and had some family stuff come up. Seems like an easy enough solution. I should be able to get a PR with something similar up today or tomorrow.

SWeav02 added the enhancement New feature or request label Feb 4, 2025

SWeav02 mentioned this issue Feb 4, 2025

Evo-search Custom Fitness Field #715

Draft

SWeav02 mentioned this issue Feb 11, 2025

Add Timer to Structure Transformations #717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Stuck Evolutionary Search Transformations #716

Fix Stuck Evolutionary Search Transformations #716

SWeav02 commented Feb 4, 2025

SWeav02 commented Feb 4, 2025

jacksund commented Feb 6, 2025

SWeav02 commented Feb 10, 2025

Fix Stuck Evolutionary Search Transformations #716

Fix Stuck Evolutionary Search Transformations #716

Comments

SWeav02 commented Feb 4, 2025

Describe the desired feature

Additional context

To-do items

SWeav02 commented Feb 4, 2025

jacksund commented Feb 6, 2025

SWeav02 commented Feb 10, 2025