Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Stuck Evolutionary Search Transformations #716

Open
3 tasks
SWeav02 opened this issue Feb 4, 2025 · 3 comments
Open
3 tasks

Fix Stuck Evolutionary Search Transformations #716

SWeav02 opened this issue Feb 4, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@SWeav02
Copy link
Contributor

SWeav02 commented Feb 4, 2025

Describe the desired feature

Sometimes a structure transformation in the evo-search algorithm will fail repeatedly and cause a worker to get stuck for at least several minutes. We should alter how transformations are monitored to avoid clogging up workers.

Additional context

Currently for each transformation the evo-search will select parent structure(s) then try to perform the transformation up to 10,000 times. If it fails after this amount of time, then the transformation is removed from the search. In my testing, this issue has come up consistently, at least with smaller structures, and takes a good chunk of time. @jacksund suggested two fixes in #715:

The quick and fallback fix is to put a timer in the transformation function and to check it after each attempt -- exiting when it takes too long. While the code won't be pretty, it's probably important to have this feature exist somewhere. The timer limit should probably be proportional to the complexity/size of the system

The more robust fix is to find the cases where it does get stuck and either catch it or adjust the number of transformation attempts ahead of time. It'd be nice to have a reproducible test case to work with -- you can try adding a bunch of logs to a search where it happens often, and when a structure gets stuck, you go in an pull out those parent structures + the transformation it was using. This is preferred but might take more time to track down.

I'm more inclined to add the timer for now and investigate the root cause later, especially if we plan to rework structure generation and transformation in the future. In terms of a reproducible test case, this issue happens every time I run a search on the Na-Cl system. I always see it when a search is running on Na, Cl, and NaCl and I think some of the larger systems as well (e.g. Na2Cl, Na2Cl2, etc.). The issue is more obvious when running with one worker and a sqlite backend like I do when running quick tests. In this case, since there's only one worker, the whole search shuts down until the transformation attempt finishes. With more workers what will likely happen is that each submission of the transformation will clog up a worker. If one does succeed it'll be because it selected a structure that is easier to transform. That's no different than selecting a different structure each attempt and will bias the results.

So for now, I vote that we add a timer to get things ready for benchmarking as quickly as possible. We can troubleshoot more extensively later.

To-do items

  • Switch to a timer system rather than set numbers of while loops
  • Troubleshoot situations where this problem occurs using logs
  • Add a more robust fix addressing the root of the problem (whatever it may be)
@SWeav02 SWeav02 added the enhancement New feature or request label Feb 4, 2025
@SWeav02
Copy link
Contributor Author

SWeav02 commented Feb 4, 2025

@jacksund I have to give group meeting this week so I might not be able to work on this until Thursday afternoon. I'll open a PR once I start working on it to give you a heads up.

Also, I talked to Scott more about the benchmarks and for now we are planning to use fewer resources (50 workers, 4 cores each) so that I can start soon despite the heavy activity on WarWulf. I'm hoping that we can add the quick fix for this issue by the end of this week, and I'll start the benchmarks as soon as it's merged.

@jacksund
Copy link
Owner

jacksund commented Feb 6, 2025

I wouldn't overcomplicate the timeout with threading or anything like that. I would just check a simple timer after each attempt. Similar to what's suggested here:
https://stackoverflow.com/questions/13293269/how-would-i-stop-a-while-loop-after-n-amount-of-time

@SWeav02
Copy link
Contributor Author

SWeav02 commented Feb 10, 2025

I wouldn't overcomplicate the timeout with threading or anything like that. I would just check a simple timer after each attempt. Similar to what's suggested here: https://stackoverflow.com/questions/13293269/how-would-i-stop-a-while-loop-after-n-amount-of-time

Sorry for the late response, I've been sick (still am really) and had some family stuff come up. Seems like an easy enough solution. I should be able to get a PR with something similar up today or tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants