-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for Custom Fitness Field #700
Conversation
Added functionality for fitness fields other than minimizing energy. Field can be anything that exists as a result of the subworkflow. Fitness function can be minimum, maximum, or target_value. Additional work is needed to make it easier to use custom fitness fields not already in simmate (i.e. easily make new models)
To make things easier on users, I've added a StagedCalculation model that should be used by any staged workflows. The model inherits the Calculation model and adds columns for subworkflow names and ids. It also has convenience methods to obtain the subworkflow results from their respective tables. In addition to this, I've made it so that the fixed_composition evo search looks to see if the selected workflow is a staged calculation. If it is, it uses the database of the last workflow of the staged calculation when filtering for finished calculations. This way the only things the user needs to do for custom calculations is define subworkflows and make sure the final workflow returns the desired fitness field. For custom fitness fields they will need to make a custom model, where the only requirement is that it inherits the Structure model.
The staged workflow and corresponding model now check if the workflows fail and if they do, record which one. Additionally, necessary arguments for subworkflows can be passes through the run argument as a dictionary "subworkflow_kwargs". So the only requirement now is that a Structure object is passed. Warren lab and badelf apps were updated to use this new staged workflow setup.
Added some additional columns and methods to SteadystateSource table. Still need to add methods that get information such as average field value or average difference from source structure.
Steadystate sources submitted up to an allowed integer. This is problematic as if one source tends to result in structures that are easier to relax than another, it will submit more than its ideal proportion of structures. The submission now looks at all submissions and submits from the source that is most needed to reach the desired proportion. This will help with the next step of dynamically updating sources based on various criteria. There are still some bugs to work on.
The steadystate sources are now automatically adjusted based on average improvement from parent structure. Only transformations are included and there may be a better way to adjust the rates.
Updated chemical system model and workflow to use alternate fitness fields. Not updated to properly print out information for best structures. Added very basic streamlit app.
@jacksund Talking with Scott, it sounds like the next steps before adding much else is to get everything to a good state, benchmark, and publish. One of the main things he remembered being an issue is structure generation as we get to larger systems. I'm planning to look into getting the structure generation faster, but I wasn't sure if there was a good system to use as a test case. I tried using a larger Na-Cl system but it didn't seem to have any issue creating a structure quickly. Do you remember running into the slowdown in a particular situation? Also, do you have any suggestions for other stuff we should work on/refine before publishing? I've mostly worked through the list you suggested in #695. The main outstanding things are the implementation of pgvector and a more involved streamlit dashboard. I figured I'd wait for you to get your pgvector work out, and for streamlit I think it might be useful to create a more involved dashboard for simmate as a whole rather than just for the evo app. That might make more sense for me to work on after/during my time at Corteva since I'll get more experience with it there. In the meantime I added a very basic dashboard to the app. |
Nice! We should clearly define an upper-limit for searches in the initial publication, and I'd suggest we target up to 25-atoms and ternary searches -- and anything beyond our decided cutoffs should be discouraged via docs + warnings until some follow up paper or release. Sounds like you and Scott were imagining a larger cutoff in the initial paper, like 50 atoms...? And maybe quaternary searches? If you want to shoot for those, there will be significantly more work required on optimizing transformations, fingerprinting, etc. And if we do aim this big, we should also try to set goal times -- e.g. is matching USPEX's 50 atom search time good enough to publish, or we need to beat it like we do in <20 atom searches?
It was both structure creation AND transformations that struggled at >20 atom counts. I never got to fine-tuning transformations and fingerprinting, which is why I had "benchmarking transformations and adding adaptive steady-states" to that list in #695. I don't have the numbers to back it up, but I would guess the "% of transformations that lead to lower energy structures" is much worse for large systems compared the % in other evo softwares
I'm excited to hear more about what you're imagining for a general dashboard. But yeah, I would wait until I can teach you what I've learned with user interfaces. I have a lot of fun stuff in the works for streamlit and other UI features that aren't yet documented. For the pgvector, I'll find some time to work on this. Just give me a heads up of your timeline so I can plan for it
Let me know once you have an upper limit set with Scott, and then we can prioritize things. Aiming for 25 vs 50 atoms as our cutoff will change which things to clean up / spend the most time on |
Sounds good! Talking more with Scott, we decided that we can focus just on the improved speed at less than about 25 atoms and ternary materials. We can then try and improve it further for > 25 atoms and quaternary materials for a follow up paper which would also ideally include other improvements such as ML-FFs. In terms of timeline, I'll probably have a better sense after the holidays. We're still finishing up the BadELF follow-up paper so I haven't started looking into the additional benchmarking we'll likely need for the evo search before publishing it. Once I start working on that a bit more I'll try and give you a more concrete idea of when we're aiming to publish. |
Added model to store elf ionic radii calculated during a BadELF calculation. Also added column in badelf model to store ELF values at atom/electride sites
from simmate.database.base_data_types import Calculation, Structure, table_column | ||
|
||
|
||
class StagedCalculation(Structure, Calculation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think StagedCalculation
should be a standalone table, but a mixin.
For total beginners that run a staged workflow, there's a big difference between giving a table that says... "here is where you can go find the results" vs. "here are the final results, and there are some extra columns that say where we got them from".
The staged relaxation and population analysis tables that I currently have in main
are more of the latter, while your PR is pushing the former. Your approach is pretty common in other software (bc it saves on disk space), but it's just a pain for new users. So my push back is more of a personal preference + design philosophy thing
So just as a heads up, once you're ready for review+edits, I'll be modifying some of your tables to clean things up
The goal of this PR is to add functionality for fitness fields other than minimizing energy.
I've broken down the goal into these steps: