-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConfigSpace Technology Integration for Enhanced GAMA Configuration and Management 🥇 #210
base: master
Are you sure you want to change the base?
Conversation
a21119c
to
1b5721e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I will have a closer look later, but just had a superficial glance now. I can not give you a timeline on the review, unfortunately. I have a hard deadline on some changes for an in-person workshop in January, and I am also being transferred to a new project. This PR is large and affects every part of GAMA, so it's something I really want to take time for, which unfortunately means it's likely I won't get to it until later in January.
It is all fine, @PGijsbers! I will wait for the time needed, continue with my Ph.D implementations on the side, and believe that this current PR is great and promising; I am sure we will reach an agreement and minor-quality (!= minor-quantity, possibly major-quantity though) corrections will be requested, so I will keep doing so on my end 👌 Meanwhile, I wish you the best of luck with both the workshop and the new project! I hope GAMA, on the other hand, does not become "deprecated" in your lab! There are great things that could come out of it, and I will recommend GAMA for any new AutoML project around my faculty, for sure ! It would be greatly appreciated if you could also label the PR here so that no one gets confused! (By the way, I will make one more commit so the tests can run!) 💪 Wishing you a lovely winter time, |
Thank you! For full transparency, I would like to share the following with you. Recently, the AutoML Toolkit has come to our attention. Perhaps you are already familiar with it, but if not it's worth a closer look. Its aims align with GAMA's in many ways, so we are planning to look into whether it makes sense to collaborate with the project. It is even possible that, if there is enough overlap, we might prefer to "absorb" GAMA into AMLTK. For example, by re-implementing GAMA with AMLTK components, extending AMLTK as necessary to do so. The benefits are that AMLTK currently is under active development and has an outlook for continued support. This significantly reduces the maintenance work for GAMA and it could automatically benefit from new research in the AMLTK ecosystem. By focusing community efforts behind one project, I think everyone would come out ahead, and it is more likely we can provide support for the (hopefully few) GAMA-specific components. Of course, this all depends on a closer look at AMLTK - does it really achieve what it sets out to do? does it do that in a way that aligns with our goals? These are questions we need to answer before making any decision on this matter. And even if we would decide to go that route and you disagree with this decision, we could look into how to structure the future of GAMA. For example, perhaps you could maintain a fork which we could refer to, or we could consider adding you as a maintainer of this project and set up a second project for the "amltk-based GAMA". I hope to provide more clarity on this in January. In the mean time, I would also really value your input in the matter. As someone now intimately familiar with GAMA, if you find the time to have a look at AMLTK I would love to hear your thoughts on the project - and what you think the right path forward is. I also want to make clear that regardless of our decision in this matter, I am committed to continuing to provide support for the work that is already planned and discussed to the best of my ability. Cheers ps. Feel free to reach out over e-mail on this matter (you can find my e-mail on this page). |
+ Custom config space is now utilising ConfigSpace + Classification.py is now divided in distinct Classifiers and Preprocesors for better understand and managmeent + Thorough documentation for users to add/modify anything
e5d2f82
to
07b96fd
Compare
Sorry for the lengthy response ! I tried to be as thorough as possible in order to avoid being vague or confused! I do not believe that proper answers to any questions should be provided soon; we could continue in January! So here we go for my responses
Thank you for being very transparent, @PGijsbers; I predicted that somehow before opening this PR, and it actually makes sense to me. Yes, I am aware of AMLTK; I wish I could have attended the autumn school last week or so to participate in the hands-on activities with them. Meanwhile, I remain convinced that GAMA has something to offer aside. The flexible way to provide your new system, while still being easy to understand overall, or at the very least, the learning curve not being too steep, allows us AutoML researchers to do stuff with it, plus, in my case, I need a flexible system because I am going to apply it not to conventional classification tasks but in slightly different ways, so I will need a fork and do my changes. AMLTK is a fantastic toolkit, I have no doubts about that, but while it appears to be an easy and flexible API, as well as being highly powerful, I am not sure if it is possible to supply your own variable work, if this makes any sense, s.t varying as in varying in comparison to standard machine learning tasks. The learning curve for their API may be steeper compared to GAMA if the existing system (AMLTK) does not allow the researcher to perform what he/she is searching for, while GAMA is still simple to understand if they wish to pursue this route. However, for your lab, I believe it is a great significant step to look into anyway, which is a little sad because I would have hoped that more than one person would actively contribute to GAMA, but in the meantime, I do believe that both of these systems have something to offer everyone, and while AMLTK may be the first great system to look into for researchers, I also believe GAMA may be one for other, allowing the "easy" system to be modifiable by the researchers in case AMLTK would not have allowed them to do so. Nonetheless, I have not yet tried AMLTK, so I could be completely wrong, and possibly GAMA should be deleted from GItHub, but for the time being I am going to stick with it because I have put too much effort into designing my system around it. See further below.
Pieter, I appreciate much all this :) For the time being, I do not have the time to investigate AMLTK because I have my professor on my back to start experiments, and I have not finished implementing the system yet.. so I reckon I will have to do some try-of-AMLTK in the same time as you in January, I believe 👌 Nonetheless, I believe that opening a new Github Issue and tagging me in it when you begin the investigation would be worthwhile - I will give my feedbacks/visions/views, count on it! However, it may also encourage others to investigate. So, for the time being, I will stick with GAMA's current stable version and do my PRs around it, all for the sake of GAMA and to support my Ph.D. If AMLTK-based GAMA proves to be worthwhile, count me in for next year! If I do not see, on the other hand, a benefit in it for my present endeavours, and that keeping alive the old GAMA for specific purposes, you can count on me to be maintaining it, but we will have to find a few justifications as to why keeping it would be valuable for newcomers compared to AMLTK-based GAMA, otherwise I will keep the fork local. I am confident that we will find answers to all of the current unanswered questions, we just need time ⏳
Yes, indeed! ConfigSpace will assist GAMA in any form, old or new. Pieter, I am glad to hear all of it. I will send an email after finishing my present task of building my system around GAMA to discuss possible AMLTK-based GAMA with all of you, most probably not in 2023 though.. but as I mentioned before, starting an issue for that would be excellent! In any case, I will save the email because it is a nice one to have in my pocket for summer internships 🇳🇱 Wishing you a great Christmas time to you and the entire lab ! 🎄 Cheers ––––––––– PS: I have no clues why the workflows are still erratic. Let us look into that after your review in January, shall we? |
Thanks Simon, and happy holidays to you!
seems reasonable to me :) |
07b96fd
to
ee3f6e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Simon,
I started reviewing the code. There are several suggested changes to simplify the code and make it more readable. I haven't dug into the changes for genetic programming yet.
from .utilities.config_space import get_estimator_by_name | ||
|
||
# Avoid stopit from logging warnings every time a pipeline evaluation times out | ||
logging.getLogger("stopit").setLevel(logging.ERROR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? It is already in gama.py
so I am not sure why it would need to be added.
if ( | ||
"estimators" not in search_space.meta | ||
or ( | ||
search_space.meta["estimators"] | ||
not in search_space.get_hyperparameters_dict() | ||
) | ||
or not isinstance( | ||
search_space.get_hyperparameter(search_space.meta["estimators"]), | ||
cs.CategoricalHyperparameter, | ||
) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ( | |
"estimators" not in search_space.meta | |
or ( | |
search_space.meta["estimators"] | |
not in search_space.get_hyperparameters_dict() | |
) | |
or not isinstance( | |
search_space.get_hyperparameter(search_space.meta["estimators"]), | |
cs.CategoricalHyperparameter, | |
) | |
): | |
estimator_hp = search_space.meta.get("estimators") | |
estimator_hp_is_categorical = isinstance(dict(search_space).get(estimator_hp), cs.CategoricalHyperparameter) | |
if not estimator_hp_is_categorical: |
(the dict()
call is required for .get
to work: automl/ConfigSpace#348)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently there was a large refactor to make ConfigSpace
object behave more like a python dict
out of the box. It would be great for readability if we could make of those features instead of functions like get_hyperparameter
, get_hyperparameters_dict
.
} | ||
|
||
def setup_classifiers(self): | ||
classifiers_choices = list(self.classifiers_setup_map.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to store this as a separate list. By default in Python, iterating over a dictionary (as done below) only yields the keys and if it is empty then if not self.classifier_setup_map
will also be true.
for classifier_name in classifiers_choices: | ||
if setup_func := self.classifiers_setup_map.get(classifier_name): | ||
setup_func(classifiers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can iterate over a dictionary with keys and values together using:
for key, value in dictionary.items():
...
in this case, it looks like you are even only using values, so you could also simplify that to
for setup_func in self.classifiers_setup_map.values():
setup_func(classifiers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that the value is always a function. If some may be none, then you can add a if setup_func:
check.
def _add_hyperparameters_and_equals_conditions( | ||
self, local_vars: dict, estimator_name: str | ||
): | ||
if "classifiers" not in local_vars or not isinstance( | ||
local_vars["classifiers"], csh.CategoricalHyperparameter | ||
): | ||
raise ValueError( | ||
"Expected 'classifiers' key with a CategoricalHyperparameter in local" | ||
"vars" | ||
) | ||
|
||
hyperparameters_to_add = [ | ||
hyperparameter | ||
for hyperparameter in local_vars.values() | ||
if isinstance(hyperparameter, csh.Hyperparameter) | ||
and hyperparameter != local_vars["classifiers"] | ||
] | ||
|
||
conditions_to_add = [ | ||
cs.EqualsCondition( | ||
hyperparameter, local_vars["classifiers"], estimator_name | ||
) | ||
for hyperparameter in hyperparameters_to_add | ||
] | ||
|
||
self.config_space.add_hyperparameters(hyperparameters_to_add) | ||
self.config_space.add_conditions(conditions_to_add) | ||
|
||
def setup_bernoulliNB(self, classifiers: csh.CategoricalHyperparameter): | ||
alpha_NB = csh.CategoricalHyperparameter( | ||
"alpha__BernoulliNB", self.shared_hyperparameters["alpha"] | ||
) | ||
fit_prior = csh.CategoricalHyperparameter( | ||
"fit_prior__BernoulliNB", self.shared_hyperparameters["fit_prior"] | ||
) | ||
self._add_hyperparameters_and_equals_conditions(locals(), "BernoulliNB") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid passing around things with locals()
, it makes it very hard to read what is going on.
As far as I can tell, refactoring setup_classifiers
and setup_CLASSIFIER
methods, you can completely remove add_hyperparameters_and_equals_condition
. I used ...
to refer to some already existing code, hopefully the context is enough for you to understand what I mean. If not, please ask.
def setup_classifiers(...):
...
for classifier, classifier_hyperparameters in self.classifiers_setup_map.items():
hps, forbidden = classifier_hyperparameters()
self.config_space.add_hyperparameters(hps)
self.config_space.add_conditions([
cs.EqualsCondition(hp, classifiers, classifier)
for hp in hps
])
self.config_space.add_forbidden_clauses(forbidden)
def setup_multinomialNB(self) -> list[csh.Hyperparameter]:
return [
csh.CategoricalHyperparameter(...alpha...),
csh.CategoricalHyperparameter(...fit_prior...),
], [] # second list is forbidden clauses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could then refactor one step further and again remove the separate methods. By introducing a named tuple it would be easy to provide for each algorithm a list/tuple of Hyperparameters and a list/tuple of Forbidden clauses. You could then easily map from the classifier names to this tuple for each classifier.
if ( | ||
"KNeighborsClassifier" | ||
in self.search_space.get_hyperparameter( | ||
self.search_space.meta["estimators"] | ||
).choices | ||
): | ||
self.search_space.add_forbidden_clause( | ||
ForbiddenEqualsClause( | ||
self.search_space.get_hyperparameter( | ||
self.search_space.meta["estimators"] | ||
), | ||
"KNeighborsClassifier", | ||
) | ||
) | ||
if ( | ||
"KNeighborsRegressor" | ||
in self.search_space.get_hyperparameter( | ||
self.search_space.meta["estimators"] | ||
).choices | ||
): | ||
self.search_space.add_forbidden_clause( | ||
ForbiddenEqualsClause( | ||
self.search_space.get_hyperparameter( | ||
self.search_space.meta["regressors"] | ||
), | ||
"KNeighborsRegressor", | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why one is "estimators" and the other is "regressors"? Would it be bad if both classification and regression used "estimators"? If it really is necessary, it would still be nicer if you preface the statement with a estimator_alias = "estimators" <if classifier else> "regressors"
where the <if classifier else>
checks if it is a gamaclassifier object (not sure what the best way to do this is without circular references, but a hack like hasattr(self, "predict_proba")
should work. though I prefer something cleaner).
else: | ||
term_config_space_name = f"{term_type}__{primitive_string}" | ||
|
||
if isinstance(value, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this always a string?
except (ValueError, SyntaxError): | ||
value = str(value) | ||
|
||
if term_config_space_name in config_space.get_hyperparameter_names(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too it seems clearer to assign a single variable with a ternary operation (a = x if TRUE else y
) and then use that variable. Reduces code duplication.
# Avoid stopit from logging warnings every time a pipeline evaluation times out | ||
logging.getLogger("stopit").setLevel(logging.ERROR) | ||
log = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be needed in every module.
while True: | ||
new_terminal_value = config_space.get_hyperparameter( | ||
old.config_space_name | ||
).sample(np.random.RandomState(42)) | ||
if new_terminal_value != old.value: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want an infinite loop here, it's just too easy to get stuck without warning. Something like a max_tries
should be used to avoid this.
Also, it looks like the sample happens with exactly the same random state each time, wouldn't that result in the same terminal value each time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PGijsbers, mistakenly opened the review again and saw this comment. Indeed, the 42 is a mistake here, I already, in one of my current implementations, got rid off it. Will look into all that further later as promised, especially the while true, and will consider a workaround with a max_tries as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's actually fair to consider the distinct cases:
- the hyperparameter has a continuous range, then just sampling a new number should "always" be different
- the hyperparameter has a discrete range, then you can follow the old logic (make a choice from the remaining choices)
that would also completely negate the need for any loops
Amazing ! cheers @PGijsbers 🎉 and Happy new year ! I will look it up later because I am about to have a system ready for my experiments, and then as soon as the experiments are started – I will cover all of this up. Feel free to ask me questions if they are urgent, otherwise wait for me to cover the reviews. Cheers, |
Thanks, Simon. Happy new year to you too. There's not going to be anything urgent here. Though I have a busy schedule until mid-February (part of which is vacation 🎉), so delays here might result in me not being able to process the whole backlog before then. On my end, that is not an issue. |
), | ||
): | ||
# For Float and Integer, check if the upper and lower bounds are different | ||
return hyperparameter.upper > hyperparameter.lower |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example where this isn't the case? And if this isn't the case, shouldn't be just force it to be defined as a Constant? If not, why not?
config_space.get_hyperparameter( | ||
config_space.meta[p._primitive.output] | ||
if p._primitive.output in get_internal_output_types() | ||
else p._primitive.output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing part of the picture here, can you give me an example of output for the p._primitive.output
that is not one of the get_internal_output_types()
?
""" | ||
if meta_key not in config_meta: | ||
raise ValueError(f"Meta key {meta_key} not in config_meta") | ||
if type(cond) not in [cs.conditions.EqualsCondition, cs.conditions.AndConjunction]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if type(cond) not in [cs.conditions.EqualsCondition, cs.conditions.AndConjunction]: | |
if not isinstance(cond, (cs.conditions.EqualsCondition, cs.conditions.AndConjunction)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even better: since you check them separately below, just use an else
clause that raises this error
if ( | ||
"estimators" not in config_meta | ||
or config_meta["estimators"] not in config.get_dictionary() | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of checks all over the place on these kind of memberships for config meta and so on. With a check up front, can't we get rid of all of these? If there does happen to be a mistake somehow, it should throw a hard error regardless, right?
Finished my first look, and my first impression is very positive :) while there are definitely some things that I'd like to see changed to improve readability and maintainability, overall it seems like it's a very focused change and only parts that I expected to be affected are - which is a good sign! However, it's still a lot to grok at once, especially since I hadn't used config space before. I want to do another round after the changes are processed, because I hope that also helps me better understand the changes to genetic programming classes and operators. |
Wonderful, @PGijsbers! I appreciate your time 🙏 I will definitely answer everything thoroughly and make the necessary changes in consequences. I will put them in FIXUP mode before rebasing them back to their origins as soon as you accept them. Furthermore, I would like to back up your wish towards a few rounds of reviews for this PR so that you can better understand it overall (i.e., especially ConfigSpace), in order to help you out for the next few PRs! I wish you a wonderful February holiday and hope I am able to finish by the end of it. Thank you for the positive feedback:) Cheers |
I am not sure how github deals with fixup. Feel free to use it, but you can also just use regular commits as we use a |
BRIEF UDPATE: I intend to start working back on all this from the end of this month onwards. I have an important deadline for the 20th and had no time to look into those but will definitely afterward. Cheers for your patience! Simon |
Don't worry about it! I am not really working on GAMA right now, just making sure the broken installs are fixed. |
Hi @PGijsbers
I am excited to announce that I have made progress on the #208 feature proposal 🎉 The custom configuration space setup improvement, i.e., by utilising ConfigSpace.
I am happy to say that GAMA has been successfully updated to utilise ConfigSpace's technology. While most of the discussions can be found in issue #208, I will list some points that need your attention below ⬇️
In the meantime, it took me a considerable amount of time to create this pull request as I was not familiar with the complete internal system architecture beforehand. Please bear with me for any modifications, but I believe it holds promise. I am open to discussing any aspects, including the reasons and methods behind the changes, as well as any further adjustments that may be necessary 👌
It is important to consider the consequences of introducing a new technology. Therefore, I would like to bring your attention to a few key changes and requests, as follow:
The configuration files, including the classification and regression py files, have been modified to utilise a ConfigSpace object. This object is responsible for accessing the appropriate ConfigSpace configuration for classifiers and preprocessors in the case of classification, or regressors and preprocessors in the case of regression. I have made the decision to divide them (estimators/preprocessors) for the purpose of enhancing comprehensibility and manageability. This is particularly important now that it has become easier to remove the preprocessors of a specific configuration (classification || regression). However, I am in full agreement that we currently possess more comprehensive files. However, the comprehensibility of the content and the quality of the documentation make it effortless for users to grasp concepts such as adding, modifying, and more. I hope you will be in agreement with this, happy to discuss worthiness of this addition 🫡. Next, I would like to bring your attention to the following: Considering that ConfigSpace excels at handling constraints such as forbidden clauses, it offers significant advantages for GAMA in the short term (e.g. handling Hyperparameter check #188 again). However, I am uncertain about the correctness of the previous custom configuration parameter checks. Therefore, I would like to draw your attention in this direction, hence, letting me know about any additional enhancements to specific algorithms that require constrained hyperparameter values, such as LinearSVC to start with. @HadiKutabi feel free to give us recomendation as to what should be constrained. A starting point could be there.
During the process of adapting the internal system, I removed a selection of components while incorporating others to align with the ConfigSpace technology. What frightened me the most is the evolutionary programming aspect. Henceforth, I kindly request your attention regarding the following matter: please ensure that the mutations align with your intended design while executing them on EA. Even if I did include numerous print statements throughout the codebase to ensure that each mutation was doing what it was supposed to do, as well as that the testing worked. This in-depth examination of you will allow us to confirm that the mutations are working as intended. I would like to avoid any potential misunderstanding from my part, here.
Lastly, I wish to discuss the subject of testing. Once more, I implemented modifications and verified that every test executed smoothly without any issues. Nevertheless, I wish to direct your focus towards the test configuration content itself. In order to facilitate testing, a simpler search space was demonstrated. However, in comparison to the initial classification or regression search spaces, the sole noteworthy modification pertains to the
number of estimators
utilised e.g, by ensemble-based classifiers such as Random Forest. It has been mentioned that the configuration space has been "simplified" for testing purposes in order to increase its lightweightness. Nonetheless, its dimensions appear to be quite substantial, or at the very least, similar to the originals (classification/regression). Would not it be better to decrease the quantity of classifiers and preprocessors to, say, three for each? I would greatly appreciate any insights in this direction.What contributions have been made
I would greatly appreciate knowing when you would have the time to address this, I must say, highly significant pull request, particularly considering the recent contribution of Auto-Sklearn in creating a "toolkit" for developing an Auto-ML system. While others are also trying, I genuinely believe that GAMA could greatly benefit from this pull request, both to keep being in the "competition" but also for the community that already is using it ☀️
For those who are new to this PR and are not familiar with any other discussion (#191, #198, #201, #202) that we had via the the Github issues section of GAMA with @PGijsbers. Furthermore, among the points discussed in #208, this PR's enhancement will also result in the seamless integration of SMAC for Bayesian optimisation (as discussed further in #202), which will introduce an additional search algorithm to GAMA and greatly increase its flexibility, ideally, increase its community's involvement 💪
Cheers,