Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/merge back v2 #474

Merged
merged 32 commits into from
Jul 12, 2022
Merged

Fix/merge back v2 #474

merged 32 commits into from
Jul 12, 2022

Conversation

sebhrusen
Copy link
Collaborator

@sebhrusen sebhrusen commented Jul 12, 2022

previous attempt was squash-merged, apparently deleting the common node between stable-v2 and master

PGijsbers and others added 30 commits September 17, 2021 10:58
* Add a workflow to tag latest `v*` release as `stable` (#399)

Currenty limited to alphabetical ordering which means that any one number in the version can not exceed one digit.

* Bump auto-sklearn to 0.14.0 (#400)
* Add the version tag to the image name if present

* Fix casing for MLNet framework definition
* Add volume meta data to aws meta info
* Add constraints for v2 benchmark

For ease of reproducibility, we want to include our experimental setup
in the constraints file. For our experiments we increase the volume size
to 100gb and require gp3 volumes (general purpose SSD).
* let the job runner handle the rescheduling logic to ensure that the job is always can't be acted upon by current worker after being rescheduled

* remove commented code
Made the previous version abstract to avoid accidentally running the
wrong version of GAMA for the benchmark.
* Unsparsify target variables for (Tuned)RF

Sparse targets are not supported in scikit-learn 0.24.2, and are used
with tasks 360932 and 360933 (QSAR) in the benchmark.

* cosmetic change to make de/serialization easier to debug

Co-authored-by: Sebastien Poirier <sebastien@h2o.ai>
Since it's entirely possible that the processes were already
terminating, but only completed termination between the process.children
call and the proc.terminate/kill calls.
* fixes #432 add precision to runtimes in results.csv

* Update amlb/results.py

Co-authored-by: seb. <sebastien@h2o.ai>

Co-authored-by: seb. <sebastien@h2o.ai>
* Iteratively build the forest to honor constraints

In particular depending on the dataset size either memory or time
constraints can become a problem which makes it unreliable as a
baseline. Gradually growing the forest sidesteps both issues.

* Make iterative fit default, parameterize execution

* Step_size as script parameter, safer check if done

When final_forest_size is not an exact multiple of step_size,
randomforest should still terminate. Additionally step_size is escaped
with an underscore as it is not a RandomForestEstimator hyperparameter.
…ts (#441)

* Iterative fit to meet memory and time constraints

Specifically for each value of `max_features` to try, an equal time
budget is alloted, with one additional budget being reserved for the
final fit. This does mean that different `max_features` can lead to
different number of trees, but it keeps it simple.

* Abort tuning when close to total time budget

The first fit of each iterative fit for a `max_features` value was not
guarded, which can lead to exceeding the total time budget. This adds a
check before the first fit to estimate whether the budget will be
exceeded, and if so aborting further tuning and continue with the final
fit.

* Make k_folds configurable

* Add scikit-learn code with explanation

* Modify cross_validate, allow 1 estimator per split

This is useful when we maintain a warm_started model for each individual
split.

* Use custom cv function to allow warm-start

By default estimators are cloned in any scikit-learn cross_validate
function (which stops warm-start) and it is not possible to specify a
specific estimator-object per fold (which stops warm-start). The added
custom_validate module makes changes to the scikit-learn code to allow
warm-starting to work in conjunction with the cross-validate
functionality. For more info see scikit-learn#22044 and
scikit-learn#22087.

* Add parameter to set tune time, rest is for fit

The previous iteration where the final fit was treated as an equivalent
budget to any other optimization sometimes left too little time to train
the final forest, in particular when the last fit took longer than
expected. This would often lead to very small forests for the final
model. The new system guarantees roughly 10% of budget for the final
forest, guaranteeing a better final fit.
In a previous iteration it was encoded as a numpy file, but now it's
serialized to JSON which means that results.probabilities is simply a
string if imputation is required.
Technically monkeypatch xmltodict function used by openml when reading the features xml
Was supposed to be included with #443
seb added 2 commits July 12, 2022 14:23
…er (#468)

* change workflow to correctly modify the app version on releases and when forcing merged version back to master

* protect main branch from accidental releases
@sebhrusen sebhrusen merged commit bcd2a28 into master Jul 12, 2022
@sebhrusen sebhrusen deleted the fix/merge-back-v2 branch July 12, 2022 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants