Re-implementation of GP-GOMEA (Python scikit-learn-compatible interface, C++ backend). This version of the code features only GP-GOMEA and no other algorithms (differently from the previous repo) and focuses on symbolic regression alone. Also, this version uses dependencies that are easier and less finicky to install (see environment.yml).
Installation requires git and conda. Run the following bash commands from a folder of your choice:
git clone https://github.com/marcovirgolin/gpg.git
cd gpg
conda env create -f environment.yml
conda activate gpg
make
You can try gpg
out with the following code snippet (or simply run try.py
if you like):
import numpy as np
from pygpg.sk import GPGRegressor
from pygpg.complexity import compute_complexity
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
X = np.random.randn(128, 3)*10
def grav_law(X : np.ndarray) -> np.ndarray:
"""Ground-truth function for the gravity law."""
return 6.67 * X[:,0]*X[:,1]/(np.square(X[:,2])) + np.random.randn(X.shape[0])*0.1 # some noise
y = grav_law(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=RANDOM_SEED)
gpg = GPGRegressor(
e=50_000, # 50,000 evaluations limit
t=-1, # no time limit,
g=-1, # no generation limit,
d=3, # maximum tree depth
verbose=True, # print progress
random_state=RANDOM_SEED, # for reproducibility
)
gpg.fit(X_train,y_train)
print(
gpg.model,
"(complexity: {})".format(compute_complexity(gpg.model, complexity_metric="node_count")))
print("Train\t\tR2: {}\t\tMSE: {}".format(
np.round(r2_score(y_train, gpg.predict(X_train)), 3),
np.round(mean_squared_error(y_train, gpg.predict(X_train)), 3),
))
print("Test\t\tR2: {}\t\tMSE: {}".format(
np.round(r2_score(y_test, gpg.predict(X_test)), 3),
np.round(mean_squared_error(y_test, gpg.predict(X_test)), 3),
))
This version has some differences compared to the code in the previous repo. Here's a list:
- Protected operators are not used here (expressions that evaluate to NaN for some training points are assigned a worst-case fitness
INF
) - Functions/variables/constants can be sampled with custom probabilities (by default, uniform with binary operators twice as likely as unary operators)
- Tournament selection can be used to speed up convergence within GOM.
- Models returned from the C++ code are simplified and (optionally) fine-tuned in Python
- Elite at multiple levels of complexity (expression size) are stored and returned to Python (a "best one" is selected using the
rci
parameter) - If the IMS is disabled and the population converges before the budget is exhausted, then a new population is started which includes a random elite from those found before
- A simple feature selection mechanism is included (if desired)
- Models obtained from C++ are converted to
sympy
and can be further processed as such - The scikit-learn interface includes imputation in case of incomplete data
- The scikit-learn interface includes coefficient fine-tuning with
sympy-torch
and L-BFGS
Running this version on SRBench (gpg
) leads to expressions that are as compact but more accurate than those of the original GP-GOMEA
, in much less time!
If you use our code for academic purposes, please support our research by citing:
@article{virgolin2021improving,
title={Improving model-based genetic programming for symbolic regression of small expressions},
author={Virgolin, Marco and Alderliesten, Tanja and Witteveen, Cees and Bosman, Peter A. N.},
journal={Evolutionary Computation},
volume={29},
number={2},
pages={211--237},
year={2021},
publisher={MIT Press}
}
swig
andpybind
are the same, with the exception that the first uses SWIG and the second uses pybind to realize the python interface.pybind
is now default and, probably,swig
will no longer be supported/updated.vector_repr
represents an expression as a vector of strings instead of a tree of nodes. This version may be slightly faster (matters only when the number of observations in the data set is relatively small). However it needs to be fixed.