The HROCH parameters
-
$pop_{size}$ Number of individuals in the population (default value is equal to 64). -
$pop_{sel}$ The size of a tournament selection in each iteration (default value is equal to 4). -
$code_{min size}, code_{max size}$ Minimum/maximum allowed size for a individual. -
$const_{size}$ Maximum alloved constants in symbolic model. -
$predefined_{const prob}$ Probability of selecting one of the predefined constants during equations search. -
$predefined_{const set}$ Predefined constants used during equations search. -
$problem$ Set of mathematical functions used in searched equations. Each mathematical function can have a defined weight with which it will be selected in the mutation process. By default, common mathematical operations such as multiplication and addition have a higher weight than goniometric functions or$pow$ and$\exp$ . This is a natural way to eliminate$\sin\left(\sin\left(\exp(x)\right)\right)$ -type equations, which may have high precision and low complexity, but are usually inappropriate and difficult to interpret. -
$feature_{probs}$ The probability that a mutation process will select a feature. This parameter allows using feature importances provided by a black-box regressor as an input parameter for symbolic regression to speed up the search by selecting mainly the important features. -
$metric$ Metric used to verify goodness of solutions in the search process. Choose from MSE, MAE, MSLE, LogLoss. -
$transformation$ Final transformation for computed value. Choose from logistic function for a classification tasks, no transformation for a regression problems, and ordinal(rounding) for a ordinal regression. -
$sample_{weight}$ Array of weights that are assigned to individual samples. -
$class_{weight}$ Weights associated with classes for a classification tasks with imbalanced classes distribution.
Stopping criteria
-
$time_{limit}$ : Time limit is reached -
$iter_{limit}$ : Number of iteration has exceeded.
Fitness function Can be controlled by
For a regression task is used a
where:
-
$N$ : Number of examples in the dataset -
$y_i$ : Ground truth (true value) for the i-th example -
$\hat{y}_i$ : Predicted value for the i-th example -
$p_i$ : Predicted probability of the positive class for the i-th example -
$w_i$ : Sample weight -
$c[y_i]$ : Class weight for a given class$y_i$
Tournament selection Among the current population of solutions (models), the
Equations representation Searched equations are represented as a fixed-length computer program encoded in three-address*wiki instructions code.
If the
Mutation Each random neighbor generation procedure consists of one code mutation and one constant mutation. The code mutation selects one random instruction from the instructions used and randomly changes their mathematical operation or operand sources, or both. If the
where:
-
$\xi$ A random variable in interval (0, 1) -
$\epsilon$ A very small constant ($10^{-6}$ )
If a
Basic HROCH scheme. The algorithm is based on the concept of hill climbing that is suited also to run in a parallel mode. The basic hill climbing algorithm is a simple heuristic search algorithm which belongs to the class of local search–based algorithms. In the basic hill climbing, the algorithm starts with an initial solution and then iteratively makes small changes to improve the solution. The algorithm usually terminates when none of the small changes of the current best solution yields an improvement. Note that the HROCH algorithm, unlike basic hill climbing, works with a population of independent solutions that compete for the time allotted for their evolution (tournament selection). Like Tabu's search, it improves the performance of local search by relaxing its basic rule. At each step, the best candidate replaces the previous solution unless there is a significant worsening of the score. Implementation of Tabu list for the symbolic regression problem can be complicated because the problem consists of a discrete(finding a suitable equation) and a continuous(fine-tuning the constants used) optimization problem. Instead, two contradictory ideas are used. Choosing the best solution from n generated neighbors tends to the solution with the better score. Not following the strict rule that a necessarily better solution must be found avoids getting stuck in a local minimum.
Input: Input training dataset
Control parameters:
Output: best symbolic formula solution
procedure Fit(
for
while stoping criteria is not met do
for
if
if
bestNeighbour = neighbour
if
if
for
if
return
end procedure