Modern C++ framework for Symbolic Regression

Operon is a modern C++ framework for symbolic regression that uses genetic programming to explore a hypothesis space of possible mathematical expressions in order to find the best-fitting model for a given regression target. Its main purpose is to help develop accurate and interpretable white-box models in the area of system identification. More in-depth documentation available at https://operongp.readthedocs.io/.

How does it work?

Broadly speaking, genetic programming (GP) is said to evolve a population of "computer programs" ― AST-like structures encoding behavior for a given problem domain ― following the principles of natural selection. It repeatedly combines random program parts keeping only the best results ― the "fittest". Here, the biological concept of fitness is defined as a measure of a program's ability to solve a certain task.

In symbolic regression, the programs represent mathematical expressions typically encoded as expression trees. Fitness is usually defined as goodness of fit between the dependent variable and the prediction of a tree-encoded model. Iterative selection of best-scoring models followed by random recombination leads naturally to a self-improving process that is able to uncover patterns in the data:

Build instructions

The project requires CMake and a C++17 compliant compiler. On Windows we recommend building with MinGW or with your WSL distro. We recommend using the latest versions of Eigen and Ceres.

Required dependencies

Optional dependencies

Ceres required to use the fully-featured solvers for bounds constrained robustified non-linear least squares problems
cxxopts required for the cli app.
doctest required for unit tests.
python and pybind11 required to build the python bindings.

These libraries are well-known and should be available in your distribution's package repository. They can also be easily managed using conda or vcpkg.

Additionally, CMake will download the following libraries during the build generation phase:

Build options

The following options can be passed to CMake:

Option	Description
`-DCERES_TINY_SOLVER=ON`	Use the very small and self-contained tiny solver from the Ceres suite for solving non-linear least squares problem.
`-DUSE_SINGLE_PRECISION=ON`	Perform model evaluation using floats (single precision) instead of doubles. Great for reducing runtime, might not be appropriate for all purposes.
`-DUSE_OPENLIBM=ON`	Link against Julia's openlibm, a high performance mathematical library (recommended to improve consistency across compilers and operating systems).
`-DBUILD_TESTS=ON`	Build the unit tests.
`-DBUILD_PYBIND=ON`	Build the Python bindings.
`-DUSE_JEMALLOC=ON`	Link against jemalloc, a general purpose `malloc(3)` implementation that emphasizes fragmentation avoidance and scalable concurrency support (mutually exclusive with `tcmalloc`).
`-DUSE_TCMALLOC=ON`	Link against tcmalloc (thread-caching malloc), a `malloc(3)` implementation that reduces lock contention for multi-threaded programs (mutually exclusive with `jemalloc`).
`-DUSE_MIMALLOC=ON`	Link against mimalloc a compact general purpose `malloc(3)` implementation with excellent performance (mutually exclusive with `jemalloc` or `tcmalloc`).

Windows / VCPKG

Install vcpkg following the instructions from https://github.com/Microsoft/vcpkg
Install the required dependencies: vcpkg install <deps>
cd <path/to/operon>
mkdir build && cd build
cmake .. -G"Your Visual Studio Version" -DCMAKE_TOOLCHAIN_FILE=[vcpkg root]\scripts\buildsystems\vcpkg.cmake
cmake --build . --config Release

GNU/Linux

Install the required dependencies
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release. Use Debug for a debug build, or use CC=clang CXX=clang++ to build with a different compiler.
make. Add VERBOSE=1 to get the full compilation output or -j for parallel compilation.

Usage

Run operon-gp --help to see the usage of the console client. This is the easiest way to just start modeling some data. The program expects a csv input file and assumes that the file has a header.
The Python script provided under scripts wraps the operon-gp binary and can be used to run bigger experiments. Data can be provided as csv or json files containing metadata (see data folder for examples). The script will run a grid search over a parameter space defined by the user.
Several examples (C++ and Python) are available here

Installing the Python bindings

Operon comes with Python bindings as well as a scikit learn estimator. To build the bindings the option -DBUILD_PYBIND=TRUE must be passed to CMake. The desired install path can be specified using the CMAKE_INSTALL_PREFIX variable (for example, -DCMAKE_INSTALL_PREFIX=/usr/local/lib/python3.8/site-packages). If an install prefix is not provided CMake will try to detect the default path as reported by Python.

Then, the Python module and package can be installed with cmake --install . or make install (with sudo if needed).

Usage

Sklearn estimator

from operon.sklearn import SymbolicRegressor

reg = SymbolicRegressor()

# usual sklearn stuff
reg.fit(X, y)

Operon library

from operon import Dataset, RSquared, etc.

Publications

If you find Operon useful you can cite our work as:

@inproceedings{10.1145/3377929.3398099,
    author = {Burlacu, Bogdan and Kronberger, Gabriel and Kommenda, Michael},
    title = {Operon C++: An Efficient Genetic Programming Framework for Symbolic Regression},
    year = {2020},
    isbn = {9781450371278},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3377929.3398099},
    doi = {10.1145/3377929.3398099},
    booktitle = {Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion},
    pages = {1562–1570},
    numpages = {9},
    keywords = {symbolic regression, genetic programming, C++},
    location = {Canc\'{u}n, Mexico},
    series = {GECCO '20}
}

Operon was also featured in a recent survey of symbolic regression methods, where it showed good results:

@misc{lacava2021contemporary,
      title={Contemporary Symbolic Regression Methods and their Relative Performance}, 
      author={William La Cava and Patryk Orzechowski and Bogdan Burlacu and Fabrício Olivetti de França and Marco Virgolin and Ying Jin and Michael Kommenda and Jason H. Moore},
      year={2021},
      eprint={2107.14351},
      archivePrefix={arXiv},
      primaryClass={cs.NE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 576 Commits
cmake		cmake
data		data
docs		docs
examples		examples
include/operon		include/operon
python		python
scripts		scripts
src		src
test		test
.clang-format		.clang-format
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.readthedocs.yml		.readthedocs.yml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
flake.nix		flake.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modern C++ framework for Symbolic Regression

How does it work?

Build instructions

Required dependencies

Optional dependencies

Build options

Windows / VCPKG

GNU/Linux

Usage

Installing the Python bindings

Usage

Sklearn estimator

Operon library

Publications

About

Releases

Packages

Languages

License

Ocarthon/operon

Folders and files

Latest commit

History

Repository files navigation

Modern C++ framework for Symbolic Regression

How does it work?

Build instructions

Required dependencies

Optional dependencies

Build options

Windows / VCPKG

GNU/Linux

Usage

Installing the Python bindings

Usage

Sklearn estimator

Operon library

Publications

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages