Skip to content

Latest commit

 

History

History
116 lines (92 loc) · 4.87 KB

README.md

File metadata and controls

116 lines (92 loc) · 4.87 KB

Interpreting Microbiome Relative Abundance Data Using Symbolic Regression

Code for the preprint "Interpreting Microbiome Relative Abundance Data Using Symbolic Regression".

alt text

Table of Contents
  1. Initial setup
  2. Repository structure
  3. Reproducing results of experiments
  4. Minimal use
  5. References

Initial setup

Install all dependencies with pip install -e .. We recommend creating a separate conda environment (with Python >=3.10) for the installation.

Repository structure

This repository has the following structure. The main library (srmb) is essentially 3 .py files that implements the special functions to be used in the function set of the SymbolicClassifier estimator from gplearn library, custom fitness functions (explicitly says to optimize the expression for accuracy or F1 or any other classification performance metric), and some utilities to visualize the learned symbolic tree expression using networkx graphs.

For the purposes of reproducing the results in the paper, we also supply all the SR and SRf estimators (for the 20 random runs) as pickle objects in the results_srmb/ directory.

.
├── LICENSE
├── README.md
├── data
│   └── data_diet_filtered.csv
├── notebooks
│   ├── data_visualization.ipynb
│   ├── experiment01_baseline_models_accuracy.ipynb
|   ├── experiment02_xgboost_surrogate.ipynb
│   └── sr_model_graph_visualization.ipynb
├── results_srmb
│   ├── sr_special_models/
│   └── sr_vanilla_models/
├── setup.py
└── srmb
   ├── fitness_functions.py
   ├── special_functions.py
   └── utils.py

Reproducing results of experiments

Minimal use

from srmb.fitness_functions import customacc
from srmb.special_functions import (
    presence, absence, add3, add10, ifelse, ifelseless,
    presence2, absence2,
    presence3, absence
)
from gplearn.genetic import SymbolicClassifier


# SR with special functions
special_functions = [presence, absence, presence2, absence2, ifelse]#, add3, add10]
function_set = ['add', 'sub', 'mul', 'div', 'neg', 'max', 'min'] + special_functions

# ... prepare data X and y
est = SymbolicClassifier(population_size=6000,
                         generations=20,
                         tournament_size=25,

                         init_depth=(2, 6),
                         const_range=(0., 100.),
                         # init_method="full",
                         parsimony_coefficient=0.001,
                         function_set=function_set,

                         stopping_criteria=1.0, metric=customacc, #use custom acc as fitness
                         
                         feature_names=X1.columns.to_list(),
                         # verbose=True,
                         random_state=42)

est.fit(X_train, y_train)

# Visualize the learned expression tree
from IPython.display import display, Image
from srmb.utils import load_sr_models, create_graph, graph_to_jpg

# it will save the image as jpg file and also display if running in a notebook cell
G = create_graph(est)
display(Image(graph_to_jpg(G, path="../results_srmb/viz.jpg"), width=500, unconfined=True))

References

The code in this repository is heavily based on the gplearn repository.

TODO