Skip to content

v2.1.21.0

Compare
Choose a tag to compare
@cezary986 cezary986 released this 12 Nov 14:03
· 27 commits to main since this release
124e486

What's new in RuleKit version 2.1.21.0?

1. Ability to use user-defined quality measures during rule induction, pruning, and voting phases.

Users can now define custom quality measures function and use them for: growing, pruning and voting. Defining quality measure function is easy and straightforward, see example below.

from rulekit.classification import RuleClassifier

def my_induction_measure(p: float, n: float, P: float, N: float) -> float:
    # do anything you want here and return a single float...
    return (p + n) / (P + N)

def my_pruning_measure(p: float, n: float, P: float, N: float) -> float:
    return p - n

def my_voting_measure(p: float, n: float, P: float, N: float) -> float:
    return (p + 1) / (p + n + 2)

python_clf = RuleClassifier(
    induction_measure=my_induction_measure,
    pruning_measure=my_pruning_measure,
    voting_measure=my_voting_measure,
)

This function was available long ago in the original Java library, but there were some technical problems that prevented its implementation in that package. Now, with the release of RuleKit v2, it is finally available.

⚠️ Using this feature comes at a price. Using the original set of quality measures from rulekit.params.Measures provides an optimized and much faster implementation of these quality functions in Java. Using a custom Python function will certainly slow down the model learning process. For example, learning rules on the Iris dataset using the FullCoverage measure went from 1.8 seconds to 10.9 seconds after switching to using the Python implementation of the same measure.

2. Reading arff files from url via HTTP/HTTPS.

In the last version of the package, a new function for reading arff files was added. It made it possible to read an arff file by accepting the file path or a file-like object as an argument. As of this version, the function also accepts URLs, giving it the ability to read an arff dataset directly from some servers via HTTP/HTTPS.

import pandas as pd
from rulekit.arff import read_arff

df: pd.DataFrame = read_arff(
    'https://raw.githubusercontent.com/'
    'adaa-polsl/RuleKit/refs/heads/master/data/seismic-bumps/'
    'seismic-bumps.arff'
)

3. Improves rules API

Access to some basic rule information was often quite cumbersome in earlier versions of this package. For example, there was no easy way to access information about the decision class of a classification rule.

In this version, rule classes and rule sets have been refactored and improved. Below is a list of some operations that are now much easier.

3.1 For classification rules

You can now access rules decision class via rulekit.rules.ClassificationRule.decision_class field. Example below:

import pandas as pd
from rulekit.arff import read_arff
from rulekit.classification import RuleClassifier
from rulekit.rules import RuleSet, ClassificationRule

DATASET_URL: str = (
    'https://raw.githubusercontent.com/'
    'adaa-polsl/RuleKit/refs/heads/master/data/seismic-bumps/'
    'seismic-bumps.arff'
)
df: pd.DataFrame = read_arff(DATASET_URL)
X, y = df.drop('class', axis=1), df['class']

clf: RuleClassifier = RuleClassifier()
clf.fit(X, y)

# RuleSet class became generic now
ruleset: RuleSet[ClassificationRule] = clf.model
rule: ClassificationRule = ruleset.rules[0]
print('Decision class of the first rule: ', rule.decision_class)

3.2 For regression rules

You can now access rules decision attribute value via rulekit.rules.RegressionRule.conclusion_value field. Example below:

import pandas as pd
from rulekit.arff import read_arff
from rulekit.regression import RuleRegressor
from rulekit.rules import RuleSet, RegressionRule

DATASET_URL: str = (
    'https://raw.githubusercontent.com/'
    'adaa-polsl/RuleKit/master/data/methane/'
    'methane-train.arff'
)
df: pd.DataFrame = read_arff(DATASET_URL)
X, y = df.drop('MM116_pred', axis=1), df['MM116_pred']

reg = RuleRegressor()
reg.fit(X, y)

ruleset: RuleSet[RegressionRule] = reg.model
rule: RegressionRule = ruleset.rules[0]
print('Decision value of the first rule: ', rule.conclusion_value)

3.3 For survival rules

More changes have been made for survival rules.

First, there is a new class rulekit.kaplan_meier.KaplanMeierEstimator, which represents Kaplan-Meier estimator rules. In the future, prediction arrays for survival problems will probably be moved from dictionary arrays to arrays of such objects, but this would be a breaking change unfortunately

In addition, one can now easily access the Kaplan-Meier curve of the entire training dataset using the rulekit.survival.SurvivalRules.get_train_set_kaplan_meier method.

Such curves can be easily plotted using the charting package of your choice.

import pandas as pd
import matplotlib.pyplot as plt
from rulekit.arff import read_arff
from rulekit.survival import SurvivalRules
from rulekit.rules import RuleSet, SurvivalRule
from rulekit.kaplan_meier import KaplanMeierEstimator # this is a new class

DATASET_URL: str = (
    'https://raw.githubusercontent.com/'
    'adaa-polsl/RuleKit/master/data/bmt/'
    'bmt.arff'
)
df: pd.DataFrame = read_arff(DATASET_URL)
X, y = df.drop('survival_status', axis=1), df['survival_status']

surv = SurvivalRules(survival_time_attr='survival_time')
surv.fit(X, y)

ruleset: RuleSet[SurvivalRule] = reg.model
rule: SurvivalRule = ruleset.rules[0]

# you can now easily access Kaplan-Meier estimator of the rules
rule_estimator: KaplanMeierEstimator = rule.kaplan_meier_estimator
plt.step(
    rule_estimator.times, 
    rule_estimator.probabilities,
    label='First rule'
)
# you can also access training dataset Kaplan-Meier estimator easily
train_dataset_estimator: KaplanMeierEstimator = surv.get_train_set_kaplan_meier()
plt.step(
    train_dataset_estimator.times, 
    train_dataset_estimator.probabilities,
    label='Training dataset'
)
plt.legend(title='Kaplan-Meier curves:')

4. Changes in expert rules induction for regression and survival ❗BREAKING CHANGES

Note that those changes will likely be reverted on the next version and are caused by a known bug in the original RuleKit library. Fixing it is beyond the scope of this package, which is merely a wrapper for it.

Since this version, there has been a change in the way expert rules and conditions for regression and survival problems are communicated. All you have to do is remove conclusion part of those rules (everything after THEN).

Expert rules before:

expert_rules = [
    (
        'rule-0',
        'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN survival_status = {NaN}'
    )
]

expert_preferred_conditions = [
    (
        'attr-preferred-0',
        'inf: IF [CD34kgx10d6 = Any] THEN survival_status = {NaN}'
    )
]


expert_forbidden_conditions = [
    ('attr-forbidden-0', 'IF [ANCrecovery = Any] THEN survival_status = {NaN}')
]

And now:

expert_rules = [
    (
        'rule-0',
        'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN'
    )
]

expert_preferred_conditions = [
    (
        'attr-preferred-0',
        'inf: IF [CD34kgx10d6 = Any] THEN'
    )
]


expert_forbidden_conditions = [
    ('attr-forbidden-0', 'IF [ANCrecovery = Any] THEN')
]

Other changes

  • Fix expert rules parsing.
  • Conditions printed in the order they had been added to the rule.
  • Fixed bug when using sklearn.base.clone function with RuleKit model classes.
  • Update tutorials in the documentation.