Skip to content

Releases: adaa-polsl/RuleKit-python

v2.1.24.1

07 Jan 11:24
Compare
Choose a tag to compare

What's new in RuleKit version 2.1.24.1?

New Contributors

Full Changelog: v2.1.24.0...v2.1.24.1

v2.1.24.0

28 Nov 13:12
Compare
Choose a tag to compare

What's new in RuleKit version 2.1.24.0?

1. Revert breaking changes in expert rules induction for regression and survival

The latest version 2.1.21.0 introduced some groundbreaking changes, which you can read more
about them in the latest release note.
Now rules and expert conditions can be defined in both the old and new formats, see example below.

# both variants will work the same
expert_rules = [
    (
        'rule-0',
        'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN survival_status = {NaN}'
    ),
    (
        'rule-0',
        'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN'
    ),
]

2. Upgrade to new version of RuleKit

In the new version of the Java RuleKit library, many bugs regarding expert induction have been corrected.

Other changes

  • Improve flake8 score
  • Add more unit tests.

v2.1.21.0

12 Nov 14:03
124e486
Compare
Choose a tag to compare

What's new in RuleKit version 2.1.21.0?

1. Ability to use user-defined quality measures during rule induction, pruning, and voting phases.

Users can now define custom quality measures function and use them for: growing, pruning and voting. Defining quality measure function is easy and straightforward, see example below.

from rulekit.classification import RuleClassifier

def my_induction_measure(p: float, n: float, P: float, N: float) -> float:
    # do anything you want here and return a single float...
    return (p + n) / (P + N)

def my_pruning_measure(p: float, n: float, P: float, N: float) -> float:
    return p - n

def my_voting_measure(p: float, n: float, P: float, N: float) -> float:
    return (p + 1) / (p + n + 2)

python_clf = RuleClassifier(
    induction_measure=my_induction_measure,
    pruning_measure=my_pruning_measure,
    voting_measure=my_voting_measure,
)

This function was available long ago in the original Java library, but there were some technical problems that prevented its implementation in that package. Now, with the release of RuleKit v2, it is finally available.

⚠️ Using this feature comes at a price. Using the original set of quality measures from rulekit.params.Measures provides an optimized and much faster implementation of these quality functions in Java. Using a custom Python function will certainly slow down the model learning process. For example, learning rules on the Iris dataset using the FullCoverage measure went from 1.8 seconds to 10.9 seconds after switching to using the Python implementation of the same measure.

2. Reading arff files from url via HTTP/HTTPS.

In the last version of the package, a new function for reading arff files was added. It made it possible to read an arff file by accepting the file path or a file-like object as an argument. As of this version, the function also accepts URLs, giving it the ability to read an arff dataset directly from some servers via HTTP/HTTPS.

import pandas as pd
from rulekit.arff import read_arff

df: pd.DataFrame = read_arff(
    'https://raw.githubusercontent.com/'
    'adaa-polsl/RuleKit/refs/heads/master/data/seismic-bumps/'
    'seismic-bumps.arff'
)

3. Improves rules API

Access to some basic rule information was often quite cumbersome in earlier versions of this package. For example, there was no easy way to access information about the decision class of a classification rule.

In this version, rule classes and rule sets have been refactored and improved. Below is a list of some operations that are now much easier.

3.1 For classification rules

You can now access rules decision class via rulekit.rules.ClassificationRule.decision_class field. Example below:

import pandas as pd
from rulekit.arff import read_arff
from rulekit.classification import RuleClassifier
from rulekit.rules import RuleSet, ClassificationRule

DATASET_URL: str = (
    'https://raw.githubusercontent.com/'
    'adaa-polsl/RuleKit/refs/heads/master/data/seismic-bumps/'
    'seismic-bumps.arff'
)
df: pd.DataFrame = read_arff(DATASET_URL)
X, y = df.drop('class', axis=1), df['class']

clf: RuleClassifier = RuleClassifier()
clf.fit(X, y)

# RuleSet class became generic now
ruleset: RuleSet[ClassificationRule] = clf.model
rule: ClassificationRule = ruleset.rules[0]
print('Decision class of the first rule: ', rule.decision_class)

3.2 For regression rules

You can now access rules decision attribute value via rulekit.rules.RegressionRule.conclusion_value field. Example below:

import pandas as pd
from rulekit.arff import read_arff
from rulekit.regression import RuleRegressor
from rulekit.rules import RuleSet, RegressionRule

DATASET_URL: str = (
    'https://raw.githubusercontent.com/'
    'adaa-polsl/RuleKit/master/data/methane/'
    'methane-train.arff'
)
df: pd.DataFrame = read_arff(DATASET_URL)
X, y = df.drop('MM116_pred', axis=1), df['MM116_pred']

reg = RuleRegressor()
reg.fit(X, y)

ruleset: RuleSet[RegressionRule] = reg.model
rule: RegressionRule = ruleset.rules[0]
print('Decision value of the first rule: ', rule.conclusion_value)

3.3 For survival rules

More changes have been made for survival rules.

First, there is a new class rulekit.kaplan_meier.KaplanMeierEstimator, which represents Kaplan-Meier estimator rules. In the future, prediction arrays for survival problems will probably be moved from dictionary arrays to arrays of such objects, but this would be a breaking change unfortunately

In addition, one can now easily access the Kaplan-Meier curve of the entire training dataset using the rulekit.survival.SurvivalRules.get_train_set_kaplan_meier method.

Such curves can be easily plotted using the charting package of your choice.

import pandas as pd
import matplotlib.pyplot as plt
from rulekit.arff import read_arff
from rulekit.survival import SurvivalRules
from rulekit.rules import RuleSet, SurvivalRule
from rulekit.kaplan_meier import KaplanMeierEstimator # this is a new class

DATASET_URL: str = (
    'https://raw.githubusercontent.com/'
    'adaa-polsl/RuleKit/master/data/bmt/'
    'bmt.arff'
)
df: pd.DataFrame = read_arff(DATASET_URL)
X, y = df.drop('survival_status', axis=1), df['survival_status']

surv = SurvivalRules(survival_time_attr='survival_time')
surv.fit(X, y)

ruleset: RuleSet[SurvivalRule] = reg.model
rule: SurvivalRule = ruleset.rules[0]

# you can now easily access Kaplan-Meier estimator of the rules
rule_estimator: KaplanMeierEstimator = rule.kaplan_meier_estimator
plt.step(
    rule_estimator.times, 
    rule_estimator.probabilities,
    label='First rule'
)
# you can also access training dataset Kaplan-Meier estimator easily
train_dataset_estimator: KaplanMeierEstimator = surv.get_train_set_kaplan_meier()
plt.step(
    train_dataset_estimator.times, 
    train_dataset_estimator.probabilities,
    label='Training dataset'
)
plt.legend(title='Kaplan-Meier curves:')

4. Changes in expert rules induction for regression and survival ❗BREAKING CHANGES

Note that those changes will likely be reverted on the next version and are caused by a known bug in the original RuleKit library. Fixing it is beyond the scope of this package, which is merely a wrapper for it.

Since this version, there has been a change in the way expert rules and conditions for regression and survival problems are communicated. All you have to do is remove conclusion part of those rules (everything after THEN).

Expert rules before:

expert_rules = [
    (
        'rule-0',
        'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN survival_status = {NaN}'
    )
]

expert_preferred_conditions = [
    (
        'attr-preferred-0',
        'inf: IF [CD34kgx10d6 = Any] THEN survival_status = {NaN}'
    )
]


expert_forbidden_conditions = [
    ('attr-forbidden-0', 'IF [ANCrecovery = Any] THEN survival_status = {NaN}')
]

And now:

expert_rules = [
    (
        'rule-0',
        'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN'
    )
]

expert_preferred_conditions = [
    (
        'attr-preferred-0',
        'inf: IF [CD34kgx10d6 = Any] THEN'
    )
]


expert_forbidden_conditions = [
    ('attr-forbidden-0', 'IF [ANCrecovery = Any] THEN')
]

Other changes

  • Fix expert rules parsing.
  • Conditions printed in the order they had been added to the rule.
  • Fixed bug when using sklearn.base.clone function with RuleKit model classes.
  • Update tutorials in the documentation.

v2.1.18.0

09 Sep 10:12
Compare
Choose a tag to compare

What's new in RuleKit version 2.1.18.0?

This release mainly focuses on fixing various inconsistencies between this package and the original Java RuleKit v2 library.

1. Add utility function for reading .arff files.

The ARFF file format was originally created by the Machine Learning Project at the University of Waikato's Department of Computer Science for use with Weka machine learning software. This format, once popular, has now become rather niche. However, some older but popular public benchmark datasets are still available as arff files.

Modern Python hovewer lacks a good package for reading such files. Most exsiting examples on the internet are using scipy.io.arff package. However, this package has some drawbacks that can be problematic (they certainly were in our own experiments). First of all, it does not read the data as pandas DataFrames. Although the returned data can be easily converted into a DataFrame, it still fails to properly encode string columns, leaving them as bytes. We also encountered problems parsing empty values, especially in numeric columns.

After encountering all these problems and drinking considerable amounts of coffee ☕ to fix all sorts of strange bugs they caused, we decided to add a custom function for reading arff files to this package. It is not a completely new implementation and uses scipy.io.arff. It fixes the previously mentioned problems, and also returns a ready-to-use pandas DataFrame compatible with the models available in this package. Example below.

import pandas as pd
from rulekit.arff import read_arff

df: pd.DataFrame = read_arff('./tests/additional_resources/cholesterol.arff')

2. Add ability to write verbose rule induction process logs to the file.

The original RuleKit provides detailed logs of the entire rule induction process. Such logs may not be of interest to the average user, but may be of value to others. They can also be helpful in the debugging process (they certainly were for us).

To configure such logs you can use RuleKit class:

from rulekit import RuleKit

RuleKit.configure_java_logger(
    log_file_path='./java.logs',
    verbosity_level=1
)
# train your model later

3. Add validation of the models parameters configuration.

This package acts as a wrapper for the original RuleKit library written in Java, offering an analogous but more Python-like API. However, this architecture has led to many bugs in the past. Most of them were due to differences between the parameter values of models configured in Python and their values set in Java. In this version, we have added automatic validation, which compares the parameter values configured by the user with those configured in Java and reports the corresponding rulekit.exceptions.RuleKitMisconfigurationException exception. However, this exception should not occur during normal use of this package and was added mainly to make debugging easier and prevent such bugs in the future.

Fixed issues

  • Inconsistent results of induction for survival #22
  • Fixed numerous inconsistencies between this package and the original Java RuleKit v2 library.

v2.1.16.0

30 Jul 11:29
Compare
Choose a tag to compare

What's new in RuleKit version 2.1.16.0?

1. RuleKit and RapidMiner part ways 💔

RuleKit has been using the RapidMiner Java API for various tasks, such as loading data, measuring model performance, etc., since its beginning. From major version 2 RuleKit finally parted ways with RuleMiner. This is mainly due to the recent work of our contributors: Wojciech Górka and Mateusz Kalisch.

This change brings many benefits and other changes such as:

  • a huge reduction in the jar file of the RuleKit java package (from 131MB to 40.9MB).
  • now the jar file is small enough to fit into the Python package distribution, which means there is no longer a need to download it in an extra step.

Although the license has remained the same (GNU AGPL-3.0 license), for commercial projects that require the ability to distribute RuleKit code as part of a program that cannot be distributed under the AGPL, it may be possible to obtain an appropriate license from the authors. Feel free to contact us!

2. ⚠️ BREAKING CHANGE min_rule_covered algorithm parameter was removed

Up to this version this parameter was marked as deprecated and its usage only resulted in warning. Now it was completely removed which might be a breaking change.

3. ⚠️ BREAKING CHANGE The classification metric negative_voting_conflicts is no longer available

As of this version, the metric returned from the RuleClassifier.predict method with return_metrics=True no longer includes the negative_voting_conflicts metric.

In fact, there was no way to calculate this metric without access to the true values of the labels. The predict method does not take labels as an argument, so previous results for this metric were unfortunately incorrect.

If you really need to calculate this specific metrics you still can but it requires more effort to do so. Here is an example how you can achieve it using currently available API:

import re
from collections import defaultdict
import numpy as np
from sklearn.datasets import load_iris

from rulekit.classification import RuleClassifier

X, y = load_iris(return_X_y=True)

clf = RuleClassifier()
clf.fit(X, y)

prediction: np.ndarray = clf.predict(X)

# 1. Group rules by decision class based on their conclusions
rule_decision_class_regex = re.compile("^.+THEN .+ = {(.+)}$")

grouped_rules: dict[str, list[int]] = defaultdict(lambda: [])
for i, rule in enumerate(clf.model.rules):
    rule_decision_class: str = rule_decision_class_regex.search(
        str(rule)).group(1)
    grouped_rules[rule_decision_class].append(i)

# 2. Get rules covering each example
coverage_matrix: np.ndarray = clf.get_coverage_matrix(X)

# 3. Group coverages of the rules with the same decision class
grouped_coverage_matrix: np.ndarray = np.zeros(
    shape=(coverage_matrix.shape[0], len(grouped_rules.keys()))
)
for i, rule_indices in enumerate(grouped_rules.values()):
    grouped_coverage_matrix[:, i] = np.sum(
        coverage_matrix[:, rule_indices], axis=1
    )
grouped_coverage_matrix[grouped_coverage_matrix > 0] = 1

# 4. Find examples with voting conflicts
voting_conflicts_mask: np.ndarray = np.sum(coverage_matrix, axis=1) > 1

# 5. Find examples with negative voting conflicts (where predicted class
# is not equal to actual class)
negative_conflicts_mask: np.ndarray = voting_conflicts_mask[
    y != prediction
]
negative_conflicts: int = np.sum(negative_conflicts_mask)
print('Number of negative voting conflicts: ', negative_conflicts)

Not so simple, right?

Perhaps in the future we will add an API to calculate this indicator in a more user-friendly way.

4. 🕰️ DEPRECATION download_jar command is now deprecated

Due to the removal of RapidMiner's dependencies from the RuleKit Java package, its jar file size has decreased significantly. Now it's small enough to fit into the Python package distribution. There is no need to download it in an extra step using this command as before:

python -m rulekit download_jar

This command will now do nothing and generate a warning. It will be completely removed in the next major version 3.

Fixed issues:

  • minsupp_new should be a float parameter #23
  • Inconsistent results of expert induction for regression and survival #19

v.1.7.14.0

25 Jul 07:26
Compare
Choose a tag to compare

1. New version of the Java RuleKit backend.

A migration to the latest version of the Java RuleKit backend (v1.7.14) has been performed. For more details, see the original RuleKit release notes.

2. New versioning scheme

From now on, this Python package will be versioned consistently with the main RuleKit Java package. The versioning scheme is as follows:

{JAVA_RULEKIT_PACKAGE_VERSION}.{PYTHON_PACKAGE_VERSION}.

e.g.

1.7.14.0

Where JAVA_RULEKIT_PACKAGE_VERSION will be equal to the Java package version used by this particular Python package version. The PYTHON_PACKAGE_VERSION will be a single number to distinguish specific versions of the Python package using the same version of the Java package that differ in Python code.

Yes, I know it's quite complicated, but now it's at least clear which version of the Java package is being used under the hood.

v.1.7.6

27 Feb 09:09
Compare
Choose a tag to compare
  • Use the latest RuleKit release v1.7.5
  • Change min_supp_new (alias min_rule_covered) parameter type from integer to float - same as in original RuleKit package

v1.7.5

15 Feb 06:39
Compare
Choose a tag to compare
  • Migrate to latest RuleKit jar file v1.7.4
  • Fix issue #20

v1.7.4

05 Feb 09:30
Compare
Choose a tag to compare

Fixed issues:

  • Issue #17 java.lang.NullPointerException during expert rule induction
  • Issue #19 java.lang.java.lang.RuntimeException when inducing rules on DataFrame with boolean columns

v.1.7.3

30 Jan 11:34
Compare
Choose a tag to compare
  • Move test dependencies to the ./tests/requirements.txt file.
  • Loosen depencencies versions requirements