Releases: adaa-polsl/RuleKit-python
v2.1.24.1
What's new in RuleKit version 2.1.24.1?
- correct max rule count in survival problems by @adamgrzelak in #37
New Contributors
- @adamgrzelak made their first contribution in #37 🎉
Full Changelog: v2.1.24.0...v2.1.24.1
v2.1.24.0
What's new in RuleKit version 2.1.24.0?
1. Revert breaking changes in expert rules induction for regression and survival
The latest version 2.1.21.0 introduced some groundbreaking changes, which you can read more
about them in the latest release note.
Now rules and expert conditions can be defined in both the old and new formats, see example below.
# both variants will work the same
expert_rules = [
(
'rule-0',
'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN survival_status = {NaN}'
),
(
'rule-0',
'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN'
),
]
2. Upgrade to new version of RuleKit
In the new version of the Java RuleKit library, many bugs regarding expert induction have been corrected.
Other changes
- Improve flake8 score
- Add more unit tests.
v2.1.21.0
What's new in RuleKit version 2.1.21.0?
1. Ability to use user-defined quality measures during rule induction, pruning, and voting phases.
Users can now define custom quality measures function and use them for: growing, pruning and voting. Defining quality measure function is easy and straightforward, see example below.
from rulekit.classification import RuleClassifier
def my_induction_measure(p: float, n: float, P: float, N: float) -> float:
# do anything you want here and return a single float...
return (p + n) / (P + N)
def my_pruning_measure(p: float, n: float, P: float, N: float) -> float:
return p - n
def my_voting_measure(p: float, n: float, P: float, N: float) -> float:
return (p + 1) / (p + n + 2)
python_clf = RuleClassifier(
induction_measure=my_induction_measure,
pruning_measure=my_pruning_measure,
voting_measure=my_voting_measure,
)
This function was available long ago in the original Java library, but there were some technical problems that prevented its implementation in that package. Now, with the release of RuleKit v2, it is finally available.
⚠️ Using this feature comes at a price. Using the original set of quality measures fromrulekit.params.Measures
provides an optimized and much faster implementation of these quality functions in Java. Using a custom Python function will certainly slow down the model learning process. For example, learning rules on the Iris dataset using the FullCoverage measure went from 1.8 seconds to 10.9 seconds after switching to using the Python implementation of the same measure.
2. Reading arff files from url via HTTP/HTTPS.
In the last version of the package, a new function for reading arff files was added. It made it possible to read an arff file by accepting the file path or a file-like object as an argument. As of this version, the function also accepts URLs, giving it the ability to read an arff dataset directly from some servers via HTTP/HTTPS.
import pandas as pd
from rulekit.arff import read_arff
df: pd.DataFrame = read_arff(
'https://raw.githubusercontent.com/'
'adaa-polsl/RuleKit/refs/heads/master/data/seismic-bumps/'
'seismic-bumps.arff'
)
3. Improves rules API
Access to some basic rule information was often quite cumbersome in earlier versions of this package. For example, there was no easy way to access information about the decision class of a classification rule.
In this version, rule classes and rule sets have been refactored and improved. Below is a list of some operations that are now much easier.
3.1 For classification rules
You can now access rules decision class via rulekit.rules.ClassificationRule.decision_class
field. Example below:
import pandas as pd
from rulekit.arff import read_arff
from rulekit.classification import RuleClassifier
from rulekit.rules import RuleSet, ClassificationRule
DATASET_URL: str = (
'https://raw.githubusercontent.com/'
'adaa-polsl/RuleKit/refs/heads/master/data/seismic-bumps/'
'seismic-bumps.arff'
)
df: pd.DataFrame = read_arff(DATASET_URL)
X, y = df.drop('class', axis=1), df['class']
clf: RuleClassifier = RuleClassifier()
clf.fit(X, y)
# RuleSet class became generic now
ruleset: RuleSet[ClassificationRule] = clf.model
rule: ClassificationRule = ruleset.rules[0]
print('Decision class of the first rule: ', rule.decision_class)
3.2 For regression rules
You can now access rules decision attribute value via rulekit.rules.RegressionRule.conclusion_value
field. Example below:
import pandas as pd
from rulekit.arff import read_arff
from rulekit.regression import RuleRegressor
from rulekit.rules import RuleSet, RegressionRule
DATASET_URL: str = (
'https://raw.githubusercontent.com/'
'adaa-polsl/RuleKit/master/data/methane/'
'methane-train.arff'
)
df: pd.DataFrame = read_arff(DATASET_URL)
X, y = df.drop('MM116_pred', axis=1), df['MM116_pred']
reg = RuleRegressor()
reg.fit(X, y)
ruleset: RuleSet[RegressionRule] = reg.model
rule: RegressionRule = ruleset.rules[0]
print('Decision value of the first rule: ', rule.conclusion_value)
3.3 For survival rules
More changes have been made for survival rules.
First, there is a new class rulekit.kaplan_meier.KaplanMeierEstimator
, which represents Kaplan-Meier estimator rules. In the future, prediction arrays for survival problems will probably be moved from dictionary arrays to arrays of such objects, but this would be a breaking change unfortunately
In addition, one can now easily access the Kaplan-Meier curve of the entire training dataset using the rulekit.survival.SurvivalRules.get_train_set_kaplan_meier
method.
Such curves can be easily plotted using the charting package of your choice.
import pandas as pd
import matplotlib.pyplot as plt
from rulekit.arff import read_arff
from rulekit.survival import SurvivalRules
from rulekit.rules import RuleSet, SurvivalRule
from rulekit.kaplan_meier import KaplanMeierEstimator # this is a new class
DATASET_URL: str = (
'https://raw.githubusercontent.com/'
'adaa-polsl/RuleKit/master/data/bmt/'
'bmt.arff'
)
df: pd.DataFrame = read_arff(DATASET_URL)
X, y = df.drop('survival_status', axis=1), df['survival_status']
surv = SurvivalRules(survival_time_attr='survival_time')
surv.fit(X, y)
ruleset: RuleSet[SurvivalRule] = reg.model
rule: SurvivalRule = ruleset.rules[0]
# you can now easily access Kaplan-Meier estimator of the rules
rule_estimator: KaplanMeierEstimator = rule.kaplan_meier_estimator
plt.step(
rule_estimator.times,
rule_estimator.probabilities,
label='First rule'
)
# you can also access training dataset Kaplan-Meier estimator easily
train_dataset_estimator: KaplanMeierEstimator = surv.get_train_set_kaplan_meier()
plt.step(
train_dataset_estimator.times,
train_dataset_estimator.probabilities,
label='Training dataset'
)
plt.legend(title='Kaplan-Meier curves:')
4. Changes in expert rules induction for regression and survival ❗BREAKING CHANGES
Note that those changes will likely be reverted on the next version and are caused by a known bug in the original RuleKit library. Fixing it is beyond the scope of this package, which is merely a wrapper for it.
Since this version, there has been a change in the way expert rules and conditions for regression and survival problems are communicated. All you have to do is remove conclusion part of those rules (everything after THEN).
Expert rules before:
expert_rules = [
(
'rule-0',
'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN survival_status = {NaN}'
)
]
expert_preferred_conditions = [
(
'attr-preferred-0',
'inf: IF [CD34kgx10d6 = Any] THEN survival_status = {NaN}'
)
]
expert_forbidden_conditions = [
('attr-forbidden-0', 'IF [ANCrecovery = Any] THEN survival_status = {NaN}')
]
And now:
expert_rules = [
(
'rule-0',
'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN'
)
]
expert_preferred_conditions = [
(
'attr-preferred-0',
'inf: IF [CD34kgx10d6 = Any] THEN'
)
]
expert_forbidden_conditions = [
('attr-forbidden-0', 'IF [ANCrecovery = Any] THEN')
]
Other changes
- Fix expert rules parsing.
- Conditions printed in the order they had been added to the rule.
- Fixed bug when using
sklearn.base.clone
function with RuleKit model classes. - Update tutorials in the documentation.
v2.1.18.0
What's new in RuleKit version 2.1.18.0?
This release mainly focuses on fixing various inconsistencies between this package and the original Java RuleKit v2 library.
1. Add utility function for reading .arff files.
The ARFF file format was originally created by the Machine Learning Project at the University of Waikato's Department of Computer Science for use with Weka machine learning software. This format, once popular, has now become rather niche. However, some older but popular public benchmark datasets are still available as arff files.
Modern Python hovewer lacks a good package for reading such files. Most exsiting examples on the internet are using scipy.io.arff
package. However, this package has some drawbacks that can be problematic (they certainly were in our own experiments). First of all, it does not read the data as pandas DataFrames. Although the returned data can be easily converted into a DataFrame, it still fails to properly encode string columns, leaving them as bytes. We also encountered problems parsing empty values, especially in numeric columns.
After encountering all these problems and drinking considerable amounts of coffee ☕ to fix all sorts of strange bugs they caused, we decided to add a custom function for reading arff files to this package. It is not a completely new implementation and uses scipy.io.arff
. It fixes the previously mentioned problems, and also returns a ready-to-use pandas DataFrame compatible with the models available in this package. Example below.
import pandas as pd
from rulekit.arff import read_arff
df: pd.DataFrame = read_arff('./tests/additional_resources/cholesterol.arff')
2. Add ability to write verbose rule induction process logs to the file.
The original RuleKit provides detailed logs of the entire rule induction process. Such logs may not be of interest to the average user, but may be of value to others. They can also be helpful in the debugging process (they certainly were for us).
To configure such logs you can use RuleKit
class:
from rulekit import RuleKit
RuleKit.configure_java_logger(
log_file_path='./java.logs',
verbosity_level=1
)
# train your model later
3. Add validation of the models parameters configuration.
This package acts as a wrapper for the original RuleKit library written in Java, offering an analogous but more Python-like API. However, this architecture has led to many bugs in the past. Most of them were due to differences between the parameter values of models configured in Python and their values set in Java. In this version, we have added automatic validation, which compares the parameter values configured by the user with those configured in Java and reports the corresponding rulekit.exceptions.RuleKitMisconfigurationException
exception. However, this exception should not occur during normal use of this package and was added mainly to make debugging easier and prevent such bugs in the future.
Fixed issues
- Inconsistent results of induction for survival #22
- Fixed numerous inconsistencies between this package and the original Java RuleKit v2 library.
v2.1.16.0
What's new in RuleKit version 2.1.16.0?
1. RuleKit and RapidMiner part ways 💔
RuleKit has been using the RapidMiner Java API for various tasks, such as loading data, measuring model performance, etc., since its beginning. From major version 2 RuleKit finally parted ways with RuleMiner. This is mainly due to the recent work of our contributors: Wojciech Górka and Mateusz Kalisch.
This change brings many benefits and other changes such as:
- a huge reduction in the jar file of the RuleKit java package (from 131MB to 40.9MB).
- now the jar file is small enough to fit into the Python package distribution, which means there is no longer a need to download it in an extra step.
Although the license has remained the same (GNU AGPL-3.0 license), for commercial projects that require the ability to distribute RuleKit code as part of a program that cannot be distributed under the AGPL, it may be possible to obtain an appropriate license from the authors. Feel free to contact us!
2. ⚠️ BREAKING CHANGE min_rule_covered
algorithm parameter was removed
Up to this version this parameter was marked as deprecated and its usage only resulted in warning. Now it was completely removed which might be a breaking change.
3. ⚠️ BREAKING CHANGE The classification metric negative_voting_conflicts
is no longer available
As of this version, the metric returned from the RuleClassifier.predict
method with return_metrics=True
no longer includes the negative_voting_conflicts
metric.
In fact, there was no way to calculate this metric without access to the true values of the labels. The predict
method does not take labels as an argument, so previous results for this metric were unfortunately incorrect.
If you really need to calculate this specific metrics you still can but it requires more effort to do so. Here is an example how you can achieve it using currently available API:
import re
from collections import defaultdict
import numpy as np
from sklearn.datasets import load_iris
from rulekit.classification import RuleClassifier
X, y = load_iris(return_X_y=True)
clf = RuleClassifier()
clf.fit(X, y)
prediction: np.ndarray = clf.predict(X)
# 1. Group rules by decision class based on their conclusions
rule_decision_class_regex = re.compile("^.+THEN .+ = {(.+)}$")
grouped_rules: dict[str, list[int]] = defaultdict(lambda: [])
for i, rule in enumerate(clf.model.rules):
rule_decision_class: str = rule_decision_class_regex.search(
str(rule)).group(1)
grouped_rules[rule_decision_class].append(i)
# 2. Get rules covering each example
coverage_matrix: np.ndarray = clf.get_coverage_matrix(X)
# 3. Group coverages of the rules with the same decision class
grouped_coverage_matrix: np.ndarray = np.zeros(
shape=(coverage_matrix.shape[0], len(grouped_rules.keys()))
)
for i, rule_indices in enumerate(grouped_rules.values()):
grouped_coverage_matrix[:, i] = np.sum(
coverage_matrix[:, rule_indices], axis=1
)
grouped_coverage_matrix[grouped_coverage_matrix > 0] = 1
# 4. Find examples with voting conflicts
voting_conflicts_mask: np.ndarray = np.sum(coverage_matrix, axis=1) > 1
# 5. Find examples with negative voting conflicts (where predicted class
# is not equal to actual class)
negative_conflicts_mask: np.ndarray = voting_conflicts_mask[
y != prediction
]
negative_conflicts: int = np.sum(negative_conflicts_mask)
print('Number of negative voting conflicts: ', negative_conflicts)
Not so simple, right?
Perhaps in the future we will add an API to calculate this indicator in a more user-friendly way.
4. 🕰️ DEPRECATION download_jar
command is now deprecated
Due to the removal of RapidMiner's dependencies from the RuleKit Java package, its jar file size has decreased significantly. Now it's small enough to fit into the Python package distribution. There is no need to download it in an extra step using this command as before:
python -m rulekit download_jar
This command will now do nothing and generate a warning. It will be completely removed in the next major version 3.
Fixed issues:
v.1.7.14.0
1. New version of the Java RuleKit backend.
A migration to the latest version of the Java RuleKit backend (v1.7.14) has been performed. For more details, see the original RuleKit release notes.
2. New versioning scheme
From now on, this Python package will be versioned consistently with the main RuleKit Java package. The versioning scheme is as follows:
{JAVA_RULEKIT_PACKAGE_VERSION}.{PYTHON_PACKAGE_VERSION}.
e.g.
1.7.14.0
Where JAVA_RULEKIT_PACKAGE_VERSION
will be equal to the Java package version used by this particular Python package version. The PYTHON_PACKAGE_VERSION
will be a single number to distinguish specific versions of the Python package using the same version of the Java package that differ in Python code.
Yes, I know it's quite complicated, but now it's at least clear which version of the Java package is being used under the hood.