Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onelearn implementation: AMF & Mondrian Tree Classifiers #1131

Merged
merged 53 commits into from
Jan 17, 2023

Conversation

AlexandreChaussard
Copy link
Contributor

@AlexandreChaussard AlexandreChaussard commented Dec 30, 2022

Description of the PR

Hi ! 👋

This is a first version of Onelearn's library (classifiers only) implementation in River.
It contains:

  • Mondrian Tree (Base and Classifier)
  • Aggregated Mondrian Forest (Base and Classifier)

The original repository with proof of working implementation can be found here (see script.py).

image

Notes on the utils

Currently I placed two functions in the utils section:

  • sample_discrete
  • log_sum_2_exp

It might seem overkill to place them as utils right now looking at where they're used in the code, but I'll need them for the regressors too when the times come. Maybe there's a better place for them keeping the regressors in mind though.

@MaxHalford
Copy link
Member

Hey hey :)

I don't have much time right now to check the code, but maybe @smastelini is available? I'll answer some of your questions though:

Management of labels: labels must be positive integers at the moment, since it just makes my life easier. Please tell me how you would prefer this to be implemented in River's framework (I've seen labels presented as dictionary, but I'm not sure what these dictionaries contains: int? string?). This might be a simple trick for me to change how labels are managed I think, just need to know your preferences.

Trees are usually multi-class classifiers. We expact multi-class classifiers to support any label of type bool, int or str (see base/typing.py)

Examples of usage: MondrianTreeClassifier and AMFClassifier would need examples of implementations for users. I actually implemented one on my repo here already, but I didn't manage to compile River so I couldn't get the scores 😢

What do you mean you didn't manage to compile? Yeah some doctests would be nice, at the minimum.

river/utils/random.py Outdated Show resolved Hide resolved
river/tree/mondrian_tree.py Outdated Show resolved Hide resolved
river/ensemble/aggregated_mondrian_forest.py Outdated Show resolved Hide resolved
river/ensemble/aggregated_mondrian_forest.py Outdated Show resolved Hide resolved
river/ensemble/aggregated_mondrian_forest.py Outdated Show resolved Hide resolved
- Adding a "mondrian" folder in the "tree" folder for better file structure
- Using "random.choices" instead of the "sample_discrete" functions in "utils.py", and removing "sample_discrete" from the "utils.py"
@smastelini
Copy link
Member

Hi @AlexandreChaussard and @MaxHalford 😃

I can take care of this review. It's going to be a good opportunity to learn more about AMF.

I'm currently enjoying the holidays with my family, but I will start reviewing next week!

If needed, besides leaving comments, I can create a PR to @AlexandreChaussard's fork with some suggestions.

- Removing the "__repr__" method of AMF
- Removing the @Setter and @Getter
- Removing the "loss" parameter of the classifiers since only the "log-loss" is being used in the end
@MaxHalford
Copy link
Member

Yes, let's take the time to review this properly, there's no rush. This has the potential to be one of our best algorithms, so it's worth spending time on it.

- Making `learn_one` and `predict_proba_one` accepting all kinds of supported labels for `y` as input
- `predict_proba_one` outputs a dictionary of scores with matching labels
@AlexandreChaussard
Copy link
Contributor Author

Hey ! 👋

I've been fixing all the previously stated points, besides the example docstring that I keep for when the model will be completely reviewed and ready to go 😄

Happy new years team ❤️

Copy link
Member

@smastelini smastelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AlexandreChaussard, I started reviewing the code and am impressed by the effort you put into implementing AMF.

I barely scratched the surface yet and left some initial comments, mostly on style. One thing popped up though, I see you're keeping track of class numbers and expects a predefined number of classes.

I suggest keeping defining a set of observed classes, as the Hoeffding Tree Classifier does. This way, new classes are allowed to appear and you can access the properties that are necessary to make predictions and perform other tasks.

If this is possible, the classes become any hashable object, as it is the case in the Hoeffding Trees. What do you think?

I'll keep reviewing your code

river/tree/__init__.py Outdated Show resolved Hide resolved
river/tree/mondrian/__init__.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree.py Show resolved Hide resolved
river/tree/mondrian/mondrian_tree.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_nodes.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_nodes.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_nodes.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_nodes.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_nodes.py Outdated Show resolved Hide resolved
AlexandreChaussard and others added 4 commits January 2, 2023 20:32
Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br>
Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br>
Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br>
Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br>
@AlexandreChaussard
Copy link
Contributor Author

Hi @AlexandreChaussard, I started reviewing the code and am impressed by the effort you put into implementing AMF.

Thanks ❤️

I suggest keeping defining a set of observed classes, as the Hoeffding Tree Classifier does. This way, new classes are allowed to appear and you can access the properties that are necessary to make predictions and perform other tasks.

Code-wise, I don't think this would be any problem to implement, there's already a dictionary collecting the observed classes that is called _classes in the classifiers. It is only used at the moment as a way to translate classes object into positive integers for easy computation later on.
However, it seems that the number of classes must be known before hand to compute the Jeffrey's prior (see description of AMF). If the number of classes could change over time, the current implementation would be incorrect (at first we would have 0 classes, so the dirichlet parameter would be discarded at the bottom and it'd be possible to have probabilities > 1 in the distribution using Jeffrey's prior).

If I'm making an error here, please let me know I'm no expert of AMF 😄

If this is possible, the classes become any hashable object, as it is the case in the Hoeffding Trees. What do you think?

The classes can be any string, int or boolean at the moment as @MaxHalford suggested (typing.ClfTarget). I'm not sure what you mean by hashable object instead (like something to put in an HashMap? It's kinda what does the dictionnary _classes right?)

I'm busy with exams at the moment, but I'll be working on all these on Thursday for sure!
Cheers 🎈

@smastelini
Copy link
Member

Yeap! You got it :)

All these things (ints, etc.) can go in a set. Do worry about changing things now. Take your time with the exams. It's actually better this way, so that I can take some time to delve deep in the code until the end of the week. Cheers!

AlexandreChaussard and others added 3 commits January 5, 2023 18:37
- Leaving `__all__` in alphabetical order for the classifiers
- Removing type parameters in the description of `log_2_sum` of math utils
- Replacing java-like getters and setters by python-like properties and setter
- Replacing Overflow from infinity to maximum possible float (so it makes computations still possible)
@AlexandreChaussard
Copy link
Contributor Author

Hey 👋

I've fixed/answered all the previous reviews, and I've added support for the random state !

Copy link
Member

@smastelini smastelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one more wave of comments. Once they are fixed, I'll create a PR for your fork with code-style suggestions. Keep up the excellent work!

river/ensemble/aggregated_mondrian_forest.py Outdated Show resolved Hide resolved
river/ensemble/aggregated_mondrian_forest.py Outdated Show resolved Hide resolved
river/ensemble/aggregated_mondrian_forest.py Outdated Show resolved Hide resolved
river/ensemble/aggregated_mondrian_forest.py Outdated Show resolved Hide resolved
river/ensemble/aggregated_mondrian_forest.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_classifier.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_classifier.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_classifier.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_classifier.py Outdated Show resolved Hide resolved
AlexandreChaussard and others added 5 commits January 6, 2023 22:44
Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br>
- Fixing import order in __init__ file of ensemble
- Using LaTeX formulation in AMFClassifier description
- Making all nodes related methods private (it shouldn't be used outside)
- Docstring syntax update and fixes
- Importing river.base instead of typing module for better readability
- Adding a short description to the MondrianTreeClassifier
- Renaming MondrianTreeLeaf into MondrianLeaf
- Reordering functions in MondrianTreeClassifier for better readability
river/tree/mondrian/mondrian_tree_classifier.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_classifier.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_classifier.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_nodes.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_nodes.py Outdated Show resolved Hide resolved
river/tree/mondrian/mondrian_tree_nodes.py Outdated Show resolved Hide resolved
AlexandreChaussard and others added 21 commits January 14, 2023 16:04
- Adding suggestions from Mastelini on keys usage
- Removing useless initialization of scores in the MondrianTreeClassifier
…to suggestions2

Critical issue to fix in AMFClassifier before merging into main
Style suggestions merge from Mastelini
- Fixing scoring bug (no propagation of counts)
- Removing unused parameters in docs
- Replacing type union of Python 3.10 in 3.9 annotations
- Adding little description for MondrianBranch
fix remaining tests and remove duplicated method call
- Adding examples for AMF & Mondrian Tree Classifiers
- Reordering __init__ in alphabetical order
- Cleaning the comments
- Adding string representation for nodes
Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br>
@AlexandreChaussard
Copy link
Contributor Author

Hey (again) @MaxHalford !
@smastelini and I have worked hard on the classifiers, and we think it's ready to go!

Any chance you could have a look as well so maybe we can go for the merge :3 ?

Cheers 🥳

Copy link
Member

@smastelini smastelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, @AlexandreChaussard, for all the work on this PR!

You have addressed all the main problems identified in the code and paved the way for the upcoming AMF regressor.

It remains to decide whether or not we can get rid of the n_classes parameter and track incoming classes automatically. We will wait for the answer from AMF's original authors and proceed to make changes if needed afterward.

@MaxHalford MaxHalford merged commit 41410a8 into online-ml:main Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants