Add fittable #140

stephantul · 2024-12-23T13:31:59Z

No description provided.

codecov · 2024-12-23T13:33:46Z

Codecov Report

Attention: Patch coverage is 97.41379% with 12 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
model2vec/inference/model.py	88.57%	8 Missing ⚠️
model2vec/train/classifier.py	97.40%	4 Missing ⚠️

Files with missing lines	Coverage Δ
model2vec/hf_utils.py	`69.56% <100.00%> (+1.08%)`	⬆️
model2vec/inference/__init__.py	`100.00% <100.00%> (ø)`
model2vec/model.py	`94.55% <100.00%> (ø)`
model2vec/train/__init__.py	`100.00% <100.00%> (ø)`
model2vec/train/base.py	`100.00% <100.00%> (ø)`
model2vec/utils.py	`92.53% <ø> (ø)`
tests/conftest.py	`100.00% <100.00%> (ø)`
tests/test_inference.py	`100.00% <100.00%> (ø)`
tests/test_trainable.py	`100.00% <100.00%> (ø)`
model2vec/train/classifier.py	`97.40% <97.40%> (ø)`
... and 1 more

Pringled

Looks good! Some minor comments and suggestions.

model2vec/train/classifier.py

model2vec/train/base.py

pyproject.toml

model2vec/train/classifier.py

davidberenstein1957

Looks super useful. left some comments. Also, perhaps we can add some reference to multi-label usage somewhere?

model2vec/model.py

davidberenstein1957 · 2025-01-26T09:00:15Z

model2vec/inference/model.py

+        """Save the model to a folder."""
+        save_pipeline(self, path)
+
+    def push_to_hub(self, repo_id: str, token: str | None = None, private: bool = False) -> None:


I would add a modelcard and perhaps tags or a library reference, this helps a lot with visibility, usability and findability.

https://huggingface.co/docs/hub/model-cards#specifying-a-library

This actually already happens because we push the underlying static model to the hub, which has a model card. This model card template is specified in the root of the code.

davidberenstein1957 · 2025-01-26T09:01:25Z

model2vec/inference/model.py

+        self.head = head
+
+    @classmethod
+    def from_pretrained(


can't we load it from the Hub? perhaps we should align the arguments a bit with the transformers naming given you've also adopted from_pretrained?

For example using pretrained_model_name_or_path. https://huggingface.co/docs/transformers/v4.48.0/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained

from_pretrained loads from the hub. The arguments mimic the ones from StaticModel and, although they don't match transformers exactly, we're wary of introducing breaking changes.

model2vec/inference/model.py

model2vec/train/README.md

model2vec/train/classifier.py

Pringled

Couple of comments and suggestions

Pringled · 2025-02-07T10:16:30Z

model2vec/inference/README.md

+
+# Usage
+
+Let's assume you're using our `potion-edu classifier`.


Suggested change

Let's assume you're using our `potion-edu classifier`.

Let's assume you're using our [potion-edu classifier](https://huggingface.co/minishlab/potion-8m-edu-classifier).

Also small todo, don't forget to make this model public

Pringled · 2025-02-07T10:17:13Z

model2vec/inference/README.md

+```python
+from model2vec.inference import StaticModelPipeline
+
+s = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")


Suggested change

s = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")

classifier = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")

Pringled · 2025-02-07T10:17:21Z

model2vec/inference/README.md

+from model2vec.inference import StaticModelPipeline
+
+s = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")
+label = s.predict("Attitudes towards cattle in the Alps: a study in letting go.")


Suggested change

label = s.predict("Attitudes towards cattle in the Alps: a study in letting go.")

label = classifier.predict("Attitudes towards cattle in the Alps: a study in letting go.")

Pringled · 2025-02-07T10:19:53Z

model2vec/inference/model.py

+
+    @classmethod
+    def from_pretrained(
+        cls: type[StaticModelPipeline], path: PathLike, token: str | None = None


Should also accept trust_remote_code, and pass it to _load_pipeline I think

Pringled · 2025-02-07T10:24:03Z

model2vec/inference/model.py

+
+    def _predict_and_coerce_to_2d(self, X: list[str] | str) -> np.ndarray:
+        """Predict the labels of the input and coerce the output to a matrix."""
+        encoded = self.model.encode(X)


There's no control over encode since this is in a private function, I'm not sure if we want to give any control here? E.g batch size, multiprocessing, etc?

Pringled · 2025-02-07T10:30:26Z

model2vec/train/classifier.py

+
+    def configure_optimizers(self) -> OptimizerLRScheduler:
+        """Simple Adam optimizer."""
+        optimizer = torch.optim.Adam(self.model.parameters(), lr=1e-3)


I think learning rate is now ignored, this should be:

Suggested change

optimizer = torch.optim.Adam(self.model.parameters(), lr=1e-3)

optimizer = torch.optim.Adam(self.model.parameters(), lr=self.learning_rate)

Pringled · 2025-02-07T10:34:30Z

tutorials/train_classifier.ipynb

@@ -0,0 +1,248 @@
+{


This prints in the notebook are not saved/pushed I think. This is a bit weird, for example when you have:
"Pretty good! We outperform the tf-idf pipeline by a wide margin.", it doesn't actually show it in the notebook (see https://github.com/MinishLab/model2vec/blob/add-fittable/tutorials/train_classifier.ipynb).

stephantul added 11 commits December 22, 2024 12:38

Fix tokenizer issue

4078a3b

fix issue with warning

09f888d

regenerate lock file

2167a4e

fix lock file

c95dca5

Try to not select 2.5.1

b5d8bb7

fix: issue with dividers in utils

3e68669

Try to not select 2.5.0

1ae4d61

fix: do not up version

1349b0c

Attempt special fix

4b83d59

merge

9515b83

feat: add training

dfd865b

stephantul added 8 commits December 23, 2024 14:38

merge with old

c4ba272

fix: no grad

4713bfa

use numpy

e8058bb

Add train_test_split

a59127e

fix: issue with fit not resetting

310fbb5

feat: add lightning

b1899d1

merge

e27f9dc

Fix bugs

8df3aaf

stephantul marked this pull request as ready for review January 3, 2025 20:22

stephantul requested a review from Pringled January 3, 2025 20:22

Pringled requested changes Jan 4, 2025

View reviewed changes

stephantul added 2 commits January 5, 2025 16:07

fix: reviewer comments

839d88a

fix train issue

8457357

Pringled approved these changes Jan 5, 2025

View reviewed changes

stephantul added 4 commits January 7, 2025 17:21

fix issue with trainer

a750709

fix: truncate during training

e83c54e

feat: tokenize maximum length truncation

803565d

fixes

9052806

stephantul added 18 commits January 16, 2025 06:54

Add pipeline saving

ffec235

fix bug

0af84fc

fix issue with normalize test

c829745

change default batch size

9ce65a1

feat: add sklearn skops pipeline

e1169fb

Device handling and automatic batch size

f096824

Add docstrings, defaults

ff3ebdf

docs

b4e966a

fix: rename

8f65bfd

fix: rename

8cdb668

fix installation

e96a72a

rename

3e76083

Add training tutorial

9f1cb5a

Add tutorial link

e2d92b9

Merge branch 'main' into add-fittable

657cef0

test: add tests

773009f

fix tests

7015341

tests: fix tests

8ab8456

stephantul requested a review from Pringled January 24, 2025 18:49

davidberenstein1957 reviewed Jan 26, 2025

View reviewed changes

stephantul added 9 commits January 26, 2025 13:26

Address comments

e21e61f

Add inference reqs to train reqs

ff75af9

fix normalize

87de7c4

update lock file

1fb33f1

Merge branch 'main' into add-fittable

59f0076

Merge branch 'main' into add-fittable

009342b

fix: move modelcards

261a9b4

fix: batch size

e1d53ac

update lock file

6b5f991

Pringled requested changes Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fittable #140

Add fittable #140

stephantul commented Dec 23, 2024

codecov bot commented Dec 23, 2024 •

edited

Loading

Pringled left a comment

davidberenstein1957 left a comment

davidberenstein1957 Jan 26, 2025

stephantul Jan 26, 2025

davidberenstein1957 Jan 26, 2025

stephantul Jan 26, 2025

Pringled left a comment

Pringled Feb 7, 2025 •

edited

Loading

Pringled Feb 7, 2025

Pringled Feb 7, 2025

Pringled Feb 7, 2025

Pringled Feb 7, 2025

Pringled Feb 7, 2025

Pringled Feb 7, 2025


		# Usage

		Let's assume you're using our `potion-edu classifier`.

	Let's assume you're using our `potion-edu classifier`.
	Let's assume you're using our [potion-edu classifier](https://huggingface.co/minishlab/potion-8m-edu-classifier).

	s = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")
	classifier = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")

	label = s.predict("Attitudes towards cattle in the Alps: a study in letting go.")
	label = classifier.predict("Attitudes towards cattle in the Alps: a study in letting go.")

	optimizer = torch.optim.Adam(self.model.parameters(), lr=1e-3)
	optimizer = torch.optim.Adam(self.model.parameters(), lr=self.learning_rate)

Add fittable #140

Are you sure you want to change the base?

Add fittable #140

Conversation

stephantul commented Dec 23, 2024

codecov bot commented Dec 23, 2024 • edited Loading

Codecov Report

Pringled left a comment

Choose a reason for hiding this comment

davidberenstein1957 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pringled left a comment

Choose a reason for hiding this comment

Pringled Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 23, 2024 •

edited

Loading

Pringled Feb 7, 2025 •

edited

Loading