Feat/pipeline simpler fitting #36

voorhs · 2024-11-05T15:51:21Z

Лаконичный пример работы с новым апи для оптимизации пайплайна.

# load data
from autointent.context.data_handler import Dataset
from autointent.context.utils import load_data

train_dataset = load_data("./data/train_data.json")
val_dataset = load_data("./data/test_data.json")

# define search space
from autointent.pipeline.optimization import PipelineOptimizer

config = {
    "nodes": [
        {
            "node_type": "scoring",
            "metric": "scoring_roc_auc",
            "search_space": [
                {"module_type": "knn", "k": [5, 10], "weights": ["uniform", "distance", "closest"], "model_name": ["avsolatorio/GIST-small-Embedding-v0"]},
                {"module_type": "linear", "model_name": ["avsolatorio/GIST-small-Embedding-v0"]},
            ],
        },
        {
            "node_type": "prediction",
            "metric": "prediction_accuracy",
            "search_space": [
                {"module_type": "threshold", "thresh": [0.5]},
                {"module_type": "tunable"},
            ],
        },
    ]
}

pipeline_optimizer = PipelineOptimizer.from_dict_config(config)

# optionally, configure your run
from autointent.configs.optimization_cli import LoggingConfig, VectorIndexConfig, EmbedderConfig
from pathlib import Path

pipeline_optimizer.set_config(LoggingConfig(run_name="sweet_cucumber", dirpath=Path(".").resolve(), dump_modules=False))
pipeline_optimizer.set_config(VectorIndexConfig(db_dir=Path("./my_vector_db").resolve(), device="cuda"))
pipeline_optimizer.set_config(EmbedderConfig(batch_size=16, max_length=32))

# run optimization
context = pipeline_optimizer.optimize_from_dataset(train_dataset, val_dataset)

# dump logs
context.dump()

Еще из фич:

инициализация Context теперь не такая громоздкая
модули можно не дампить, если указать logs.dump_modules=False в конфиге

TODO:

опция очищать ли модули из RAM (т.е. убрать gc.collect() и проч по запросу пользователя)
очистка db_dir по запросу пользователя
fix unintended runs directory creation

Samoed · 2024-11-05T15:54:30Z

autointent/context/context.py

+    def get_max_length(self) -> int | None:
+        return self.vector_index_client.embedder_max_length
+
+    def get_dump_dir(self) -> Path | None:


Сделай get... просто как property

autointent/pipeline/optimization/pipeline_optimizer.py

Samoed · 2024-11-05T15:59:42Z

autointent/pipeline/optimization/pipeline_optimizer.py

+        context.config_logs(self.logging_config)
+        context.config_vector_index(self.vector_index_config, self.embedder_config)
+
+        self.optimize(context)


Мб лучше это сделать как init для оптимизатора, а потом он сам по себе будет оптимизировать?

Предлагаешь ещё один класс создать, чтобы у него был свой инит, который бы создавал контекст либо принимал существующий?

Предлагаю оставить такой метод только у самого context

autointent/pipeline/optimization/pipeline_optimizer.py

# Conflicts: # autointent/context/optimization_info/data_models.py # autointent/context/optimization_info/optimization_info.py # autointent/pipeline/inference/inference_pipeline.py # autointent/pipeline/optimization/pipeline_optimizer.py

* tess: added inference_test * test: added inference pipeline cli * test: fixed device * test: added optimization tests * fix `inference_config.yaml` not found error --------- Co-authored-by: voorhs <ilya_alekseev_2016@list.ru>

voorhs · 2024-11-09T09:56:47Z

пр готов к мерджу, жду только ревью от кого-нибудь

truff4ut · 2024-11-11T05:31:36Z

autointent/context/utils.py

+from .data_handler import Dataset
+
+
+class NumpyEncoder(json.JSONEncoder):


Как будто этот класс больше не нужен

он используется еще в Context.dump() и infererence.cli_enpoint.main()

truff4ut · 2024-11-11T05:44:02Z

autointent/context/optimization_info/optimization_info.py

+
+
+@dataclass
+class ModulesList:


А может везде Pydantic сделаем. Есть ли минусы?

Конкретно в данном месте pydantic не получилось использовать из-за сложной схемы с тайпингом и проблемой circular import. Была ошибка, что ещё не определен объект Module. Как только я заменил на датакласс, ошибка пропала

truff4ut · 2024-11-11T05:44:52Z

autointent/context/context.py

+    def get_max_length(self) -> int | None:
+        return self.vector_index_client.embedder_max_length
+
+    def get_dump_dir(self) -> Path | None:


truff4ut · 2024-11-11T06:13:22Z

autointent/modules/base.py

@@ -52,3 +52,6 @@ def predict(self, *args: list[str] | npt.NDArray[Any], **kwargs: dict[str, Any])
    @abstractmethod
    def from_context(cls, context: Context, **kwargs: dict[str, Any]) -> Self:
        pass
+
+    def get_embedder_name(self) -> str | None:


Тогда можно будет убрать все переопределения этого метода.

def get_embedder_name(self) -> str | None: if hasattr(self, "embedder_name"): return getattr(self, "embedder_name", None) return None

Пока хочу пожить с такой версией. Просто боюсь вдруг понадобится в одном из потомков как-то поменять название embedder_name. Если даже в следующих релизах не понадобится, то уберем

помечу эту функцию в базовом Module как экспериментальную

truff4ut · 2024-11-11T07:41:40Z

autointent/pipeline/optimization/pipeline_optimizer.py

+            context.vector_index_client.delete_db()
+
+    def optimize_from_dataset(
+        self, train_data: Dataset, val_data: Dataset | None = None, force_multilabel: bool = False


train_dataset и test_dataset в моем понимании лучше

В будущем ближе к релизу мб исправим. Проблем с неймингами много

truff4ut · 2024-11-11T07:49:32Z

autointent/pipeline/optimization/pipeline_optimizer.py

+        self.vector_index_config = VectorIndexConfig()
+        self.embedder_config = EmbedderConfig()
+
+    def set_config(self, config: LoggingConfig | VectorIndexConfig | EmbedderConfig) -> None:


Можно так написать для красоты. Только сообщение для ошибки вынести в отдельную переменную, иначе ruff падает.

def set_config(self, config: LoggingConfig | VectorIndexConfig | EmbedderConfig) -> None: match config: case LoggingConfig(): self.logging_config = config case VectorIndexConfig(): self.vector_index_config = config case EmbedderConfig(): self.embedder_config = config case _: raise TypeError("unknown config type")

жесть не знал что в питоне есть свой switch-case...

truff4ut · 2024-11-11T07:57:17Z

autointent/context/context.py

+            augmenter=augmenter,
+        )
+
+    def set_datasets(


Метод set_datasets, а на вход ..._data

не оч понял

truff4ut · 2024-11-11T07:57:26Z

autointent/context/context.py

+        self.seed = seed
+        self._logger = logging.getLogger(__name__)
+
+    def config_logs(self, config: LoggingConfig) -> None:


Я бы подобные методы называл configure_logging или setup_logging

truff4ut · 2024-11-11T08:04:16Z

autointent/pipeline/optimization/cli_endpoint.py

-        cfg.embedder.max_length,
-    )
+    context = Context(cfg.seed)
+    context.config_logs(cfg.logs)


logs -> logging_config и т. д.

Тем более так сделано в классе PipelineOptimizer.

truff4ut · 2024-11-11T08:05:20Z

autointent/pipeline/inference/inference_pipeline.py


    def predict(self, utterances: list[str]) -> list[LabelType]:
        scores = self.nodes[NodeType.scoring].module.predict(utterances)
        return self.nodes[NodeType.prediction].module.predict(scores)  # type: ignore[return-value]

    def fit(self, utterances: list[str], labels: list[LabelType]) -> None:
        pass
+
+    @classmethod
+    def from_context(cls, context: Context) -> "InferencePipeline":


Рома пишет -> Self. Надо договориться

voorhs added 9 commits November 5, 2024 12:14

stage result

f3d7777

decompose Context.__init__() and implement get_ methods

b192dc8

fix tests

0f6f568

fix typing

a118e27

add Context.set_datasets and allow not dumping modules

0e3fe2a

implement PipelineOptimizer.fit_from_dataset

c4e15d9

enable configuration for python api

109d47e

fix typing

2b0d371

fix tests

d7c4066

voorhs requested a review from Samoed November 5, 2024 15:51

Samoed reviewed Nov 5, 2024

View reviewed changes

autointent/pipeline/optimization/pipeline_optimizer.py Show resolved Hide resolved

Samoed reviewed Nov 5, 2024

View reviewed changes

autointent/pipeline/optimization/pipeline_optimizer.py Show resolved Hide resolved

voorhs added 2 commits November 6, 2024 11:56

add clear_ram option

d305bb5

infering modules from ram after optimization

d648849

voorhs marked this pull request as draft November 6, 2024 10:19

voorhs and others added 6 commits November 6, 2024 13:34

minor change

e7d0fbd

fix unintended runs directory creation

975c8df

add save_db option

378e582

fix circular imports

8c2eaff

fix tests

322340b

voorhs marked this pull request as ready for review November 8, 2024 20:13

voorhs requested a review from Samoed November 8, 2024 20:18

Darinochka and others added 4 commits November 9, 2024 10:47

Test/pipeline simpler fitting (#39)

c487363

* tess: added inference_test * test: added inference pipeline cli * test: fixed device * test: added optimization tests * fix `inference_config.yaml` not found error --------- Co-authored-by: voorhs <ilya_alekseev_2016@list.ru>

refactor github actions

a2e4dea

rename actions

c349f18

fix model_name issue

b26a878

voorhs requested a review from truff4ut November 9, 2024 10:03

truff4ut mentioned this pull request Nov 11, 2024

feat: added eps for zeros #35

Merged

truff4ut reviewed Nov 11, 2024

View reviewed changes

response to review

7ccbca2

voorhs mentioned this pull request Nov 11, 2024

Add enhanced output #38

Merged

voorhs added 8 commits November 12, 2024 09:58

attempt to fix winerror access denied problem

db7f207

try to fix unexpected argument error

11bb883

minor bug fix

616ba32

another attempt to fix permission error

11bdb6f

stupid bug fix

9545bbf

refactor cache cleaning

1096e04

another attempt (workaround: ingore permission error)

d6884d4

change return type of classmethods-constructors to Self

9e55a65

voorhs merged commit ad097e8 into dev Nov 12, 2024
20 checks passed

voorhs deleted the feat/pipeline-simpler-fitting branch November 12, 2024 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/pipeline simpler fitting #36

Feat/pipeline simpler fitting #36

voorhs commented Nov 5, 2024

Samoed Nov 5, 2024

truff4ut Nov 11, 2024

Samoed Nov 5, 2024

voorhs Nov 5, 2024

Samoed Nov 5, 2024

voorhs commented Nov 9, 2024 •

edited

Loading

truff4ut Nov 11, 2024

voorhs Nov 11, 2024

truff4ut Nov 11, 2024

voorhs Nov 11, 2024

truff4ut Nov 11, 2024

truff4ut Nov 11, 2024

voorhs Nov 11, 2024

truff4ut Nov 11, 2024

voorhs Nov 11, 2024

truff4ut Nov 11, 2024

voorhs Nov 11, 2024

truff4ut Nov 11, 2024

voorhs Nov 11, 2024

truff4ut Nov 11, 2024

truff4ut Nov 11, 2024

truff4ut Nov 11, 2024

		from .data_handler import Dataset


		class NumpyEncoder(json.JSONEncoder):

Feat/pipeline simpler fitting #36

Feat/pipeline simpler fitting #36

Conversation

voorhs commented Nov 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voorhs commented Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voorhs commented Nov 9, 2024 •

edited

Loading