Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hydra help and documentation update for file-based execution configuration #31

Merged
merged 4 commits into from
Nov 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 86 additions & 53 deletions README.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

надо бы не забыть со следующим пр изменить в README дефолтные значения, указанные в разделе Все опции

(пишу для себя)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

я бы еще добавил в ридми строчку в духе "ПРимеры конфигурационных файлов можно посмотреть в этой папке"

Original file line number Diff line number Diff line change
Expand Up @@ -48,69 +48,102 @@ autointent data.train_path=default-multiclass \
seed=42
```

Все опции (по группам):
Все опции в виде yaml (показаны дефолтные значения):
```yaml
data:
# Path to a json file with training data. Set to "default" to use banking77 data stored within the
# autointent package.
train_path: ???

# Path to a json file with test records. Skip this option if you want to use a random subset of the
# training sample as test data.
test_path: null

# Set to true if your data is multiclass but you want to train the multilabel classifier.
force_multilabel: false

task:
# Path to a yaml configuration file that defines the optimization search space.
# Omit this to use the default configuration.
search_space_path: null
logs:
# Name of the run prepended to optimization assets dirname (generated randomly if omitted)
run_name: "awful_hippo_10-30-2024_19-42-12"

# Location where to save optimization logs that will be saved as `<logs_dir>/<run_name>_<cur_datetime>/logs.json`.
# Omit to use current working directory. <-- on Windows it is not correct
dirpath: "/home/user/AutoIntent/awful_hippo_10-30-2024_19-42-12"

dump_dir: "/home/user/AutoIntent/runs/awful_hippo_10-30-2024_19-42-12/modules_dumps"

vector_index:
# Location where to save faiss database file. Omit to use your system's default cache directory.
db_dir: null

# Specify device in torch notation
device: cpu

augmentation:
# Number of shots per intent to sample from regular expressions. This option extends sample utterance
# within multiclass intent records.
regex_sampling: 0

# Config string like "[20, 40, 20, 10]" means 20 one-label examples, 40 two-label examples, 20 three-label examples,
# 10 four-label examples. This option extends multilabel utterance records.
multilabel_generation_config: null

embedder:
# batch size for embedding computation.
batch_size: 1
# sentence length limit for embedding computation
max_length: null

#Affects the randomness
seed: 0

# String from {DEBUG,INFO,WARNING,ERROR,CRITICAL}. Omit to use ERROR by default.
hydra.job_logging.root.level: "ERROR"
```
seed Affects the randomness

== task ==

search_space_path Path to a yaml configuration file that defines the
optimization search space. Omit this to use the
default configuration.

== data ==

train_path Path to a json file with training data. Set to
"default" to use banking77 data stored within the
autointent package.

test_path Path to a json file with test records. Skip this
option if you want to use a random subset of the
training sample as test data.

force_multilabel Set to true if your data is multiclass but you want to
train the multilabel classifier.

== logs ==

dirpath Location where to save optimization logs that will be
saved as `<logs_dir>/<run_name>_<cur_datetime>/logs.json`.
Omit to use current working directory.

run_name Name of the run prepended to optimization assets dirname

log_level String from {DEBUG,INFO,WARNING,ERROR,CRITICAL}.
Omit to use ERROR by default.

== vector_index ==

db_dir Location where to save faiss database file. Omit to
use your system's default cache directory.

device Specify device in torch notation
### Как задавать конфигурационные опции
* Вариант 1 - в коммандной строке в виде key=value. Пример:
```bash
autointent embedder.batch_size=32
```

== augmentation ==
* Вариант 2 - в конфигурационном yaml файле.
Создайте в отдельной папке yaml файл со следующей структурой **my_config.yaml**:
```yaml
defaults:
- optimization_config
- _self_
- override hydra/job_logging: custom

# put the configuration options you want to override here. The full structure is presented above.
# Here is just an example with the same options as for the command line variant above.
embedder:
embedder_batch_size: 32
```
Запускаем AutoIntent:
```bash
autointent --config-path=/path/to/config/directory --config-name=my_config
```

regex_sampling Number of shots per intent to sample from regular
expressions. This option extends sample utterances
within multiclass intent records.
Важно:
* указывайте полный путь в опции config-path.
* не используйте tab в yaml файле.
* желательно чтобы имя файла отличалось от
optimization_config.yaml, чтобы избежать warnings от hydra

seed Affects the data partitioning
Вы можете использовать комбинацию Варианта 1 и 2. Опции из коммандной строки имеют наивысший приоритет.

hydra.job_logging.root.level
String from {DEBUG,INFO,WARNING,ERROR,CRITICAL}.
Omit to use ERROR by default.

multilabel_generation_config
Config string like "[20, 40, 20, 10]" means 20 one-
label examples, 40 two-label examples, 20 three-label
examples, 10 four-label examples. This option extends
multilabel utterance records.
```

Вместе с пакетом предоставляются дефолтные конфиг и данные (5-shot banking77 / 20-shot dstc3).

Пример входных данных в директории `data/intent_records`.
Примеры:
- примеры входных данных: [data](./data)
- примеры конфигов: [example_configs](./example_configs)

### Инференс

Expand Down
30 changes: 27 additions & 3 deletions autointent/configs/optimization_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
from hydra.core.config_store import ConfigStore
from omegaconf import MISSING

from autointent.custom_types import LogLevel
from autointent.pipeline.optimization.utils import generate_name


Expand All @@ -28,7 +27,6 @@ class TaskConfig:
class LoggingConfig:
run_name: str | None = None
dirpath: Path | None = None
level: LogLevel = LogLevel.ERROR
dump_dir: Path | None = None

def __post_init__(self) -> None:
Expand Down Expand Up @@ -84,7 +82,11 @@ class OptimizationConfig:
embedder: EmbedderConfig = field(default_factory=EmbedderConfig)

defaults: list[Any] = field(
default_factory=lambda: ["_self_", {"override hydra/job_logging": "autointent_standard_job_logger"}]
default_factory=lambda: [
"_self_",
{"override hydra/job_logging": "autointent_standard_job_logger"},
{"override hydra/help": "autointent_help"},
]
)


Expand All @@ -107,7 +109,29 @@ class OptimizationConfig:
"disable_existing_loggers": "false",
}

help_config = {
"app_name": "AutoIntent",
"header": "== ${hydra.help.app_name} ==",
"footer": """
Powered by Hydra (https://hydra.cc)
Use --hydra-help to view Hydra specific help""",
"template": """
${hydra.help.header}

This is ${hydra.help.app_name}!
== Config ==
This is the config generated for this run.
You can override everything, for example:
python my_app.py db.user=foo db.pass=bar
-------
$CONFIG
-------

${hydra.help.footer}""",
}


cs = ConfigStore.instance()
cs.store(name="optimization_config", node=OptimizationConfig)
cs.store(name="autointent_standard_job_logger", group="hydra/job_logging", node=logger_config)
cs.store(name="autointent_help", group="hydra/help", node=help_config)
11 changes: 11 additions & 0 deletions example_configs/example_1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
defaults:
- optimization_config
- _self_

data:
train_path: "default-multilabel"

hydra:
job_logging:
root:
level: "INFO"
15 changes: 15 additions & 0 deletions example_configs/example_2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
defaults:
- optimization_config
- _self_

data:
train_path: "data/intent_records/ac_robotic_new.json"
force_multilabel: true

logs:
dirpath: "experiments/multiclass_as_multilabel/"
run_name: "robotics_new_testing"

augmentation:
regex_sampling: 10
multilabel_generation_config: "[0, 4000, 1000]"
11 changes: 11 additions & 0 deletions example_configs/example_3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
defaults:
- optimization_config
- _self_

data:
train_path: "data/intent_records/ac_robotic_new.json"
test_path: "data/intent_records/ac_robotic_val.json"
force_multilabel: true

augmentation:
regex_sampling: 20
9 changes: 9 additions & 0 deletions example_configs/example_4.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
defaults:
- optimization_config
- _self_

data:
train_path: "default-multiclass"
test_path: "data/intent_records/banking77_test.json"

seed: 42