This is the official repo for the work Anomaly Detection via Gumbel Noise Score Matching. If you use this repository for research purposes, please cite our work using the citation at the end of this document.
This section explains how to train the model on a new dataset. You'll need to modify two main files: dataconfigs.py
and dataloader.py
. Follow these steps to add your new dataset:
- Add a new
ConfigDict
for your dataset. Use the existing configs as a template. - Fill in the following fields:
dataset
: The name of your dataset (string)categories
: A list of integers representing the number of classes for each categorical featurenumerical_features
: The number of numerical features in your dataset (integer)label_column
: The name of the column containing the class labels (string)anomaly_label
: The label used for anomalies in your dataset (string)
Example:
new_dataset = ml_collections.ConfigDict()
new_dataset.dataset = "new_dataset"
new_dataset.categories = [3, 4, 2, 5] # Example: 4 categorical features with 3, 4, 2, and 5 classes respectively
new_dataset.numerical_features = 2
new_dataset.label_column = "class"
new_dataset.anomaly_label = "anomaly"
# Add your new dataset to the main config
config.new_dataset = new_dataset
- Add your dataset to the
tabular_datasets
dictionary:
tabular_datasets = {
# ... existing datasets ...
"new_dataset": "new_dataset.csv",
}
2 (optional). If your dataset requires special handling, modify the load_dataset
function. Look for the section that handles specific datasets (e.g., "cars", "mushrooms", "nursery") and add your logic there.
Example:
if name == "new_dataset":
df = pd.read_csv(f"data/{tabular_datasets[name]}")
# Add any necessary preprocessing steps here
# For example:
# df = df.drop(columns=['unnecessary_column'])
# df[label_name] = df[label_name].map({'normal': '0', 'anomaly': '1'})
- Ensure your dataset file (e.g., "new_dataset.csv") is in the correct format (CSV for new datasets).
- Place the file in the appropriate directory (usually the
data/
folder). - If your dataset requires any preprocessing, consider doing it beforehand to simplify the loading process.
To train the model with your new dataset, use the following command:
python main.py --config=configs/your_config.py --mode=train --workdir=/path/to/your/workdir
Make sure to update
your_config.py
withconfig.data = get_data_config("new_dataset")
and include the necessary parameters for your new dataset.
- Ensure that the data types in your CSV file match the expectations of the loader. Categorical data should be strings, and numerical data should be integers or floats.
- If your dataset has a different file format (e.g., ARFF), you may need to modify the
load_dataset
function indataloader.py
to handle it correctly. - Always test your changes with a small subset of your data before running a full training session.
This section explains the purpose and structure of the model training configuration files, and how to create a new configuration for your experiments.
Configuration files in this project serve several important purposes:
- They centralize all hyperparameters and settings for an experiment.
- They allow for easy reproducibility of experiments.
- They facilitate hyperparameter tuning and ablation studies.
- They provide a clear overview of the experimental setup.
The configuration system is built using ml_collections.ConfigDict
, which allows for nested configurations and easy access to parameters.
There are two main types of configuration files:
- Base Configuration (
base_config.py
): This file contains default settings for all experiments. - Dataset-specific Configuration (e.g.,
cars_config.py
): These files inherit from the base configuration and specify settings for a particular dataset or experiment.
The training
section includes parameters such as:
batch_size
: Size of training batchesn_steps
: Total number of training steps // freq is measured in number of trianing steps //log_freq
: Frequency of logging metrics (to tensorboard or wandb)eval_freq
: Frequency of evaluationcheckpoint_freq
: Frequency of saving checkpointssnapshot_freq
: Frequency of saving snapshotsresume
: Whether to resume training from a checkpoint
The data
section is typically loaded from a separate data configuration file (e.g., dataconfigs.py
) and includes dataset-specific information. See previous section for more information.
The model
section specifies the architecture and hyperparameters of the model, including:
name
: Model architecture (e.g., "tab-transformer", "tab-resnet")ndims
: Number of dimensions in the modellayers
: Number of layersdropout
: Dropout rateattention_heads
andattention_dim_head
: For transformer-based models- Other architecture-specific parameters
The optim
section includes optimization-related settings such as:
optimizer
: Optimizer type (e.g., "AdamW")lr
: Learning rateweight_decay
: L2 regularization strengthbeta1
andbeta2
: Adam optimizer parametersgrad_clip
: Gradient clipping valuescheduler
: Learning rate scheduler type
The eval
section specifies evaluation-related parameters, such as the batch size for evaluation.
The sweep
section defines the configuration for hyperparameter tuning, including:
- Parameters to sweep over
- Sweep method (e.g., "bayes" for Bayesian optimization)
- Metric to optimize
- Early termination criteria
This project maskes use of the Weights and Biases (wandb
) library to perform and log hyperparameter sweeps.
To create a new configuration file for your experiment:
-
Create a new Python file (e.g.,
my_experiment_config.py
). -
Import the necessary modules and the base configuration:
import ml_collections from configs.base_config import get_config as get_base_config from configs.dataconfigs import get_config as get_data_config
-
Define a
get_config()
function that returns your custom configuration:def get_config(): config = get_base_config() # Modify or add configuration parameters config.training.batch_size = 256 config.training.n_steps = 1000000 # Set the data configuration # The dataconfigs.py file should have already been updated # by following information in the previous section config.data = get_data_config("my_dataset") # Modify model configuration config.model.name = "my_custom_model" config.model.ndims = 512 config.model.layers = 8 # Modify optimization configuration config.optim.lr = 1e-4 config.optim.weight_decay = 1e-5 return config
-
Customize the configuration as needed for your experiment, overriding default values from the base configuration.
The configuration is used in the runner.py
file to set up the training process. The main steps are:
- The configuration is loaded using the command-line argument
--config
. - The appropriate model is instantiated based on the
config.model.name
. - The dataset is loaded using parameters from
config.data
. - The optimizer and learning rate scheduler are set up using
config.optim
. - The training loop uses various parameters from the configuration, such as
config.training.n_steps
andconfig.training.eval_freq
.
To run training with your new configuration:
python main.py --config=configs/my_experiment_config.py --mode=train --workdir=/path/to/your/workdir
- Start with a copy of an existing configuration file and modify it for your needs.
- Use the
sweep
section to define hyperparameter searches for your experiment. - Keep track of different configurations by using clear, descriptive filenames.
- Comment your configuration files, especially when using non-standard settings.
- When running experiments, save the configuration along with the results for future reference.
- Use the
devtest
flag in the configuration for quick testing of your setup before running full experiments.
Remember that changes to the configuration structure might require corresponding updates in the runner.py
file to ensure all new parameters are properly utilized during training.
@article{10.3389/frai.2024.1441205,
AUTHOR={Mahmood, Ahsan and Oliva, Junier and Styner, Martin A.},
TITLE={Anomaly Detection via Gumbel Noise Score Matching},
JOURNAL={Frontiers in Artificial Intelligence},
VOLUME={7},
YEAR={2024},
URL={https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2024.1441205},
DOI={10.3389/frai.2024.1441205},
ISSN={2624-8212},
}