Denoising Score Matching for Categorical Data

This is the official repo for the work Anomaly Detection via Gumbel Noise Score Matching. If you use this repository for research purposes, please cite our work using the citation at the end of this document.

Training on a new dataset

This section explains how to train the model on a new dataset. You'll need to modify two main files: dataconfigs.py and dataloader.py. Follow these steps to add your new dataset:

Step 1: Update `dataconfigs.py`

Add a new ConfigDict for your dataset. Use the existing configs as a template.
Fill in the following fields:
- dataset: The name of your dataset (string)
- categories: A list of integers representing the number of classes for each categorical feature
- numerical_features: The number of numerical features in your dataset (integer)
- label_column: The name of the column containing the class labels (string)
- anomaly_label: The label used for anomalies in your dataset (string)

Example:

new_dataset = ml_collections.ConfigDict()
new_dataset.dataset = "new_dataset"
new_dataset.categories = [3, 4, 2, 5]  # Example: 4 categorical features with 3, 4, 2, and 5 classes respectively
new_dataset.numerical_features = 2
new_dataset.label_column = "class"
new_dataset.anomaly_label = "anomaly"

# Add your new dataset to the main config
config.new_dataset = new_dataset

Step 2: Update `dataloader.py`

Add your dataset to the tabular_datasets dictionary:

tabular_datasets = {
    # ... existing datasets ...
    "new_dataset": "new_dataset.csv",
}

2 (optional). If your dataset requires special handling, modify the load_dataset function. Look for the section that handles specific datasets (e.g., "cars", "mushrooms", "nursery") and add your logic there.

Example:

if name == "new_dataset":
    df = pd.read_csv(f"data/{tabular_datasets[name]}")
    # Add any necessary preprocessing steps here
    # For example:
    # df = df.drop(columns=['unnecessary_column'])
    # df[label_name] = df[label_name].map({'normal': '0', 'anomaly': '1'})

Step 3: Prepare the dataset file

Ensure your dataset file (e.g., "new_dataset.csv") is in the correct format (CSV for new datasets).
Place the file in the appropriate directory (usually the data/ folder).
If your dataset requires any preprocessing, consider doing it beforehand to simplify the loading process.

Step 4: Run the training

To train the model with your new dataset, use the following command:

python main.py --config=configs/your_config.py --mode=train --workdir=/path/to/your/workdir

Make sure to update your_config.py with config.data = get_data_config("new_dataset") and include the necessary parameters for your new dataset.

Additional Notes

Ensure that the data types in your CSV file match the expectations of the loader. Categorical data should be strings, and numerical data should be integers or floats.
If your dataset has a different file format (e.g., ARFF), you may need to modify the load_dataset function in dataloader.py to handle it correctly.
Always test your changes with a small subset of your data before running a full training session.

Model Training Configuration

This section explains the purpose and structure of the model training configuration files, and how to create a new configuration for your experiments.

Purpose of Configuration Files

Configuration files in this project serve several important purposes:

They centralize all hyperparameters and settings for an experiment.
They allow for easy reproducibility of experiments.
They facilitate hyperparameter tuning and ablation studies.
They provide a clear overview of the experimental setup.

Structure of Configuration Files

The configuration system is built using ml_collections.ConfigDict, which allows for nested configurations and easy access to parameters.

There are two main types of configuration files:

Base Configuration (base_config.py): This file contains default settings for all experiments.
Dataset-specific Configuration (e.g., cars_config.py): These files inherit from the base configuration and specify settings for a particular dataset or experiment.

Key Configuration Sections

Training

The training section includes parameters such as:

batch_size: Size of training batches
n_steps: Total number of training steps // freq is measured in number of trianing steps //
log_freq: Frequency of logging metrics (to tensorboard or wandb)
eval_freq: Frequency of evaluation
checkpoint_freq: Frequency of saving checkpoints
snapshot_freq: Frequency of saving snapshots
resume: Whether to resume training from a checkpoint

Data

The data section is typically loaded from a separate data configuration file (e.g., dataconfigs.py) and includes dataset-specific information. See previous section for more information.

Model

The model section specifies the architecture and hyperparameters of the model, including:

name: Model architecture (e.g., "tab-transformer", "tab-resnet")
ndims: Number of dimensions in the model
layers: Number of layers
dropout: Dropout rate
attention_heads and attention_dim_head: For transformer-based models
Other architecture-specific parameters

Optimization

The optim section includes optimization-related settings such as:

optimizer: Optimizer type (e.g., "AdamW")
lr: Learning rate
weight_decay: L2 regularization strength
beta1 and beta2: Adam optimizer parameters
grad_clip: Gradient clipping value
scheduler: Learning rate scheduler type

Evaluation

The eval section specifies evaluation-related parameters, such as the batch size for evaluation.

Hyperparameter Sweeps

The sweep section defines the configuration for hyperparameter tuning, including:

Parameters to sweep over
Sweep method (e.g., "bayes" for Bayesian optimization)
Metric to optimize
Early termination criteria

This project maskes use of the Weights and Biases (wandb) library to perform and log hyperparameter sweeps.

Creating a New Configuration File

To create a new configuration file for your experiment:

Create a new Python file (e.g., my_experiment_config.py).

Import the necessary modules and the base configuration:

import ml_collections
from configs.base_config import get_config as get_base_config
from configs.dataconfigs import get_config as get_data_config

Define a get_config() function that returns your custom configuration:

def get_config():
    config = get_base_config()
    
    # Modify or add configuration parameters
    config.training.batch_size = 256
    config.training.n_steps = 1000000
    
    # Set the data configuration
    # The dataconfigs.py file should have already been updated
    # by following information in the previous section
    config.data = get_data_config("my_dataset")
    
    # Modify model configuration
    config.model.name = "my_custom_model"
    config.model.ndims = 512
    config.model.layers = 8
    
    # Modify optimization configuration
    config.optim.lr = 1e-4
    config.optim.weight_decay = 1e-5
    
    return config

Customize the configuration as needed for your experiment, overriding default values from the base configuration.

Using the Configuration in the Training Process

The configuration is used in the runner.py file to set up the training process. The main steps are:

The configuration is loaded using the command-line argument --config.
The appropriate model is instantiated based on the config.model.name.
The dataset is loaded using parameters from config.data.
The optimizer and learning rate scheduler are set up using config.optim.
The training loop uses various parameters from the configuration, such as config.training.n_steps and config.training.eval_freq.

To run training with your new configuration:

python main.py --config=configs/my_experiment_config.py --mode=train --workdir=/path/to/your/workdir

Tips for Experimenting with Configurations

Start with a copy of an existing configuration file and modify it for your needs.
Use the sweep section to define hyperparameter searches for your experiment.
Keep track of different configurations by using clear, descriptive filenames.
Comment your configuration files, especially when using non-standard settings.
When running experiments, save the configuration along with the results for future reference.
Use the devtest flag in the configuration for quick testing of your setup before running full experiments.

Remember that changes to the configuration structure might require corresponding updates in the runner.py file to ensure all new parameters are properly utilized during training.

Citing

@article{10.3389/frai.2024.1441205,
   AUTHOR={Mahmood, Ahsan  and Oliva, Junier  and Styner, Martin A.},
   TITLE={Anomaly Detection via Gumbel Noise Score Matching},
   JOURNAL={Frontiers in Artificial Intelligence},
   VOLUME={7},
   YEAR={2024},
   URL={https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2024.1441205},
   DOI={10.3389/frai.2024.1441205},
   ISSN={2624-8212},
}

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
DAGMM_pytorch		DAGMM_pytorch
adbench_minimal		adbench_minimal
configs		configs
docker		docker
evaluation_notebooks		evaluation_notebooks
models		models
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
baseline_runner.py		baseline_runner.py
concrete_mnist.ipynb		concrete_mnist.ipynb
dataloader.py		dataloader.py
example-run.sh		example-run.sh
losses.py		losses.py
main.py		main.py
ood_detection_helper.py		ood_detection_helper.py
parallel_run.sh		parallel_run.sh
runner.py		runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Denoising Score Matching for Categorical Data

Training on a new dataset

Step 1: Update `dataconfigs.py`

Step 2: Update `dataloader.py`

Step 3: Prepare the dataset file

Step 4: Run the training

Additional Notes

Model Training Configuration

Purpose of Configuration Files

Structure of Configuration Files

Key Configuration Sections

Training

Data

Model

Optimization

Evaluation

Hyperparameter Sweeps

Creating a New Configuration File

Using the Configuration in the Training Process

Tips for Experimenting with Configurations

Citing

About

Releases

Packages

Languages

ahsanMah/categorical-dsm

Folders and files

Latest commit

History

Repository files navigation

Denoising Score Matching for Categorical Data

Training on a new dataset

Step 1: Update dataconfigs.py

Step 2: Update dataloader.py

Step 3: Prepare the dataset file

Step 4: Run the training

Additional Notes

Model Training Configuration

Purpose of Configuration Files

Structure of Configuration Files

Key Configuration Sections

Training

Data

Model

Optimization

Evaluation

Hyperparameter Sweeps

Creating a New Configuration File

Using the Configuration in the Training Process

Tips for Experimenting with Configurations

Citing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 1: Update `dataconfigs.py`

Step 2: Update `dataloader.py`

Packages