B-Clean is a library built to support Bring Your Own Data (BYOD)
project. B-Clean provides different functionalities to detect outliers in tabular datasets and suggest possible transformations to clean the data.
- Statistical outlier detection
- Few-shot outlier detection
- Baseline model (HoloDetect)
- Data-driven model (LSTM)
- Improve performance
- Decrease number of examples
- Active learning outlier detection
- Automatic suggestion based on statistical model
- Policy-based active learning model
- Data transformation
We define and detect three different types of outliers as follows:
- Global outliers: values that rarely appear in the real-world data.
- Local outliers: values that are different from other values in the same attribute.
- Null outliers: values that have no meaning
- Install and activate conda environment
conda env create -f environment.yml
conda activate byod
- For evaluation on demo dataset, run command
PYTHONPATH=.:$PYTHONPATH python kbclean/experiments/error_detection.py evaluate --data_path demo/data --method lstm -i -k 2 -e 5