infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information
This repository provides datasets, demo and code of the following paper:
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information
Jaehyung Kim, Yekyung Kim, Karin de Langis, Jinwoo Shin, Dongyeop Kang
ACL 2023 (main track, long paper)
The following command installs all necessary packages:
pip install -r requirements.txt
The project was tested using Python 3.7
.
To construct infoVerse, one first needs to 1) train the vanilla classifiers. Then, using the trained classifiers, one can construct infoVerse by extracting the pre-defined meta-information (defined in ./src/scores_src
). We release the constructed infoVerse at google drive. Please check out run.sh
.
- Train the classifiers used for gathering meta-informations
python train.py --train_type 0000_base --save_ckpt --epochs 10 --dataset sst2 --seed 1234 --backbone roberta_large
- Construction of infoVerse
python construct_infoverse.py --train_type 0000_base --seed_list "1234 2345 3456" --epochs 10 --dataset sst2 --seed 1234 --backbone roberta_large
In addition, one can visualize the constructed infoVerse and use it to analyize the given dataset using visualize.ipynb
. For example, we provide a code to generate an interactive html file, as shown in the below figure. Pre-constructed tSNE and HTML files can be downloaded from the google drive.
Please see the repository ./data_pruning
.
Please see the repository ./active_learning
.
Please see the repository ./data_annotation
.
If you find this work useful for your research, please cite our papers:
@article{kim2023infoverse,
title={infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information},
author={Kim, Jaehyung and Kim, Yekyung and de Langis, Karin and Shin, Jinwoo and Kang, Dongyeop},
journal={The 61st Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2023}
}