@InProceedings{Saini21,
author = {Saini, D. and Jain, A.K. and Dave, K. and Jiao, J. and Singh, A. and Zhang, R. and Varma, M.},
title = {GalaXC: Graph Neural Networks with Labelwise Attention for Extreme Classification},
booktitle = {Proceedings of The Web Conference},
month = "April",
year = "2021",
}
git clone https://github.com/Extreme-classification/GalaXC.git
conda env create -f GalaXC/environment.yml
conda activate galaxc
pip install hnswlib
git clone https://github.com/kunaldahiya/pyxclib.git
cd pyxclib
python setup.py install
cd ../GalaXC
Your dataset should have the following structure:
DatasetName (e.g. LF-AmazonTitles-131K)
│ trn_X.txt (text for trn documents, one text in each line)
| tst_X.tst (text for tst documents, one text in each line)
| Y.txt (text for labels, one text in each line)
│ trn_X_Y.txt (trn labels in spmat format)
| tst_X_Y.txt (tst labels in spmat format)
| filter_labels_test.txt (filter labels where label and test documents are same)
│
└───XXCondensedData (embeddings for tst, trn documents and labels, for benchmark datasets, XX=DX[Astec])
│ trn_point_embs.npy (2D nummpy matrix for trn document embeddings)
│ tst_point_embs.npy (2D nummpy matrix for tst document embeddings)
| label_embs.npy (2D nummpy matrix for label embeddings)
We have provided the DX(embeddings from Module 1 of Astec) embeddings for public benchmark datasets for ease of use. Got better(higher recall) embeddings from somewhere? Just plug the new ones and GalaXC will have better preformance, no nead to make any code change! These files for LF-AmazonTitles-131K, LF-WikiSeeAlsoTitles-320K and LF-AmazonTitles-1.3M can be found here. Except the files in XXCondensedData, all other files are copy of the datasets from The Extreme Classification Repository.
To reproduce the numbers on public benchmark datasets reported in the paper, the sample runs are
LF-AmazonTitles-131K
python -u -W ignore train_main.py --dataset /your/path/to/data/LF-AmazonTitles-131K --save-model 0 --devices cuda:0 --num-epochs 30 --num-HN-epochs 0 --batch-size 256 --lr 0.001 --attention-lr 0.001 --adjust-lr 5,10,15,20,25,28 --dlr-factor 0.5 --mpt 0 --restrict-edges-num -1 --restrict-edges-head-threshold 20 --num-random-samples 30000 --random-shuffle-nbrs 0 --fanouts 4,3,2 --num-HN-shortlist 500 --embedding-type DX --run-type NR --num-validation 25000 --validation-freq -1 --num-shortlist 500 --predict-ova 0 --A 0.6 --B 2.6
LF-WikiSeeAlsoTitles-320K
python -u -W ignore train_main.py --dataset /your/path/to/data/LF-WikiSeeAlsoTitles-320K --save-model 0 --devices cuda:0 --num-epochs 30 --num-HN-epochs 0 --batch-size 256 --lr 0.001 --attention-lr 0.05 --adjust-lr 5,10,15,20,25,28 --dlr-factor 0.5 --mpt 0 --restrict-edges-num -1 --restrict-edges-head-threshold 20 --num-random-samples 32000 --random-shuffle-nbrs 0 --fanouts 4,3,2 --num-HN-shortlist 500 --repo 1 --embedding-type DX --run-type NR --num-validation 25000 --validation-freq -1 --num-shortlist 500 --predict-ova 0 --A 0.55 --B 1.5
LF-AmazonTitles-1.3M
python -u -W ignore train_main.py --dataset /your/path/to/data/LF-AmazonTitles-1.3M --save-model 0 --devices cuda:0 --num-epochs 24 --num-HN-epochs 15 --batch-size 512 --lr 0.001 --attention-lr 0.05 --adjust-lr 4,8,12,16,18,20,22 --dlr-factor 0.5 --mpt 0 --restrict-edges-num 5 --restrict-edges-head-threshold 20 --num-random-samples 100000 --random-shuffle-nbrs 1 --fanouts 3,3,3 --num-HN-shortlist 500 --embedding-type DX --run-type NR --num-validation 25000 --validation-freq -1 --num-shortlist 500 --predict-ova 0 --A 0.6 --B 2.6