WAC: Wasserstein distance-based news Article Clustering

This project contains the implementation of the Wasserstein distance-based news Article Clustering algorithm. The algorithm is an unsupervised two-step online clustering algorithm that uses the Wasserstein distance (and distances similar to it). The two steps are (1) monolingual clustering of news articles and (2) multilingual clustering of events into clusters.

The articles and events are represented using an SBERT language model, which are fine-tunned for clustering tasks.

The remainder of the project contains the instructions for running the experiments.

📚 Papers

In case you use any of the components for your research, please refer to (and cite) the papers:

TODO

☑️ Requirements

Before starting the project make sure these requirements are available:

python. For setting up your research environment and python dependencies (version 3.8 or higher).
git. For versioning your code.

🛠️ Setup

Create a python environment

First create the virtual environment where all the modules will be stored.

Using venv

Using the venv command, run the following commands:

# create a new virtual environment
python -m venv venv

# activate the environment (UNIX)
source ./venv/bin/activate

# activate the environment (WINDOWS)
./venv/Scripts/activate

# deactivate the environment (UNIX & WINDOWS)
deactivate

Install

To install the requirements run:

pip install -e .

🗃️ Data

The data used in the experiments are a currated set of news articles retrieved from the Event Registry and prepared for the scientific paper¹.

To download the data run:

bash scripts/00_download_data.sh

This will download the data files and store them in the data/raw folder.

⚗️ Experiments

To run the experiments, run the folowing command:

# run the experiments
bash scripts/run_exp_pipeline.sh

The command above will perform a series of experiments by executing the following steps (the names of the files are listed in the scripts/run_exp_pipeline.sh file):

# prepare the data examples for the experiment
python scripts/01_prepare_data.py \
    --input_file ./data/raw/dataset.test.json \
    --output_file ./data/processed/dataset.test.csv

# cluster articles into events
python scripts/02_article_clustering.py \
    --input_file ./data/processed/dataset.test.csv \
    --output_file ./data/processed/article_clusters/dataset.test.csv \
    --rank_th 0.5 \
    --time_std 3 \
    --multilingual \
    --ents_th 0.0 \
    -gpu

# cluster events based on their similarity
python scripts/03_event_clustering.py \
    --input_file ./data/processed/article_clusters/dataset.test.csv \
    --output_file ./data/processed/event_clusters/dataset.test.csv \
    --rank_th 0.7 \
    --time_std 3 \
    --w_reg 0.1 \
    --w_nit 10 \
    -gpu

# evaluate the clusters
python scripts/04_evaluate.py \
    --label_file_path ./data/processed/dataset.test.csv \
    --pred_file_dir ./data/processed/event_clusters \
    --output_file ./results/dataset.test.csv

The results will be stored in the results folder.

Results

the hyper-parameters were selected by evaluating the performance of the clustering algorithm on the dev set. We performed a grid-search across the following hyper-parameters:

Clustering	Parameter	Grid Search	Description
article	rank_th	[0.4, 0.5, 0.6, 0.7]	Threshold for deciding if an article should be added to the cluster.
article	ents_th	[0.2, 0.3, 0.4, 0.5]	Threshold for deciding if an article should be added to the cluster (considering the entities).
article	time_std	[1, 2, 3, 5]	The std for temporal similarity between the article and event.
article	multilingual	[True, False]	Whether to use monolingual or multilingual clustering.
event	rank_th	[0.6, 0.7, 0.8, 0.9]	Threshold for deciding if events should be merged.
event	time_std	[1, 2, 3]	The std for temporal similarity between an events.

Performance results

The best performance is obtained with the following parameters:

	Article Clustering			Cluster Merging		Standard			BCubed
Variant name	rank_th	ents_th	time_std	rank_th	time_std	F1	P	R	F1	P	R	clusters
WAC_MONO	0.5	-	3	0.7	3	87.00	98.45	77.95	85.42	93.04	78.95	1066
WAC_MONO	0.6	-	3	0.7	3	69.50	98.71	53.63	81.08	94.14	71.20	1108
WAC_MONO+NER	0.5	0.2	3	0.7	3	85.02	98.52	74.77	84.78	93.51	77.54	1089
WAC_MONO+NER	0.6	0.2	3	0.7	3	67.23	98.12	51.14	79.72	93.80	69.32	1109
WAC_MULTI	0.5	-	3	0.7	3	92.20	98.55	86.62	86.67	92.94	81.20	1074
WAC_MULTI	0.6	-	3	0.7	3	74.43	98.81	59.70	81.98	94.00	72.68	1112

Cluster merging assessment analysis

To evaluate the impact the cluster merging process has on the algorithm’s performance, we compare the WAC algorithm variants to those where the cluster merging phase was not performed. Note that we compare only the WAC_MULTI variant, as it already generates multilingual clusters during the article clustering phase

	Article Clustering			Cluster Merging		Standard			BCubed
Variant name	rank_th	ents_th	time_std	rank_th	time_std	F1	P	R	F1	P	R	clusters
WAC_MULTI	0.5	-	3	0.7	3	92.20	98.55	86.62	86.67	92.94	81.20	1074
WAC_MULTI/MERGE	0.5	-	3	-	-	56.04	98.71	39.12	71.14	96.98	56.17	2339
WAC_MULTI	0.6	-	3	0.7	3	74.43	98.81	59.70	81.98	94.00	72.68	1112
WAC_MULTI/MERGE	0.6	-	3	-	-	24.28	99.40	13.83	47.10	99.04	31.59	4675

📣 Acknowledgments

This work is developed by Department of Artificial Intelligence at Jozef Stefan Institute.

This work was supported by the Slovenian Research Agency, and the European Union's Horizon 2020 project Humane AI Net [H2020-ICT-952026].

S. Miranda, A. Znotiņš, S. B. Cohen, and G. Barzdins, “Multilingual clustering of streaming news” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 4535–4544. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
notebooks		notebooks
results		results
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WAC: Wasserstein distance-based news Article Clustering

📚 Papers

☑️ Requirements

🛠️ Setup

Create a python environment

Using venv

Install

🗃️ Data

⚗️ Experiments

Results

Performance results

Cluster merging assessment analysis

📣 Acknowledgments

About

Releases 2

Packages

Languages

License

eriknovak/WAC

Folders and files

Latest commit

History

Repository files navigation

WAC: Wasserstein distance-based news Article Clustering

📚 Papers

☑️ Requirements

🛠️ Setup

Create a python environment

Using venv

Install

🗃️ Data

⚗️ Experiments

Results

Performance results

Cluster merging assessment analysis

📣 Acknowledgments

Footnotes

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages