Wójcik, F. (2024). An Analysis of Novel Money Laundering Data Using Heterogeneous Graph Isomorphism Networks. FinCEN Files Case Study. Econometrics. Ekonometria. Advances in Applied Data Analysis, 28(2), 32-49.
This project accompanies the above-mentioned publication and focuses on developing and applying the novel HexGIN (Heterogeneous extension for Graph Isomorphism Network) model to the FinCEN Files case data. The primary goal is to compare HexGIN's performance with existing solutions such as the SAGE-based graph neural network and Multi-Layer Perceptron (MLP), demonstrating its potential advantages in anti-money laundering (AML) systems.
The dataset in data/01_raw
folder contains the original files made publicly available by the International Consortium of Investigative Journalists (ICIJ) as part of the FinCEN Files investigation. It can be found under the following address, with the full case description: original data source.
The data processing pipeline consists of several stages:
- Data Collection and Cleaning:
- Load raw transaction data from the FinCEN Files.
- Clean the data to handle missing values, remove duplicates, and correct inconsistencies.
- Feature Engineering:
- Transform transaction data into a graph structure.
- Extract relevant features such as node attributes and edge attributes.
- Graph Construction:
- Construct a heterogeneous graph representing various entities (e.g., individuals, accounts) and their relationships (e.g., transactions).
- Data Splitting:
- Split the graph data into training, validation, and test sets ensuring no data leakage between sets.
- Normalization and Scaling:
- Apply normalization and scaling techniques to ensure the data is suitable for model training.
- Preparation of Training Data:
- Format the data into a suitable structure for input into the different models (HexGIN, Graph SAGE, MLP).
- Model Training:
- Train the HexGIN model on the training data.
- Also train baseline models (Graph SAGE and MLP) for comparison.
- Model Evaluation:
- Evaluate the models using cross-validation on the training set.
- Use metrics such as F1 score, precision, and ROC AUC for performance comparison.
- Testing:
- Apply the trained models to the test set and compare their performance.
Picture below presents detailed overview of the processing pipeline and dependencies between steps.
- HexGIN: A novel extension of Graph Isomorphism Networks capable of handling heterogeneous data.
- Graph SAGE: A well-established graph neural network model used for inductive node embedding.
- MLP (Multi-Layer Perceptron): A traditional neural network model that operates on flattened tabular data.
The main dependency resolution tool used in this project is poetry. The environment management is handled using conda.
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Create a Conda environment:
conda env create -f environment.yml conda activate hexgin
-
Install dependencies using Poetry:
pip install poetry poetry install
-
Run the experiments:
kedro run
For your convenience, steps 1-3 can be autometed by running the following command:
sh ./setup_project.sh
You will need to activate the environment later via
conda activate hexgin
-
Compare Results: The
compare_results.ipynb
notebook provides a detailed comparison of the models' performance, presenting differences between HexGIN, Graph SAGE, and MLP. -
Models presentation: The
models_presentation.ipynb
notebook provides a detailed overview of the HexGIN model, Graph SAGE, and MLP, including their architecture and training process.
To run the entire pipeline, use the following command:
kedro run
To visualize the pipeline, use:
kedro viz