Skip to content

Highly Comparative Graph Analysis - Code for network phenotyping

License

Notifications You must be signed in to change notification settings

barahona-research-group/hcga

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hcga: Highly comparative graph analysis

This is the official repository of hcga, a highly comparative graph analysis toolbox. It performs a massive feature extraction from a set of graphs, and applies supervised classification methods.

Networks are widely used as mathematical models of complex systems across many scientific disciplines, not only in biology and medicine but also in the social sciences, physics, computing and engineering. Decades of work have produced a vast corpus of research characterising the topological, combinatorial, statistical and spectral properties of graphs. Each graph property can be thought of as a feature that captures important (and some times overlapping) characteristics of a network. In the analysis of real-world graphs, it is crucial to integrate systematically a large number of diverse graph features in order to characterise and classify networks, as well as to aid network-based scientific discovery. Here, we introduce hcga, a framework for highly comparative analysis of graph data sets that computes several thousands of graph features from any given network. hcga also offers a suite of statistical learning and data analysis tools for automated identification and selection of important and interpretable features underpinning the characterisation of graph data sets.

Installation

For users who are not familiar with python and would like to use this code, we apologise that it isn't available in other languages. However, to help set you up we have provided a description of the steps required to install python and its necessary dependencies if you scroll down to the bottom.

hcga can easily be installed via PyPi:

pip install hcga

Alternatively please clone the repository, navigate to the main folder and install using:

pip install .

Main Work Flow

1. Create a dataset

Benchmarks datasets from Graphkernel <https://ls11-www.cs.tu-dortmund.de/people/morris/graphkerneldatasets>_ can be loaded directly with:

$ hcga get_data DATASET

where for example, DATASET can be one of * ENZYMES * DD * COLLAB * PROTEINS * REDDIT-MULTI-12K

To create a custom dataset, please follow one of the examples in the examples/ directory.

2. Extract features

Once a dataset is created, features can be extracted using for example::

$ hcga -v extract_features dataset.pkl --mode fast --timeout 10 --n-workers 4 

we refer to the hcga app documentation for more details, but the main options here include:

  • --mode fast: only extract simple features (other options include medium/slow
  • --timeout 10: stop feature computation after 10 seconds (this prevents some features to get stuck)
  • --n-workers 4: set the number of workers in multiprocessing
  • --runtime: this option runs a small set of graphs and outputs estimated times for each feature
  • -v: verbose mode, to have more information on the state of the run

3. Classify graphs

Finally, to use the extracted features to classify graphs with respect to their labels, one uses:

$ hcgafeature_analysis dataset --interpretability 1

where dataset is the name of the dataset, and --interpretability 1 selects the features with all interpretabilities. Choices range from 1 to 5, where 5 only uses most interpretable features.

Documentation

Head over to our documentation to find out more about installation, data handling, creation of datasets and a full list of implemented features, transforms, and datasets. For a quick start, check out our examples in the examples/ directory.

Contributors

  • Robert Peach, GitHub: peach-lucien <https://github.com/peach-lucien>_
  • Alexis Arnaudon, GitHub: arnaudon <https://github.com/arnaudon>_
  • Henry Palasciano, GitHub: henrypalasciano <https://github.com/henrypalasciano>_
  • Nathan Bernier, GitHub: nrbernier <https://github.com/nrbernier>_
  • Julia Schmidt, GitHub: misterblonde <https://github.com/misterblonde>_

We are always on the look out for individuals that are interested in contributing to this open-source project. Even if you are just using hcga and made some minor updates, we would be interested in your input.

To contribute you just need to follow some simple steps:

  1. Create a github account and sign-in.
  2. Fork the hcga repository to your own github account. You can do this by clicking on the upper right Fork link.
  3. Clone the forked repository to your local machine e.g. git clone https://github.com/your_user_name/hcga.git
  4. Navigate to your local repository in the command terminal.
  5. Add the original hcga repository as your upstream e.g. git remote add upstream https://github.com/barahona-research-group/hcga.git
  6. Pull the latest changes by typing git pull upstream master
  7. Create a new branch with some name by typing git checkout -b BRANCH_NAME
  8. Make the changes to the code that you wanted to make.
  9. Commit your changes with git add -A
  10. Stage your changes with git commit -m "DESCRIPTION OF CHANGES"
  11. Push your changes by typing git push origin BRANCH_NAME
  12. Go back to the forked repository on the github website and begin the pull request by clicking the green compare and pull request button.

Thanks for anyone and everyone that chooses to contribute on this project.

Cite

Please cite our paper if you use this code in your own work:

Robert L. Peach, Alexis Arnaudon, Julia A. Schmidt, Henry A. Palasciano, Nathan R. Bernier, Kim E. Jelfs, Sophia N. Yaliraki, Mauricio Barahona, HCGA: Highly comparative graph analysis for network phenotyping, Patterns 2 (4), 100227 (2021) ISSN 2666-3899, https://doi.org/10.1016/j.patter.2021.100227.

Originally appeared as a Biorxiv preprint: bioRxiv 2020.09.25.312926; doi: https://doi.org/10.1101/2020.09.25.312926


The bibtex reference:

@article{PEACH2021,
title = {HCGA: Highly comparative graph analysis for network phenotyping}, 
journal = {Patterns},
volume = {2},
number = {4},
pages = {100227},
year = {2021},
issn = {2666-3899},
doi = {https://doi.org/10.1016/j.patter.2021.100227},
url = {https://www.sciencedirect.com/science/article/pii/S2666389921000416},
author = {Robert L. Peach and Alexis Arnaudon and Julia A. Schmidt and Henry A. Palasciano and Nathan R. Bernier and Kim E. Jelfs and Sophia N. Yaliraki and Mauricio Barahona}
}

Run example

In the example folder, the script run_example.sh can be used to run the benchmark examples in the paper:

./run_example.sh DATASET

where DATASET is one of

  • ENZYMES
  • DD
  • COLLAB
  • PROTEINS
  • REDDIT-MULTI-5K

Other examples can be found as jupyter-notebooks in examples/ directory. We have included six examples:

  • Example 1: Classification on synthetic data
  • Example 2: Regression on synthetic data
  • Example 3: Large Molecule dataset and regression
  • Example 4: Training on labelled data, saving the fitted model, and predicting on unseen unlabelled data.
  • Example 5: Pairwise classification. Exploring the similarity of classes.
  • Example 6: Loading data in different ways.
  • Example 7: Using different classifiers.

Python, Anaconda and hcga installation

The simplest setup will be to install Anaconda. Anaconda is a package manager and contains useful IDEs for writing and viewing python scripts and notebooks. Choose from one of the following links below depending on your operating system:

  • Windows users. Simply download the installer and make sure to register Anaconda3 as the default Python.
  • Mac users. Perform the standard installation.
  • Linux users. Linux will often require dependencies depending on your Linux distribution - these are described in the link.

Please update to the most recent version of Python (>3.7). If you are using anaconda then type conda update python.

Once Anaconda is installed can download the hcga project. There are two ways to do this:

  1. Manually by clicking on the green Code button and downloading the zip file. Then unzip into your directory of choice.
  2. If you have git then clone the code. Go to your terminal and type git clone https://github.com/barahona-research-group/hcga.git

Once hcga is on your local machine you can open your command terminal (in any operating system), navigate into the hcga folder and simply type:

pip install .

If you are running on windows and receive an 'access denied' then either run your command terminal as administrator or try the command:

pip install --user . 

The hcga package should now be installed directly into your Anaconda packages alongside other dependencies.

If you want to run the example scripts then you need to open jupyter-notebook. Alternatively, you can run the example python scripts directly from the command line (see Run Example above). Thankfully jupyter-notebook is automatically installed with Anaconda. To open jupyter-notebook open a command terminal and type:

jupyter-notebook

You can then navigate to the examples folder and open the notebook of your choosing.

Our other available packages

If you are interested in trying our other packages, see the below list:

  • GDR : Graph diffusion reclassification. A methodology for node classification using graph semi-supervised learning.
  • hcga : Highly comparative graph analysis. A graph analysis toolbox that performs massive feature extraction from a set of graphs, and applies supervised classification methods.
  • MSC : MultiScale Centrality: A scale dependent metric of node centrality.
  • DynGDim : Dynamic Graph Dimension: Computing the relative, local and global dimension of complex networks.
  • PyGenStability : Markov Stability: Computing the Markov Stability graph community detection algorithm in Python.
  • RMST : Relaxed Minimum Spanning Tree: Computing the relaxed minimum spanning tree to sparsify networks whilst retaining dynamic structure.
  • StEP : Spatial-temporal Epidemiological Proximity: Characterising contact in disease outbreaks via a network model of spatial-temporal proximity.