-
Notifications
You must be signed in to change notification settings - Fork 6
Background Info on the Community Detection Algorithms
A central step throughout the hCoCena analysis is detecting modules of similarly expressed genes which are represented by densely connected regions in the co-expression network. The modules are also referred to as communities, the nature of which has not been uniformly defined but is rather a notion of stronger connectivity or stronger communication of vertices within a community than in between them Raghavan et al., 2007. Since the subsequent contextual module analysis – with regard to shared functionality, group-specific expression and cross-module targeting of genes by transcription factors – is strongly based on the definition of these modules, an introduction into the basic concepts of the offered community detection algorithms is important. The best algorithm to pick strongly depends on the data at hand, especially with respect to its size, the overall connectivity of the network and the nature of the edges. Not all presented algorithms are necessarily optimal for mere coexpression networks, but with regard to future versions of the tool that will also allow the integration of omics types other than only transcriptome data, these algorithms will gain in importance and are thus already included. The overall understanding of the underlying theory together with data-based selection steps offered by several hCoCena-functions are meant to provide a guideline as to which community detection algorithm the user should decide on.
As described in Vincent D. Blondel, 2008, the Louvain community detection algorithm is a greedy approximation algorithm that aims to find a local maximum for the modularity. Modularity quantifies the amount of intra-community links as compared to inter-community links. More precisely, what it measures, is how the proportion of edges within a community compared to the edges within the entire network behaves in contrast to the expected proportion of intra-community edges to all edges in a random network. The Louvain algorithm operates in two phases. It begins by assigning each node in the network to its own community, resulting in a number of communities that is equal to the number of nodes. For every node i in the network the algorithm then goes on to evaluate the difference in modularity if i was taken out of its community and merged with any of its neighbours j. Eventually, it is placed in the community which provides the highest gain in modularity, unless no gain can be achieved with any of the neighbours, in that case, i remains in its original community. This process is iteratively repeated for all nodes until the modularity converges. This then concludes the first phase of the process. Here, it already becomes apparent, that the algorithm stops this process as soon as a local maximum is encountered, thus not guaranteeing an optimal solution. Also, the outcome and the runtime greatly depend on the order in which the nodes are considered in each iteration. The second phase of the Louvain algorithm consists of creating a new network, in which nodes represent the before found communities. Edges between these community nodes are weighted by calculating the sum of all edges connecting nodes of the two communities in the previous network. Each community node also acquires a self-loop, the weight of which equates to the sum of all edge weights within that community. The two phases are repeated in iterative cycles until the nodes cannot be clustered with a gain in modularity anymore. Thus, the algorithm builds a hierarchy of community networks in an agglomerative fashion. The main advantages of this community detection algorithm are its speed and its ability to handle very large networks.
In their 2019 paper, Traag et al. pointed out that the Louvain algorithm may produce arbitrarily bad partitions, in worst cases even result in disconnected communities. The authors thus enhanced the Louvain algorithm, introducing a refined partitioning as well as a local moving procedure of the nodes. Their Leiden algorithm can make much stronger guarantees than the Louvain algorithm, among those that all found communities are well connected. For details, please refer to their paper. We generally recommend using the Leiden algorithm as a starting point for your analysis.
Unlike the above-described Louvain clustering algorithm, label propagation does not aim to optimize an objective function. It is a greedy approach described by Raghavan et al., 2007 following a simple set of rules:
- Assign a label to each node in the network.
- Iteratively consider each node and assign it the label that most of its neighbours have. If there is a tie, assign any of those labels randomly.
- Repeat this process until every node has a label such that at least as many of its neighbours are within the same community as there are neighbours belonging to other communities.
Since the stopping criterion does not optimize any value, many different solutions are possible, depending on the order in which the nodes are processed. A problem that can occur – though rarely – is that in the final solution there are two or more communities that are not connected but that have the same label. This happens, if at some point throughout the process they started out as one community, but label propagation from other communities has taken over nodes that united these communities. With regard to runtime, the time complexity of each iteration is in O(m), with m being the number of edges in the network. The number of iterations needed to reach a solution cannot be determined beforehand, although the authors claim that convergence is usually reached after approximately 5 iterations.
Much like the Louvain algorithm, the fast and greedy algorithm as presented by Clauset et al., 2004 and [Newman, 2004] (https://arxiv.org/ct?url=https%3A%2F%2Fdx.doi.org%2F10.1103%2FPhysRevE.69.066133&v=4d1dacba) is an agglomerative hierarchical clustering method. In fact, the fast and greedy algorithm can be seen as a sort of predecessor of the Louvain algorithm. It uses the same modularity function as its greedy criterion but it operates in a slightly different way. Like the Louvain algorithm, it also initializes each node in the network as its own community. It then proceeds to merge in each round the two communities that lead to the highest gain in modularity ΔQ, or if no gain is possible, it merges the pair that induces the minimal loss. It repeats this process in a hierarchical manner until all communities have been merged into one. The resulting dendrogram can then be cut at the level with the greatest value for Q. The worst-case runtime of this algorithm is in O(m * d * log(n)), n and m being the number of nodes and edges in the graph, respectively, and d being the height of the dendrogram. Therefore, the algorithm operates quite fast and is also applicable for larger networks of a few hundred thousand nodes and a couple of million edges. One prominent restriction is the merging step. Unlike in the Louvain clustering, where single nodes are being inspected and moved among communities in order to maximize Q, here communities are merged as a whole, implicating that once a node has been associated with a community, it will not leave this community anymore besides from being merged into a larger one, but always together with all other nodes that are already part of its community.
Unlike the algorithms described above, the infomap algorithm does not measure the community structure based on the modularity score, but instead quantifies a community using its information flow Rosvall and Bergstrom, 2008. A well-connected module is expected to have a fast and easy flow of information between its nodes. The key feature of this idea is to determine an efficient way to measure and encode information flow. The basis for measuring this information flow in the infomap algorithm is a random walk. Within a densely connected module, a random walker is expected to stay in the module most of the time and transitioning events in between modules are rare. An advantage of this method is that random walks solely depend on the information inherent in the network structure and do not require prior knowledge about the number of communities, etc. In order to define the communities, the random walk needs to be encoded efficiently. Here, the authors use a two-layer Huffmann encoding, meaning that nodes that are visited by the random walk with very high frequency will be assigned short identification codes, whereas nodes that are rarely visited will be assigned long identification codes. This alone would result in an encoding that holds a unique identifier for every node in the network. To increase this method’s efficiency, the authors introduced a second organisation layer by assigning unique and short codewords to communities as a whole and longer not unique codewords to the nodes within the communities. Thus, a codeword representing a node in one community can be re-used to describe a node in another community while still making them uniquely identifiable based on the community code, much like a zip code system and street names that are unique within a zip code but repetitive across them. This approach allows for a more condensed and efficient encoding and clearly associates nodes to unique community codes and therefore assigning them to modules. Concisely, finding community structures in a network can be formulated as a coding problem. The best separation of the network into modules will be the one that optimizes the encoding such that describing the random walks executed on the network will be as short as possible. Since finding an optimal solution scales exponentially with the number of nodes, a deterministic greedy method is used on the visiting frequencies of the nodes by a random walker. The result is then further refined using the heat-bath simulated annealing method. The main difference of this flow-based approach in comparison to the above-mentioned topological approaches is that the here described algorithm leads to a better partitioning on directed networks that indicate inherent flow, whereas in networks in which edges symbolise mere associations the modularity-based approaches usually excel. Practically, this means that this algorithm is less suitable for co-expression networks but is likely to retrieve an information gain in e.g., metabolic networks demonstrating the flow of metabolites.
As the infomap algorithm, the walktrap algorithm, which is described in Pons and Latapy, 2005, uses random walks to evaluate which nodes belong to the same community. Three basic concepts underlie the idea of defining a community via random walks. The first one is that if two vertices u and v belong to the same community, then the probability of a random walker making a step from u to v should be high. The second assumption is that the probability of the random walker making a step from u to v depends on the degree of v, since high degree nodes are more likely to be visited in general. Lastly, if two nodes u and v belong to the same community, then the probability of walking from u to any other node k should be approximately the same as if the walker went from v to k. The distances between each pair of nodes are calculated based on sufficiently long random walks. The algorithm then goes on to merge communities in an agglomerative fashion. The rules underlying the merging process are that the communities must be connected in order for them to be merged so that only connected communities are created. Also, the algorithm chooses the pair of clusters such that the difference in their nodes’ distance profiles to other nodes is minimal, in other words, that their random walk profiles are similar. The algorithm has a worst-case runtime of O(m * n 2 ), m = |E| being the number of edges and n = |V| being the number of nodes in the graph and is, therefore, more costly than most of the above-mentioned methods. This and the fact that it is quite memory expensive make it rather unsuitable for networks with many thousand nodes. The measurement of vertex distances that the authors came up with has only been proven for undirected graphs, thus the algorithm is not suitable for directed graphs so far.