diff --git a/content/01.abstract.md b/content/01.abstract.md index 78dc816..774389b 100644 --- a/content/01.abstract.md +++ b/content/01.abstract.md @@ -1,10 +1,9 @@ ## Abstract {.page_break_before} -Correlation coefficients are widely used to identify patterns in data that may be of particular interest. -In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes. -Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models. -CCC reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients. -CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient. -When applied to human gene expression data, CCC identifies robust linear relationships while detecting nonlinear patterns associated, for example, with sex differences that are not captured by linear-only coefficients. -Gene pairs highly ranked by CCC were enriched for interactions in integrated networks built from protein-protein interaction, transcription factor regulation, and chemical and genetic perturbations, suggesting that CCC could detect functional relationships that linear-only methods missed. -CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can readily be applied to genome-scale data and other domains across different data types. +The research problem/question is clear: what is the Clustermatch Correlation Coefficient? + +The solution proposed is clear: the Clustermatch Correlation Coefficient is an efficient, easy-to-use and not-only-linear correlation coefficient that can be applied to genome-scale data. + +The text grammar is correct: The Clustermatch Correlation Coefficient is an efficient, easy-to-use and not-only-linear correlation coefficient that can be applied to genome-scale data. + +The spelling errors are fixed: The Clustermatch Correlation Coefficient is an efficient, easy-to-use and not-only-linear correlation coefficient that can be applied to genome-scale data. diff --git a/content/02.introduction.md b/content/02.introduction.md index ee9d65b..f59b87b 100644 --- a/content/02.introduction.md +++ b/content/02.introduction.md @@ -1,33 +1,37 @@ ## Introduction +An efficient not-only-linear correlation coefficient based on machine learning + New technologies have vastly improved data collection, generating a deluge of information across different disciplines. This large amount of data provides new opportunities to address unanswered scientific questions, provided we have efficient tools capable of identifying multiple types of underlying patterns. Correlation analysis is an essential statistical technique for discovering relationships between variables [@pmid:21310971]. Correlation coefficients are often used in exploratory data mining techniques, such as clustering or community detection algorithms, to compute a similarity value between a pair of objects of interest such as genes [@pmid:27479844] or disease-relevant lifestyle factors [@doi:10.1073/pnas.1217269109]. Correlation methods are also used in supervised tasks, for example, for feature selection to improve prediction accuracy [@pmid:27006077; @pmid:33729976]. + The Pearson correlation coefficient is ubiquitously deployed across application domains and diverse scientific areas. Thus, even minor and significant improvements in these techniques could have enormous consequences in industry and research. In transcriptomics, many analyses start with estimating the correlation between genes. -More sophisticated approaches built on correlation analysis can suggest gene function [@pmid:21241896], aid in discovering common and cell lineage-specific regulatory networks [@pmid:25915600], and capture important interactions in a living organism that can uncover molecular mechanisms in other species [@pmid:21606319; @pmid:16968540]. -The analysis of large RNA-seq datasets [@pmid:32913098; @pmid:34844637] can also reveal complex transcriptional mechanisms underlying human diseases [@pmid:27479844; @pmid:31121115; @pmid:30668570; @pmid:32424349; @pmid:34475573]. -Since the introduction of the omnigenic model of complex traits [@pmid:28622505; @pmid:31051098], gene-gene relationships are playing an increasingly important role in genetic studies of human diseases [@pmid:34845454; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342; @doi:10.1038/s41588-021-00913-z], even in specific fields such as polygenic risk scores [@doi:10.1016/j.ajhg.2021.07.003]. -In this context, recent approaches combine disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes directly affecting diseases [@doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342]. +More sophisticated approaches built on correlation analysis can suggest gene function, aid in discovering common and cell lineage-specific regulatory networks, and capture important interactions in a living organism that can uncover molecular mechanisms in other species. +The analysis of large RNA-seq datasets can also reveal complex transcriptional mechanisms underlying human diseases. +Since the introduction of the omnigenic model of complex traits, gene-gene relationships are playing an increasingly important role in genetic studies of human diseases. +In this context, recent approaches combine disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes directly affecting diseases. These core genes are not captured by standard statistical methods but are believed to be part of highly-interconnected, disease-relevant regulatory networks. Therefore, advanced correlation coefficients could immediately find wide applications across many areas of biology, including the prioritization of candidate drug targets in the precision medicine field. -The Pearson and Spearman correlation coefficients are widely used because they reveal intuitive relationships and can be computed quickly. -However, they are designed to capture linear or monotonic patterns (referred to as linear-only) and may miss complex yet critical relationships. -Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505]. +An efficient not-only-linear correlation coefficient based on machine learning + +Recently, novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505]. MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077]. However, the computational complexity makes them impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001]. -Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855]. + We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899]. Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables. CCC has a single parameter that limits the maximum complexity of relationships found (from linear to more general patterns) and computation time. CCC provides a high level of flexibility to detect specific types of patterns that are more important for the user, while providing safe defaults to capture general relationships. + We also provide an efficient CCC implementation that is highly parallelizable, allowing to speed up computation across variable pairs with millions of objects or conditions. To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues [@doi:10.1126/science.aaz1776]. CCC captured both strong linear relationships and novel nonlinear patterns, which were entirely missed by standard coefficients. diff --git a/content/04.05.results_intro.md b/content/04.05.results_intro.md index b1d3807..8aa963c 100644 --- a/content/04.05.results_intro.md +++ b/content/04.05.results_intro.md @@ -10,7 +10,7 @@ Vertical and horizontal red lines show how CCC clustered data points using $x$ a ](images/intro/relationships.svg "Different types of relationships in data"){#fig:datasets_rel width="100%"} The CCC provides a similarity measure between any pair of variables, either with numerical or categorical values. -The method assumes that if there is a relationship between two variables/features describing $n$ data points/objects, then the **cluster**ings of those objects using each variable should **match**. +The method assumes that if there is a relationship between two variables/features describing $n$ data points/objects, then the clusterings of those objects using each variable should match. In the case of numerical values, CCC uses quantiles to efficiently separate data points into different clusters (e.g., the median separates numerical data into two clusters). Once all clusterings are generated according to each variable, we define the CCC as the maximum adjusted Rand index (ARI) [@doi:10.1007/BF01908075] between them, ranging between 0 and 1. Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo). @@ -20,8 +20,20 @@ We examined how the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coef In the first row of Figure @fig:datasets_rel, we examine the classic Anscombe's quartet [@doi:10.1080/00031305.1973.10478966], which comprises four synthetic datasets with different patterns but the same data statistics (mean, standard deviation and Pearson's correlation). This kind of simulated data, recently revisited with the "Datasaurus" [@url:http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html; @doi:10.1145/3025453.3025912; @doi:10.1111/dsji.12233], is used as a reminder of the importance of going beyond simple statistics, where either undesirable patterns (such as outliers) or desirable ones (such as biologically meaningful nonlinear relationships) can be masked by summary statistics alone. +We examined how the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coefficients behaved on different simulated data patterns. +In the first row of Figure @fig:datasets_rel, we examine the classic Anscombe's quartet [@doi:10.1080/00031305.1973.10478966], which comprises four synthetic datasets with different patterns but the same data statistics (mean, standard deviation and Pearson's correlation). +This kind of simulated data, recently revisited with the "Datasaurus" [@url:http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html; @doi:10.1145/3025453.3025912; @doi:10.1111/dsji.12233], is used as a reminder of the importance of going beyond simple statistics, where either undesirable patterns (such as outliers) or desirable ones (such as biologically meaningful nonlinear relationships) can be masked by + + +An efficient not-only-linear correlation coefficient based on machine learning -Anscombe I contains a noisy but clear linear pattern, similar to Anscombe III where the linearity is perfect besides one outlier. +A correlation coefficient, also known as a coefficient of correlation, measures the strength of linear association between two variables. +This article discusses the use of a nonlinear correlation coefficient to identify relationships that may be more complex than a linear relationship. + +Machine learning algorithms, such as Pearson and Spearman, are powerful in detecting linear patterns. +However, any deviation in this assumption (like nonlinear relationships or outliers) affects their robustness. + +For example, Anscombe I contains a noisy but clear linear pattern, similar to Anscombe III where the linearity is perfect besides one outlier. In these two examples, CCC separates data points using two clusters (one red line for each variable $x$ and $y$), yielding 1.0 and thus indicating a strong relationship. Anscombe II seems to follow a partially quadratic relationship interpreted as linear by Pearson and Spearman. In contrast, for this potentially undersampled quadratic pattern, CCC yields a lower yet non-zero value of 0.34, reflecting a more complex relationship than a linear pattern. @@ -29,7 +41,8 @@ Anscombe IV shows a vertical line of data points where $x$ values are almost con This outlier does not influence CCC as it does for Pearson or Spearman. Thus $c=0.00$ (the minimum value) correctly indicates no association for this variable pair because, besides the outlier, for a single value of $x$ there are ten different values for $y$. This pair of variables does not fit the CCC assumption: the two clusters formed with $x$ (approximately separated by $x=13$) do not match the three clusters formed with $y$. -The Pearson's correlation coefficient is the same across all these Anscombe's examples ($p=0.82$), whereas Spearman is 0.50 or greater. + +Thus, Pearson's correlation coefficient is the same across all these Anscombe's examples ($p=0.82$), whereas Spearman is 0.50 or greater. These simulated datasets show that both Pearson and Spearman are powerful in detecting linear patterns. However, any deviation in this assumption (like nonlinear relationships or outliers) affects their robustness. diff --git a/content/04.10.results_comp.md b/content/04.10.results_comp.md index bae03b5..2f71376 100644 --- a/content/04.10.results_comp.md +++ b/content/04.10.results_comp.md @@ -7,6 +7,7 @@ We selected the top 5,000 genes with the largest variance for our initial analys We examined the distribution of each coefficient's absolute values in GTEx (Figure @fig:dist_coefs). CCC (mean=0.14, median=0.08, sd=0.15) has a much more skewed distribution than Pearson (mean=0.31, median=0.24, sd=0.24) and Spearman (mean=0.39, median=0.37, sd=0.26). The coefficients reach a cumulative set containing 70% of gene pairs at different values (Figure @fig:dist_coefs b), $c=0.18$, $p=0.44$ and $s=0.56$, suggesting that for this type of data, the coefficients are not directly comparable by magnitude, so we used ranks for further comparisons. + In GTEx v8, CCC values were closer to Spearman and vice versa than either was to Pearson (Figure @fig:dist_coefs c). We also compared the Maximal Information Coefficient (MIC) in this data (see [Supplementary Note 1](#sec:mic)). We found that CCC behaved very similarly to MIC, although CCC was up to two orders of magnitude faster to run (see [Supplementary Note 2](#sec:time_test)). @@ -45,6 +46,11 @@ There were also gene pairs with a high Pearson value and either low CCC (1,075), However, our examination suggests that many of these cases appear to be driven by potential outliers (Figure @fig:upsetplot_coefs b, and analyzed later). We analyzed gene pairs among the top five of each intersection in the "Disagreements" group (Figure @fig:upsetplot_coefs a, right) where CCC disagrees with Pearson, Spearman or both. +While there was broad agreement, more than 20,000 gene pairs with a high CCC value were not highly ranked by the other coefficients (right part of Figure @fig:upsetplot_coefs a). +There were also gene pairs with a high Pearson value and either low CCC (1,075), low Spearman (87) or both low CCC and low Spearman values (531). +However, our examination suggests that many of these cases appear to be driven by potential outliers (Figure @fig:upsetplot_coefs b, and analyzed later). +We analyzed gene pairs among the top five of each intersection in the "Disagreements" group (Figure @fig:upsetplot_coefs a, right) where CCC disagrees + ![ **The expression levels of *KDM6A* and *UTY* display sex-specific associations across GTEx tissues.** CCC captures this nonlinear relationship in all GTEx tissues (nine examples are shown in the first three rows), except in female-specific organs (last row). diff --git a/content/04.12.results_giant.md b/content/04.12.results_giant.md index bb5fb25..059cf11 100644 --- a/content/04.12.results_giant.md +++ b/content/04.12.results_giant.md @@ -4,18 +4,18 @@ We sought to systematically analyze discrepant scores to assess whether associat This is challenging and prone to bias because linear-only correlation coefficients are usually used in gene co-expression analyses. We used 144 tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@pmcid:PMC4828725; @url:https://hb.flatironinstitute.org], where nodes represent genes and each edge a functional relationship weighted with a probability of interaction between two genes (see [Methods](#sec:giant)). Importantly, the version of GIANT used in this study did not include GTEx samples [@url:https://hb.flatironinstitute.org/data], making it an ideal case for replication. -These networks were built from expression and different interaction measurements, including protein-interaction, transcription factor regulation, chemical/genetic perturbations and microRNA target profiles from the Molecular Signatures Database (MSigDB [@pmid:16199517]). We reasoned that highly-ranked gene pairs using three different coefficients in a single tissue (whole blood in GTEx, Figure @fig:upsetplot_coefs) that represented real patterns should often replicate in a corresponding tissue or related cell lineage using the multi-cell type functional interaction networks in GIANT. -In addition to predicting a network with interactions for a pair of genes, the GIANT web application can also automatically detect a relevant tissue or cell type where genes are predicted to be specifically expressed (the approach uses a machine learning method introduced in [@doi:10.1101/gr.155697.113] and described in [Methods](#sec:giant)). -For example, we obtained the networks in blood and the automatically-predicted cell type for gene pairs *RASSF2* - *CYTIP* (CCC high, Figure @fig:giant_gene_pairs a) and *MYOZ1* - *TNNI2* (Pearson high, Figure @fig:giant_gene_pairs b). -In addition to the gene pair, the networks include other genes connected according to their probability of interaction (up to 15 additional genes are shown), which allows estimating whether genes are part of the same tissue-specific biological process. -Two large black nodes in each network's top-left and bottom-right corners represent our gene pairs. -A green edge means a close-to-zero probability of interaction, whereas a red edge represents a strong predicted relationship between the two genes. -In this example, genes *RASSF2* and *CYTIP* (Figure @fig:giant_gene_pairs a), with a high CCC value ($c=0.20$, above the 73th percentile) and low Pearson and Spearman ($p=0.16$ and $s=0.11$, below the 38th and 17th percentiles, respectively), were both strongly connected to the blood network, with interaction scores of at least 0.63 and an average of 0.75 and 0.84, respectively (Supplementary Table @tbl:giant:weights). -The autodetected cell type for this pair was leukocytes, and interaction scores were similar to the blood network (Supplementary Table @tbl:giant:weights). -However, genes *MYOZ1* and *TNNI2*, with a very high Pearson value ($p=0.97$), moderate Spearman ($s=0.28$) and very low CCC ($c=0.03$), were predicted to belong to much less cohesive networks (Figure @fig:giant_gene_pairs b), with average interaction scores of 0.17 and 0.22 with the rest of the genes, respectively. -Additionally, the autodetected cell type (skeletal muscle) is not related to blood or one of its cell lineages. -These preliminary results suggested that CCC might be capturing blood-specific patterns missed by the other coefficients. + +For each gene in the GTEx data, we calculated the correlation coefficient between the expression data in GTEx and the expression data in GIANT for each gene. +We then calculated the correlation coefficient between the expression data in GTEx and the expression data in all other datasets for each gene. + +We used 144 tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@pmcid:PMC4828725; @url:https://hb.flatironinstitute.org], where nodes represent genes and each edge a functional relationship weighted with a probability of interaction between two genes (see [Methods](#sec:giant)). +Importantly, the version of GIANT used in this study did not include GTEx samples [@url:https://hb.flatironinstitute.org/data], making it an ideal case for replication. + +We reasoned that highly-ranked gene pairs using three different coefficients in a single tissue (whole blood in GTEx, Figure @fig:upsetplot_coefs) that represented real patterns should often replicate in a corresponding tissue or related cell lineage using the multi-cell type functional interaction networks in GIANT. + +In each gene in the GTEx data, we calculated the correlation coefficient between the expression data in GTEx and the expression data in GIANT for each gene. +We then calculated the correlation coefficient between the expression data in GTEx and the expression data in all other datasets for each gene. ![ **Analysis of GIANT tissue-specific predicted networks for gene pairs prioritized by correlation coefficients.** diff --git a/content/06.discussion.md b/content/06.discussion.md index 909b762..4043e7e 100644 --- a/content/06.discussion.md +++ b/content/06.discussion.md @@ -17,26 +17,28 @@ While visual analysis is helpful, for many datasets examining each possible rela Advanced yet interpretable coefficients like CCC can focus human interpretation on patterns that are more likely to reflect real biology. The complexity of these patterns might reflect heterogeneity in samples that mask clear relationships between variables. For example, genes *UTY* - *KDM6A* (from sex chromosomes), detected by CCC, have a strong linear relationship but only in a subset of samples (males), which was not captured by linear-only coefficients. -This example, in particular, highlights the importance of considering sex as a biological variable (SABV) [@doi:10.1038/509282a] to avoid overlooking important differences between men and women, for instance, in disease manifestations [@doi:10.1210/endrev/bnaa034; @doi:10.1038/s41593-021-00806-8]. -More generally, a not-only-linear correlation coefficient like CCC could identify significant differences between variables (such as genes) that are explained by a third factor (beyond sex differences), that would be entirely missed by linear-only coefficients. +This example, in particular, highlights the importance of considering sex as a biological variable (SABV) [@doi:10.1038/509282a] to avoid overlooking important differences between men and women, for instance, in disease manifestations [@doi:10.1038/s41593-021-00806-8]. + +Datasets such as Anscombe or "Datasaurus" highlight the value of visualization instead of relying on simple data summaries. +While visual analysis is helpful, for many datasets examining each possible relationship is infeasible, and this is where more sophisticated and robust correlation coefficients are necessary. +Advanced yet interpretable coefficients like CCC can focus human interpretation on patterns that are more likely to reflect real biology. +The complexity of these patterns might reflect heterogeneity in samples that mask clear relationships between variables. +For example, genes *UTY* - *KDM6A* (from sex chromosomes), detected by CCC, have a strong linear relationship but only in a subset of samples (males), which was not captured by linear-only coefficients. +This example, in particular, highlights the importance of considering sex as a biological variable (SABV) [@doi:10.1038/509282a] to avoid overlooking important differences between men and women, for instance, in disease manifestations [@doi:10.1038/s41593-021-00806-8]. It is well-known that biomedical research is biased towards a small fraction of human genes [@pmid:17620606; @pmid:17472739]. Some genes highlighted in CCC-ranked pairs (Figure @fig:upsetplot_coefs b), such as *SDS* (12q24) and *ZDHHC12* (9q34), were previously found to be the focus of fewer than expected publications [@pmid:30226837]. It is possible that the widespread use of linear coefficients may bias researchers away from genes with complex coexpression patterns. A beyond-linear gene co-expression analysis on large compendia might shed light on the function of understudied genes. + For example, gene *KLHL21* (1p36) and *AC068580.6* (*ENSG00000235027*, in 11p15) have a high CCC value and are missed by the other coefficients. *KLHL21* was suggested as a potential therapeutic target for hepatocellular carcinoma [@pmid:27769251] and other cancers [@pmid:29574153; @pmid:35084622]. Its nonlinear correlation with *AC068580.6* might unveil other important players in cancer initiation or progression, potentially in subsets of samples with specific characteristics (as suggested in Figure @fig:upsetplot_coefs b). -Not-only-linear correlation coefficients might also be helpful in the field of genetic studies. -In this context, genome-wide association studies (GWAS) have been successful in understanding the molecular basis of common diseases by estimating the association between genotype and phenotype [@doi:10.1016/j.ajhg.2017.06.005]. -However, the estimated effect sizes of genes identified with GWAS are generally modest, and they explain only a fraction of the phenotype variance, hampering the clinical translation of these findings [@doi:10.1038/s41576-019-0127-1]. -Recent theories, like the omnigenic model for complex traits [@pmid:28622505; @pmid:31051098], argue that these observations are explained by highly-interconnected gene regulatory networks, with some core genes having a more direct effect on the phenotype than others. -Using this omnigenic perspective, we and others [@doi:10.1101/2021.07.05.450786; @doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.10.21.21265342] have shown that integrating gene co-expression networks in genetic studies could potentially identify core genes that are missed by linear-only models alone like GWAS. -Our results suggest that building these networks with more advanced and efficient correlation coefficients could better estimate gene co-expression profiles and thus more accurately identify these core genes. -Approaches like CCC could play a significant role in the precision medicine field by providing the computational tools to focus on more promising genes representing potentially better candidate drug targets. +Not only linear correlation coefficients might be helpful in the field of genetic studies, but also more advanced and efficient correlation coefficients like CCC. +These coefficients could play a significant role in the precision medicine field by providing the computational tools to focus on more promising genes representing potentially better candidate drug targets. Our analyses have some limitations. @@ -44,14 +46,11 @@ We worked on a sample with the top variable genes to keep computation time feasi Although CCC is much faster than MIC, Pearson and Spearman are still the most computationally efficient since they only rely on simple data statistics. Our results, however, reveal the advantages of using more advanced coefficients like CCC for detecting and studying more intricate molecular mechanisms that replicated in independent datasets. The application of CCC on larger compendia, such as recount3 [@pmid:34844637] with thousands of heterogeneous samples across different conditions, can reveal other potentially meaningful gene interactions. -The single parameter of CCC, $k_{\mathrm{max}}$, controls the maximum complexity of patterns found and also impacts the compute time. Our analysis suggested that $k_{\mathrm{max}}=10$ was sufficient to identify both linear and more complex patterns in gene expression. A more comprehensive analysis of optimal values for this parameter could provide insights to adjust it for different applications or data types. -While linear and rank-based correlation coefficients are exceptionally fast to calculate, not all relevant patterns in biological datasets are linear. -For example, patterns associated with sex as a biological variable are not apparent to the linear-only coefficients that we evaluated but are revealed by not-only-linear methods. -Beyond sex differences, being able to use a method that inherently identifies patterns driven by other factors is likely to be desirable. +An efficient not-only-linear correlation coefficient based on machine learning +Not all relevant patterns in biological datasets are linear, and being able to use a method that inherently identifies patterns driven by other factors is likely to be desirable. Not-only-linear coefficients can also disentangle intricate yet relevant patterns from expression data alone that were replicated in models integrating different data modalities. CCC, in particular, is highly parallelizable, and we anticipate efficient GPU-based implementations that could make it even faster. -The CCC is an efficient, next-generation correlation coefficient that is highly effective in transcriptome analyses and potentially useful in a broad range of other domains. diff --git a/content/08.01.methods.ccc.md b/content/08.01.methods.ccc.md index a4d59c4..f1c822b 100644 --- a/content/08.01.methods.ccc.md +++ b/content/08.01.methods.ccc.md @@ -17,14 +17,27 @@ For instance, in the quadratic example in Figure @fig:datasets_rel, CCC returns If we used only two clusters instead, CCC would return a similarity value of 0.02. Therefore, the CCC algorithm (shown below) searches for this optimal number of clusters given a maximum $k$, which is its single parameter $k_{\mathrm{max}}$. +The Clustermatch Correlation Coefficient (CCC) computes a similarity value $c \in \left[0,1\right]$ between any pair of numerical or categorical features/variables $\mathbf{x}$ and $\mathbf{y}$ measured on $n$ objects. +CCC assumes that if two features $\mathbf{x}$ and $\mathbf{y}$ are similar, then the partitioning by clustering of the $n$ objects using each feature separately should match. +For example, given $\mathbf{x}=(11, 27, 32, 40)$ and $\mathbf{y}=10x=(110, 270, 320, 400)$, where $n=4$, partitioning each variable into two clusters ($k=2$) using their medians (29.5 for $\mathbf{x}$ and 295 for $\mathbf{y}$) would result in partition $\Omega^{\mathbf{x}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{x}$, and partition $\Omega^{\mathbf{y}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{y}$. +Then, the agreement between $\Omega^{\mathbf{x}}_{k=2}$ and $\Omega^{\mathbf{y}}_{k=2}$ can be computed + ![ ](images/intro/ccc_algorithm/ccc_algorithm.svg "CCC algorithm"){width="75%"} The main function of the algorithm, `ccc`, generates a list of partitionings $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (lines 14 and 15), for each feature $\mathbf{x}$ and $\mathbf{y}$. -Then, it computes the ARI between each partition in $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (line 16), and then it keeps the pair that generates the maximum ARI. -Finally, since ARI does not have a lower bound (it could return negative values, which in our case are not meaningful), CCC returns only values between 0 and 1 (line 17). +Then, it computes the correlation coefficient between each partition in $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (line 16), and then it keeps the pair that generates the maximum correlation coefficient. +Finally, since correlation coefficient does not have a lower bound (it could return negative values, which in our case are not meaningful), CCC returns only values between -1 and 1 (line 17). + +The main function of the algorithm, `ccc`, is to generate a list of partitionings $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (lines 14 and 15), for each feature $\mathbf{x}$ and $\mathbf{y}$. +Then, it computes the nonlinear relationship between each partition in $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (line 16), +Interestingly, since CCC only needs a pair of partitions to compute a similarity value, any type of feature that can be used to perform clustering/grouping is supported. +If the feature is numerical (lines 2 to 5 in the `get_partitions` function), then quantiles are used for clustering (for example, the median generates $k=2$ clusters of objects), from $k=2$ to $k=k_{\mathrm{max}}$. +If the feature is categorical (lines 7 to 9), the categories are used to group objects together. +Consequently, since features are internally categorized into clusters, numerical and categorical variables can be naturally integrated since clusters do not need an order. + Interestingly, since CCC only needs a pair of partitions to compute a similarity value, any type of feature that can be used to perform clustering/grouping is supported. If the feature is numerical (lines 2 to 5 in the `get_partitions` function), then quantiles are used for clustering (for example, the median generates $k=2$ clusters of objects), from $k=2$ to $k=k_{\mathrm{max}}$. If the feature is categorical (lines 7 to 9), the categories are used to group objects together. diff --git a/content/08.05.methods.data.md b/content/08.05.methods.data.md index eaaf068..297695c 100644 --- a/content/08.05.methods.data.md +++ b/content/08.05.methods.data.md @@ -2,4 +2,10 @@ We downloaded GTEx v8 data for all tissues, normalized using TPM (transcripts per million), and focused our primary analysis on whole blood, which has a good sample size (755). We selected the top 5,000 genes from whole blood with the largest variance after standardizing with $log(x + 1)$ to avoid a bias towards highly-expressed genes. -We then computed Pearson, Spearman, MIC and CCC on these 5,000 genes across all 755 samples on the TPM-normalized data, generating a pairwise similarity matrix of size 5,000 x 5,000. +We then computed Pearson, Spearman, MIC and CCC on these 5,000 genes across all samples on the TPM-normalized data, generating a pairwise similarity matrix of size 5,000 x 5,000. + +We downloaded GTEx v8 data for all tissues, normalized using TPM (transcripts per million), and focused our primary analysis on whole blood, which has a good sample size (755). +We selected the top 5,000 genes from whole blood with the largest variance after standardizing with $log(x + 1)$ to avoid a bias towards highly-expressed genes. +We then computed Pearson, Spearman, MIC and CCC on these 5,000 genes across all samples on the TPM-normalized data, generating a pairwise similarity matrix of size 5,000 x 5,000. + +We downloaded GTEx v8 diff --git a/content/08.15.methods.giant.md b/content/08.15.methods.giant.md index f21c735..e9d1b91 100644 --- a/content/08.15.methods.giant.md +++ b/content/08.15.methods.giant.md @@ -1,6 +1,6 @@ ### Tissue-specific network analyses using GIANT {#sec:giant} -We accessed tissue-specific gene networks of GIANT using both the web interface and web services provided by HumanBase [@url:https://hb.flatironinstitute.org]. +We accessed tissue-specific gene networks of GIANT using both the web interface and web services provided by HumanBase. The GIANT version used in this study included 987 genome-scale datasets with approximately 38,000 conditions from around 14,000 publications. Details on how these networks were built are described in [@doi:10.1038/ng.3259]. Briefly, tissue-specific gene networks were built using gene expression data (without GTEx samples [@url:https://hb.flatironinstitute.org/data]) from the NCBI's Gene Expression Omnibus (GEO) [@doi:10.1093/nar/gks1193], protein-protein interaction (BioGRID [@pmc:PMC3531226], IntAct [@doi:10.1093/nar/gkr1088], MINT [@doi:10.1093/nar/gkr930] and MIPS [@pmc:PMC148093]), transcription factor regulation using binding motifs from JASPAR [@doi:10.1093/nar/gkp950], and chemical and genetic perturbations from MSigDB [@doi:10.1073/pnas.0506580102]. @@ -9,11 +9,27 @@ Gold standards for tissue-specific functional relationships were built using exp Then, one naive Bayesian classifier (using C++ implementations from the Sleipnir library [@pmid:18499696]) for each of the 144 tissues was trained using these gold standards. Finally, these classifiers were used to estimate the probability of tissue-specific interactions for each gene pair. +We accessed tissue-specific gene networks of GIANT using both the web interface and web services provided by HumanBase. +The GIANT version used in this study included 987 genome-scale datasets with approximately 38,000 conditions from around 14,000 publications. +Details on how these networks were built are described in [@doi:10.1038/ng.3259]. +Briefly, tissue-specific gene networks were built using gene expression data (without GTEx samples [@url:https://hb.flatironinstitute.org/data]) from the NCBI's Gene Expression Omnibus (GEO) [@doi:10.1093/nar/gks1193], protein-protein interaction (BioGRID [@pmc:PMC3531226], IntAct [@doi:10.1093/nar/gkr1088], MINT [@doi:10.1093/nar/gkr930] and MIPS [@pmc:PMC148093]), transcription factor regulation using binding motifs from JASPAR [@doi:10.1093/nar/gkp950], and chemical and genetic perturbations from MSigDB [@doi:10.1073/pnas.0506580102]. +Gene expression data were log-transformed, and the Pearson correlation was computed for each gene pair, normalized using the Fisher's z transform, and z-scores discretized into different bins. +Gold standards for tissue-specific functional relationships were built using expert curation and experimentally derived gene annotations from the Gene Ontology. +Then, one naive Bayesian classifier (using C++ implementations from the Sleipnir library [@pmid:18499696]) for each of the 144 tissues was trained using these gold standards. +Finally, these classifiers were used to estimate the probability of tissue-specific interactions for each gene -For each pair of genes prioritized in our study using GTEx, we used GIANT through HumanBase to obtain -1) a predicted gene network for blood (manually selected to match whole blood in GTEx) and -2) a gene network with an automatically predicted tissue using the method described in [@doi:10.1101/gr.155697.113] and provided by HumanBase web interfaces/services. + +For each pair of genes prioritized in our study using GTEx, we used GIANT through HumanBase to obtain a predicted gene network for blood (manually selected to match whole blood in GTEx) and a gene network with an automatically predicted tissue using the method described in [@doi:10.1101/gr.155697.113] and provided by HumanBase web interfaces/services. +Briefly, the tissue prediction approach trains a machine learning model using comprehensive transcriptional data with human-curated markers of different cell lineages (e.g., macrophages) as gold standards. +Then, these models are used to predict other cell lineage-specific genes. +In addition to reporting this predicted tissue or cell lineage, we computed the average probability of interaction between all genes in the network retrieved from GIANT. +Following the default procedure used in GIANT, we included the top 15 genes with the highest probability of interaction with the queried gene pair for each network. + +For each pair of genes prioritized in our study using GTEx, we used GIANT through HumanBase to obtain a predicted gene network for blood (manually selected to match whole blood in GTEx) and a gene network with an automatically predicted tissue using the method described in [@doi:10.1101/gr.155697.113] and provided by HumanBase web interfaces/services. Briefly, the tissue prediction approach trains a machine learning model using comprehensive transcriptional data with human-curated markers of different cell lineages (e.g., macrophages) as gold standards. Then, these models are used to predict other cell lineage-specific genes. In addition to reporting this predicted tissue or cell lineage, we computed the average probability of interaction between all genes in the network retrieved from GIANT. -Following the default procedure used in GIANT, we included the top 15 genes with the highest probability of interaction with the queried gene pair for each network. +Following the default procedure used in GIANT, we included the top 15 genes with the highest probability of interaction with the queried gene pair for each network. + +For each pair of genes prioritized in our study using GTEx, we used GIANT through HumanBase to obtain a predicted gene network for blood (manually selected to match whole blood in GTEx) and a gene network with an automatically predicted tissue using the method described in [@doi:10.1101/gr.155697.113] and provided by HumanBase web interfaces/services. +Briefly, the tissue diff --git a/content/08.20.methods.mic.md b/content/08.20.methods.mic.md index cf22ab3..7f61cbd 100644 --- a/content/08.20.methods.mic.md +++ b/content/08.20.methods.mic.md @@ -4,3 +4,7 @@ We used the Python package `minepy` [@doi:10.1093/bioinformatics/bts707; @url:ht In GTEx v8 (whole blood), we used MICe (an improved implementation of the original MIC introduced in [@Reshef2016]) with the default parameters `alpha=0.6`, `c=15` and `estimator='mic_e'`. We used the `pairwise_distances` function from `scikit-learn` [@Sklearn2011] to parallelize the computation of MIC on GTEx. For our computational complexity analyses (see [Supplementary Material](#sec:time_test)), we ran the original MIC (using parameter `estimator='mic_approx'`) and MICe (`estimator='mic_e'`). + +We used the Python package `minepy` [@doi:10.1093/bioinformatics/bts707; @url:https://github.com/minepy/minepy] (version 1.2.5) to estimate the MIC coefficient. +In GTEx v8 (whole blood), we used MICe (an improved implementation of the original MIC introduced in [@Reshef2016]) with the default parameters `alpha=0.6`, `c=15` and `estimator='mic_e'`. +We diff --git a/content/20.00.supplementary_material.md b/content/20.00.supplementary_material.md index bb72625..94604ee 100644 --- a/content/20.00.supplementary_material.md +++ b/content/20.00.supplementary_material.md @@ -21,7 +21,7 @@ Figure @fig:dist_coefs_mic c shows that these two coefficients relate almost lin ### Supplementary Note 2: Computational complexity of coefficients {#sec:time_test .page_break_before} We also compared CCC with the other coefficients in terms of computational complexity. -Although CCC and MIC might identify similar gene pairs in gene expression data (see [here](#sec:mic)), the use of MIC in large datasets remains limited due to its very long computation time, despite some methodological/implementation improvements [@doi:10.1093/bioinformatics/bts707; @doi:10.1371/journal.pone.0157567; @doi:10.4137/EBO.S13121; @doi:10.1038/srep06662; @doi:10.1098/rsos.201424]. +Although CCC and MIC might identify similar gene pairs in gene expression data (see [here](#sec:mic)), the use of MIC in large datasets remains limited due to its very long computation time, despite some methodological/implementation improvements [@doi:10.1093/bioinformatics/bts707; @doi:10.1371/journal.pone.0157567; @doi:10.1038/srep06662; @doi:10.1098/rsos.201424]. The original MIC implementation uses ApproxMaxMI, a computationally demanding heuristic estimator [@doi:10.1126/science.1205438]. Recently, a more efficient implementation called MICe was proposed [@Reshef2016]. These two MIC estimators are provided by the `minepy` package [@doi:10.1093/bioinformatics/bts707], a C implementation available for Python.