Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in preparePathwaysAndStats #149

Closed
TX-Xiao opened this issue Feb 19, 2024 · 5 comments
Closed

Error in preparePathwaysAndStats #149

TX-Xiao opened this issue Feb 19, 2024 · 5 comments

Comments

@TX-Xiao
Copy link

TX-Xiao commented Feb 19, 2024

I am trying to perform GSEA on GSE48301 results and successfully ran fgsea to extract a list of enriched pathways. However, when I run the collapsePathway function, I receive this error message:

Error in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
Not all stats values are finite numbers

I checked the gene_list I fed into stats using is.finite() and was sure that it only contains finite numbers. I saw a post (https://www.biostars.org/p/9557480/#9587941) reflecting the same issue 11 months ago and seemed to be unsolved.

@assaron
Copy link
Member

assaron commented Feb 19, 2024

Can you please provide a reproducible example?

@TX-Xiao
Copy link
Author

TX-Xiao commented Feb 20, 2024

#General Bioconductor packages
library(Biobase)
library(oligoClasses)

#Annotation and data import packages
library(ArrayExpress)
library(pd.hugene.1.0.st.v1)
library(hugene10sttranscriptcluster.db)

#Quality control and pre-processing packages
library(oligo)
library(arrayQualityMetrics)

#Analysis and statistics packages
library(limma)
library(topGO)
library(ReactomePA)
library(clusterProfiler)

#Plotting and color options packages
library(gplots)
library(ggplot2)
library(geneplotter)
library(RColorBrewer)
library(pheatmap)
library(enrichplot)

#Formatting/documentation packages
#library(rmarkdown)
#library(BiocStyle)
library(dplyr)
library(tidyr)

#Helpers:
library(stringr)
library(matrixStats)
library(genefilter)
library(openxlsx)
#library(devtools)

library(GEOquery)
library(umap)

load series and platform data from GEO

gset <- getGEO("GSE48301", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL6244", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

ex <- exprs(gset)

Translate array IDs to gene symbols

anno_gset <- AnnotationDbi::select(hugene10sttranscriptcluster.db,
keys = (featureNames(gset)),
columns = c("SYMBOL", "GENENAME"),
keytype = "PROBEID")

anno_gset <- subset(anno_gset, !is.na(SYMBOL))
anno_grouped <- group_by(anno_gset, PROBEID)
anno_summarized <-
dplyr::summarize(anno_grouped, no_of_matches = n_distinct(SYMBOL))

head(anno_summarized)
anno_filtered <- filter(anno_summarized, no_of_matches > 1)

head(anno_filtered)
probe_stats <- anno_filtered

nrow(probe_stats)
ids_to_exlude <- (featureNames(gset) %in% probe_stats$PROBEID)

table(ids_to_exlude)
gset_final<-subset(gset,!ids_to_exlude)
validObject(gset_final)
fData(gset_final)$PROBEID <- rownames(fData(gset_final))
fData(gset_final) <- left_join(fData(gset_final), anno_gset)
rownames(fData(gset_final)) <- fData(gset_final)$PROBEID

validObject(gset_final)

GSEA plot

GSEA = function(gene_list, GO_file, pval) {
set.seed(54321)
library(dplyr)
library(fgsea)

if ( any( duplicated(names(gene_list)) ) ) {
warning("Duplicates in gene names")
gene_list = gene_list[!duplicated(names(gene_list))]
}
if ( !all( order(gene_list, decreasing = TRUE) == 1:length(gene_list)) ){
warning("Gene list not sorted")
gene_list = sort(gene_list, decreasing = TRUE)
}
myGO = fgsea::gmtPathways(GO_file)

fgRes <- fgsea::fgsea(pathways = myGO,
stats = gene_list,
minSize=15, ## minimum gene set size
maxSize=400, ## maximum gene set size
nperm=10000) %>%
as.data.frame() %>%
dplyr::filter(padj < !!pval) %>%
arrange(desc(NES))
message(paste("Number of signficant gene sets =", nrow(fgRes)))

message("Collapsing Pathways -----")
concise_pathways = collapsePathways(data.table::as.data.table(fgRes),
pathways = myGO,
stats = gene_list)
fgRes = fgRes[fgRes$pathway %in% concise_pathways$mainPathways, ]
message(paste("Number of gene sets after collapsing =", nrow(fgRes)))

fgRes$Enrichment = ifelse(fgRes$NES > 0, "Up-regulated", "Down-regulated")
filtRes = rbind(head(fgRes, n = 10),
tail(fgRes, n = 10 ))

total_up = sum(fgRes$Enrichment == "Up-regulated")
total_down = sum(fgRes$Enrichment == "Down-regulated")
header = paste0("Top 10 (Total pathways: Up=", total_up,", Down=", total_down, ")")

colos = setNames(c("firebrick2", "dodgerblue2"),
c("Up-regulated", "Down-regulated"))

g1= ggplot(filtRes, aes(reorder(pathway, NES), NES)) +
geom_point( aes(fill = Enrichment, size = size), shape=21) +
scale_fill_manual(values = colos ) +
scale_size_continuous(range = c(2,10)) +
geom_hline(yintercept = 0) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title=header)

output = list("Results" = fgRes, "Plot" = g1)
return(output)
}
gene_list <- exprs(gset_final)[,c("GSM1174427")]-exprs(gset_final)[,c("GSM1174416")]
names(gene_list) =fData(gset_final)$SYMBOL
gene_list <- gene_list[!duplicated(names(gene_list)) ]
gene_list = sort(gene_list, decreasing = TRUE)
head(gene_list)

Need to download gene symbols GO file from gsea website

GO_file = "/Users/txiao/Documents/YuLab/PublicDatabase/c2.all.v2023.2.Hs.symbols.gmt.txt"
res = GSEA(gene_list, GO_file, pval = 0.05)

error occurs

dim(res$Results)
res$Plot

@assaron
Copy link
Member

assaron commented Feb 20, 2024

Thanks for the example. Indeed there is a bug in incorrect handling of NA in gene names.

As a workaround you can apply gene_list <- gene_list[!is.na(names(gene_list))] before running fgsea.

@bug1303
Copy link

bug1303 commented Aug 5, 2024

A colleague got the same error, we checked the known suspects

stopifnot(all(is.finite(ranked_genes)))
stopifnot(!any(is.na(names(ranked_genes))))

Then I found that there was one gene without name/empty string "", and that this causes the same issue.
Leaving this note here in case someone else has this issue.

So, additionally one should check for: any(names(ranked_genes) == "") to be FALSE

Or to follow the previous post, run this before running fgsea:
gene_list <- gene_list[!(is.na(names(gene_list)) | names(gene_list) == "") ]

( ... although one should probably rather take care of why gene names turn up being "" or NA in the first place... )

@assaron
Copy link
Member

assaron commented Aug 5, 2024

@bug1303 thanks for the report, unexpectedly empty strings break look up in named vectors for some reason. I've added the additional check into fgsea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants