Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies in 'k' Output: Command Line vs. Python Environment #75

Open
mluciarr opened this issue Oct 26, 2023 · 4 comments
Open

Discrepancies in 'k' Output: Command Line vs. Python Environment #75

mluciarr opened this issue Oct 26, 2023 · 4 comments

Comments

@mluciarr
Copy link

mluciarr commented Oct 26, 2023

Hi Dylan!

Yesterday I ran the cNMF in the terminal (Mac M1) and everything went smoothly until the final step, where I encountered an unusual error:

(py37) Mac-Studio-de-Maria:tables marialuciaromerorivero$ cnmf k_selection_plot --output-dir ./1_cNFM_trials --name 1_cNFM_trials`

(py37) Mac-Studio-de-Maria:tables marialuciaromerorivero$ cnmf consensus --output-dir ./1_cNFM_trials --name 1_cNFM_trials --components 11 --local-density-threshold 2 --show-clustering

Traceback (most recent call last):
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/bin/cnmf", line 8, in <module>
    sys.exit(main())
 File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 1050, in main
    close_clustergram_fig=True)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 765, in consensus
spectra_tpm = self.refit_spectra(tpm.X, norm_usages.astype(tpm.X.dtype))
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 680, in refit_spectra
    return(self.refit_usage(X.T, usage.T).T)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 658, in refit_usage
    _, rf_usages = self._nmf(X, nmf_kwargs=refit_nmf_kwargs)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 551, in _nmf
    (usages, spectra, niter) = non_negative_factorization(X, **nmf_kwargs)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py", line 1107, in non_negative_factorization
    W, H, n_iter = est._fit_transform(X, W=W, H=H, update_H=update_H)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py", line 1597, in _fit_transform
    W, H = self._check_w_h(X, W, H, update_H)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py", line 1470, in _check_w_h
    _check_init(H, (self._n_components, n_features), "NMF (input H)")
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py", line 58, in _check_init
    % (whom, shape, np.shape(A))
ValueError: Array with the wrong shape passed to NMF (input H). Expected (11, 224758), but got (11, 224759) 

For that reason, I ran it in Python, which worked perfectly with the exception that the k_selection_plot to select the optimal 'k' is completely different from what I obtained in the terminal. I used exactly the same parameters in both methodologies, and the plots look completely different. Here, I will show them to you:

Terminal k_selection_plot: shows the optimal 'k' as 11
1_cNFM_trials k_selection

Python environment k_selection_plot: shows the optimal 'k' as 7

1_cNFM_trials_comp7_py k_selection

  1. Why is there such a significant difference in the results?
  2. Which of them should I trust?

Thank you very much in advance!
Looking forward to your reply :)

Lucia

@mluciarr mluciarr changed the title Different k output btween command line and Python enviroment Different k output between command line and Python enviroment Oct 26, 2023
@mluciarr mluciarr changed the title Different k output between command line and Python enviroment Discrepancies in 'k' Output: Command Line vs. Python Environment Oct 26, 2023
@LiuCanidk
Copy link

@mluciarr sorry to interrupt, could you tell me how to determine the optimal value based on the plotting result? It seems that the first peak in the stability curve?
image
For me to consider the trade-off between error and stability, should I choose 5, 6 or 7 as the optimal value? The plot above is my result of a bulk RNA dataset

@mluciarr
Copy link
Author

Hi @LiuCanidk ,

Well, looking at your results I wouldn't be sure with one of the 3 options you mention. If I were you I would run it using the 3 resolutions and see which one fits best with what you want to see or what you consider that makes more sense with your data. I would start using 5 because it's the one that is slightly higher and then see what happens if you increase the number, I bet that the results won't change significantly since the 3 are consecutive.
I may not be helping you much but in your case is quite tricky to confirm the optimal number of factors.
Let me know if you try it and you see many differences. I'm intrigued about it.

Regards.

Lucia

@LiuCanidk
Copy link

Hi @mluciarr
Thanks for your reply. I do not try many K values, but pick the 7 as the ultimate and found 7 is OK.
Although I cannot show you the different results, but I found other differences in my another single cell project using cNMF. As the selection plot shows, error definitely decreases as the k values become larger, but stability's trend is not monotonous. So I think the key is to select the optimal stability while maintaining lower error.

image
The selection plot above showed that k=3 has highest stability and also lower error than k=2; k=5 has the first increased trend of stability and also lower error; k=7, 10 all have the increased trend of stability compared to their last value. So I believe the key is to select the increased trend (non-monotonous). Then I pick 3, 5, 7, 10 to see the difference of their clustering plots.

k=3
image
k=5
image
k=7
image
k=10
image
As you can see, k=3 formed the most robust clusters. k=5 also worked fine (after setting the parameter of filtering threshold). k=7 has slightly discrepancies across iterations or spectras (n.iter=100, although here 19% were filtered, I found the default is to filter 30% mentioned in the tutorial) and k=10 has the worst clustering result, which is obviously non-acceptable (I did change the parameter of local-threshold-density, but see no difference of this discrepancy pattern ,though it can occur in different clusters). So here I choose k=5 and k=7, and perform GSEA to annotate these gene expression programs to see if they are proper.

Downstream results were not shown here because the non-ideal GSEA result and also the program usage distribution across cells, which I guess was the problem on the single cell quality. Hope it helps.

Thank you again for your advice.

@dylkot
Copy link
Owner

dylkot commented May 7, 2024

Yes, I'll just add that choosing K is hard and I recommend to look at the results for a few values of K (as you would do with clustering). Usually only one or two GEPs change at the margin while the majority remain pretty stable. So I recommend exploring what GEPs are changing with the different values of K. I also think GSEA can only help to some extent because often the gene sets available to analyze the programs don't actually tell us what the programs are. So I also recommend looking at the top weighted genes in the gene_spectra_score output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants