Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_markers function using each cell as a different cell type #68

Open
hd4git opened this issue Aug 29, 2024 · 3 comments
Open

check_markers function using each cell as a different cell type #68

hd4git opened this issue Aug 29, 2024 · 3 comments

Comments

@hd4git
Copy link

hd4git commented Aug 29, 2024

Hi,

I am trying to check markers for a list and it seems like the cell type information is not being used for some reason.
Total_nominated is close to the whole cell count in the cds instead of those for that particular cell_type.
Could someone please let me know how can I fix this ?

Here is how my cds looks like:

 class: cell_data_set 
dim: 24672 68342 
metadata(1): cds_version
assays(1): counts
rownames(24672): AL627309.1 AL627309.5 ... LHFPL1 CSAG1
rowData names(1): gene_short_name
colnames(68342): T0_1_AAACCCAAGCCGTTAT-1 T0_1_AAACCCAAGGTGCAGT-1 ...
  AIaw4_TTTGTTGTCTGAGCAT-1 AIaw4_TTTGTTGTCTTCCTAA-1
colData names(2): cell_type Size_Factor
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

Part of my marker list looks like this:

>cluster0
expressed: NEAT1, MALAT1, TFF3, TRPS1, ESR1, SOX4
not expressed: TUBB, TUBA1B, TUBB4B, STMN1, JPT1, PPP1R14B, TUBA1C, RHOD, TXNRD1

>cluster1
expressed: TMSB4X, GPRC5A, ITGB1, TUBB3, MYL12A, TUBA1A, MDK, CD24, MYL12B, MARCKSL1, LMO7, RASD1, PMEPA1, ANXA2, TNFRSF12A, SPTSSB, SYTL2, MIDN, MEX3A, TES, BTG1, SQSTM1, TNFRSF11B, BASP1, ITGA2, TXNRD1, PHLDA2, ID3
not expressed: HMGB2, BLVRB, LDHA, ARMT1, HMGN2, MT2A, PCLAF

>cluster2
expressed: GAS5, BCAS3, BASP1, ANXA2, SNHG29, C19orf33, CITED2, ZFAS1, CLU, FXYD3, ADIRF, S100A11, TACSTD2, CST3, S100A10, CD24, GSTM3, CEBPD, TACC1, ZFP36L2, LGALS3, GRN, BAG1, MSX2, SH3YL1, CD47, YPEL3, CSTB, PCSK1N, SNHG7, CLIC3, CD55, NPC2, EEF1A2, EIF3F, CAST, ITM2B
not expressed: TFF1, ARMT1, HMGB1, SPDEF, PTMS, MT2A, XBP1, TFF3, SCD, PCLAF, HNRNPAB, TRPS1, ESR1, HMGN2, CCND1, LAPTM4B, FASN, CENPX, LDHA, RMND1, SKA2, SYNGR2, HMGB2, ENO1, BZW1

marker_check <- check_markers(ai_cds, "data/markerList3.txt",
                              db=org.Hs.eg.db,
                              cds_gene_id_type = "SYMBOL",
                              marker_file_gene_id_type = "SYMBOL", 
                              propogate_markers = TRUE,
                              use_tf_idf = TRUE)


Please let me know if more information is required.

Thanks !

@hpliner
Copy link
Collaborator

hpliner commented Sep 3, 2024

Hi, sometimes the total nominated can be very high if a ubiquitously expressed gene is present in the cell type definition. Could this be the case? Please share an example of the problematic output of check_markers for further troubleshooting

@hd4git
Copy link
Author

hd4git commented Sep 3, 2024

Hi,
Thanks for your reply. I would expect a few genes to be ubiquitously expressed but the entire range goes close to the count of total cells in the cds.
These are the cell counts for each cluster:

>table(ai_cds@colData$cell_type)

 cluster0 cells  cluster1 cells cluster10 cells cluster11 cells cluster12 cells 
          12534           10645            1034             564             493 
 cluster2 cells  cluster3 cells  cluster4 cells  cluster5 cells  cluster6 cells 
          10543            9066            8647            5226            4926 
 cluster7 cells  cluster8 cells  cluster9 cells 
           2407            1204            1053 

Here are the range of values for total nominated

> range(marker_check_exp$total_nominated)
[1] 66789 68342

And here's how the marker check output looks like. I have removed the genes marked not_expressed from marker_check output:

> head(marker_check_exp[order(marker_check_exp$cell_type),], n=20L)
     marker_gene         gene_id parent      cell_type in_cds nominates
158         ESR1 ENSG00000091831   root cluster0 cells   TRUE     25286
262        TRPS1 ENSG00000104447   root cluster0 cells   TRUE     34738
561         SOX4 ENSG00000124766   root cluster0 cells   TRUE     50941
919         TFF3 ENSG00000160180   root cluster0 cells   TRUE     26116
1462       NEAT1 ENSG00000245532   root cluster0 cells   TRUE     57275
1468      MALAT1 ENSG00000251562   root cluster0 cells   TRUE     59423
13     TNFRSF12A ENSG00000006327   root cluster1 cells   TRUE     47474
31        GPRC5A ENSG00000013588   root cluster1 cells   TRUE     31156
226       MYL12A ENSG00000101608   root cluster1 cells   TRUE     60491
345        RASD1 ENSG00000108551   root cluster1 cells   TRUE     23564
370          MDK ENSG00000110492   root cluster1 cells   TRUE     44941
458          ID3 ENSG00000117318   root cluster1 cells   TRUE     28851
477       MYL12B ENSG00000118680   root cluster1 cells   TRUE     65432
552       PMEPA1 ENSG00000124225   root cluster1 cells   TRUE     38762
680         BTG1 ENSG00000133639   root cluster1 cells   TRUE     43128
701          TES ENSG00000135269   root cluster1 cells   TRUE     32870
711         LMO7 ENSG00000136153   root cluster1 cells   TRUE     28040
742        SYTL2 ENSG00000137501   root cluster1 cells   TRUE     40576
863        ITGB1 ENSG00000150093   root cluster1 cells   TRUE     51096
938       SQSTM1 ENSG00000161011   root cluster1 cells   TRUE     57484
     total_nominated exclusion_dismisses inclusion_ambiguates   most_overlap
158            66789                 228                  228 cluster1 cells
262            66789                 418                  418 cluster1 cells
561            66789                1822                 1822 cluster1 cells
919            66789                 216                  216 cluster1 cells
1462           66789                 682                  682 cluster2 cells
1468           66789                 826                  826 cluster2 cells
13             68324                   3                    3 cluster2 cells
31             68324                   0                    0           <NA>
226            68324                   3                    3 cluster2 cells
345            68324                   1                    1 cluster0 cells
370            68324                   1                    1 cluster0 cells
458            68324                   0                    0           <NA>
477            68324                   8                    8 cluster3 cells
552            68324                   0                    0           <NA>
680            68324                   1                    1 cluster0 cells
701            68324                   1                    1 cluster0 cells
711            68324                   0                    0           <NA>
742            68324                   4                    4 cluster0 cells
863            68324                   2                    2 cluster0 cells
938            68324                   5                    5 cluster2 cells
        ambiguity marker_score summary
158  9.016847e-03     19.90841      Ok
262  1.203293e-02     23.60628      Ok
561  3.576687e-02     16.66523      Ok
919  8.270792e-03     21.40151      Ok
1462 1.190746e-02     39.14426      Ok
1468 1.390034e-02     37.22593      Ok
13   6.319248e-05     69.04731      Ok
31   0.000000e+00     45.60037      Ok
226  4.959415e-05     88.09859      Ok
345  4.243762e-05     34.34287      Ok
370  2.225140e-05     65.63026      Ok
458  0.000000e+00     42.22674      Ok
477  1.222643e-04     94.61048      Ok
552  0.000000e+00     56.73263      Ok
680  2.318679e-05     62.97675      Ok
701  3.042288e-05     47.96309      Ok
711  0.000000e+00     41.03975      Ok
742  9.858044e-05     58.80789      Ok
863  3.914201e-05     74.49327      Ok
938  8.698073e-05     83.40892      Ok

I would really appreciate if you could help me to figure out the mistake and make it work. Please let me know if you need more information.

Thanks !

@hpliner
Copy link
Collaborator

hpliner commented Sep 14, 2024

Hi, I think the column you want to be looking at is the "nominates" column rather than "total nominated". "total_nominated" shows the total number of cells nominated by all genes in the definition - that means that if there's even one ubiquitously expressed gene in the definition, total_nominated will be very high. I recommend looking at the ambiguity scores (you can use the plot_markers function to easily identify) and start removing genes that are ubiquitously nominating. Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants