-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to set reference condition for two_sample tests? #205
Comments
Hi @adkinsrs thanks for the issue. This is currently not supported as grouping formats are lost during parsing, the current state may also be desirable because this makes the test output invariable wrt data permutation. But I see your point here, enabling this would however require and extension of the code, specifically:
I will leave this open until addressed and would invite code contributions from interested parties via PR on dev branch! |
@davidsebfischer, thank you for the response. From my testing and tinkering it seems the issue is using numpy's diffxpy/diffxpy/testing/utils.py Line 114 in b8c6ae0
The I was thinking I could create a PR for this, but being it was a one-line change, I would want to know if this change is feasible for the bigger picture. Also should this be expanded as an option for the t-test/rank-test functions, or does that not matter so much since some people may not care which condition is the baseline/reference? |
Thanks a lot for looking into this @adkinsrs! The issue with using pandas functions here is that the argument diffxpy/diffxpy/testing/utils.py Line 109 in b8c6ae0
pandas.Series directly to t_test::grouping instead of going via the string?
We could however extend your fix, encoding the error in the category order of the diffxpy/diffxpy/testing/utils.py Line 105 in b8c6ae0
We could leave this option out for |
Actually I passed a string to an adata.obs column to the "grouping" parameter. I will paste my transformation code from the REPL Python 3.7.3 (default, Apr 19 2021, 15:07:07)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import anndata, diffxpy.api as de
/opt/Python-3.7.3/lib/python3.7/site-packages/pandas/compat/__init__.py:117: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
>>> adata = anndata.read("./df726e89-b7ac-d798-83bf-2bd69d7f3b52.h5ad")
>>> adata.obs
cluster replicate louvain
index
Ute_P1_Neg_0 TEC 5 TEC
Ute_P1_Neg_5 TEC 3 TEC
Ute_P1_GFP_3 SC (i) 8 SC (i)
Ute_P1_GFP_0 SC (ii) 23 SC (ii)
Ute_P1_tdTom_3 HC (ii) 15 HC (ii)
... ... ... ...
Ute_P1_Neg_31 TEC 25 TEC
Ute_P1_GFP_43 SC (ii) 32 SC (ii)
Ute_P1_GFP_45 SC (ii) 39 SC (ii)
Ute_P1_GFP_49 SC (ii) 30 SC (ii)
Ute_P1_tdTom_68 HC (iii-iv) 40 HC (iii-iv)
[158 rows x 3 columns]
>>> de_filter1 = adata.obs.cluster.isin(["HC (i)"])
>>> de_filter2 = adata.obs.cluster.isin(["SC (i)"])
>>> selected1 = adata[de_filter1, :]
>>> selected2 = adata[de_filter2, :]
>>> de_selected = selected1.concatenate(selected2) # Concatenate to sort in the order I want
>>> de_selected.obs
cluster replicate louvain batch
index
Ute_P1_tdTom_9-0 HC (i) 6 HC (i) 0
Ute_P1_tdTom_0-0 HC (i) 13 HC (i) 0
Ute_P1_Neg_26-0 HC (i) 1 HC (i) 0
Ute_P1_tdTom_19-0 HC (i) 12 HC (i) 0
Ute_P1_tdTom_37-0 HC (i) 9 HC (i) 0
Ute_P1_tdTom_20-0 HC (i) 14 HC (i) 0
Ute_P1_tdTom_25-0 HC (i) 11 HC (i) 0
Ute_P1_tdTom_47-0 HC (i) 4 HC (i) 0
Ute_P1_tdTom_57-0 HC (i) 10 HC (i) 0
Ute_P1_tdTom_52-0 HC (i) 8 HC (i) 0
Ute_P1_tdTom_66-0 HC (i) 2 HC (i) 0
Ute_P1_tdTom_63-0 HC (i) 3 HC (i) 0
Ute_P1_tdTom_60-0 HC (i) 5 HC (i) 0
Ute_P1_tdTom_64-0 HC (i) 7 HC (i) 0
Ute_P1_GFP_3-1 SC (i) 8 SC (i) 1
Ute_P1_GFP_1-1 SC (i) 9 SC (i) 1
Ute_P1_Neg_3-1 SC (i) 4 SC (i) 1
Ute_P1_GFP_2-1 SC (i) 11 SC (i) 1
Ute_P1_GFP_10-1 SC (i) 10 SC (i) 1
Ute_P1_Neg_15-1 SC (i) 2 SC (i) 1
Ute_P1_Neg_20-1 SC (i) 1 SC (i) 1
Ute_P1_Neg_19-1 SC (i) 3 SC (i) 1
Ute_P1_GFP_9-1 SC (i) 6 SC (i) 1
Ute_P1_GFP_26-1 SC (i) 7 SC (i) 1
Ute_P1_GFP_34-1 SC (i) 14 SC (i) 1
Ute_P1_GFP_37-1 SC (i) 13 SC (i) 1
Ute_P1_GFP_35-1 SC (i) 15 SC (i) 1
Ute_P1_Neg_34-1 SC (i) 5 SC (i) 1
Ute_P1_GFP_48-1 SC (i) 18 SC (i) 1
Ute_P1_GFP_39-1 SC (i) 17 SC (i) 1
Ute_P1_GFP_40-1 SC (i) 16 SC (i) 1
Ute_P1_GFP_53-1 SC (i) 12 SC (i) 1
>>> de_selected.obs.cluster.unique() # HC is encountered before SC
array(['HC (i)', 'SC (i)'], dtype=object)
>>> de_selected_opposite = selected2.concatenate(selected1)
>>> de_selected_opposite.obs.cluster.unique() # SC is encountered before HC. The "np.unique(de_selected_opposite.obs.cluster)" call gives HC first, due to the sorting of uniq vals
array(['SC (i)', 'HC (i)'], dtype=object)
>>> de.test.t_test(de_selected, grouping="cluster", gene_names=de_selected.var["gene_symbol"], is_logged=True).summary()
gene pval qval log2fc mean zero_mean zero_variance
0 0610005C13Rik NaN NaN 0.000000 0.000000 True True
1 0610009B22Rik 0.242574 0.584453 -1.766539 5.018637 False False
2 0610009E02Rik 0.528240 0.753421 0.893326 2.728064 False False
3 0610009L18Rik 0.724149 0.872133 0.249389 0.570136 False False
4 0610010F05Rik 0.473403 0.716048 -1.073669 1.446289 False False
... ... ... ... ... ... ... ...
22807 Zyg11a NaN NaN 0.000000 0.000000 True True
22808 Zyg11b 0.272259 0.584453 -1.095174 0.903192 False False
22809 Zyx 0.488327 0.727949 1.132955 6.172250 False False
22810 Zzef1 0.952420 0.980600 -0.075449 0.599622 False False
22811 Zzz3 0.972575 0.989865 0.046684 2.544517 False False
[22812 rows x 7 columns]
>>> de.test.t_test(de_selected_opposite, grouping="cluster", gene_names=de_selected_opposite.var["gene_symbol"], is_logged=True).summary()
gene pval qval log2fc mean zero_mean zero_variance
0 0610005C13Rik NaN NaN 0.000000 0.000000 True True
1 0610009B22Rik 0.242574 0.584453 1.766539 5.018637 False False
2 0610009E02Rik 0.528240 0.753421 -0.893326 2.728064 False False
3 0610009L18Rik 0.724149 0.872133 -0.249389 0.570136 False False
4 0610010F05Rik 0.473403 0.716048 1.073669 1.446289 False False
... ... ... ... ... ... ... ...
22807 Zyg11a NaN NaN 0.000000 0.000000 True True
22808 Zyg11b 0.272259 0.584453 1.095174 0.903192 False False
22809 Zyx 0.488327 0.727949 -1.132955 6.172250 False False
22810 Zzef1 0.952420 0.980600 0.075449 0.599622 False False
22811 Zzz3 0.972575 0.989865 -0.046684 2.544517 False False
[22812 rows x 7 columns]
# For each gene between the two summary dataframes, the pval/qval is the same, but the log2fc is of the opposite sign due to switched order of HC (i) and SC (i) |
We are currently using @adkinsrs version in production to address issues we had. |
I'd like to mention that on version |
I was originally trying to use the
rank_genes_groups()
function in scanpy to get p-values and logfoldchanges between a single condition and a reference condition. My end-goal was to take the DE results and make a volcano plot. After reading some of the scanpy Github issue tickets (scverse/scanpy#397), I learned of diffxpy and how it seems to be recommended for DEG over scanpy in general situations.I like how there is less need to transform the
diffxpy.api.test.t_test
output when compared to the output ofscanpy.tl.rank_genes_groups
to get the data into a format to create a volcano plot for my code. However, it seems that no matter which order I supply the two conditions for the t-test, the resulting output is the same. This results in some volcano plots that, when compared to a plot made usingscanpy.tl.rank_genes_groups
output, has about the same p-val but a log-fold change with the opposite sign.Is there some easy way of forcing one of the conditions to be the reference condition for the diffxpy DEG tests? I apologize if I'm missing some inherent understanding or limitation of the tool.
Also, I think this question relates to #188 and to #184.
The text was updated successfully, but these errors were encountered: