-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
Proposal
Add a new argument - combinatorial
- in order for all combinations of the input ndarray to be compared to one another:
stats.ttest_ind(ndarray, combinatorial=True,...)
If the ndarray contains 5 vectors (np.array([array1, array2...array5)]
) then the function would calculate 1vs2, 1vs3, 1vs4...2vs3, 2vs4,...4vs5.
Current Limitations
Currently the ttest_ind function requires that you pass in two array like variables, a & b, and compares one vector to the next:
>>> import numpy as np
>>> from scipy import stats
>>> array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> array2 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> stats.ttest_ind(array1, array2)
Ttest_indResult(statistic=4.3817804600413295, pvalue=0.00046449025177634345)
or as an ndarray
>>> stats.ttest_ind(np.array([array1, array2]).T, np.array([array2, array1]).T)
Ttest_indResult(statistic=array([ 4.38178046, -4.38178046]), pvalue=array([0.00046449, 0.00046449]))
Due to the structure of the arguments it means that only 1 comparison can be made: Thus requiring some combination, potentially using the itertools library to compare multiple groups (i.e. 1 vs 2, 2 vs 3 & 1 vs 3):
>>> import itertools as it
>>> array3 = np.array([7, 7, 7, 7, 7, 7, 7, 7, 7])
>>> ndarray = np.array([array1, array2, array3])
>>> [ stats.ttest_ind(ndarray[x[0], :], ndarray[x[1], :]) for x in it.combinations(range(0, ndarray.shape[0]), 2)]
[Ttest_indResult(statistic=4.3817804600413295, pvalue=0.00046449025177634345),
Ttest_indResult(statistic=-2.1908902300206647, pvalue=0.04360954625650814),
Ttest_indResult(statistic=-inf, pvalue=0.0)]
or this way:
>>> stats.ttest_ind(np.array([array1, array2, array1]).T, np.array([array2, array3, array3]).T)
Ttest_indResult(statistic=array([ 4.38178046, -inf, -2.19089023]), pvalue=array([0.00046449, 0. , 0.04360955]))
However, it's not immediately obvious how to construct the input variables when doing combinations and suppose the dimensions become very large: It would require very large input ndarrays to be constucted, with large amounts of duplicated data. It therefore seems sensible that the arguments would support a general combinatorial case.
Internally I wrote a series of functions for the different t-tests that do support this input using matrix broadcasting and thus removing the need for python looping external to the function or constucting large input ndarrays:
>>> p_values, score = t_test_(ndarray, None, {'paired': False, 'equal_variance': True})
>>> p_values
array([[1.00000000e+00, 4.64490252e-04, 4.36095463e-02],
[4.64490252e-04, nan, 0.00000000e+00],
[4.36095463e-02, 0.00000000e+00, nan]])
>>> score
array([[ 0. , 4.38178046, -2.19089023],
[-4.38178046, nan, -inf],
[ 2.19089023, inf, nan]])
Internal function:
def t_test_(v, weights, opts):
base = np.nansum(~np.isnan(v)*1, axis=1)
mean = np.nanmean(v, axis=1)
t_nom = mean[:, np.newaxis] - mean
s = np.nanvar(v, axis=1, ddof=1)
if opts['equal_variance']:
sp = ((base - 1) * s + (base[:, np.newaxis] - 1) * s[:, np.newaxis]) / (base + base[:, np.newaxis] - 2)
score = t_nom / (np.sqrt(sp) * np.sqrt(1 / base + 1 / base[:, np.newaxis]))
dof = base + base[:, np.newaxis] - 2
else:
s_n = s / base
score = t_nom / np.sqrt(s_n + s_n[:, np.newaxis])
dof_denom = (s_n**2 / (base - 1)) + (s_n[:, np.newaxis]**2 / (base[:, np.newaxis] - 1))
dof = (s_n + s_n[:, np.newaxis])**2 / dof_denom
return stats.t.sf(np.abs(score), dof) * 2, score
Benchmark
>>> array4 = np.array([11, 3, 2, 1, 5, 2, 1, 3, 6])
>>> array5 = np.array([11, 3, 2, 1, 5, 2, 1, 6, 6])
>>> ndarray = np.array([array1, array2, array3, array4, array5])
>>> def test_current():
... return [ stats.ttest_ind(ndarray[x[0], :], ndarray[x[1], :], nan_policy='omit') for x in it.combinations(range(0, ndarray.shape[0]), 2)]
>>> def test_new():
... return standard_t_test_(ndarray, None, {'paired': False, 'equal_variance': True})
>>> print(timeit.timeit("test_current()", "from __main__ import test_current", number=100000))
202.31119344299987
>>> print(timeit.timeit("test_new()", "from __main__ import test_new", number=100000))
20.02507360100026
This demonstrates that the new function is around 10 times faster when comparing 5 groups with one another when comparing to the python looping way.
Summary
What are the implications for adding a feature such as this? Statistically running multiple tests would affect the p-value performance as it would permit p-hacking/mining - however I'd argue that that decision rests with the statistician, not the statistical library.
I'd like to know people's thoughts on a feature such as - or similar to - this.