Proposal: Include 'combinatorial' argument to t-test functions

## Proposal

Add a new argument - `combinatorial` - in order for all combinations of the input ndarray to be compared to one another:

`stats.ttest_ind(ndarray, combinatorial=True,...)`

If the ndarray contains 5 vectors (`np.array([array1, array2...array5)]`) then the function would calculate 1vs2, 1vs3, 1vs4...2vs3, 2vs4,...4vs5.

## Current Limitations

Currently the ttest_ind function requires that you pass in two array like variables, a & b, and compares one vector to the next:

``` python
>>> import numpy as np
>>> from scipy import stats
>>> array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> array2 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> stats.ttest_ind(array1, array2)
Ttest_indResult(statistic=4.3817804600413295, pvalue=0.00046449025177634345)
```

or as an ndarray

``` python
>>> stats.ttest_ind(np.array([array1, array2]).T, np.array([array2, array1]).T)
Ttest_indResult(statistic=array([ 4.38178046, -4.38178046]), pvalue=array([0.00046449, 0.00046449]))
```

Due to the structure of the arguments it means that only 1 comparison can be made: Thus requiring some combination, potentially using the itertools library to compare multiple groups (i.e. 1 vs 2, 2 vs 3 & 1 vs 3):

```  python
>>> import itertools as it
>>> array3 = np.array([7, 7, 7, 7, 7, 7, 7, 7, 7])
>>> ndarray = np.array([array1, array2, array3])
>>> [ stats.ttest_ind(ndarray[x[0], :], ndarray[x[1], :]) for x in it.combinations(range(0, ndarray.shape[0]), 2)]
[Ttest_indResult(statistic=4.3817804600413295, pvalue=0.00046449025177634345),
Ttest_indResult(statistic=-2.1908902300206647, pvalue=0.04360954625650814),
Ttest_indResult(statistic=-inf, pvalue=0.0)]
```

or this way:

``` python
>>> stats.ttest_ind(np.array([array1, array2, array1]).T, np.array([array2, array3, array3]).T)
Ttest_indResult(statistic=array([ 4.38178046,        -inf, -2.19089023]), pvalue=array([0.00046449, 0.        , 0.04360955]))
```

However, it's not immediately obvious how to construct the input variables when doing combinations and suppose the dimensions become very large: It would require very large input ndarrays to be constucted, with large amounts of duplicated data. It therefore seems sensible that the arguments would support a general combinatorial case.

Internally I wrote a series of functions for the different t-tests that do support this input using matrix broadcasting and thus removing the need for python looping external to the function or constucting large input ndarrays:

``` python
>>> p_values, score = t_test_(ndarray, None, {'paired': False, 'equal_variance': True})
>>> p_values
array([[1.00000000e+00, 4.64490252e-04, 4.36095463e-02],
       [4.64490252e-04,            nan, 0.00000000e+00],
       [4.36095463e-02, 0.00000000e+00,            nan]])
>>> score
array([[ 0.        ,  4.38178046, -2.19089023],
       [-4.38178046,         nan,        -inf],
       [ 2.19089023,         inf,         nan]])
```

Internal function:

``` python
def t_test_(v, weights, opts):
    base = np.nansum(~np.isnan(v)*1, axis=1)
    mean = np.nanmean(v, axis=1)
    t_nom = mean[:, np.newaxis] - mean
    s = np.nanvar(v, axis=1, ddof=1)
    if opts['equal_variance']:
        sp = ((base - 1) * s + (base[:, np.newaxis] - 1) * s[:, np.newaxis]) / (base + base[:, np.newaxis] - 2)
        score = t_nom / (np.sqrt(sp) * np.sqrt(1 / base + 1 / base[:, np.newaxis]))
        dof = base + base[:, np.newaxis] - 2
    else:
        s_n = s / base
        score = t_nom / np.sqrt(s_n + s_n[:, np.newaxis])
        dof_denom = (s_n**2 / (base - 1)) + (s_n[:, np.newaxis]**2 / (base[:, np.newaxis] - 1))
        dof = (s_n + s_n[:, np.newaxis])**2 / dof_denom
    return stats.t.sf(np.abs(score), dof) * 2, score
```

## Benchmark

``` python
>>> array4 = np.array([11, 3, 2, 1, 5, 2, 1, 3, 6])
>>> array5 = np.array([11, 3, 2, 1, 5, 2, 1, 6, 6])
>>> ndarray = np.array([array1, array2, array3, array4, array5])
>>> def test_current():
...     return [ stats.ttest_ind(ndarray[x[0], :], ndarray[x[1], :], nan_policy='omit') for x in it.combinations(range(0, ndarray.shape[0]), 2)]
>>> def test_new():
...     return standard_t_test_(ndarray, None, {'paired': False, 'equal_variance': True})
>>> print(timeit.timeit("test_current()", "from __main__ import test_current", number=100000))
202.31119344299987
>>> print(timeit.timeit("test_new()", "from __main__ import test_new", number=100000))
20.02507360100026
```

This demonstrates that the new function is around 10 times faster when comparing 5 groups with one another when comparing to the python looping way.


## Summary

What are the implications for adding a feature such as this? Statistically running multiple tests would affect the p-value performance as it would permit p-hacking/mining - however I'd argue that that decision rests with the statistician, not the statistical library.

I'd like to know people's thoughts on a feature such as - or similar to - this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Proposal: Include 'combinatorial' argument to t-test functions #9860

Proposal

Current Limitations

Benchmark

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: Include 'combinatorial' argument to t-test functions #9860

Description

Proposal

Current Limitations

Benchmark

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions