Skip to content

Proposal: Include 'combinatorial' argument to t-test functions #9860

@Rambatino

Description

@Rambatino

Proposal

Add a new argument - combinatorial - in order for all combinations of the input ndarray to be compared to one another:

stats.ttest_ind(ndarray, combinatorial=True,...)

If the ndarray contains 5 vectors (np.array([array1, array2...array5)]) then the function would calculate 1vs2, 1vs3, 1vs4...2vs3, 2vs4,...4vs5.

Current Limitations

Currently the ttest_ind function requires that you pass in two array like variables, a & b, and compares one vector to the next:

>>> import numpy as np
>>> from scipy import stats
>>> array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> array2 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> stats.ttest_ind(array1, array2)
Ttest_indResult(statistic=4.3817804600413295, pvalue=0.00046449025177634345)

or as an ndarray

>>> stats.ttest_ind(np.array([array1, array2]).T, np.array([array2, array1]).T)
Ttest_indResult(statistic=array([ 4.38178046, -4.38178046]), pvalue=array([0.00046449, 0.00046449]))

Due to the structure of the arguments it means that only 1 comparison can be made: Thus requiring some combination, potentially using the itertools library to compare multiple groups (i.e. 1 vs 2, 2 vs 3 & 1 vs 3):

>>> import itertools as it
>>> array3 = np.array([7, 7, 7, 7, 7, 7, 7, 7, 7])
>>> ndarray = np.array([array1, array2, array3])
>>> [ stats.ttest_ind(ndarray[x[0], :], ndarray[x[1], :]) for x in it.combinations(range(0, ndarray.shape[0]), 2)]
[Ttest_indResult(statistic=4.3817804600413295, pvalue=0.00046449025177634345),
Ttest_indResult(statistic=-2.1908902300206647, pvalue=0.04360954625650814),
Ttest_indResult(statistic=-inf, pvalue=0.0)]

or this way:

>>> stats.ttest_ind(np.array([array1, array2, array1]).T, np.array([array2, array3, array3]).T)
Ttest_indResult(statistic=array([ 4.38178046,        -inf, -2.19089023]), pvalue=array([0.00046449, 0.        , 0.04360955]))

However, it's not immediately obvious how to construct the input variables when doing combinations and suppose the dimensions become very large: It would require very large input ndarrays to be constucted, with large amounts of duplicated data. It therefore seems sensible that the arguments would support a general combinatorial case.

Internally I wrote a series of functions for the different t-tests that do support this input using matrix broadcasting and thus removing the need for python looping external to the function or constucting large input ndarrays:

>>> p_values, score = t_test_(ndarray, None, {'paired': False, 'equal_variance': True})
>>> p_values
array([[1.00000000e+00, 4.64490252e-04, 4.36095463e-02],
       [4.64490252e-04,            nan, 0.00000000e+00],
       [4.36095463e-02, 0.00000000e+00,            nan]])
>>> score
array([[ 0.        ,  4.38178046, -2.19089023],
       [-4.38178046,         nan,        -inf],
       [ 2.19089023,         inf,         nan]])

Internal function:

def t_test_(v, weights, opts):
    base = np.nansum(~np.isnan(v)*1, axis=1)
    mean = np.nanmean(v, axis=1)
    t_nom = mean[:, np.newaxis] - mean
    s = np.nanvar(v, axis=1, ddof=1)
    if opts['equal_variance']:
        sp = ((base - 1) * s + (base[:, np.newaxis] - 1) * s[:, np.newaxis]) / (base + base[:, np.newaxis] - 2)
        score = t_nom / (np.sqrt(sp) * np.sqrt(1 / base + 1 / base[:, np.newaxis]))
        dof = base + base[:, np.newaxis] - 2
    else:
        s_n = s / base
        score = t_nom / np.sqrt(s_n + s_n[:, np.newaxis])
        dof_denom = (s_n**2 / (base - 1)) + (s_n[:, np.newaxis]**2 / (base[:, np.newaxis] - 1))
        dof = (s_n + s_n[:, np.newaxis])**2 / dof_denom
    return stats.t.sf(np.abs(score), dof) * 2, score

Benchmark

>>> array4 = np.array([11, 3, 2, 1, 5, 2, 1, 3, 6])
>>> array5 = np.array([11, 3, 2, 1, 5, 2, 1, 6, 6])
>>> ndarray = np.array([array1, array2, array3, array4, array5])
>>> def test_current():
...     return [ stats.ttest_ind(ndarray[x[0], :], ndarray[x[1], :], nan_policy='omit') for x in it.combinations(range(0, ndarray.shape[0]), 2)]
>>> def test_new():
...     return standard_t_test_(ndarray, None, {'paired': False, 'equal_variance': True})
>>> print(timeit.timeit("test_current()", "from __main__ import test_current", number=100000))
202.31119344299987
>>> print(timeit.timeit("test_new()", "from __main__ import test_new", number=100000))
20.02507360100026

This demonstrates that the new function is around 10 times faster when comparing 5 groups with one another when comparing to the python looping way.

Summary

What are the implications for adding a feature such as this? Statistically running multiple tests would affect the p-value performance as it would permit p-hacking/mining - however I'd argue that that decision rests with the statistician, not the statistical library.

I'd like to know people's thoughts on a feature such as - or similar to - this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-decisionItems that need further discussion before they are merged or closedscipy.stats

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions