Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new function for pairwise T-tests between columns of a dataframe (pingouin.ptests) #291

Merged
merged 10 commits into from
Aug 27, 2022

Conversation

raphaelvallat
Copy link
Owner

As discussed in #290, this PR adds the ptests (pairwise_ttest) method to pandas.DataFrame to calculate pairwise T-tests between columns of a pandas DataFrame. This can be used as an alternative to the pingouin.pairwise_tests function when the data is in wide-format instead of long-format. Unlike the pairwise_tests function, the ptests function only return the T-values (lower triangle) and p-values (upper triangle). Please see examples below:

I'm looking for one reviewer to review the PR. Thanks!

>>> import numpy as np
>>> import pandas as pd
>>> import pingouin as pg
>>> # Load an example dataset of personality dimensions
>>> df = pg.read_dataset('pairwise_corr').iloc[:30, 1:]
>>> df.columns = ["N", "E", "O", 'A', "C"]
>>> # Add some missing values
>>> df.iloc[[2, 5, 20], 2] = np.nan
>>> df.iloc[[1, 4, 10], 3] = np.nan
>>> df.head().round(2)
    N     E     O     A     C
0  2.48  4.21  3.94  3.96  3.46
1  2.60  3.19  3.96   NaN  3.23
2  2.81  2.90   NaN  2.75  3.50
3  2.90  3.56  3.52  3.17  2.79
4  3.02  3.33  4.02   NaN  2.85

# Independent pairwise T-tests

>>> df.ptests()
      N       E      O      A    C
N       -     ***    ***    ***  ***
E  -8.397       -                ***
O  -8.332  -0.596      -         ***
A  -8.804    0.12   0.72      -  ***
C  -4.759   3.753  4.074  3.787    -

# Let's compare with SciPy

>>> from scipy.stats import ttest_ind
>>> np.round(ttest_ind(df["N"], df["E"]), 3)
array([-8.397,  0.   ])

# Passing custom parameters to the lower-level :py:func:`scipy.stats.ttest_ind` function

>>> df.ptests(alternative="greater", equal_var=True)
      N       E      O      A    C
N       -
E  -8.397       -                ***
O  -8.332  -0.596      -         ***
A  -8.804    0.12   0.72      -  ***
C  -4.759   3.753  4.074  3.787    -

# Paired T-test, showing the actual p-values instead of stars

>>> df.ptests(paired=True, stars=False, decimals=4)
      N        E       O       A       C
N        -   0.0000  0.0000  0.0000  0.0002
E  -7.0773        -  0.8776  0.7522  0.0012
O  -8.0568  -0.1555       -  0.8137  0.0008
A  -8.3994   0.3191  0.2383       -  0.0009
C  -4.2511   3.5953  3.7849  3.7652       -

# Adjusting for multiple comparisons using the Holm-Bonferroni method

>>> df.ptests(paired=True, stars=False, padjust="holm")
      N       E      O      A      C
N       -   0.000  0.000  0.000  0.001
E  -7.077       -     1.     1.  0.005
O  -8.057  -0.155      -     1.  0.005
A  -8.399   0.319  0.238      -  0.005
C  -4.251   3.595  3.785  3.765      -

@raphaelvallat raphaelvallat added the feature request 🚧 New feature or request label Jul 15, 2022
@raphaelvallat raphaelvallat self-assigned this Jul 15, 2022
@codecov
Copy link

codecov bot commented Jul 15, 2022

Codecov Report

Merging #291 (4a11e6b) into master (dce908b) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 4a11e6b differs from pull request most recent head 65c4da5. Consider uploading reports for the commit 65c4da5 to get more accurate results

@@            Coverage Diff             @@
##           master     #291      +/-   ##
==========================================
+ Coverage   98.75%   98.76%   +0.01%     
==========================================
  Files          19       19              
  Lines        3298     3332      +34     
  Branches      529      536       +7     
==========================================
+ Hits         3257     3291      +34     
  Misses         24       24              
  Partials       17       17              
Impacted Files Coverage Δ
pingouin/pairwise.py 99.46% <100.00%> (+0.05%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@raphaelvallat raphaelvallat mentioned this pull request Jul 17, 2022
11 tasks
Copy link
Contributor

@remrama remrama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @raphaelvallat , zero problems here.

I was working on something yesterday and realized that I could really use this new feature! I wasn't sure if it was implemented yet, took a look and saw you were still waiting for a review. I hope you don't mind I jumped in 👋

Sidenote, I was hoping there was a bit more flexibility in the upper triangle. I'm sure you already considered that so you probably landed on the current structure for good reason. But just fyi, I was thinking the stars parameter could be replaced with something like upper="pvals" (or "stars", "effsize", ...). I don't wanna get too crazy, but offering a similar flexibility in the lower triangle too (an analogous lower parameter) would allow easy access to non-parametrics etc.

notebooks/01_ANOVA.ipynb Outdated Show resolved Hide resolved

Passing custom parameters to the lower-level :py:func:`scipy.stats.ttest_ind` function

>>> df.ptests(alternative="greater", equal_var=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up, I got an error thrown from ttest_ind when I initially ran this in an environment with scipy version 1.7.3:

ValueError: nan-containing/masked inputs with nan_policy='omit' are currently not supported by permutation tests, one-sided asymptotic tests, or trimmed tests.

I updated straight to 1.9.0 and it worked fine 👍

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know! I think I'll keep the requirements of scipy>=1.7 for now, and we'll bump it to 1.9 in a future Pingouin release.

pingouin/pairwise.py Outdated Show resolved Hide resolved
pingouin/pairwise.py Outdated Show resolved Hide resolved
docs/changelog.rst Outdated Show resolved Hide resolved
@raphaelvallat
Copy link
Owner Author

Thanks so much @remrama — I was desperately waiting for a reviewer :-)

So about being more flexible in the output, I agree that this could be a nice addition in a future PR. My worry — and the reason I did not implement it — is that for increased speed we are using the lower-level scipy functions here and not a call to pg.ttest:

if paired:
    func = ttest_rel
else:
    func = ttest_ind
t, p = func(self[a], self[b], **kwargs, nan_policy="omit")

Unfortunately however, scipy only returns the T and p-values, so we'd have to either recalculate the effsize / degrees of freedom manually, or, simpler but probably much slower, do a call to pg.ttest instead.

That said, I'll have to do some benchmarks on how much slower this is going to be. I tend to be very obsessed about code speed, but most of the time the differences are barely visible to the users in real-world data...

Thanks!

@raphaelvallat raphaelvallat merged commit adf0718 into master Aug 27, 2022
@raphaelvallat raphaelvallat deleted the rcorr_ptests branch August 27, 2022 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 🚧 New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants