Skip to content

Regarding pd.DataFrame.std() function #12230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
capissimo opened this issue Feb 4, 2016 · 5 comments
Closed

Regarding pd.DataFrame.std() function #12230

capissimo opened this issue Feb 4, 2016 · 5 comments

Comments

@capissimo
Copy link

It looks like it calculates an adjusted value of std, i.e. a squared deviation from the mean, (x_i - x_bar) ** 2, divided by n-1 instead of n.
It is a reasonable measure for cases when n <= 50 allowing to avoid underestimation of std. But in cases when n > 50 actually there is no difference what number is used in the denominator, n or n-1. I guess you need to take this into account.

Many thanks for your excellent piece of work.

B.R., Andre Logunov, Russia
https://plus.google.com/u/1/

@kawochen
Copy link
Contributor

kawochen commented Feb 4, 2016

What do you mean? They are both biased estimators of standard deviation.

@capissimo
Copy link
Author

Hi!,

Here's one of the links to the material in quenstion:
http://studopedia.org/1-31408.html, unfortunately in Russian. The passage
below explains everything, and I translated it:

Обе предложенные оценки - выборочная дисперсия и исправленная выборочная
дисперсия – являются состоятельными оценками генеральной дисперсии, и
разница между ними заметна лишь при небольшом числе наблюдений n. При n >
30 в качестве оценки для D вполне можно использовать Dв... Для оценки же
среднего квадратического отклонения генеральной совокупности используют
исправленное среднее квадратическое отклонение, которое равно корню
квадратному из исправленной дисперсии...

Both the SAMPLE variance estimator and the unbiased SAMPLE
variance estimator are consistent estimators of POPULATION variance, and
the difference between them is noticeable only when the number of
observations n is small. Given n > 30 it is quite possible to use them
interchangably... To evaluate the standard deviation from population the
unbiased standard deviation estimator is used, which is equal to the square
root of the unbiased variance...

Where am I wrong?

2016-02-05 2:56 GMT+10:00 Ka Wo Chen notifications@github.com:

What do you mean? They are both biased estimators of standard deviation.


Reply to this email directly or view it on GitHub
#12230 (comment).

@TomAugspurger
Copy link
Contributor

In the docstring it says

Return unbiased standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

So working as intended. If you want the n version use df.std(ddof=0)

@capissimo
Copy link
Author

Thanks for the lead, I see now.))

2016-02-05 12:26 GMT+10:00 Tom Augspurger notifications@github.com:

In the docstring it says

Return unbiased standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

So working as intended. If you want the n version use df.std(ddof=0)


Reply to this email directly or view it on GitHub
#12230 (comment).

@kawochen
Copy link
Contributor

kawochen commented Feb 5, 2016

@TomAugspurger Then the doc string is either wrong or misleading. If you treat the data as a sample of some distribution, then the doc string is wrong regarding bias. The unbiased estimator of standard deviation has no closed form without further knowledge of the distribution. You can think of the data as the population, with each data point having the same probability, but then there should be no bias to speak of (and N should be used).

jreback pushed a commit that referenced this issue Feb 6, 2016
xref #12230

Author: Ka Wo Chen <kawoc@tepper.cmu.edu>

Closes #12234 from kawochen/DOC-std and squashes the following commits:

d224abb [Ka Wo Chen] DOC: Improve doc string for std
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants