Skip to content

Pandas: why pandas.Series.std() is quite different from numpy.std() #10489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
infozyzhang opened this issue Jul 2, 2015 · 1 comment
Closed

Comments

@infozyzhang
Copy link

I got two snippets code as follows.

numpy.std([766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346])

0

pd.Series([766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346, 766897346]).std(ddof=0)

10.119288512538814

The two lists are identical but the result are quite different. I think the pandas' result must be wrong. I work on the latest version 0.16.2 with Python 3.4.

May I ask why? Is it a bug?

@shoyer
Copy link
Member

shoyer commented Jul 2, 2015

Closing this as a duplicate of #10242

Pandas uses a correct formula for the standard deviation. However, the formula we use is not as numerically stable as the formula used by numpy. Pull requests to fix this would definitely be welcome!

@shoyer shoyer closed this as completed Jul 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants