-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Bug in pd.Series.mean() #6915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are you able to narrow it down to an example dataframe where you have the issue you can post here? |
I have the dataframe that I found the problem with -- it is quite large but I can delete all but the two variables where the calculation goes awry and post the dataset somewhere. |
If you can, please do, that would be interesting. Do you have NaNs? And what is the result of |
After some playing around, one hypothesis is that the bug has something to do with int32 vs int64 dtypes. So initially, I exported it to csv and tried it on another computer and I got the right answer. I then saved it as an hdf5 file and I got the wrong answer. Looking at the dtypes:
and
The two files are at: In answer to your questions:
|
Could you post output of describe() both pre- and post-load? |
I'm not sure what you mean by pre-load. |
what @jtratner means is show EXACTLY what you are doing with those files, every command in an ipython session, so it can simply be copy pasted and reproduced |
You mean something like:
|
That said, in addition to int32 vs int64, it seems to be architecture specific -- I've also done the same on a 64 bit machine where the mean gets computed correctly. I was doing the construction of the datasets using a 32 bit machine. I'm running one of the flavors of Manjaro linux on all my machines. |
ok; that's shows the problem (but the reason we ask for ALL the info is you didn't mention that it was in an hdf file). can you show what you are writing? (e.g. you are reading the file, but show the code to generate the frame that you are then writing). it should be completely self-contained and copy-pastable. |
I did in my comment from 4 days ago. As I also mentioned in that comment, it appeared to be a difference between int32 and int64 so it's not obvious to me why hdf5 matters -- I observed the problem before saving anything out to hdf5. |
ok then show how you created them you are showing a sympton, but not how you created the data. So it is still impossible to even guess where the problem lies. |
As you can see, I haven't done anything with area but read in another datafile and merge it into the larger dataset. |
w/o the original files its still impossible to look that said, don't do furthermore, better to use |
Ok, thanks for the workaround but that doesn't imply that series means are not being computed incorrectly for certain dtypes and certain architectures. In looking at the code again, the only dataset needed is, https://copy.com/722BDiVaQ7BL, and the code I posted can be truncated following:
|
can you show:
|
its possible that their is a bug in the merging, but I can't reproduce, that's why I am asking :) |
As I pointed out in the prior comment, the merge command is not necessary (i.e., it is my astype(int) that is making it int32) but nevertheless:
|
ok, how about a sample of |
All of stateemp is here: https://copy.com/722BDiVaQ7BL in 'stateemp'. I.e., read_hdf('pandas_mean_error.h5','stateemp'). The code was posted before but again, could be further truncated to reproduce the problem. |
how did you write the hdf file? you have an odd filter in their |
Although I can't be absolutely sure which complib I used since I was doing it interactively to produce datasets for you folks to play with. |
ok...can you save w/o bzip...i don't have installed (and generally not that efficient anyhow, use 'blosc') |
Here is one that has been bloscified. |
ok.still doesn't show anything, this looks fine.
|
Maybe it is very architecture specific -- the work was initially done on one of the early atom cpus (embarrassed grin) and then I also tested it on a P4. The atom's cpu flags are:
I can get the P4 flags if that would be useful. |
weird.... ok...use the methods I described above for conversions and generally keep to 64-bit dtypes closing...reopen if you need |
So why is int32 problematic? Presumably it takes less space. If I'm not wrong then there are cases where I'd even prefer to use int8, especially on RAM constrained machines. |
u CAN use |
This bug is really bothering me so I installed a 32 bit linux (also Manjaro) on a spare partition of my 64 bit machine and same problem -- df.area.describe() gives me a mean of 58.9... So on three separate machines with 32 bit OSes, I'm observing this issue. I don't think this is an isolated case due to special hardware. I'm happy to try some alternate 32 bit flavor of linux (Ubuntu, Debian, what have you). Alternatively, if int32 is really not an important dtype, perhaps it should simply be dropped. |
@zoof is certainly could be a bug on 32-bit. But need a simple reproducible test. |
You have the dataset. I read it in on 32 bit OSes, ask it to df.area.describe() and I get 58.927762 for the mean. When I do the same on a 64 bit linux, I get a mean of 23785.447812. Do I need to do exhaustive tests on every conceivable hardware and versions of linux? BTW, I have now confirmed the problem on a live session of 32 bit Ubuntu 14.04 from which I am now typing. |
you are missing the point without a simple test i can't even begin to figure out where the problem is in order for this to move forward you need to make a simple test you can even read in a data set but it has to be short, preferably from a string it could be a numpy, python or pandas bug |
It does not seem to be a numpy bug:
If it is a Python bug, it works in a way that does not affect numpy. It is not clear to me why it has to be short. Is this a standard method of operation that if the precise bug cannot be pinned down, the issue is closed? Despite the fact that it is clearly a bug. I'll say again, you have the dataset. You could produce the erroneous result if you tried it with a 32 bit linux distro. |
it probably is a bug but pandas has a test suite of almost 5000 tests in order to patch anything it has to be a test how can we include this massive data set in the test suite? it's simply not possible you found and issue great - but u have to narrow it down by producing a much smaller example that can be put directly in the code someone has to run the test and see that it fails on a particular platform and then test a fix that works and does not break anything else what u have provided is an indication of a bug if you do narrow it down pls reopen the issue |
The best I can do is cut the series down to about 90,000. Taking every 58th observation gets:
Using every 59th fails to produce an error. |
pls show |
Should I install and try the calculation again? |
sorry meant in any event, this is a numpy bug, see numpy/numpy#4638 |
installing |
Many thanks! Will do so. |
thanks for reporting. something bugs are hard to fine! |
@zoof ok...going to fix on the pandas side, as this is broken in both bottleneck (for float32), suprisingly int32 works, and in numpy on both. they 'wont' fix as its the user responsibility to do the upcasting (which is odd because on 64-bit it works), because the return variable is already 64-bit, whilst on 32-bit it is not. weird. |
That is weird. Seems to me that if upcasting the the user's responsibility, it should at least throw a warning or else how is the user to know that there is a problem. I guess use only int64 and float64. |
their is a way to intercept it you see the importance of narrowing down the problem; have to be able to debug it. |
In some cases, the mean is computed incorrectly. Numpy however does the correct calculation. There is no problem with the standard deviation calculation. The following is an example.
The text was updated successfully, but these errors were encountered: