-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
pd.testing.assert_frame_equal doesn't do precision according to the doc #25068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report! This does look strange - investigation and PRs would certainly be welcome |
pandas-dev/pandas#25068 (comment) which quite often fails on CI. once it's resolved can change the setting back to check_less_precise=True (or better =3), until then using =2 as it works, but this check is less good.
What I have a hard time grasping is the way this function is designed. Unless I don't understand the documentation, how can it help me to compare these two numbers:
The approach of comparing only n number of decimals is so strange. These two numbers are almost identical, and no matter now many digits you set, this function will still assert failure if the 9's go for quite a few more digits. For example, math.isclose has a relative and absolute tolerance, which makes total sense. So in the example above, I can say ask for say 0.1% tolerance and those 2 numbers will be close.
|
I think the comparison is done in this function pandas/pandas/_libs/testing.pyx Lines 42 to 46 in f75a220
The code in the comment, however, does not use the (more strict) 0.5 function. In NumPy that function uses 1.5. There is also a comment there now to use NumPy's And Instead, perhaps, it would be easier to replace the function by either something like the new function in NumPy, or perhaps some other function? Cheers |
thank you for digging up the code, @kinow! So the description of the functionality needs to be improved - numpy's version is indeed much better explained. What it does is comparing how many 0.000x decimals the difference is between 2 numbers, and not how many decimals of each number it looks at. and then there is 1/2... Let's rewrite:
to:
so it's easier to understand. So 2 digits gives us:
so 2 digits gives us a [0,0.005) absolute range tolerance [0, 0.5*1e-2) and 3:
so 3 digits gives us a [0,0.0005) absolute range tolerance [0, 0.5*1e-3) and so n digits gives us [0, 0.5*1e-n) absolute range tolerance. So the description should probably use code instead of words:
I hope I didn't miss a zero somewhere. Except it doesn't seem to be the right function, since if I now apply this same logic to the original failing test to emulate
I get:
none of which is >0.5, i.e. it shouldn't assert. It should assert with
So it's not a question of 0.5 vs 1.5, but 1 vs 10000. Finally, a sanity check of the same numbers with numpy:
doesn't fail, with |
I just hit this problem. Unsure if this is still on anyone's radar, but it was pretty surprising for me. I also used numpy functions ( If there is interest in updating this parameter, it seems to me like @kinow's suggestion of using these numpy functions is a good path forward, though I'm far from an expert on this. |
equal_tsv calls equal_dataframes. equal_dataframes compares non-float columns for exact equality. equal_dataframes converts float columns to numpy arrays and compares for equality within a given tolerance using numpy.allclose. This is used instead of pandas.testing.assert_frame_equal as there is an issue with how that function handles precision (see [pandas.testing.assert_frame_equal doesn't do precision according to the doc #25068](pandas-dev/pandas#25068) "NAN" values in float columns are considered to be equal.
Dear everybody, any update on this? I'm trying to compare only 2 decimals but it seems it still checks 3... |
I just ran into this error with some code I am writing.. I have the |
Hi @stas00
+1
I'm also starting to think that that function may not be the best for what is documented in import pandas as pd
df1 = pd.DataFrame([0.15])
df2 = pd.DataFrame([0.16])
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1) Or import pandas as pd
df1 = pd.DataFrame([0.099999])
df2 = pd.DataFrame([0.09]) # 0.099 will apss
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1) The function I mentioned before, is not actually called with these values. pandas/pandas/_libs/testing.pyx Lines 206 to 215 in 0ffee8b
So for
In this case, the ratio is not close enough. So the function is failing. However, the callee function was supposed to compare based on the digits after the decimal. So if If instead of the ratio, we use the function directly with However, if we use the other example pair
Still fails. Looks like Not super confident that that is the proper solution though, so happy if others chime in with their suggestions. |
Hmm, maybe I spoke too fast. This commit has a unit test with the examples discussed here: kinow@f45be0e The test passes, but several other tests fail. For example, # test_timeseries.test_pct_change_shift_over_nas
def test_pct_change_shift_over_nas(self):
s = Series([1.0, 1.5, np.nan, 2.5, 3.0])
chg = s.pct_change()
expected = Series([np.nan, 0.5, 0.0, 2.5 / 1.5 - 1, 0.2])
tm.assert_series_equal(chg, expected) Fails with
The values that fail are
This part is the most confusing for me: "digits (...) after decimal points are compared". If we have 5 digits, and |
equal_tsv calls equal_dataframes. equal_dataframes compares non-float columns for exact equality. equal_dataframes converts float columns to numpy arrays and compares for equality within a given tolerance using numpy.allclose. This is used instead of pandas.testing.assert_frame_equal as there is an issue with how that function handles precision (see [pandas.testing.assert_frame_equal doesn't do precision according to the doc #25068](pandas-dev/pandas#25068) "NAN" values in float columns are considered to be equal.
Any updates? It's been several updates, but the problem seems to persist. |
Any updates? |
Looks fixed by #30562. Example from OP now does not raise (and also |
Thank you for tracking that, @mzeitlin11! I verified that with the current ver==1.1.5 if I replace check_less_precise with rtol=3 it works as expected. Awesome! |
Code Sample, a copy-pastable example if possible
Problem description
This asserts, despite all columns being identical in the first 3 digits after the decimal point.
It doesn't assert if
check_less_precise=2
is used instead. So something is not right here. Is there some kind of a rounding issue here?Doc:
I understand the doc says
check_less_precise
defines how many digits after the decimal point are compared.Unrelated: The doc should probably say "decimal point" (singular) as there is only one, no? and "specify the digits to compare" is vague, perhaps "In int, then specify how many digits after decimal point to compare"?
Here is a proposed updated doc entry:
Specify comparison precision. Only used when check_exact is False. int: How many digits after the decimal point to compare, False: 5 digits, True: 3 digits.
Expected Output
no assert for up to
check_less_precise=4
in this example, the numbers start to diverge at digit 5.and it's still unclear whether rounding is performed or not.
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_CA.UTF-8
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8
pandas: 0.24.0
pytest: 4.0.2
pip: 19.0.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.5
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: