Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: ValueError: Cannot convert non-finite values (NA or inf) to integer only when DF exceed certain size #35227

Closed
3 tasks done
ben-arnao opened this issue Jul 11, 2020 · 9 comments · Fixed by #46534
Closed
3 tasks done
Assignees
Labels
Bug good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@ben-arnao
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Here is the code in question:

    print(df.shape)
    print(df.memory_usage().sum())
    df.dropna(inplace=True)  # drop rows with not enough lookback (they will be nan)

The line dropna line is what throws the error

So when i run my code with adding 500 features/columns the output is following

(2177432, 503)
4398412768

However when i run the same exact code for 1000 features

(2177432, 1003)
8753276768

I get an error

  File "C:\Users\Ben\PycharmProjects\tradingbot\tradingbot\trainer\sample_maker.py", line 17, in get_signal_features
    df.dropna(inplace=True)  # drop rows with not enough lookback (they will be nan)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\pandas\core\frame.py", line 4751, in dropna
    count = agg_obj.count(axis=agg_axis)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\pandas\core\frame.py", line 7807, in count
    return result.astype("int64")
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\pandas\core\generic.py", line 5698, in astype
    new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\pandas\core\internals\managers.py", line 582, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\pandas\core\internals\managers.py", line 442, in apply
    applied = getattr(b, f)(**kwargs)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\pandas\core\internals\blocks.py", line 625, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\pandas\core\dtypes\cast.py", line 868, in astype_nansafe
    raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
ValueError: Cannot convert non-finite values (NA or inf) to integer

The version i am running is 1.0.5

@ben-arnao ben-arnao added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 11, 2020
@jorisvandenbossche
Copy link
Member

@ben-arnao could you provide a reproducible example? (see eg https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports)

@ben-arnao
Copy link
Author

ben-arnao commented Jul 11, 2020

@jorisvandenbossche

Please see below

import numpy as np
import pandas as pd


def test(lookback, columns):
    df = pd.DataFrame({'value': np.random.random_sample((columns,))})

    df['value'] = df['value'].astype(np.float32)

    for x in range(1, lookback + 1):
        df['_relhist_' + str(x)] = df['value'].shift(periods=x, fill_value=np.nan) / df['value'] - 1
    df.dropna(inplace=True)
    print(df.shape)


test(100, 300000)  # does not throw error

test(1000, 3000000)  # throws error

@ben-arnao
Copy link
Author

ben-arnao commented Jul 22, 2020

I think i've been able to narrow it down to the sum() function failing with larger dataframes. In the pandas.core.frame module there is a count(self, axis=0, level=None, numeric_only=False) function around line 7700

The execution enters the block of code where result = notna(frame).sum(axis=axis) is performed.

If we do a few print statements after this bit of code

        print(notna(frame).isnull().values.any())
        print(notna(frame))
        print(notna(frame).sum(axis=axis))

We can see here that we correctly mark cells as True or False, for null/not null.

notna(frame).isnull().values.any() Shows us that for both cases, there is no null values for the null-indicator frame (which is correct).

So the frame that notna(frame) return is valid.

However we can see that when we do sum(axis=axis) This is where the smaller frame has no problems counting the nulls, but for the larger frame, every cell is simply NaN leading me to believe that there is something in the sum() function that fails when operating on larger dataframes.

@ben-arnao
Copy link
Author

The issue is this

np.prod((2244367, 1253) returns -1482775445

Obviously this is a numpy issue, as there is some sort of overload going on here. But we can pretty easily deal with the problem for now in pandas by just checking that the mask size is positive since the mask size should never be negative.

In line 1313 of nanops check_below_min_count

non_nulls = np.prod(shape)

We see that this value is incorrectly set negative. Which causes the next line

        if non_nulls < min_count:
            return True

to return true and then set our result to nan

        if check_below_min_count(shape, mask, min_count):
            result = np.nan

If we just add a condition to the above line requires the mask also be positive, this should resolve the issue

@simonjayhawkins
Copy link
Member

The issue is this

np.prod((2244367, 1253) returns -1482775445

Obviously this is a numpy issue, as there is some sort of overload going on here. But we can pretty easily deal with the problem for now in pandas by just checking that the mask size is positive since the mask size should never be negative.

or specifying dtype would give the correct answer

>>> np.prod((2244367, 1253), dtype="int64")
2812191851
>>>

see also #34827

@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020
@jreback jreback added this to the 1.2 milestone Oct 10, 2020
@jreback jreback modified the milestones: 1.2, Contributions Welcome Nov 19, 2020
@fonnesbeck
Copy link

Is there a plan to fix this, or a workaround in the meantime?

@jreback
Copy link
Contributor

jreback commented Sep 1, 2021

community PRs are always welcome @fonnesbeck as pandas is all volunteer and there issue prioritization is not possible in a direct way

@fonnesbeck
Copy link

Actually, this may not be an issue with newer versions of NumPy. np.prod does now appear to return int64s, so the failure demonstrated by @simonjayhawkins no longer occurs.

@jreback jreback added Needs Tests Unit test(s) needed to prevent regressions good first issue labels Sep 1, 2021
@FactorizeD
Copy link
Contributor

take

@jreback jreback modified the milestones: Contributions Welcome, 1.5 Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
7 participants