Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PandasDataset slow at creating when many large DataFrames are given #2147

Closed
lostella opened this issue Jul 10, 2022 · 4 comments · Fixed by #2148
Closed

PandasDataset slow at creating when many large DataFrames are given #2147

lostella opened this issue Jul 10, 2022 · 4 comments · Fixed by #2148
Labels
bug Something isn't working

Comments

@lostella
Copy link
Contributor

lostella commented Jul 10, 2022

Description

The PandasDataset class is slow at constructing when several large DataFrames are given. It appears like this check is to be blamed.

To Reproduce

The following snippet takes something like 14 seconds to run on my machine:

import pandas as pd
from gluonts.dataset.pandas import PandasDataset

df = pd.DataFrame(
    {
        k: [1.0] * 5000
        for k in range(200)
    },
    index=pd.period_range("2005-01-01", periods=5000, freq="2H")
)

dataset = PandasDataset(dict(df))

What I tried

Changing the definition of is_uniform to

def is_uniform(index: pd.PeriodIndex) -> bool:
    ts_index = index.to_timestamp()
    return (ts_index[1:] - ts_index[:-1] == index.freq).all()

drastically reduces the runtime. However, this doesn't work with irregular offsets like MonthEnd (in fact, a test using 3M frequency fails): turning MonthEnd periods to timestamp makes their difference become irregular in terms of days:

import pandas as pd
pi = pd.period_range("2012-01", periods=3, freq="M")
print(pi[1:] - pi[:-1])  # Index([<MonthEnd>, <MonthEnd>], dtype='object')
dti = pi.to_timestamp()
print(dti[1:] - dti[:-1])  # TimedeltaIndex(['31 days', '29 days'], dtype='timedelta64[ns]', freq=None)
@lostella lostella added the bug Something isn't working label Jul 10, 2022
@kashif
Copy link
Contributor

kashif commented Jul 10, 2022

my suggestion is to just use the index even if it is regular or irregular without converting it to period and then back to time ranges.... as far as I can tell with pandas dataframe one already has the index for each time point so you can use the irregular time series approach...

@lostella
Copy link
Contributor Author

@kashif that would require #1973 right?

@lostella
Copy link
Contributor Author

lostella commented Jul 10, 2022

A workaround for my example above would be to have a constructor option that disables the check, and have an alternative constructor from_wide_dataframe that does the check just once.

My example above is really about constructing a PandasDataset from a wide DataFrame: the solution relying on pd.melt (turning it a long DataFrame and then invoke from_long_dataframe) appears very slow when a lot of data is involved.

cc @rsnirwan

@kashif
Copy link
Contributor

kashif commented Jul 10, 2022

@lostella yes... I believe so... so just use the index directly and as you can see it all works without going back to period and date range again...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants