`PandasDataset` slow at creating when many large `DataFrame`s are given #2147

lostella · 2022-07-10T13:12:51Z

Description

The PandasDataset class is slow at constructing when several large DataFrames are given. It appears like this check is to be blamed.

To Reproduce

The following snippet takes something like 14 seconds to run on my machine:

import pandas as pd
from gluonts.dataset.pandas import PandasDataset

df = pd.DataFrame(
    {
        k: [1.0] * 5000
        for k in range(200)
    },
    index=pd.period_range("2005-01-01", periods=5000, freq="2H")
)

dataset = PandasDataset(dict(df))

What I tried

Changing the definition of is_uniform to

def is_uniform(index: pd.PeriodIndex) -> bool:
    ts_index = index.to_timestamp()
    return (ts_index[1:] - ts_index[:-1] == index.freq).all()

drastically reduces the runtime. However, this doesn't work with irregular offsets like MonthEnd (in fact, a test using 3M frequency fails): turning MonthEnd periods to timestamp makes their difference become irregular in terms of days:

import pandas as pd
pi = pd.period_range("2012-01", periods=3, freq="M")
print(pi[1:] - pi[:-1])  # Index([<MonthEnd>, <MonthEnd>], dtype='object')
dti = pi.to_timestamp()
print(dti[1:] - dti[:-1])  # TimedeltaIndex(['31 days', '29 days'], dtype='timedelta64[ns]', freq=None)

The text was updated successfully, but these errors were encountered:

kashif · 2022-07-10T13:21:44Z

my suggestion is to just use the index even if it is regular or irregular without converting it to period and then back to time ranges.... as far as I can tell with pandas dataframe one already has the index for each time point so you can use the irregular time series approach...

lostella · 2022-07-10T14:42:49Z

@kashif that would require #1973 right?

lostella · 2022-07-10T14:49:55Z

A workaround for my example above would be to have a constructor option that disables the check, and have an alternative constructor from_wide_dataframe that does the check just once.

My example above is really about constructing a PandasDataset from a wide DataFrame: the solution relying on pd.melt (turning it a long DataFrame and then invoke from_long_dataframe) appears very slow when a lot of data is involved.

cc @rsnirwan

kashif · 2022-07-10T14:55:47Z

@lostella yes... I believe so... so just use the index directly and as you can see it all works without going back to period and date range again...

lostella added the bug Something isn't working label Jul 10, 2022

lostella mentioned this issue Jul 10, 2022

Make PandasDataset faster #2148

Merged

lostella closed this as completed in #2148 Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PandasDataset` slow at creating when many large `DataFrame`s are given #2147

`PandasDataset` slow at creating when many large `DataFrame`s are given #2147

lostella commented Jul 10, 2022 •

edited

Loading

kashif commented Jul 10, 2022

lostella commented Jul 10, 2022

lostella commented Jul 10, 2022 •

edited

Loading

kashif commented Jul 10, 2022

PandasDataset slow at creating when many large DataFrames are given #2147

PandasDataset slow at creating when many large DataFrames are given #2147

Comments

lostella commented Jul 10, 2022 • edited Loading

Description

To Reproduce

What I tried

kashif commented Jul 10, 2022

lostella commented Jul 10, 2022

lostella commented Jul 10, 2022 • edited Loading

kashif commented Jul 10, 2022

`PandasDataset` slow at creating when many large `DataFrame`s are given #2147

`PandasDataset` slow at creating when many large `DataFrame`s are given #2147

lostella commented Jul 10, 2022 •

edited

Loading

lostella commented Jul 10, 2022 •

edited

Loading