PARSynthesizer model won't fit if sequence_index is missing #1972

srinify · 2024-04-30T15:59:10Z

Environment Details

Happening in Mac & Colab in both SDV 1.11 and 1.12 (haven't tried other versions)

Error Description

When using PARSynthesizer, supplying a sequence_key but not a sequence_index seems to be throwing an error. Both of the following examples can be fixed by adding the sequence_index column back in, updating the metadata, and then running fit()

Steps to reproduce

Example 1

Full Code:

import pandas as pd
from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata
from sdv.datasets.demo import get_available_demos, download_demo

demo_data, metadata = download_demo(dataset_name='AtrialFibrillation', modality='sequential')
# Removed column that would normally be the sequence_index
demo2 = demo2.drop('s_index', axis=1)

demo2_metadata = SingleTableMetadata()
# Re-building metadata (ofc I could remove s_index column too from existing metadata)
demo2_metadata.detect_from_dataframe(demo2)

demo2_metadata.update_column(column_name='e_id', sdtype='id')
demo2_metadata.set_sequence_key(column_name='e_id')

synthesizer_demo = PARSynthesizer(demo2_metadata)
synthesizer_demo.fit(demo2)

Throws this error:

/usr/local/lib/python3.10/dist-packages/sdv/single_table/base.py:80: UserWarning: We strongly recommend saving the metadata using 'save_to_json' for replicability in future SDV versions.
  warnings.warn(
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-25-e429dedddd24> in <cell line: 2>()
      1 synthesizer_demo = PARSynthesizer(demo2_metadata)
----> 2 synthesizer_demo.fit(demo2)

7 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in _raise_if_missing(self, key, indexer, axis_name)
   5936                 if use_interval_msg:
   5937                     key = list(key)
-> 5938                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5939 
   5940             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())

KeyError: "None of [Index(['74899b63-1f49-4701-8cdc-e9aeda8426cf'], dtype='object')] are in the [columns]"

Example 2

Generating from scratch:

import numpy as np
import pandas as pd

ids = np.arange(0, 50_000, 1)
ids = np.repeat(ids, 45)

obs = np.concatenate(
    [np.random.normal(loc=5, scale=1, size=1) for i in ids]
)

df = pd.DataFrame(
    {
        "id": ids,
        "obs": obs
    }
)

from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)
metadata.update_column(column_name='id', sdtype='id')
metadata.set_sequence_key(column_name='id')

synthesizer = PARSynthesizer(metadata, verbose=True)
synthesizer.fit(df)

Returns the same error as above but with a different ID:

...
KeyError: "None of [Index(['fb9aa2a7-3694-47f0-8145-2434b9196bbb'], dtype='object')] are in the [columns]"

Workaround

Create a simple incrementing integer column (e.g. from 1 to n rows per sequence) that can be used to index each row that's linked to the same sequence_key.

s_key | s_id | dim1
------------------
A        | 1        | 9.1 
A        | 2        | 8.1
B        | 1        | 4.1
B        | 2        | 5.1
...

Then update your metadata. Here's a code snippet that does both:

# Replace "seq_key" with your column you're using for the sequence_key
s_index = demo2.groupby('seq_key').cumcount() + 1
df['s_index'] = s_index

metadata.set_sequence_key(column_name='s_index')

# Now this should work!
synthesizer.fit(df)

The text was updated successfully, but these errors were encountered:

srinify added bug Something isn't working new Automatic label applied to new issues data:sequential Related to timeseries datasets and removed new Automatic label applied to new issues labels Apr 30, 2024

srinify changed the title ~~PARSynthesizer model can't be trained if sequence_index is missing~~ PARSynthesizer model won't fit if sequence_index is missing Apr 30, 2024

This was referenced May 17, 2024

Handle PARSynthesizer model if sequence_index is missing sdv-dev/DeepEcho#114

Closed

Add the dummy context column to metadata and not to extra_context_column #2019

Merged

lajohn4747 closed this as completed in #2019 May 21, 2024

amontanez24 assigned lajohn4747 May 31, 2024

amontanez24 added this to the 1.13.2 milestone May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARSynthesizer model won't fit if sequence_index is missing #1972

PARSynthesizer model won't fit if sequence_index is missing #1972

srinify commented Apr 30, 2024 •

edited

Loading

PARSynthesizer model won't fit if sequence_index is missing #1972

PARSynthesizer model won't fit if sequence_index is missing #1972

Comments

srinify commented Apr 30, 2024 • edited Loading

Environment Details

Error Description

Steps to reproduce

Workaround

srinify commented Apr 30, 2024 •

edited

Loading