InvalidDataError when fitting datetime columns as context columns in PARSynthesizer #2115

ardulat · 2024-07-10T12:07:01Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 1.14.0
Python version: Python 3.12.4
Operating System: macOS Sonoma 14.5

Error Description

I am encountering an InvalidDataError similar to the issue: #1772. I am passing participant__date_of_birth column as a context column to the PARSynthesizer, which fails during a call to .fit function. Here is a full error message:

sdv.errors.InvalidDataError: The provided data does not match the metadata:
Invalid values found for datetime column 'participant__date_of_birth': [-1.1264832e+18, -1.1267424e+18, -1.1268288e+18, '+ 5116 more'].

I tried debugging the .fit method for the PARSynthesizer and found that there is a call to the .preprocess function, which alters my data. Here is an example:

Input data:

0        1918-01-13
1        1918-01-13
2        1918-01-13
3        1918-01-13
4        1918-01-13
            ...
134358   1930-05-17
134359   1930-05-17
134360   1930-05-17
134361   1930-05-17
134362   1930-05-17
Name: participant__date_of_birth, Length: 134363, dtype: datetime64[ns]

The input data after a call to https://github.com/sdv-dev/SDV/blob/main/sdv/single_table/base.py#L471:

primary_key
1        -1.639958e+18
2        -1.639958e+18
3        -1.639958e+18
4        -1.639958e+18
5        -1.639958e+18
              ...
134359   -1.250554e+18
134360   -1.250554e+18
134361   -1.250554e+18
134362   -1.250554e+18
134363   -1.250554e+18
Name: participant__date_of_birth, Length: 134363, dtype: float64

Can you please suggest a good way to handle datetime columns in the context?

The text was updated successfully, but these errors were encountered:

npatki · 2024-07-10T19:11:03Z

Hi @ardulat nice to meet you.

There is currently a known issue #1485 specifically for context columns that are datetimes. Fortunately, there is a workaround you can use in the meantime which I've presented in this comment.

Let me know if my understanding is off or if the workaround does not do the trick. It will always be helpful if you can share your code (where you instantiate and use PARSynthesizer) as well as your metadata. Thanks.

ardulat · 2024-07-11T07:40:50Z

Hi, @npatki likewise. Thank you for your quick reply!

The workaround from the comment helps, and the synthesizer trains and generates without any issues. The generated distribution is similar to the original data. However, I believe it's not fully correct to train the synthesizer on timestamp data for the birthdates since, as a result, it generates dates precise to nanoseconds, which in real life is less likely to happen in the data. I hope this can be solved in future versions. Thank you again!

npatki · 2024-07-11T14:19:10Z

Hi @ardulat you're welcome. The workaround I have just converts the Unix timestamps back to a datetime at the highest possible precision (nanoseconds). This is done intentionally so that you can round it off later to the nearest second, day, or whatever other precision level you desire.

For example, if you need precision at the second (instead of nanoseconds) you can do this:

synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1s')

Or for a day:

synthetic_data[COLUMN_NAME]  = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1d')

Hope that helps.

npatki · 2024-07-29T18:48:33Z

Hi @ardulat, I'm going to close this issue out since we have a workaround. I'll defer to the original issue of #1485 for a fix. (FYI at the time of this writing, it looks like this issue is closed, meaning that the fix will be available in the upcoming SDV release.)

ardulat · 2024-09-27T10:37:58Z

Hi @npatki!

As you suggested, I converted dates in the context column to timestamps as follows:

# Spark code
for col_name in self.context_columns:
    if self.context_df.schema[col_name].dataType == T.DateType():
        self.converted_date_columns.add(col_name)  # Save for further usage
        processed_context_df = processed_context_df.withColumn(
            col_name, F.to_timestamp(F.col(col_name), "yyyy-MM-dd")
        )

# Pandas code
for col_name in self.context_columns:
    if output_df[col_name].dtype == "datetime64[ns]":
        output_df[col_name] = pd.to_datetime(
            output_df[col_name], format="%Y-%m-%d"
        ).astype(int)

Then, I convert the sampled timestamps back to dates:

# Convert timestamp columns in the context back to datetime
for col_name in self.converted_date_columns:
    data[col_name] = pd.to_datetime(data[col_name], unit="ns").dt.date

As a result, I am getting irrelevant dates in the range from year 1677 to year 2253:

What can I do to produce relevant dates with distribution similar to the training data?

npatki · 2024-09-27T14:22:02Z

Hi @ardulat, since you first filed this bug, the underlying issue (#1485) has already been resolved. You no longer need to apply a workaround of converting the datetime columns to numerical. I would recommend upgrading your SDV version to the latest and re-trying the synthesis without any workarounds.

If you are continuing to have problems with it, please file a new issue so that we can take a look. Thanks.

ardulat added bug Something isn't working new Automatic label applied to new issues labels Jul 10, 2024

npatki added data:sequential Related to timeseries datasets under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jul 10, 2024

npatki closed this as completed Jul 29, 2024

npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Jul 29, 2024

ardulat mentioned this issue Sep 27, 2024

PARSynthesizer samples uniformly distributed time series data #2241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InvalidDataError when fitting datetime columns as context columns in PARSynthesizer #2115

InvalidDataError when fitting datetime columns as context columns in PARSynthesizer #2115

ardulat commented Jul 10, 2024

npatki commented Jul 10, 2024 •

edited

Loading

ardulat commented Jul 11, 2024

npatki commented Jul 11, 2024 •

edited

Loading

npatki commented Jul 29, 2024

ardulat commented Sep 27, 2024

npatki commented Sep 27, 2024

InvalidDataError when fitting datetime columns as context columns in PARSynthesizer #2115

InvalidDataError when fitting datetime columns as context columns in PARSynthesizer #2115

Comments

ardulat commented Jul 10, 2024

Environment Details

Error Description

npatki commented Jul 10, 2024 • edited Loading

ardulat commented Jul 11, 2024

npatki commented Jul 11, 2024 • edited Loading

npatki commented Jul 29, 2024

ardulat commented Sep 27, 2024

npatki commented Sep 27, 2024

npatki commented Jul 10, 2024 •

edited

Loading

npatki commented Jul 11, 2024 •

edited

Loading