Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidDataError when fitting datetime columns as context columns in PARSynthesizer #2115

Closed
ardulat opened this issue Jul 10, 2024 · 6 comments
Labels
bug Something isn't working data:sequential Related to timeseries datasets resolution:duplicate This issue or pull request already exists

Comments

@ardulat
Copy link

ardulat commented Jul 10, 2024

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.14.0
  • Python version: Python 3.12.4
  • Operating System: macOS Sonoma 14.5

Error Description

I am encountering an InvalidDataError similar to the issue: #1772. I am passing participant__date_of_birth column as a context column to the PARSynthesizer, which fails during a call to .fit function. Here is a full error message:

sdv.errors.InvalidDataError: The provided data does not match the metadata:
Invalid values found for datetime column 'participant__date_of_birth': [-1.1264832e+18, -1.1267424e+18, -1.1268288e+18, '+ 5116 more'].

I tried debugging the .fit method for the PARSynthesizer and found that there is a call to the .preprocess function, which alters my data. Here is an example:

Input data:

0        1918-01-13
1        1918-01-13
2        1918-01-13
3        1918-01-13
4        1918-01-13
            ...
134358   1930-05-17
134359   1930-05-17
134360   1930-05-17
134361   1930-05-17
134362   1930-05-17
Name: participant__date_of_birth, Length: 134363, dtype: datetime64[ns]

The input data after a call to https://github.com/sdv-dev/SDV/blob/main/sdv/single_table/base.py#L471:

primary_key
1        -1.639958e+18
2        -1.639958e+18
3        -1.639958e+18
4        -1.639958e+18
5        -1.639958e+18
              ...
134359   -1.250554e+18
134360   -1.250554e+18
134361   -1.250554e+18
134362   -1.250554e+18
134363   -1.250554e+18
Name: participant__date_of_birth, Length: 134363, dtype: float64

Can you please suggest a good way to handle datetime columns in the context?

@ardulat ardulat added bug Something isn't working new Automatic label applied to new issues labels Jul 10, 2024
@npatki
Copy link
Contributor

npatki commented Jul 10, 2024

Hi @ardulat nice to meet you.

There is currently a known issue #1485 specifically for context columns that are datetimes. Fortunately, there is a workaround you can use in the meantime which I've presented in this comment.

Let me know if my understanding is off or if the workaround does not do the trick. It will always be helpful if you can share your code (where you instantiate and use PARSynthesizer) as well as your metadata. Thanks.

@npatki npatki added data:sequential Related to timeseries datasets under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jul 10, 2024
@ardulat
Copy link
Author

ardulat commented Jul 11, 2024

Hi, @npatki likewise. Thank you for your quick reply!

The workaround from the comment helps, and the synthesizer trains and generates without any issues. The generated distribution is similar to the original data. However, I believe it's not fully correct to train the synthesizer on timestamp data for the birthdates since, as a result, it generates dates precise to nanoseconds, which in real life is less likely to happen in the data. I hope this can be solved in future versions. Thank you again!

@npatki
Copy link
Contributor

npatki commented Jul 11, 2024

Hi @ardulat you're welcome. The workaround I have just converts the Unix timestamps back to a datetime at the highest possible precision (nanoseconds). This is done intentionally so that you can round it off later to the nearest second, day, or whatever other precision level you desire.

For example, if you need precision at the second (instead of nanoseconds) you can do this:

synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1s')

Or for a day:

synthetic_data[COLUMN_NAME]  = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1d')

Hope that helps.

@npatki
Copy link
Contributor

npatki commented Jul 29, 2024

Hi @ardulat, I'm going to close this issue out since we have a workaround. I'll defer to the original issue of #1485 for a fix. (FYI at the time of this writing, it looks like this issue is closed, meaning that the fix will be available in the upcoming SDV release.)

@npatki npatki closed this as completed Jul 29, 2024
@npatki npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Jul 29, 2024
@ardulat
Copy link
Author

ardulat commented Sep 27, 2024

Hi @npatki!

As you suggested, I converted dates in the context column to timestamps as follows:

# Spark code
for col_name in self.context_columns:
    if self.context_df.schema[col_name].dataType == T.DateType():
        self.converted_date_columns.add(col_name)  # Save for further usage
        processed_context_df = processed_context_df.withColumn(
            col_name, F.to_timestamp(F.col(col_name), "yyyy-MM-dd")
        )

# Pandas code
for col_name in self.context_columns:
    if output_df[col_name].dtype == "datetime64[ns]":
        output_df[col_name] = pd.to_datetime(
            output_df[col_name], format="%Y-%m-%d"
        ).astype(int)

Then, I convert the sampled timestamps back to dates:

# Convert timestamp columns in the context back to datetime
for col_name in self.converted_date_columns:
    data[col_name] = pd.to_datetime(data[col_name], unit="ns").dt.date

As a result, I am getting irrelevant dates in the range from year 1677 to year 2253:
Screenshot 2024-09-23 at 2 30 51 PM

What can I do to produce relevant dates with distribution similar to the training data?

@npatki
Copy link
Contributor

npatki commented Sep 27, 2024

Hi @ardulat, since you first filed this bug, the underlying issue (#1485) has already been resolved. You no longer need to apply a workaround of converting the datetime columns to numerical. I would recommend upgrading your SDV version to the latest and re-trying the synthesis without any workarounds.

If you are continuing to have problems with it, please file a new issue so that we can take a look. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants