-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InvalidDataError when fitting datetime columns as context columns in PARSynthesizer #2115
Comments
Hi @ardulat nice to meet you. There is currently a known issue #1485 specifically for context columns that are datetimes. Fortunately, there is a workaround you can use in the meantime which I've presented in this comment. Let me know if my understanding is off or if the workaround does not do the trick. It will always be helpful if you can share your code (where you instantiate and use PARSynthesizer) as well as your metadata. Thanks. |
Hi, @npatki likewise. Thank you for your quick reply! The workaround from the comment helps, and the synthesizer trains and generates without any issues. The generated distribution is similar to the original data. However, I believe it's not fully correct to train the synthesizer on timestamp data for the birthdates since, as a result, it generates dates precise to nanoseconds, which in real life is less likely to happen in the data. I hope this can be solved in future versions. Thank you again! |
Hi @ardulat you're welcome. The workaround I have just converts the Unix timestamps back to a datetime at the highest possible precision (nanoseconds). This is done intentionally so that you can round it off later to the nearest second, day, or whatever other precision level you desire. For example, if you need precision at the second (instead of nanoseconds) you can do this: synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1s') Or for a day: synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1d') Hope that helps. |
Hi @npatki! As you suggested, I converted dates in the context column to timestamps as follows: # Spark code
for col_name in self.context_columns:
if self.context_df.schema[col_name].dataType == T.DateType():
self.converted_date_columns.add(col_name) # Save for further usage
processed_context_df = processed_context_df.withColumn(
col_name, F.to_timestamp(F.col(col_name), "yyyy-MM-dd")
)
# Pandas code
for col_name in self.context_columns:
if output_df[col_name].dtype == "datetime64[ns]":
output_df[col_name] = pd.to_datetime(
output_df[col_name], format="%Y-%m-%d"
).astype(int) Then, I convert the sampled timestamps back to dates: # Convert timestamp columns in the context back to datetime
for col_name in self.converted_date_columns:
data[col_name] = pd.to_datetime(data[col_name], unit="ns").dt.date As a result, I am getting irrelevant dates in the range from year 1677 to year 2253: What can I do to produce relevant dates with distribution similar to the training data? |
Hi @ardulat, since you first filed this bug, the underlying issue (#1485) has already been resolved. You no longer need to apply a workaround of converting the datetime columns to numerical. I would recommend upgrading your SDV version to the latest and re-trying the synthesis without any workarounds. If you are continuing to have problems with it, please file a new issue so that we can take a look. Thanks. |
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
I am encountering an InvalidDataError similar to the issue: #1772. I am passing
participant__date_of_birth
column as a context column to thePARSynthesizer
, which fails during a call to.fit
function. Here is a full error message:I tried debugging the
.fit
method for thePARSynthesizer
and found that there is a call to the.preprocess
function, which alters my data. Here is an example:Input data:
The input data after a call to https://github.com/sdv-dev/SDV/blob/main/sdv/single_table/base.py#L471:
Can you please suggest a good way to handle datetime columns in the context?
The text was updated successfully, but these errors were encountered: