-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datatime columns as context_columns in PARsynthizer #1772
Comments
Thanks for filing this issue @Ng-ms. I'm a bit confused at the scenario you are describing. The
I am confused because this sentence implies that you already have synthetic data. How are you able to get synthetic data if the synthesizer crashed (with the It would be helpful if you could share the Python code that you are using to load data, modify it, create metadata, create the synthesizer, sample from it, etc. And also if you could indicate where the crash is happening. |
Sorry if my earlier messages were a bit unclear. Here's more info to explain better. i have two cases/tries here : 1.Using datetime columns as context without alteration: This leads to InvalidDataError due to a mismatch between data and the defined metadata even though in the metadata these columns are specified as datetime type , preventing the fitting of the PARSynthesizer.e
2.Converting datetime to numerical for synthesis: This results in synthetic data with unrealistic dates (e.g., 09-08-1768), indicating a problem in handling or converting these numerical values back to datetime.
|
Hello @npatki, do you have any ideas on how to solve this ? |
Hi @Ng-ms, Thanks for confirming. The errors indicate that there are mismatches between how you are converting the data from datetime to numerical, and how you're converting them back from numerical to datetime. If you are doing any conversions, you also need to update your metadata as the sdtype is no longer datetime but numerical. Here is a code snippet that may help: import pandas as pd
# convert datetime columns to numerical
data[COLUMN_NAME] = pd.to_datetime(data[COLUMN_NAME], format='%d/%m/%Y').astype(int)
# update these columns to be sdtype 'numerical' in the metadata, as they are no longer datetime!
metadata.update_column(column_name=COLUMN_NAME, sdtype='numerical')
# save this version of metadata!
metadata.save_to_json(filepath='metadata_converted_context.json')
# now you can fit and sample
synthesizer = PARSynthesizer(metadata, epochs=150, context_columns=[COLUMN_NAME])
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_sequences=100)
# synthetic data will have numerical values. convert them to datetime
synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').dt.date |
thank you @npatki |
Hi @Ng-ms, that is unfortunate to hear. As I mentioned in my previous message, you may want to double check how you are doing your conversion from datetime --> numerical, and back from numerical --> datetime. Good practice will be to inspect your data every step of the way. What does the input data look like? What are the min/max values in the input data for Unfortunately, there is only so much I can do with these screenshots. If you are able to provide access to your real data or metadata, as well as the full and complete code that you have currently in SDV, that will be helpful. If we are not able to replicate your issue, it is unlikely we will be able to provide any kinds of useful information. Please provide any other information you think will be helpful. Thanks. |
Hi @Ng-ms are you still encountering this problem? Since this issue has been inactive for a while, I'm closing it off. But please feel free to reply with any more info. We can always reopen the issue to continue investigation. |
hi @npatki yes unfortunately i am still having this problem eventhough my converting for the data is right but I am still getting unlogical (out of the min and max ) dates |
@npatki Hello, i am still having the same error , sometimes just one column gives this unrealistic dates and some times (if I train the model longer) I am getting more than one date columns with unrealistic dates like (1700, 1898) |
Hello @npatki! I'm experiencing a similar issue while dealing with my data. The issue I am getting is Input data:
The input data after a call to https://github.com/sdv-dev/SDV/blob/main/sdv/single_table/base.py#L471:
Can you please suggest a good way to handle datetime columns in the context? |
Description
I'm encountering an InvalidDataError when trying to fit with the PAR model for sequential data. when I introduce timedate type data as context_columns
this is the error
/python3.8/site-packages/sdv/single_table/base.py", line 385, in fit
self.fit_processed_data(processed_data)
b/python3.8/site-packages/sdv/single_table/base.py", line 368, in fit_processed_data
self._fit(processed_data)
python3.8/site-packages/sdv/sequential/par.py", line 265, in _fit
self._fit_context_model(processed_data)
/python3.8/site-packages/sdv/sequential/par.py", line 203, in _fit_context_model
self._context_synthesizer.fit(context)
/python3.8/site-packages/sdv/single_table/base.py", line 384, in fit
processed_data = self._preprocess(data)
/python3.8/site-packages/sdv/single_table/base.py", line 328, in _preprocess
self.validate(data)
python3.8/site-packages/sdv/single_table/base.py", line 154, in validate
raise InvalidDataError(errors)
`sdv.errors.InvalidDataError: The provided data does not match the metadata:
Invalid values found for datetime column 'date_DID': [1.327536e+18, 1.329264e+18, 1.3306464e+18, '+ 96 more'].
Invalid values found for datetime column 'date_DCM': [1.0274688e+18, 1.1066976e+18, 1.1317536e+18, '+ 39 more'].
Invalid values found for datetime column 'date_HIP': [1.1605248e+18, 1.2881376e+18, 1.296432e+18, '+ 6 more'].
Invalid values found for datetime column 'DATE_DIS': [1.1605248e+18, 1.270944e+18, 1.3627008e+18, '+ 33 more'].
Invalid values found for datetime column 'data_DM': [0.0].`
I tried to convert those columns to numerical as suggested in #1485, but when converting the data back to DateTime in the synthetic data is gives a range of dates like 09-08-1768 and 10-03-1644 ..
The text was updated successfully, but these errors were encountered: