Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datatime columns as context_columns in PARsynthizer #1772

Closed
Ng-ms opened this issue Feb 2, 2024 · 10 comments
Closed

Datatime columns as context_columns in PARsynthizer #1772

Ng-ms opened this issue Feb 2, 2024 · 10 comments
Labels
data:sequential Related to timeseries datasets question General question about the software resolution:cannot replicate The problem cannot be replicated

Comments

@Ng-ms
Copy link

Ng-ms commented Feb 2, 2024

Description
I'm encountering an InvalidDataError when trying to fit with the PAR model for sequential data. when I introduce timedate type data as context_columns
this is the error

/python3.8/site-packages/sdv/single_table/base.py", line 385, in fit
self.fit_processed_data(processed_data)
b/python3.8/site-packages/sdv/single_table/base.py", line 368, in fit_processed_data
self._fit(processed_data)
python3.8/site-packages/sdv/sequential/par.py", line 265, in _fit
self._fit_context_model(processed_data)
/python3.8/site-packages/sdv/sequential/par.py", line 203, in _fit_context_model
self._context_synthesizer.fit(context)
/python3.8/site-packages/sdv/single_table/base.py", line 384, in fit
processed_data = self._preprocess(data)
/python3.8/site-packages/sdv/single_table/base.py", line 328, in _preprocess
self.validate(data)
python3.8/site-packages/sdv/single_table/base.py", line 154, in validate
raise InvalidDataError(errors)

`sdv.errors.InvalidDataError: The provided data does not match the metadata:
Invalid values found for datetime column 'date_DID': [1.327536e+18, 1.329264e+18, 1.3306464e+18, '+ 96 more'].

Invalid values found for datetime column 'date_DCM': [1.0274688e+18, 1.1066976e+18, 1.1317536e+18, '+ 39 more'].

Invalid values found for datetime column 'date_HIP': [1.1605248e+18, 1.2881376e+18, 1.296432e+18, '+ 6 more'].

Invalid values found for datetime column 'DATE_DIS': [1.1605248e+18, 1.270944e+18, 1.3627008e+18, '+ 33 more'].

Invalid values found for datetime column 'data_DM': [0.0].`

I tried to convert those columns to numerical as suggested in #1485, but when converting the data back to DateTime in the synthetic data is gives a range of dates like 09-08-1768 and 10-03-1644 ..

@Ng-ms Ng-ms added new Automatic label applied to new issues question General question about the software labels Feb 2, 2024
@npatki
Copy link
Contributor

npatki commented Feb 2, 2024

Thanks for filing this issue @Ng-ms. I'm a bit confused at the scenario you are describing.

The InvalidDataError indicates that there is a mismatch between the data and metadata. If you convert the data column to numerical, you would also need to update the metadata for that column to be numerical. The InvalidDataError means that your synthesizer has crashed so you are be unable to fit PARSynthesizer and sample from it.

when converting the data back to DateTime in the synthetic data is gives a range of dates like 09-08-1768 and 10-03-1644 ..

I am confused because this sentence implies that you already have synthetic data. How are you able to get synthetic data if the synthesizer crashed (with the InvalidDataError)? Something doesn't seem to add up.

It would be helpful if you could share the Python code that you are using to load data, modify it, create metadata, create the synthesizer, sample from it, etc. And also if you could indicate where the crash is happening.

@npatki npatki added data:sequential Related to timeseries datasets under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Feb 2, 2024
@Ng-ms
Copy link
Author

Ng-ms commented Feb 5, 2024

Sorry if my earlier messages were a bit unclear. Here's more info to explain better. i have two cases/tries here :

1.Using datetime columns as context without alteration: This leads to InvalidDataError due to a mismatch between data and the defined metadata even though in the metadata these columns are specified as datetime type , preventing the fitting of the PARSynthesizer.e

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.sequence_key = 'ID_P'
metadata.update_column(column_name='ID_P', sdtype='id')
metadata.set_sequence_index(column_name='DATE_P')

#metadata.save_to_json(filepath='my_metadata_v2.json')
from sdv.metadata import SingleTableMetadata
#metadata = SingleTableMetadata.load_from_json(
 #   filepath='my_metadata_v1.json')
print(metadata)
# Generate synthetic data
print('start')
synthesizer = PARSynthesizer(metadata,epochs=150, context_columns= [' data_DM', 'DATE_DIS','date_HIP','date_DCM','date_DID'],  verbose=True,  enforce_min_max_values=True, enforce_rounding=True,  cuda=True)

2.Converting datetime to numerical for synthesis: This results in synthetic data with unrealistic dates (e.g., 09-08-1768), indicating a problem in handling or converting these numerical values back to datetime.


context_date_columns = ['data_DM', 'DATE_DIS', 'date_HIP', 'date_DCM', 'date_DID']


for col in context_date_columns:
    df[col] = pd.to_datetime(df[col], format='%d/%m/%Y').astype(int)
#metadata.save_to_json(filepath='my_metadata_v2.json')
from sdv.metadata import SingleTableMetadata
#metadata = SingleTableMetadata.load_from_json(
 #   filepath='my_metadata_v1.json')
print(metadata)
# Generate synthetic data
print('start')
synthesizer = PARSynthesizer(metadata,epochs=150, context_columns= [' data_DM', 'DATE_DIS','date_HIP','date_DCM','date_DID'],  verbose=True,  enforce_min_max_values=True, enforce_rounding=True,  cuda=True)
synthesizer.fit(df)


print('end')
synthetic_data = synthesizer.sample(num_sequences=100,sequence_length=None)
for col in context_date_columns:
    synthetic_data[col] = pd.to_datetime(synthetic_data[col], unit='ns').dt.date


@Ng-ms
Copy link
Author

Ng-ms commented Feb 21, 2024

Hello @npatki, do you have any ideas on how to solve this ?

@npatki
Copy link
Contributor

npatki commented Mar 1, 2024

Hi @Ng-ms,

Thanks for confirming. The errors indicate that there are mismatches between how you are converting the data from datetime to numerical, and how you're converting them back from numerical to datetime. If you are doing any conversions, you also need to update your metadata as the sdtype is no longer datetime but numerical.

Here is a code snippet that may help:

import pandas as pd

# convert datetime columns to numerical
data[COLUMN_NAME] = pd.to_datetime(data[COLUMN_NAME], format='%d/%m/%Y').astype(int)

# update these columns to be sdtype 'numerical' in the metadata, as they are no longer datetime!
metadata.update_column(column_name=COLUMN_NAME, sdtype='numerical')

# save this version of metadata!
metadata.save_to_json(filepath='metadata_converted_context.json')

# now you can fit and sample
synthesizer = PARSynthesizer(metadata, epochs=150, context_columns=[COLUMN_NAME])
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_sequences=100)

# synthetic data will have numerical values. convert them to datetime
synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').dt.date

@Ng-ms
Copy link
Author

Ng-ms commented Mar 1, 2024

thank you @npatki
i am actully upadating the metadata, the problem i am getting very strange dates
Screenshot from 2024-03-01 17-37-49

@npatki
Copy link
Contributor

npatki commented Mar 1, 2024

Hi @Ng-ms, that is unfortunate to hear. As I mentioned in my previous message, you may want to double check how you are doing your conversion from datetime --> numerical, and back from numerical --> datetime. Good practice will be to inspect your data every step of the way. What does the input data look like? What are the min/max values in the input data for fit? Etc.

Unfortunately, there is only so much I can do with these screenshots. If you are able to provide access to your real data or metadata, as well as the full and complete code that you have currently in SDV, that will be helpful. If we are not able to replicate your issue, it is unlikely we will be able to provide any kinds of useful information. Please provide any other information you think will be helpful. Thanks.

@npatki
Copy link
Contributor

npatki commented Apr 10, 2024

Hi @Ng-ms are you still encountering this problem?

Since this issue has been inactive for a while, I'm closing it off. But please feel free to reply with any more info. We can always reopen the issue to continue investigation.

@npatki npatki closed this as completed Apr 10, 2024
@npatki npatki added resolution:cannot replicate The problem cannot be replicated and removed under discussion Issue is currently being discussed labels Apr 10, 2024
@Ng-ms
Copy link
Author

Ng-ms commented Apr 18, 2024

hi @npatki yes unfortunately i am still having this problem eventhough my converting for the data is right but I am still getting unlogical (out of the min and max ) dates

@Ng-ms
Copy link
Author

Ng-ms commented May 8, 2024

@npatki Hello, i am still having the same error , sometimes just one column gives this unrealistic dates and some times (if I train the model longer) I am getting more than one date columns with unrealistic dates like (1700, 1898)

@ardulat
Copy link

ardulat commented Jul 10, 2024

Hello @npatki! I'm experiencing a similar issue while dealing with my data. The issue I am getting is sdv.errors.InvalidDataError: The provided data does not match the metadata: Invalid values found for datetime column 'participant__date_of_birth': [-1.1264832e+18, -1.1267424e+18, -1.1268288e+18, '+ 5116 more']. when fitting the data. I tried debugging the .fit method for PARSynthesizer and found that there is a call to the preprocess function, which alters my data. Here is an example:

Input data:

0        1918-01-13
1        1918-01-13
2        1918-01-13
3        1918-01-13
4        1918-01-13
            ...
134358   1930-05-17
134359   1930-05-17
134360   1930-05-17
134361   1930-05-17
134362   1930-05-17
Name: participant__date_of_birth, Length: 134363, dtype: datetime64[ns]

The input data after a call to https://github.com/sdv-dev/SDV/blob/main/sdv/single_table/base.py#L471:

primary_key
1        -1.639958e+18
2        -1.639958e+18
3        -1.639958e+18
4        -1.639958e+18
5        -1.639958e+18
              ...
134359   -1.250554e+18
134360   -1.250554e+18
134361   -1.250554e+18
134362   -1.250554e+18
134363   -1.250554e+18
Name: participant__date_of_birth, Length: 134363, dtype: float64

Can you please suggest a good way to handle datetime columns in the context?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:sequential Related to timeseries datasets question General question about the software resolution:cannot replicate The problem cannot be replicated
Projects
None yet
Development

No branches or pull requests

3 participants