Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN values for numerical variables DISAPPEAR when using CTGANSynthesizer #2288

Open
wilcovanvorstenbosch opened this issue Nov 11, 2024 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@wilcovanvorstenbosch
Copy link

Environment Details

  • SDV version: 1.17.1
  • Pandas version: 2.2.3
  • Python version: 3.11.10
  • Operating System: Windows

Error Description

I tried to synthesize a DataFrame containing ~2 mil entries of loan data. The dataset has 9 numerical variables, and 11 categorical variables. The problem occurs only on the numerical variables. During synthesizing, all NaN values disappear even if the original variable had 90% missing values. I was told the SDV should be able to handle this, so I am left confused. Any help would be appreciated!

To be clear: the missing values in the numerical columns are of type np.nan

Steps to reproduce

df_loan_data = pd.read_csv('....csv')

from sdv.metadata import Metadata
metadata = Metadata.detect_from_dataframe(
    data=df_loan_data )

from sdv.single_table import CTGANSynthesizer

synthesizer = CTGANSynthesizer(
    metadata=metadata,
    embedding_dim=128,
    generator_dim=(256, 256),
    discriminator_dim=(256, 256),
    generator_lr=2e-4,
    generator_decay=1e-6,
    discriminator_lr=2e-4,
    discriminator_decay=1e-6,
    discriminator_steps=1,
    batch_size=50,      
    pac=5,             
    verbose=True,
    epochs=1000
    )

synthesizer.fit(df_loan_data )

synthetic_loan_data = synthesizer.sample(num_rows=10000)

I've attached a screenshot to show you the issue. Unfortunately I am very hesitant to share any of the original (meta)data.
Screenshot 2024-11-04 102139

@wilcovanvorstenbosch wilcovanvorstenbosch added bug Something isn't working new Automatic label applied to new issues labels Nov 11, 2024
@srinify
Copy link
Contributor

srinify commented Nov 12, 2024

Hi there @wilcovanvorstenbosch are you able to share just the sdtypes in your metadata for your numerical columns? The detect_from_dataframe() method does a good job trying to auto-detect the correct SDV Sdtypes for each column but we always recommend double checking the sdtypes to ensure they align with what you expect. So I usually start my debugging there.

You can either:

  • Use our handy Metadata.anonymize() method to quickly mask your column names and share the output with us. This way, we will know the sdtypes but not what your columns represent.
  • Or you can just display the metadata (print(metadata)) and share the sdtypes with us for your numerical columns.

With the sdtypes, I can try to reproduce the issue on my end!

@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Nov 12, 2024
@srinify srinify self-assigned this Nov 12, 2024
@wilcovanvorstenbosch
Copy link
Author

Dear Srini [@srinify ],

First of all: I'm delighted that you are willing to help out. It is beyond my expectations, and I greatly appreciate it.
This is my metadata:

{ "tables": { "table1": { "primary_key": "col1", "columns": { "col1": { "sdtype": "id" }, "col2": { "sdtype": "categorical" }, "col3": { "sdtype": "categorical" }, "col4": { "sdtype": "categorical" }, "col5": { "sdtype": "categorical" }, "col6": { "sdtype": "categorical" }, "col7": { "sdtype": "numerical" }, "col8": { "sdtype": "numerical" }, "col9": { "sdtype": "categorical" }, "col10": { "sdtype": "categorical" }, "col11": { "sdtype": "categorical" }, "col12": { "sdtype": "categorical" }, "col13": { "sdtype": "categorical" }, "col14": { "sdtype": "categorical" }, "col15": { "sdtype": "numerical" }, "col16": { "sdtype": "numerical" }, "col17": { "sdtype": "numerical" }, "col18": { "sdtype": "numerical" }, "col19": { "sdtype": "numerical" }, "col20": { "sdtype": "numerical" }, "col21": { "sdtype": "numerical" } } } }, "relationships": [], "METADATA_SPEC_VERSION": "MULTI_TABLE_V1" }

Hope this suffices. If you need any more info, let me know.

Kind regards,
Wilco

@srinify
Copy link
Contributor

srinify commented Nov 18, 2024

Hi @wilcovanvorstenbosch I tried to reproduce this issue with a fake dataset (with a few thousand rows) but wasn't able to unfortunately. My fake dataset with lots of missing values still had the same ratios in the synthetic data when using your code.

Were you able to train CTGANSynthesizer on your full dataset (with 2 million rows) or did you use a subset? I'm asking because training with such a large dataset was taking an incredibly long time, so I figured I would clarify the size of your training data!

Do you mind swapping CTGANSynthesizer out with GaussianCopulaSynthesizer instead to see if that helps unblock you?

@wilcovanvorstenbosch
Copy link
Author

Dear Srini,

I am indeed using a subset, for now. I'm randomly sampling 10.000 rows from the original dataset to test this package. How does GaussianCopulaSynthesizer compare to the CTGAN? Does it retain correlations between attributes in a similar fashion?

Either way, I would like to compare the results against the CTGAN synthesizer, so will be looking to fix this.
I'll try and see if I can create a 'fake' dataset that recreates the issue.

Do you think the parameters of the model could somehow prevent this issue?

Kind regards,
Wilco

@wilcovanvorstenbosch
Copy link
Author

Just to clarify @srinify : the values are not missing at random. Often, the variable was not relevant for a specific row because of a certain value for another variable. I was hoping that the synthesizer would be able to pick up on this correlations. It should, right?

@srinify
Copy link
Contributor

srinify commented Nov 18, 2024

@wilcovanvorstenbosch

Does it retain correlations between attributes

Definitely - all of our synthesizers try our best to learn statistical patterns between columns. In fact, we actually include the correlation between pairs of columns (we call them "Column Pair Trends") in our Quality Report. This report will compare the similarity of the column-level distributions (we call them "Column Shapes") and the Column Pair Trends between your real and synthetic data. You can run and compare these reports for each batch of synthetic data you create, which can also help you compare data created using different synthesizers or synthesizer parameters.

I'll try and see if I can create a 'fake' dataset

That would be awesome if you're able to!

the values are not missing at random. Often, the variable was not relevant for a specific row

Oh that's interesting. So are you saying that the synthetic data contains NaN values but their occurrences aren't in line with some inter-column logic that you expect? Or are you seeing no NaN values at all in your synthetic data for the numeric columns?

We actually created a feature called Constraints to help you define specific rules that the synthetic data must follow. We have a few pre-defined constraint classes or you can create custom constraint classes with more open-ended logic.

@wilcovanvorstenbosch
Copy link
Author

In my synthetic data, there are no NaN values for the numeric columns.
In the original data, whether a column has a NaN is not random.

By the way, I found out that the discriminator and generator loss values are all over the place. They do not seem to "converge". This might be the issue. I will try tweaking the parameters, although I found a comment of yours on another issue saying that this a) is difficult, b) outside your expertise, and that sometimes the data does not suit this synthesizer?

What do you mean with the last statement? I imagined that GAN-based synthesizer would be better than other methods at dealing with ANY type of dataset. Have you found that this is not true?

@npatki
Copy link
Contributor

npatki commented Nov 20, 2024

Hi @wilcovanvorstenbosch and @srinify, quickly jumping in here with a few clarifications.

Details about NaN values

We expect CTGAN to produce synthetic NaN values at roughly the same proportion as the real data.

Internally, NaN-values are handled by SDV at a level that is outside the scope of the internal GAN algorithm. Therefore:

  • GAN convergence -- or lack thereof -- is unrelated to NaN-value handling. Though here is a blog article that discusses convergence in more detail. I suggest we can further discuss this in a new issue if you'd like?
  • Similarly, the parameters of your model (as shown in the code) should not really affect NaN values
  • By default, SDV will assume your data is missing completely at random. I can walk you how to update this if you'd like. But it might be best to do so in a different issue (and perhaps resolve this current issue with NaNs first).

The only thing that might affect NaN values is if you are (a) manipulating the data in any way after reading from CSV, or (b) making customizations such as updating transformers/adding constraints. This doesn't seem to be the case.

Diagnosing this current issue: Synthetic data doesn't contain NaNs

Unfortunately neither @srinify nor I have been table to replicate this. We have tried all sorts of combinations of sdtypes (numerical, categorical) with the same proportion of np.nan values. CTGANSynthesizer always gives us back np.nan values as expected. I think @srinify's suggestion to try GaussianCopula is to help discover if this bug is isolated to CTGAN.

One more thing that may help: For the column you are visualizing, I know that the missing values are stored as np.nan. But what is the overall column stored as? print(df_loan_data[COLUMN_NAME].dtype)

Is there any other information you can provide that might be useful to replicate? Perhaps if you are able to replicate this on unrelated (or made up) data, you can share that? In the meantime, we will continue trying to replicate but it's proving to be a bit tricky!

Other Notes

  • From looking at your visualization, my hunch is that your SDV synthesizer is producing a lot of data points at the mean -value that are actually supposed to be marked as NaN. (That's why there's such a high peak there.) This has been an issue on older versions on SDV so I'm surprised it's happening now.
  • It is not always the case that CTGAN (or GAN-based modeling in general) is better than statistical methods such as GaussianCopula. In fact, many users have preferred GaussianCopula because it's faster, and easier to customize to get higher quality. But totally up to you!

@wilcovanvorstenbosch
Copy link
Author

Sorry for the late reply. I was busy with other work, but will be working on this topic for most of this week so I'll try to further clarify the issue and maybe create a dataset that I can share.

Regarding your question:

The column that I was visualising is of type Int64.
However, the same issue occurs with columns of type float64.

I'm currently testing the GausianCopula method to see if the same issue persists.

@wilcovanvorstenbosch
Copy link
Author

Update @npatki @srinify ,

For my dataset, the same problem does not occur with the GausianCopula method.
This method has some other issues that I'd like to tweak, but see below for the distribution of synthetic data for the same column as shown before:

I used the exact same DataFrame and metadata.

image

@wilcovanvorstenbosch
Copy link
Author

wilcovanvorstenbosch commented Nov 25, 2024

From looking at your visualization, my hunch is that your SDV synthesizer is producing a lot of data points at the mean -value that are actually supposed to be marked as NaN. (That's why there's such a high peak there.) This has been an issue on older versions on SDV so I'm surprised it's happening now.

One note on this: the NaN values are not synthesized exactly to the mean value, but they indeed are close.
Below, see the results for synth_loan_data[COLUMN_NAME].value_counts(dropna=False, ascending=False)

144.306535    1
144.238072    1
92.817252     1
144.596491    1
144.399917    1
             ..
144.123484    1
144.397720    1
144.376040    1
144.697407    1
144.344513    1
Name: count, Length: 10000, dtype: int64

You mentioned that NaN values are handled outside of the GANs. Can you point me to the pieces of code that handle this? I can't seem to find it.

@npatki
Copy link
Contributor

npatki commented Nov 25, 2024

Hi @wilcovanvorstenbosch, no problem at all.

I'm glad to hear that the GaussianCopula synthesizer does not have this problem with NaNs. If you want, you're welcome to file a new issue about improving the distribution quality so we can discuss that separately. (Hint: With GaussianCopula, there is a lot more you can do to control & customize the quality. For eg, you can put in the exact shape you want using the numerical_distributions parameter -- see the API docs).

I'm going to update the title of this current NaN issue to mention that it is for CTGAN only. One thing that stands out to me is that you mention your column is dtype Int64. This is a bit odd because whenever I do pd.read_csv to load my data, it always reads in numerical columns as float64. Also, it is my understanding that an Int64 column will represent null values using pd.NA rather than np.nan, so something doesn't seem to match up.

To get an Int64 column, are you doing any kind of data manipulation after reading in the CSV? If so, it would be helpful to share.

You mentioned that NaN values are handled outside of the GANs. Can you point me to the pieces of code that handle this? I can't seem to find it.

Sure, NaN handling is done during the data pre-processing stage, with the help of RDT transformers. Underlying algorithms (such as CTGAN) are not designed to work with NaNs, so this pre-processing stage will typically fill the NaN values with some random other values (and keep track of what it did, so it is reversible later).

After fitting, you can see which transformers were used, and whether it learned the % of missing values in your column:

all_transformers = synthesizer.get_transformers()

column_transformer = all_transformers[COLUMN_NAME] # add the name of the numerical column to debug

try:
    print('Learned proportion of missing values:', column_transformer.null_transformer._null_percentage)
except:
    print('Transformer used:', column_transformer)

For more information see the RDT documentation site and RDT GitHub.

@npatki npatki changed the title NaN values for numerical variables DISAPPEAR NaN values for numerical variables DISAPPEAR when using CTGANSynthesizer Nov 25, 2024
@wilcovanvorstenbosch
Copy link
Author

wilcovanvorstenbosch commented Nov 25, 2024

Dear @npatki ,

In fact I am loading the dataset directly from an SQL database. I did not think this would matter, much.
Like I said earlier, the issue is not limited to integer / float dtype columns, but to any numerical column in my dataset.

Still, I put it to the test.
I have tried saving it as .csv and then loading it with pd.load_csv().
Indeed, it was loaded with float64 as dtype for the column.
The same problem occurs.

I will come back to you when I have some 'shareable' information about my dataset.
In an earlier reply, you said:

SDV will assume your data is missing completely at random. I can walk you how to update this if you'd like.

Is this true for the GausianCopula synthesizer as well?
If so, I would very much appreciate some help towards fixing this.

Kind regards,
Wilco

@npatki
Copy link
Contributor

npatki commented Nov 25, 2024

Hi @wilcovanvorstenbosch

Is this true for the GausianCopula synthesizer as well?
If so, I would very much appreciate some help towards fixing this.

I have filed a separate issue, #2310 dedicated to discussing this particular topic (data missing completely at random vs. not).

Re the original issue for being unable to sample NaN values:

In fact I am loading the dataset directly from an SQL database. I did not think this would matter, much.
Like I said earlier, the issue is not limited to integer / float dtype columns, but to any numerical column in my dataset.

I understand that the issue is happening for numerical columns in general. I just wanted to sanity check the dtypes because SDV has only been tested with object, float64, int64, and datetime64 with missing values stored as np.nan. All of our pre-processing and algorithms do data manipulation, so a different dtype or missing value representation may be causing issues.

Perhaps there is some other property of your dataset (unrelated to dtype) that is causing the bug, but I'm not sure what it could be. Please do let us know if you have any shareable information! This problem has been trickier for us to replicate than most :)

@wilcovanvorstenbosch
Copy link
Author

wilcovanvorstenbosch commented Nov 26, 2024

Dear @npatki ,

I am pretty sure it has nothing to do with the dtypes.
I just ran the CTGAN with a tiny sample, and this time the NaN's did not disappear for the column I showed.
Still, the NaN's are all over the place. I'm trying to grasp what the RDT does with those values.
It seems very unpredictable.

From previous tests, it looked like it generated values around the mean.
But this does not always happen. Sometimes the distribution is similar, but with no missing values.

Edit:
I had a look at column_transformer.null_transformer._null_percentage.
The percentages are the same as in the original dataset.
Still, the output is way off.

@npatki
Copy link
Contributor

npatki commented Nov 26, 2024

Hi @wilcovanvorstenbosch,

Still, the NaN's are all over the place. I'm trying to grasp what the RDT does with those values.
It seems very unpredictable.

As I mentioned earlier, by default all SDV synthesizers will consider the data "missing completely at random". I.e. they just learn the % of NaN values and randomly add them back into your synthetic data. Please check #2310 for more details.

The percentages are the same as in the original dataset.
Still, the output is way off.

Very strange! We will continue to investigate with this new info that a smaller subsample does not have the same issue as the original (full) dataset.

@srinify srinify assigned npatki and unassigned srinify Nov 26, 2024
@npatki npatki removed the under discussion Issue is currently being discussed label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants