Can it be used for relational datasets? #316

echatzikyriakidis · 2023-12-20T11:22:51Z

echatzikyriakidis
Dec 20, 2023

Is it possible for the library to be used for relational datasets? If we load the database data into multiple dataframes, e.g. Customer table, Orders table, Products table. Can we use the library to generate data that are both synthetic and related to each other? A customer could have 1-N orders while its orders could also have 1-N products, depending the distributions found in the database. This is the idea. So far, I see that is only for single tables.

yiweny · 2023-12-20T19:38:12Z

yiweny
Dec 20, 2023
Collaborator

This library only works on single tables. To deal with multiple tables you'd need to combine PyTroch Frame with PyG, you can take a look at RelBench for more examples.

0 replies

echatzikyriakidis · 2023-12-20T21:13:49Z

echatzikyriakidis
Dec 20, 2023
Author

@yiweny Thanks for the info! Is there an example or do you mean I have to do it from scratch? I was wondering if there is a demo notebook showing how this can be done that combines both libraries. It is too difficult for me to do it because the libraries are very new for me.

0 replies

zechengz · 2023-12-20T22:42:22Z

zechengz
Dec 20, 2023
Collaborator

@echatzikyriakidis Here is an example https://github.com/snap-stanford/relbench/blob/main/examples/gnn.py But I am not sure whether it's still WIP

0 replies

echatzikyriakidis · 2023-12-21T09:36:11Z

echatzikyriakidis
Dec 21, 2023
Author

What I see is that the example is using by default the rel-stackex-engage task, where in the paper seems to be predictive task: Predict if a user will be an active contributor to the site. I am wondering if there is already a task for generating new records, a generative task. e.g. creating a new customer along with the related rows in other tables.

0 replies

akihironitta · 2023-12-22T01:26:51Z

akihironitta
Dec 22, 2023
Maintainer

I am wondering if there is already a task for generating new records, a generative task. e.g. creating a new customer along with the related rows in other tables.

Hey @echatzikyriakidis, thanks for creating this issue. I'm curious about your use case to see if there's anything we can do on our side! Noob question: Do you know what the task is called generally in literature by any chance? I'm not very familiar with the task that you mentioned where you want to generate records in tables.

There may not be examples that you could use out-of-the-box at the moment. However, as @yiweny mentioned above, I think you're already able to train some model for your task by utilising:

PyTorch Frame to extract features from each of your tables, and
PyG to exchange information across tables.

0 replies

weihua916 · 2023-12-22T08:46:05Z

weihua916
Dec 22, 2023
Collaborator

Hi @echatzikyriakidis, I think you may be referring to generative modeling of relational databases. As far as I know, there is no such work on generating synthetic table rows. I am curious why you want to generate new rows. What kind of application do you have in mind?

0 replies

echatzikyriakidis · 2023-12-22T14:16:48Z

echatzikyriakidis
Dec 22, 2023
Author

@akihironitta @weihua916 Hi all! Yes it is exactly that, generative modeling for relational databases. Most generative models on tabular data are modelling a single table and the trained model can be used later on to generate new records. In that cases usually categorical columns do not generate new values (which is correct as "customer type" has a limited set of possible values) while on numerical/datetime columns new values can be generated from the learned distributions.

However, my use case is a relational database where there are more than one tables and happen to be interconnected with primary and foreign keys. Also, the table columns can be text, numerical, datetime and many of them have high while other low cardinality. The intuition is that we need to learn the internal statistics and correlations of all possible pair-wise columns in the relational schema and then be able to generate new records for each table, that somehow will need to be able to be connected in a meaningful way, otherwise the referential integrity of the data won't be able to enforce the relationships (this has to do with the PKs and FKs).

Use cases for that are for testing analytics systems, machine learning or simply someone would not want to share the real data but some other synthetic that follow the same distributions than the real one. Also, cases exist where some of the columns are PII values so the model would be great if it could handle this and generate new values that don't exist in the real training data (e.g. SSNs, passwords, etc). However, usually this is handled in post-process. Also the data is masked so that the model is not trained on personal data.

Again, this is the overall idea: A relational datatabase is given with many tables. We load the data of the tables in dataframes, we know also the way how these are interconnected from the schema. Now, the challenge is to train a single or many models to learn the relationships that exist in the data. After learning the relationships and patterns we can use the models for inference to generate new data records respecting (as much as possible) the same patterns. Patterns such as cardinality (e.g. how many orders will be created for a single customer?), distributions (what the amount of an order will be?), pairwise correlations (if customer's country is USA, then probably city could be some of the cities that exist in USA).

So far, I haven't found such a generative algorithm.

Thanks!

1 reply

weihua916 Feb 13, 2024
Collaborator

Interesting diffusion-based generative modeling for (single) tabular data: https://arxiv.org/abs/2310.09656

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can it be used for relational datasets? #316

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Can it be used for relational datasets? #316

echatzikyriakidis Dec 20, 2023

Replies: 7 comments · 1 reply

yiweny Dec 20, 2023 Collaborator

echatzikyriakidis Dec 20, 2023 Author

zechengz Dec 20, 2023 Collaborator

echatzikyriakidis Dec 21, 2023 Author

akihironitta Dec 22, 2023 Maintainer

weihua916 Dec 22, 2023 Collaborator

echatzikyriakidis Dec 22, 2023 Author

weihua916 Feb 13, 2024 Collaborator

echatzikyriakidis
Dec 20, 2023

Replies: 7 comments 1 reply

yiweny
Dec 20, 2023
Collaborator

echatzikyriakidis
Dec 20, 2023
Author

zechengz
Dec 20, 2023
Collaborator

echatzikyriakidis
Dec 21, 2023
Author

akihironitta
Dec 22, 2023
Maintainer

weihua916
Dec 22, 2023
Collaborator

echatzikyriakidis
Dec 22, 2023
Author

weihua916 Feb 13, 2024
Collaborator