Can it be used for relational datasets? #316
Replies: 7 comments 1 reply
-
This library only works on single tables. To deal with multiple tables you'd need to combine PyTroch Frame with PyG, you can take a look at RelBench for more examples. |
Beta Was this translation helpful? Give feedback.
-
@yiweny Thanks for the info! Is there an example or do you mean I have to do it from scratch? I was wondering if there is a demo notebook showing how this can be done that combines both libraries. It is too difficult for me to do it because the libraries are very new for me. |
Beta Was this translation helpful? Give feedback.
-
@echatzikyriakidis Here is an example https://github.com/snap-stanford/relbench/blob/main/examples/gnn.py But I am not sure whether it's still WIP |
Beta Was this translation helpful? Give feedback.
-
What I see is that the example is using by default the rel-stackex-engage task, where in the paper seems to be predictive task: Predict if a user will be an active contributor to the site. I am wondering if there is already a task for generating new records, a generative task. e.g. creating a new customer along with the related rows in other tables. |
Beta Was this translation helpful? Give feedback.
-
Hey @echatzikyriakidis, thanks for creating this issue. I'm curious about your use case to see if there's anything we can do on our side! Noob question: Do you know what the task is called generally in literature by any chance? I'm not very familiar with the task that you mentioned where you want to generate records in tables. There may not be examples that you could use out-of-the-box at the moment. However, as @yiweny mentioned above, I think you're already able to train some model for your task by utilising:
|
Beta Was this translation helpful? Give feedback.
-
Hi @echatzikyriakidis, I think you may be referring to generative modeling of relational databases. As far as I know, there is no such work on generating synthetic table rows. I am curious why you want to generate new rows. What kind of application do you have in mind? |
Beta Was this translation helpful? Give feedback.
-
@akihironitta @weihua916 Hi all! Yes it is exactly that, generative modeling for relational databases. Most generative models on tabular data are modelling a single table and the trained model can be used later on to generate new records. In that cases usually categorical columns do not generate new values (which is correct as "customer type" has a limited set of possible values) while on numerical/datetime columns new values can be generated from the learned distributions. However, my use case is a relational database where there are more than one tables and happen to be interconnected with primary and foreign keys. Also, the table columns can be text, numerical, datetime and many of them have high while other low cardinality. The intuition is that we need to learn the internal statistics and correlations of all possible pair-wise columns in the relational schema and then be able to generate new records for each table, that somehow will need to be able to be connected in a meaningful way, otherwise the referential integrity of the data won't be able to enforce the relationships (this has to do with the PKs and FKs). Use cases for that are for testing analytics systems, machine learning or simply someone would not want to share the real data but some other synthetic that follow the same distributions than the real one. Also, cases exist where some of the columns are PII values so the model would be great if it could handle this and generate new values that don't exist in the real training data (e.g. SSNs, passwords, etc). However, usually this is handled in post-process. Also the data is masked so that the model is not trained on personal data. Again, this is the overall idea: A relational datatabase is given with many tables. We load the data of the tables in dataframes, we know also the way how these are interconnected from the schema. Now, the challenge is to train a single or many models to learn the relationships that exist in the data. After learning the relationships and patterns we can use the models for inference to generate new data records respecting (as much as possible) the same patterns. Patterns such as cardinality (e.g. how many orders will be created for a single customer?), distributions (what the amount of an order will be?), pairwise correlations (if customer's country is USA, then probably city could be some of the cities that exist in USA). So far, I haven't found such a generative algorithm. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Is it possible for the library to be used for relational datasets? If we load the database data into multiple dataframes, e.g. Customer table, Orders table, Products table. Can we use the library to generate data that are both synthetic and related to each other? A customer could have 1-N orders while its orders could also have 1-N products, depending the distributions found in the database. This is the idea. So far, I see that is only for single tables.
Beta Was this translation helpful? Give feedback.
All reactions