Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset generation script #3

Open
imoneoi opened this issue Nov 30, 2023 · 1 comment
Open

Dataset generation script #3

imoneoi opened this issue Nov 30, 2023 · 1 comment

Comments

@imoneoi
Copy link

imoneoi commented Nov 30, 2023

Can you share your dataset generation script for symbolic SQL data? I found some invalid SQL and wanted to improve it.

There are spaces in table column names, which is invalid, as shown in the example below.

[ header: no. | country | 2009 winter universiade | 2007 wjcc | 2007 wwcc | 2008 wjcc | 2008 wwcc | points row 1 : 1 | canada | 24 | 12 | 9 | 12 | 10 | 67
row 2 : 2 | china | 28 | None | 14 | 4 | 6 | 52
row 3 : 3 | sweden | 10 | 5 | 12 | 14 | 9 | 50
row 4 : 4 | great britain | 16 | 14 | 5 | 1 | 12 | 48
row 5 : 5 | russia | 20 | 8 | 6 | 6 | 5 | 45
row 6 : 6 | united states | 4 | 6 | 4 | 10 | 8 | 32
row 7 : 7 | switzerland | None | 10 | 8 | 8 | 3 | 29
row 8 : 8 | germany | None | None | 7 | 2 | 14 | 23
row 9 : 9 | denmark | None | 3 | 10 | None | 7 | 20
row 10 : 10 | czech republic | 12 | 4 | None | 3 | None | 19
row 11 : 11 | south korea | 8 | None | 3 | None | None | 11
row 12 : 12 | japan | 6 | 1 | None | None | 2 | 9
row 13 : 13 | france | None | 2 | None | 5 | None | 7
row 14 : 14 | norway | None | None | 2 | None | 4 | 6
row 15 : 15 | poland | 2 | None | None | None | None | 2
row 16 : 16 | italy | None | None | 1 | None | None | 1
row 17 : 17 | latvia | None | None | None | None | 1 | 1
row 18 : None | turkey (host) | None | None | None | None | None | 0 ] Execute this SQL based on the above table: select country where 2007 wwcc = ( select min ( 2007 wwcc ) )
@SivilTaram
Copy link
Collaborator

Hi @imoneoi , thanks for your interest on our work! Sure I'd like to share the dataset generation script. I use the script at https://github.com/microsoft/Table-Pretraining/tree/main/data_generator to synthesize the dataset. I'm still trying to build one clean repo to synthesize SQL queries from any table in the csv format - but it may still require some time 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants