pauq

PAUQ🕷️ = Pioneer dAtaset for rUssian text-to-SQL.

The Text-to-SQL dataset in Russian based on Spider. It contains three components that are modified, localized and enllarged: the NL questions, the SQL queries and the content of the databases. DB, table and column names remain unchanged; values are augmented by new Russian examples differ from existing ones.

Data Content and Format

Spider dataset

PAUQ train set: 8800 samples

PAUQ dev set: 1076 samples

Databases

If Spider data is loaded, it can be updated by this instructions:

Load the "upload" folder.
Launch python converter.py --db_path=PATH-TO-DB-FOLDERS (Python 3.5+).

Structure:

id [str] primery key
db_id [str] the database id to which this question is addressed
source [str] "train-spider", "train-others", "dev" or "addition" (new samples, not from Spider)
type [str] "train" or "dev"
query Dict[str, str] SQL query (en English, ru Russian)
question Dict[str, str] the natural language question (en English, ru Russian)
sql Dict[str, str] parsed results of this SQL query using Spider parsing file (en English, ru Russian)
question_toks Dict[str, str] the natural language question tokens (en English, ru Russian)
query_toks Dict[str, str] the SQL query tokens corresponding to the question (en English, ru Russian)
query_toks_no_values Dict[str, str] the SQL query tokens, column values replaced by (en English, ru Russian)

Citation

Paper link

@inproceedings{bakshandaeva-etal-2022-pauq,
    title = "{PAUQ}: Text-to-{SQL} in {R}ussian",
    author = "Bakshandaeva, Daria  and
      Somov, Oleg  and
      Dmitrieva, Ekaterina  and
      Davydova, Vera  and
      Tutubalina, Elena",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.175",
    pages = "2355--2376",
    abstract = "Semantic parsing is an important task that allows to democratize human-computer interaction. One of the most popular text-to-SQL datasets with complex and diverse natural language (NL) questions and SQL queries is Spider. We construct and complement a Spider dataset for Russian, thus creating the first publicly available text-to-SQL dataset for this language. While examining its components - NL questions, SQL queries and databases content - we identify limitations of the existing database structure, fill out missing values for tables and add new requests for underrepresented categories. We select thirty functional test sets with different features that can be used for the evaluation of neural models{'} abilities. To conduct the experiments, we adapt baseline architectures RAT-SQL and BRIDGE and provide in-depth query component analysis. On the target language, both models demonstrate strong results with monolingual training and improved accuracy in multilingual scenario. In this paper, we also study trade-offs between machine-translated and manually-created NL queries. At present, Russian text-to-SQL is lacking in datasets as well as trained models, and we view this work as an important step towards filling this gap.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
analysis		analysis
corrections		corrections
dataset		dataset
test_sets		test_sets
update		update
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pauq

Data Content and Format

Citation

About

Releases

Packages

Contributors 3

Languages

ai-spiderweb/pauq

Folders and files

Latest commit

History

Repository files navigation

pauq

Data Content and Format

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages