Skip to content

congy/AutoSuggest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

===============

Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks

Overview

This benchmark dataset contains 78K real invocations of Pandas table-manipulation operators (Join, Pivot, Unpivot, GroupBy), harvested from replaying hundreds of thousands of notebooks publicly available on GitHub. We crawled over 4M notebooks, replayed a significant fraction of these notebooks by programmatically executing them, and used dynamic instrumentation to extract input-tables/parameters used by these aforementioned operators.

We used this dataset to predict/recommend data-preparation steps (particularly for non-expert users) [1]. We hope the release of this data set would facilitate large-scale research on topics such as understanding the characteristics of real data manipulation tasks that data scientists perform in the wild; and building predictive models to automate data preparation tasks for expert and non-expert users alike.

Data link

Data can be downloaded from this link. Each instance of an operator invocation is stored under a separate sub-folder with a structure of [OPERATOR]/[GITREPO]_[FILEPATH]_cell[CELLID]_[OPERATORID]/. Each sub-folder contains a data.csv for the input dataframe fed into the operator (dumped as a csv), and also a param.json for the exact parameters used in this invocation (the ground-truth we want to predict in [1]).

License

This data set is released under the Computational Use of Data Agreement v1.0.

Reference

[1] Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. https://dl.acm.org/doi/abs/10.1145/3318464.3389738.

===============

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published