StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

(Check out other code resources in our group at: https://github.com/sunlab-osu)

1. StaQC dataset

1.1 Introduction

StaQC (Stack Overflow Question-Code pairs) is the largest dataset to date of around 148K Python and 120K SQL domain question-code pairs, which are automatically mined from Stack Overflow using a Bi-View Hierarchical Neural Network, as described in the paper "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" (WWW'18).

Click to see some quick examples randomly sampled from StaQC!

StaQC is collected from three sources: multi-code answer posts, single-code answer posts, and manual annotations on multi-code answer posts:

	#of question-code pair
Source	Python	SQL
Multi-Code Answer Posts	60,083	41,826
Single-Code Answer Posts	85,294	75,637
Manual Annotation	2,169	2,056
Sum	147,546	119,519

1.2 Multi-code answer posts & manual annotations

A Multi-code answer post is an (accepted) answer post that contains multiple code snippets, some of which may not be a standalone code solution to the question (see Section 1 in paper). For example, in this multi-code answer post, the third code snippet is not a code solution to the question "How to limit a number to be within a specified range? (Python)".

The ids of question-code pairs automatically mined or manually annotated from multi-code answer posts can be found here: Python and SQL.
Format: Each line corresponds to one code snippet, which can be paired with its question. The code snippet is identified by (question id, code snippet index), where the code snippet index refers to the index (starting from 0) of the code snippet in the accepted answer post of this question. For example, (5996881, 0) refers to the first code snippet in the accepted answer post of the question with id "5996881", which can be paired with its question "How to limit a number to be within a specified range? (Python)".

We also provide the complete source data. Note that the source data contains all available resources (not only the mined question-code pairs). Given the source data, you can retreive the mined code solutions using the provided question-code ids (see above).
Source data: Python 2.7 Pickle files. Please open with pickle.load(open(filename)).

Code snippets for Python and SQL: A dict of {(question id, code index): code snippet}.
Question titles for Python and SQL: A dict of {question id: question title}.

1.3 Single-code answer posts

A Single-code answer post is an (accepted) answer post that contains only one code snippet. We pair such code snippet with the question title as a question-code pair.

Source data: Python 2.7 Pickle files. Please open with pickle.load(open(filename)).

Code snippets for Python and for SQL): A dict of {question id: accepted code snippet}.
Question titles for Python and SQL: A dict of {question id: question title}.

2. Software

2.1 Prerequisite

Python 2.7
NLTK
Tensorflow (1.0.1 or later)
Raw Stack Overflow (SO) dump or our processed data

[Update 05/27/2019] If you are using our processed data, vocabularies (text_word_vocab.pickle for text, code_token_vocab.pickle for code) can be found in the following folders:

Python: text vocab, code vocab.
SQL: text vocab, code vocab.

2.2 Manual annotations

Human annotations can be found: Python and SQL. Both are pickle files.

2.3 How-to-do-it question type classifier

The script that extracts features for constructing a "how-to-do-it" question type classifier can be found here. The 250 manually annotated posts for Python and SQL can be found here (label '1' denotes "how-to-do-it"). For details, please refer to Section 2.2.1 in our paper.

2.4 Code snippet processing

The script for processing code snippets can be found here. For details, please read Section 5.1 in our paper. The implementation of the SQL parser is adapted from https://github.com/sriniiyer/codenn.

Installing package cd data_processing/codenn/src/sqlparse/ python setup.py install
Processing code snippets (tokenization, normalizing variable name, etc.)
cd data_processing
The tokenize_code_corpus function receives a dictionary of code snippets and returns the paring results. Please run python code_processing.py for testing.

2.5 Run BiV-HNN

We provide processed training/validation/testing files in our experiments here.

Before running, please unzip the word embedding files for Python (code_word_embedding.gz*) following:
cd data/data_hnn/python/train/
cat code_word_embedding.gza* | zcat > rnn_partialcontext_word_embedding_code_150.pickle
rm code_word_embedding.gza*
then go back the code dir:
cd ../../../../BiV_HNN/.

No other operations demanded for SQL data.

Train:
For Python data:

python run.py --train --train_setting=1 --text_model=1 --code_model=1 --query_model=1 --text_model_setting="64-150-24379-0-1-0-1" --code_model_setting="64-150-218900-0-1-0-1" --query_model_setting="64-150-24379-0-1-0-1" --keep_prob=0.5

For SQL data:

python run.py --train --train_setting=2 --text_model=1 --code_model=1 --query_model=1 --text_model_setting="64-150-13698-0-1-0-1" --code_model_setting="64-150-33192-0-1-0-1" --query_model_setting="64-150-13698-0-1-0-1" --keep_prob=0.7

The above program trains the BiV-HNN model. It will print the model's learning process on the training set, and its performance on the validation set and the testing set.

For training Text-HNN, set:
--code_model=0 --query_model=0 --code_model_setting=None --query_model_setting=None to dismiss the code and query modeling.

For training Code-HNN, set:
--text_model=0 --text_model_setting=None
to dismiss the text modeling.

Test:
You may revise the test function in run.py for testing other datasets, and run the above command (Note: replace --train with --test).

3. Cite

If you use the dataset or the code in your research, please cite the following paper:

@inproceedings{yao2018staqc,
  title={StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow},
  author={Yao, Ziyu and Weld, Daniel S and Chen, Wei-Peng and Sun, Huan},
  booktitle={Proceedings of the 2018 World Wide Web Conference on World Wide Web},
  pages={1693--1703},
  year={2018},
  organization={International World Wide Web Conferences Steering Committee}
}

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
BiV_HNN		BiV_HNN
annotation_tool		annotation_tool
data/data_hnn		data/data_hnn
data_processing		data_processing
final_collection		final_collection
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

1. StaQC dataset

1.1 Introduction

Click to see some quick examples randomly sampled from StaQC!

1.2 Multi-code answer posts & manual annotations

1.3 Single-code answer posts

2. Software

2.1 Prerequisite

2.2 Manual annotations

2.3 How-to-do-it question type classifier

2.4 Code snippet processing

2.5 Run BiV-HNN

3. Cite

About

Releases

Packages

Languages

License

LittleYUYU/StackOverflow-Question-Code-Dataset

Folders and files

Latest commit

History

Repository files navigation

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

1. StaQC dataset

1.1 Introduction

Click to see some quick examples randomly sampled from StaQC!

1.2 Multi-code answer posts & manual annotations

1.3 Single-code answer posts

2. Software

2.1 Prerequisite

2.2 Manual annotations

2.3 How-to-do-it question type classifier

2.4 Code snippet processing

2.5 Run BiV-HNN

3. Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages