-
analyzed_projects_all.csv
contains in CSV format the list of all cloned projects at the time of this study.repo_name
is the repository name;last_checkout
is the hash of the last commit available at the time of the clone, and;date
is the date of the latest available commit.
-
detailed-database
is a folder containing the two complete datasets we defined.overall.json
contains all the instances of our dataset (1,930);language-filtered.json
contains 1,115 instances involving files in the following languages: C, Python, C++, JavaScript, Java, PHP, Ruby, and C#. Both these datasets are JSON arrays. Each element has the following structure:id
is a unique ID used during the construction phase, it is a univocal value for every entry;repository
is the repository name as hosted in GitHub (owner/project-name);fix
contains information about the fix, including:commit
: meta-data about the commit, including:hash
: commit hash;message
: commit message;author
: commit author;url
: GitHub API URL with complete information about the commit;
files
: an array of files modified in the fix commit; each element provides:name
: name of the modified file after the commit (this is not the complete path, just the file name);old_path
/new_path
: the path of the file before and after the commit;lang
: extension of the file (indicating the programming language);lines_added
/lines_deleted
: lists of line numbers added/deleted;change_type
: type of change (one of the following: "MODIFY"/"ADD"/"RENAME"/"DELETE");
bugs
contains the list of bug-inducing-commits for the fix; each element includes:commit
: meta-data about the commit, including:hash
: commit hash;message
: commit message;author
: commit author;url
: GitHub API URL with complete information about the commit;
files
: an array of files modified in the fix commit; each element provides:name
: name of the modified file after the commit (this is not the complete path, just the file name);old_path
/new_path
: the path of the file before and after the commit;lang
: extension of the file (indicating the programming language);lines_added
/lines_deleted
: lists of line numbers added/deleted;change_type
: type of change (one of the following: "MODIFY"/"ADD"/"RENAME"/"DELETE");
issue_urls
is a list of URLs of issues referenced in the fix commit;earliest_issue_date
is the date of the earliest issue referenced in the fix commit (YYYY-MM-DDTHH:MM:SS);best_scenario_issue_date
represents the date of an ideal issue reported for the bug; it is the date of the last bug-inducing commit incremented by 60 seconds (YYYY-MM-DDTHH:MM:SS).
-
json-input-raw
is a folder containing four datasets used as input for our experimentations, derived fromlanguage-filtered.json
.bugfix_commits_all.json
andbugfix_commits_issues_only.json
contain 1,115 and 129 instances in JSON format, respectively.bugfix_commits_all_java.json
andbugfix_commits_issues_only_java.json
contain 80 and 10 instances in JSON format, respectively.
These datasets represent the input list of the selected fix commits and its relative list of bug-inducing commits, other than the following additional information used in our SZZ evaluation.id
is a unique ID used during the construction phase, it is a univocal value for every entry;repo_name
is the repository name as hosted in GitHub;fix_commit_hash
is the commit's hash of the selected fix;bug_commit_hash
is a list of bug-inducing commits;earliest_issue_date
is a string containing the timestamp of the earliest issue (YYYY-MM-DDTHH:MM:SS);best_scenario_issue_date
represents the date of an ideal issue reported for the bug; it is the date of the last bug-inducing commit incremented by 60 seconds (YYYY-MM-DDTHH:MM:SS);issue_urls
is a list of URLs of issues referenced in the fix commit;language
is a list of the programming languages of the files impacted by the fix commit.
-
cloned
is a placeholder folder where git repositories must be copied (or cloned) to replicate this work. See the instructions below. -
json-output-raw
is a folder containing a list of JSON files containing our pre-calculated results for each SZZ algorithm. -
scripts
is a folder that contains all scripts created to post-process or analyze our data. -
tools
is a folder that contains a snapshot of developed codes. For new studies, please use the extended version PySZZ v2. -
results
is a folder that contains all calculated metrics, such as Precision, Recall, F-measure, etc.
The following are the instructions needed to execute our suite of tools and generate our results. This example refers to the B-SSZ variant, but any other algorithm can be reproduced by changing the input arguments as detailed in the original guide. See tools/pyszz.zip
for more instructions.
-
Preparing input data. As the first step you need to clone the git repository of every project. You can rely on the following approach.
- As an alternative, you can clone into
cloned
folder each repository and then checkout the list of commit's hashes contained inanalyzed_projects_all.csv
andanalyzed_projects_issues_only.csv
. This recreates the exact same conditions of our experiment.
- As an alternative, you can clone into
-
Running SZZ. PySZZ (see
tools/pyszz.zip
for a replication snapshot, and check the reported URL for the latest version) is a free open-source suite of tools used to implement in Python all SZZ major variants. You can run a specific variant by passing a pre-definedyml
file or experiment with custom inputs. E.g.,conf/bszz.yml
activates B-SZZ variant.
python3 main.py json-input-raw/bugfix_commits_all.json conf/bszz.yml cloned
runs B-SZZ algorithm.
Where:
json-input-raw/bugfix_commits_all.json
is the input list of fixes;conf/bszz.yml
is a pre-defined list of settings used to activate a specific variant (seetools/pyszz.zip
for more details);cloned
is the folder containing a list of pre-cloned repositories.
NOTE. SZZUnleashed and OpenSZZ are not part of PySZZ suite. We adapted the original implementations to our input formats.
- The SZZUnleashed implementation has been forked to handle our input formant and add parallel support SZZUnleashed-adapted
(See
tools/szz-unleashed.zip
as a snapshot of our adapter) - The OpenSZZ implementation has been forked to exclude the Jira filter OpenSZZ (See
tools/open-szz.zip
as a snapshot of our adapter) OpenSZZ needs post-processing to adapt the generated results to our JSON format. See below OpenSZZ post-processing script
Both snapshots tools/szz-unleashed.zip
and tools/open-szz.zip
contain the instructions to use our adapters.
json-output-raw
contains a list of JSON files generated by each SZZ variant.
Specifically, bic_<algorithm-name>_bugfix_commits_all.json
and bic_<algorithm-name>_bugfix_commits_issues_only.json
refer to the output of <algorithm-name>
SZZ variant.
Instead, bic_<algorithm-name>_bugfix_commits_all-filter.json
and bic_<algorithm-name>_bugfix_commits_issues_only-filter.json
is the post-filtered output when the filter on issue data is applied.
We use ruby postfilter.rb <json-output> <cloned>
to post-process bic_<algorithm-name>_bugfix_commits_all.json
and bic_<algorithm-name>_bugfix_commits_issues_only.json
and generate bic_<algorithm-name>_bugfix_commits_all-filter.json
and bic_<algorithm-name>_bugfix_commits_issues_only-filter.json
, as a reduced list of datapoints filter by issue's date.
postfilter.rb
is our ruby script used to parse the output of any SZZ algorithm to filter out BIC commits that do not respect the issue date condition.<json-output>
is the input folder containing the list of JSON files produced by PySZZ;<cloned>
is the path to the pre-cloned (or checked out) repositories.
overlap.py
is a Python script with embedded input paths that can be used to calculate Recall, Precision, F-measure, and overlap.
You may need to adapt base_path
global variable to point to your result's directory. E.g., base_path = "json-output-raw/"
analyzes the study's results.
This tool produces:
-
<dataset>-recall-precision.csv
lists Precision, Recall, F-measure, total number of correct instances (our oracle), and total number of identified instances; -
<dataset>-overlap_vi_vj.csv
lists the overlap, the total number of BIC uniquely identified, the total number of correctly identified, and the union of all BIC correctly identified by all models; -
<dataset>-overlap_vi_but_others.csv
is a CSV version of the heatmap for the overlap comparison. -
<dataset>-not-identified.csv
summarizes the not found BICs; -
<dataset>-heatmap.pdf
as reported in the manuscript. -
wrong
is a subfolder with a list of CSV files containing the wrongly identified BIC with a link to GitHub FIX commit.
OpenSZZ produces three files for each analyzed instance. E.g., AIFDR_inasafe_BugFixingCommit.csv
, AIFDR_inasafe_BugInducingCommits.csv
, and AIFDR_inasafe.txt
.
To transform all these CSV files in a single JSON file compatible to overlap.py
we create a small script openszz_file_refactoring.py
.
python3 openszz_file_refactoring.py <oracle> <openszz-issue> <bic_open_bugfix_commits_issues_only.json>
Where:
<oracle>
is the list of fixes. E.g.,json-input-raw/bugfix_commits_all.json
;<openszz-issue>
is the folder path where openSZZ produces its results;<bic_open_bugfix_commits_issues_only.json>
is the destination file output where to store in JSON format openSZZ bug-inducing commits;