This project contains the artifacts used in the paper "Towards Accurate Recommendations of Merge Conflicts Resolution Strategies", published in the IST journal. Pre-print is available here.
In this paper, we propose MESTRE, a merge conflict resolution strategy recommender.
The complementary material to the paper can be found in the "complementary" folder.
The dataset can be obtained through the steps outlined below.
The scripts used to reproduce the study can be found in the "scripts" folder.
There are two options for accessing the dataset used in this paper. You can either collect the data by yourself (takes a long time) or directly download the dataset files.
We assume you have access to the conflicts database used in this paper. The database information can be configured in the scripts/database.py file.
Reproduce the scripts in the following order:
Script | Input | Output | Description |
---|---|---|---|
extract_initial_dataset.py | Conflicts database | ./data/INITIAL_DATASET.csv | Extracts a csv with conflicting chunks and some descriptive attributes. |
concatenation_relabel.py | ./data/INITIAL_DATASET.csv, Conflicts database | ./data/LABELLED_DATASET.csv | Relabels the developerdecision from each chunk that used the Concatenation strategy. |
clone_projects.py | ./data/INITIAL_DATASET.csv | Repos folder | Clones all projects into the ./repos folder. |
collect_chunk_authors.py | ./data/INITIAL_DATASET.csv, Repos folder | ./data/chunk_authors.csv | Extracts a csv with information about all authors that contributed to a conflicting chunk. Detailed information can be found in this link. |
collect_attributes.py | ./data/INITIAL_DATASET.csv, Repos folder | ./data/collected_attributes1.csv | Extracts a csv with collected attributes from the conflicting chunks. Extracted attributes are described in this link. |
execute_mac_tool.py | ./data/INITIAL_DATASET.csv, Repos folder | Two csv files for each analyed repo, ./data/macTool_output.csv | Executes a modified version of the macTool to extract merge attributes. More info in this link. |
collect_merge_type.py | ./data/macTool_output.csv, Repos folder | ./data/merge_types_data.csv | Extracts the merge commit message for each chunk merge commit, the merge branch message indicator, and the boolean attribute regarding the existence of multiple developers on each branch of the merge. More info in this link. |
collect_attributes_db.py | ./data/INITIAL_DATASET.csv, Conflicts database, Repos folder | ./data/collected_attributes2.csv | Extracts a csv with collected attributes from the conflicting chunks that can be calculated from the data in the database. Extracted attributes are described in this link. |
extract_author_self_conflict.py | ./data/chunk_authors.csv | ./data/authors_self_conflicts.csv | Extracts a csv with the calculated self_conflict_perc metric for each conflicting chunk. |
assemble_dataset.py | ./data/collected_attributes1.csv, ./data/collected_attributes2.csv, ./data/authors_self_conflicts.csv, ./data/merge_types_data.csv, ./data/macTool_output.csv | ./data/dataset.csv | Combines all collected data from the previous scripts into a single csv. |
select_projects.py | ./data/LABELLED_DATASET, ./data/number_conflicting_chunks.csv, ./data/dataset.csv | ./data/selected_dataset.csv, ./data/SELECTED_LABELLED_DATASET.csv, ./data/projects_intersection.csv | Extracts only the conflicting chunks that satisfy the criteria contained in the script (currently chunks from projects that have at least 1,000 conflicting chunks, and that are not implicit forks from other selected projects). |
github_api_data_preprocess.py | ./data/number_conflicting_chunks.csv, ./data/number_chunks__updated_repos.csv, ./data/projects_data_from_github_api.csv | ./data/api_data.csv | This script joins the data about projects (collected from GitHub API) with the data of the number of chunks per project (extracted from Ghiotto's database) and the data of the new owner/names of the projects, as well the projects not found by the API. |
transform_boolean_attributes.py | ./data/selected_dataset.csv | ./data/selected_dataset2.csv | Transforms the language construct column in each conflicting chunk into a boolean attribute. |
process_projects_dataset.py | ./data/selected_dataset2.csv, ./data/chunk_authors.csv | Two csv files (training/test) for each analyzed selected repository put into .data/projects, .data/dataset-training.csv, .data/dataset-test.csv | Splits the dataset into training/validation (80%) and test (20%) parts. Creates the boolean attribute for authors in each selected project. Details can be viewed in this link |
discretize_dataset.py | ./data/dataset-training.csv, ./data/dataset-test.csv, ./data/projects/{project}-training.csv, ./data/projects/{project}-test.csv | Two csv files (training/test) for each analyzed selected repository put into .data/projects/discretized_log2 and .data/projects/discretized_log10, .data/dataset-training_log2.csv, .data/dataset-training_log10.csv, .data/dataset-test_log2.csv, .data/dataset-test_log10.csv | Discretizes categorical attributes from the dataset using log2 and log10 functions. |
Execute the script download_dataset_files.py. All data files will be put into the ./data folder.
Paulo Elias
Heleno de S. Campos Junior
Eduardo Ogasawara
Leonardo Gresta Paulino Murta