Detecting simple stupid bugs (SStuBs) using pre-trained transformer and repairing them with seq2seq model

Some code to work with the ManySStuBs4J dataset, which is a collection of simple fixes to single line Java bugs.

Repository Description

`utils`

This package contains some utility modules to fix and prepare the data.

data_reader.py: Loads the json dataset and puts SStuB properties into a Bug class. It defines some useful methods like generating GitHub URLs to be used in other modules or the directory paths to access source files.

config.py: Contains configuration variables for dataset paths and other assets. By default, datasets reside in the data directory in the root of the repository:

DATASET_ROOT = '../data'
SRC_FILES = DATASET_ROOT / 'src_files'

sstubs = DATASET_ROOT / 'sstubs.json'
bugs = DATASET_ROOT / 'bugs.json'
sstubs_large = DATASET_ROOT / 'sstubsLarge.json'
bugs_large = DATASET_ROOT / 'bugsLarge.json'

fix_dataset.py: Some projects in the dataset are removed from GitHub or moved to another repository (e.g., b3log.solo has moved to 88250.solo). This module replaces them with an appropriate fork or a new repository that contains the same history to have access to its commits and files. Furthermore, especially in the large version, some project names only contain the repository name (e.g., struts that should be apache.struts). Therefore, we manually found and completed their repository owner part. After replacing correct project names, GitHub URL for each project is built and checked if the project exists on GitHub.

retrieve_files.py: Downloads fixed and buggy source files based on the commit hashes given for each bug. The download process is concurrent, and the maximum number of jobs can be specified using the n_jobs variable in config.py. The directory structure of retrieved source files is like this:

username.repository/commit_hash/dotted_file_path/file.java

For example

https://github.com/apache/camel/commit/d55fc4de68d1c8d9a5aff883e2c5f84ad02aa0b8/components/camel-restlet/src/test/java/org/apache/camel/component/restlet/RestletConfigurationTest.java

is saved in:

apache.camel/d55fc4de68d1c8d9a5aff883e2c5f84ad02aa0b8/components.camel-restlet.src.test.java.org.apache.camel.component.restlet/RestletConfigurationTest.java

The downloaded source files are also available here:

all	sstubs	bugs	sstubsLarge	bugsLarge
`all_src_files.zip`	`sstubs_src_files.zip`	`bugs_src_files.zip`	`sstubsLarge_src_files.zip`	`bugsLarge_src_files.zip`

These files have the replaced project names for deleted or moved projects from fix_dataset.py.

line_normalize.py: Line numbers in the dataset are sometimes off, and for example, point to comment multiple lines before the actual intended line. Moreover, sometimes the programmer has broken a single Java statement into multiple lines, and the line number is only pointing to a part of this statement. Therefore, It is needed to normalize these cases by moving up and down the lines and checking for Java language specific separators like { and ; to collect the complete Java statement. This is especially needed for the tool used in the repair part to generate patches since it needs the given buggy line to be complete Java statements and not just part of a statement.

This module does this normalization using a heuristic and saves new source files in a directory like

username.repository/commit_hash/dotted_file_path/filename.java/line_number

where line_number shows which line is normalized. These line numbers are the same as the ones in the dataset.

`detect`

This package contains a simple example-based bug detection tool that uses a pre-trained transformer for the bug classification task.

build_model.py: Fine-tunes a pre-trained CodeBERTa model to build a bug detection model for all the bug types described in the mineSStuBs repository. Fine-tuning is based on source_before_fix and source_after_fix fields of the dataset for buggy and fixed examples, respectively. During fine-tuning, the checkpoints save in the utils.config.DETECT_RESULT directory for each epoch and can be used to further train or predict bugs.

`repair`

This package generates patches and tries to repair the SStuBs.

get_patches.py: Uses SequenceR to generate patches for each SStuB. You should install SequenceR separately for this to work. The directory where SequenceR installed is specified in the sequencer_home variable. By default, it points to the home directory of the operating system. The beam size is also set to 50.

compare_patches.py: After getting patches, it's time to find if the bug is repaired or not. The comparison is done between the generated patch line and the fixed line of the fix commit. Two methods can be used to compare these two lines:

spoon-core: That relies on Spoon's default pretty-printing to uniformize separators and whitespaces.
gumtree-spoon: That uses the snippet compare functionality of Gumtree Spoon AST Diff.

The default compare backend is spoon-core, but it can change using the backend variable in the main function of this module. You need Java 11 installed for these to work.

evaluate.py: Results from the patch comparison of the previous module are written to repair_result.csv. This module parses this file and prints out evaluations like total generated patches, the number of repaired bugs, and the number repaired bugs grouped by bug patterns.

The generated patches for the sstubs.json version of the dataset and the correct ones detected using the spoon-core backend can be downloaded from this table:

Generated Patches	Correct Patches
`repair_output.zip`	`correct_patches.zip`

In this output, a total of 250861 patches are generated for 6430 bugs with an average of 39.01 patches for each bug. Out of these, 1266 bugs got a correct patch. The following table shows the detailed result for each bug pattern.

Pattern Name	SStuBs	Correct Patches	Ratio
CHANGE_IDENTIFIER	2332	350	15.01%
DIFFERENT_METHOD_SAME_ARGS	1365	136	9.96%
CHANGE_NUMERAL	744	202	27.15%
OVERLOAD_METHOD_MORE_ARGS	649	62	9.55%
CHANGE_OPERATOR	237	135	56.96%
CHANGE_CALLER_IN_FUNCTION_CALL	169	34	20.12%
CHANGE_UNARY_OPERATOR	154	100	64.94%
OVERLOAD_METHOD_DELETED_ARGS	154	92	59.74%
MORE_SPECIFIC_IF	151	20	13.25%
LESS_SPECIFIC_IF	132	7	5.30%
SWAP_ARGUMENTS	117	10	8.55%
SWAP_BOOLEAN_LITERAL	111	86	77.48%
CHANGE_OPERAND	92	20	21.74%
CHANGE_MODIFIER	23	12	52.17%
Total	6430	1266	19.69%

How To Use

Install Python 3.8+ and clone this repository:

git clone https://github.com/h4iku/repairSStuBs.git
cd repairSStuBs

Create a virtual environment and activate it:

python -m venv env
# On Windows:
env\Scripts\activate
# Or on Linux:
source env/bin/activate

Then install the dependencies:

python -m pip install -U pip setuptools
pip install -r requirements.txt

To run each module, step outside its package (so you are at the root of the repository) and type:
```
python -m package.module
```
For example, to run the retrieve_files.py module:
```
python -m utils.retrieve_files
```
To run tests:
```
python -m unittest discover -s tests
```
Modules in each package have an order of execution, and they work on top of each other's output. The order is intuitive according to their names, and it's the same order as they are described above.

Also, don't forget to install Java and SequenceR for the repair part.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Detecting simple stupid bugs (SStuBs) using pre-trained transformer and repairing them with seq2seq model

Repository Description

`utils`

`detect`

`repair`

How To Use

Files

README.md

Latest commit

History

README.md

File metadata and controls

Detecting simple stupid bugs (SStuBs) using pre-trained transformer and repairing them with seq2seq model

Repository Description

utils

detect

repair

How To Use

`utils`

`detect`

`repair`