Skip to content

Commandline Parameters

Luca Di Grazia edited this page Apr 1, 2022 · 18 revisions

Commandline parameters for DiffSearch

Introduction

To simplify the execution, use the command

dfs() { mvn exec:java -Dexec.mainClass=research.diffsearch.main.App -Dexec.args="$1" ; }

Then, DiffSearch can be run using

dfs "args"

Requirements

  • Java 11 and Python 3.7
  • Linux Operating System
  • ANTLR 4 -> apt install antlr
  • Python dependencies:
    • virtualenv -p /usr/bin/python3 diffsearch-env
    • source diffsearch-env/bin/activate
    • pip3 install faiss-cpu
    • pip3 install numpy
    • pip3 install pandas
    • pip3 install dask[dataframe]

General Setup

To create an index, perform the following steps:

  1. Clone all repositories using `dfs "-clone ". The path must be a path to a text file with all links to the GitHub repositories (one link per line).
  2. Parse a corpus of code changes using dfs "-d -lang <language>", where language is Java, JavaScript, or Python.
  3. Extract feature vectors and create an index using dfs "-fe -lang <language>". Further below are additional parameters given that modify how features are extracted.
  4. DiffSearch can now accept queries. It is important that additional parameters for the feature extraction are also used for any mode that accepts queries. For example, when using "-extractors node:500" in the feature extraction, this parameter must also be used for the online modes.

DiffSearch Modes

Parameter Usage Description
n -n DiffSearch in console mode. Queries can be entered in the console, where also the results are shown.
g -g DiffSearch serves as webserver for the DiffSearch UI
w -w DiffSearch serves as webserver for the old DiffSearch UI (deprecated)
q -q query performs a search on the given query
b -b inputpath outputpath processes all queries of a text file at the given path and saves the result in the given output file.
fe -fe Feature extraction mode, creates feature vectors from code changes and indexes them
clone -clone repository-list Clones the list of git repositories. The parameter should be a path to a txt files with all links to GitHub repositories.
d -d Extracts and parses code changes of the cloned git repositories.

Other parameters

Parameter Usage Description
pyc -pyc pycommand sets the command to start the python environment. For windows systems, the windows subsystem for linux is required. With WSL this parameter usually should be "wsl python3".
lang -lang (java|javascript|python) sets the target programming language.
l -l saves the server log to a log file. (WIP)
oj -oj runs only the java server and does not invoke python. Only matters in "-fe", "-q", "-n", "-w", "-g", and "-b" mode.
p -p port sets the port of the web interface
r -r enables recall measurement. This requires a lot of time. Recall results are saved in the resources/Recall folder. Best use in batch mode.
silent --silent if given, DiffSearch omits large console outputs, like results of a query.
py_port -py_port port sets the port of the python server
k -k number sets value for k, which is the number of candidate changes. Higher values increase recall but reduce performance.
vl -vl number size a each partition of the feature vectors.
cb -cb number number of count bits (Default 1). This is the number of bits that are used for each feature index. Changing this parameter requires reindexing.
t -t number number of additional threads used for matching etc. The number of actual threads is 2 + this number.
extractors -extractors (name(:length)?;)+ defines the extractors DiffSearch uses, e.g. -extractors parentchild:2000;triangle:2000. Valid extractors are node, triangle, parentchild, sibling, rulecount, editscript. Changing this parameter requires reindexing.
mt -mt seconds maximum matching time for a single code change. After this time, a code change is considered not a match.
extract-query-placeholders --extract-query-placeholders extract query placeholders like EXPR, default is false. Enabling this typically lowers recall.
tfidf -tfidf tfidf weights are used in the feature vectors. Changing this parameter requires reindexing.
noquerymultiplication -noquerymultiplication query vectors do not get multiplied. Typically lowers recall.
nondividedextraction -nondividedextraction feature extraction is not divided in the old and new part. Typically lowers recall.
lr -lr Low RAM mode. Does not use pre-parsed parse trees. Relevant for "-d", -"fe", and all modes that process queries. Lowers the RAM requirements, decreases performance.
gurl -gurl sets the URL of the WEB GUI.
effectiveness -effectiveness Run effectiveness evaluation.
scalability -scalability Run scalability evaluation.

Evaluation of DiffSearch

Measuring Recall, Average Search Time and Sensitivity Analysis

Required: A list of queries ( src/main/resources/Recall/Input/queriesForRecall_JAVA.txt )

If the corpus has changed, it is important to delete src/main/resources/Recall/ExpectedValues.csv

Usage: dfs "-b -r -k 5000 -silent -lang java"

Results: src/main/resources/Recall/RecallResults.csv

Scalability and Efficency

Java: dfs "-scalability -lang java" Python: dfs "-scalability -lang python" JavaScript: dfs "-scalability -lang javascript"

Results: src/main/resources/Scalability/

Simple Bugs Case Study

Unzip java_simple_bugs.zip

dfs "-effectiveness -lang java" Then, run the Python script from its folder: src/main/resources/DiffSearch_effectiveness

Results: src/main/resources/Effectiveness/