-
Notifications
You must be signed in to change notification settings - Fork 4
Commandline Parameters
To simplify the execution, use the command
dfs() { mvn exec:java -Dexec.mainClass=research.diffsearch.main.App -Dexec.args="$1" ; }
Then, DiffSearch can be run using
dfs
"args"
- Java 11 and Python 3.7
- Linux Operating System
- ANTLR 4 -> apt install antlr
- Python dependencies:
- virtualenv -p /usr/bin/python3 diffsearch-env
- source diffsearch-env/bin/activate
- pip3 install faiss-cpu
- pip3 install numpy
- pip3 install pandas
- pip3 install dask[dataframe]
To create an index, perform the following steps:
- Clone all repositories using `dfs "-clone ". The path must be a path to a text file with all links to the GitHub repositories (one link per line).
- Parse a corpus of code changes using
dfs "-d -lang <language>
", where language is Java, JavaScript, or Python. - Extract feature vectors and create an index using
dfs "-fe -lang <language>"
. Further below are additional parameters given that modify how features are extracted. - DiffSearch can now accept queries. It is important that additional parameters for the feature extraction are also used for any mode that accepts queries. For example, when using "-extractors node:500" in the feature extraction, this parameter must also be used for the online modes.
Parameter | Usage | Description |
---|---|---|
n | -n | DiffSearch in console mode. Queries can be entered in the console, where also the results are shown. |
g | -g | DiffSearch serves as webserver for the DiffSearch UI |
w | -w | DiffSearch serves as webserver for the old DiffSearch UI (deprecated) |
q | -q query | performs a search on the given query |
b | -b inputpath outputpath | processes all queries of a text file at the given path and saves the result in the given output file. |
fe | -fe | Feature extraction mode, creates feature vectors from code changes and indexes them |
clone | -clone repository-list | Clones the list of git repositories. The parameter should be a path to a txt files with all links to GitHub repositories. |
d | -d | Extracts and parses code changes of the cloned git repositories. |
Parameter | Usage | Description |
---|---|---|
pyc | -pyc pycommand | sets the command to start the python environment. For windows systems, the windows subsystem for linux is required. With WSL this parameter usually should be "wsl python3". |
lang | -lang (java|javascript|python) | sets the target programming language. |
l | -l | saves the server log to a log file. (WIP) |
oj | -oj | runs only the java server and does not invoke python. Only matters in "-fe", "-q", "-n", "-w", "-g", and "-b" mode. |
p | -p port | sets the port of the web interface |
r | -r | enables recall measurement. This requires a lot of time. Recall results are saved in the resources/Recall folder. Best use in batch mode. |
silent | --silent | if given, DiffSearch omits large console outputs, like results of a query. |
py_port | -py_port port | sets the port of the python server |
k | -k number | sets value for k, which is the number of candidate changes. Higher values increase recall but reduce performance. |
vl | -vl number | size a each partition of the feature vectors. |
cb | -cb number | number of count bits (Default 1). This is the number of bits that are used for each feature index. Changing this parameter requires reindexing. |
t | -t number | number of additional threads used for matching etc. The number of actual threads is 2 + this number. |
extractors | -extractors (name(:length)?;)+ | defines the extractors DiffSearch uses, e.g. -extractors parentchild:2000;triangle:2000. Valid extractors are node, triangle, parentchild, sibling, rulecount, editscript. Changing this parameter requires reindexing. |
mt | -mt seconds | maximum matching time for a single code change. After this time, a code change is considered not a match. |
extract-query-placeholders | --extract-query-placeholders | extract query placeholders like EXPR, default is false. Enabling this typically lowers recall. |
tfidf | -tfidf | tfidf weights are used in the feature vectors. Changing this parameter requires reindexing. |
noquerymultiplication | -noquerymultiplication | query vectors do not get multiplied. Typically lowers recall. |
nondividedextraction | -nondividedextraction | feature extraction is not divided in the old and new part. Typically lowers recall. |
lr | -lr | Low RAM mode. Does not use pre-parsed parse trees. Relevant for "-d", -"fe", and all modes that process queries. Lowers the RAM requirements, decreases performance. |
gurl | -gurl | sets the URL of the WEB GUI. |
effectiveness | -effectiveness | Run effectiveness evaluation. |
scalability | -scalability | Run scalability evaluation. |
Required: A list of queries ( src/main/resources/Recall/Input/queriesForRecall_JAVA.txt
)
If the corpus has changed, it is important to delete src/main/resources/Recall/ExpectedValues.csv
Usage: dfs "-b -r -k 5000 -silent -lang java"
Results: src/main/resources/Recall/RecallResults.csv
Java: dfs "-scalability -lang java"
Python: dfs "-scalability -lang python"
JavaScript: dfs "-scalability -lang javascript"
Results: src/main/resources/Scalability/
Unzip java_simple_bugs.zip
dfs "-effectiveness -lang java"
Then, run the Python script from its folder: src/main/resources/DiffSearch_effectiveness
Results: src/main/resources/Effectiveness/