Learning whether a method should be logged or not based on code metrics.
- Bash environment to run scripts
- Java >= 8.0
- Python >= 3.6 with pip available
The subject selection step in our experiments relies on a R
script with the dplyr
package available. This is only required if you are interested in replicating our study or trying different selection criteria.
-
We highly recommend you to create a Python virtual environment in your working directory before starting:
python3 -m venv .venv && source .venv/bin/activate
-
"Where is the dataset?"
- Raw data is not provided for practical reasons; however, the process to generate and analyze data is fully automated.
- Source files and data related to our industry partner are confidential and unavailable.
Getting started is easy as 1, 2, 3:
Step | What | How |
---|---|---|
1 | Install the Python dependencies | pip3 install -r requirements.txt |
2 | Build the project components | ./gradlew deploy-aux-tools |
3 | Get the selected list of Apache projects (~3.5 Gb) | ./gradlew fetch-projects-paper |
Some Python scripts depends on the dependencies installed from Step 1 (e.g., Pandas and Numpy) to analyze intermediate data during experimentation.
Step 3 will download 29 Apache projects from a CSV file into a apache-downloads
dir. Also, it will generate a apache-projects
dir with scripts that exports the absolute path and revision of a given project.
If you want to download the initial list of all 69 Apache
projects, run ./gradlew fetch-apache-projects
instead.
Keep in mind that the full list contains ~7GB.
By now, you should be able have some fun. Try to process the project Apache Commons BeansUtils:
> ./run-single.sh apache-projects/commons-beanutils.sh
What happenend?
- Scripts classified java source files
- Scripts analyzed the presence of log statements in those files
- Num. of log statements, where they were placed, and how they are distributed
- Created a copy of the analyzed files and removed log statements for feature extraction
This is the expected output:
> find out -d 2 -type d
out/analysis/commons-beanutils # Analysis of source files
out/dataset/commons-beanutils # Contains the final dataset for machine learning experiments
out/log-removal/commons-beanutils # Contains a copy of source files without log statements
out/codemetrics/commons-beanutils # Contains code metrics of the analyzed files
With the generated CSV, you can reuse our machine learning package in a interactive environment:
>>> import logpred_method
>>> X, y = logpred_method.load_dataset("out/dataset/commons-beanutils/dataset_full.csv")
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=0)
>>> out = logpred_method.run("rf", X_train, X_test, y_train, y_test, output_to="output.log", tuning_enabled=False)
For details about replicating our study, see docs/Paper Evaluation - Tuning Enabled.pdf.
Feel free to use the generated dataset with your preferred ML library. We encourage you to explore your own ML training process and compare it with our results.
log-identifier
: Cross-project utility that identifies log statements based on regexlog-placement-analyzer
: Analyzes the placement of log statements in a project given a list of source files.log-remover
: Removes log statement from a projectlog-prediction
: ML component for experimentationjava-token-extractor
: Utility for tokens and method calls extraction
Feel free to post a question in the Q&A Section in the Discussions tab.
Please, keep the Issues for bugs/code-related concerns/etc.