GitHub

Overview

We explore the calibration of LLMs on Code Summarization: can an LLM's token probabilities be leveraged to produce a well-calibrated likelihood of whether the generated summary is similar to what a developer would've wrote for the same code.

This repository stores code to replicate experiments in the paper Calibration of Large Language Models on Code Summarization (FSE 2025).

Data is omitted due to large size. All data is accessible at: https://zenodo.org/records/11646569

If you have difficulty working with the code or need additional details, let me know: yuvivirk344 at gmail dot com

Where is what:

Thresholds on similarity metrics:
- thresholds.py: Measuring agreement to human judgements of similarity
- best_metric_thresh.ipynb: Searching for the best similarity metrics and thresholds
Generating code summaries:
- generate_contexts/: Generating few-shot and ASAP contexts
- summarization_inference.py: Prompting LLMs to generate code summaries
RQ1 & RQ2: Raw and scaled calibration of LLMs on code summarization
- calibration_metrics.py: Calculate calibration metrics and plots for raw and scaled LLM token probabilites
- reliability_plot.py: Plotting function for reliability diagrams
RQ3: Relationship between token position and confidence
- logit_vs_position_analysis.ipynb: Plotting token position vs. distribution of token probabilties at position
- calibration_vs_token_cutoff.py: Computing calibration metrics when only using first k tokens to measure confidence.
- brier_score_significant_testing.py: Statistical test of significance for improvement in brier score
Additional experiments:
- reflective_logit_analysis.py: Yes-No propmting LLMs whether the summary is similar to what a developer would write.
- self_reflection_prompting.py: Prompting LLMs to generate scores or probabiltiies directly the summary is similar to what a developer would write.
- verbalized_confidence_analysis.py: Analysis of self-reflection results
- benchmarking.py: Measuring performance of LLMs on code summarization with different similarity metrics

Navigating the data:

The data directory contains 3 subdirectories: haque_et_al, Java, Python.

The haque_et_al subdirectory contains all human evaluation data used (from Haque, Z.Eberhart, A.Bansal, and C.McMillan, “Semantic similarity metrics for evaluating source code summarization").
Each language directory contains 3 subdirectories: metrics_results, model_outputs, and prompting_data.
- prompting_data contains all prompts used for each prompting method and the data used to construct the prompts.
- model_outputs contains the generated output per model per prompting method.
- metrics_results contains the calculated summary evaluation metrics for all model outputs.

The results directory contains the raw results data in JSON format for benchmarks, rank correlations, raw and rescaled calibration evaluations, thresholds, and corresponding figures across multiple similarity metrics e.g. SentenceBERT(sbert).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Where is what:

Navigating the data:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
generate_contexts		generate_contexts
.gitignore		.gitignore
README.md		README.md
benchmarking.py		benchmarking.py
best_metric_thresh.ipynb		best_metric_thresh.ipynb
brier_score_signficance_testing.py		brier_score_signficance_testing.py
calibration_metrics.py		calibration_metrics.py
calibration_vs_token_cutoff.py		calibration_vs_token_cutoff.py
data_utils.py		data_utils.py
logit_vs_position_analysis.ipynb		logit_vs_position_analysis.ipynb
metric_rank_correlations.py		metric_rank_correlations.py
platt_scale.py		platt_scale.py
probability_vs_similarity_plots.ipynb		probability_vs_similarity_plots.ipynb
reflective_logit_analysis.py		reflective_logit_analysis.py
reliability_plot.py		reliability_plot.py
self_reflection_prompting.py		self_reflection_prompting.py
summarization_inference.py		summarization_inference.py
thresholds.py		thresholds.py
verbalized_confidence_analysis.py		verbalized_confidence_analysis.py

yuvrajvirk/CalibrationLLMsCodeSummaries

Folders and files

Latest commit

History

Repository files navigation

Overview

Where is what:

Navigating the data:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages