- 📜 Contains 154 code snippets to test and benchmark.
- 🏷 Offers 845 type annotations across a diverse set of Python functionalities.
- 📂 Organized into 18 distinct categories targeting various Python features.
- 🚢 Seamlessly manages the execution of containerized tools.
- 🔄 Efficiently transforms inferred types into a standardized format.
- 📊 Automatically produces meaningful metrics for in-depth assessment and comparison.
- 🤖 Autogenerates code snippets and ground truth to scale the benchmark based on the original
TypeEvalPy
benchmark. - 📈 The autogen benchmark now contains:
- Python files: 7121
- Type annotations: 78373
Supported ✅ | In-progress 🔧 | Planned 💡 |
---|---|---|
HeaderGen | Intellij PSI | MonkeyType |
Jedi | Pyre | Pyannotate |
Pyright | PySonar2 | |
HiTyper | Pytype | |
Scalpel | TypeT5 | |
Type4Py | ||
GPT | ||
Ollama |
Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.
Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total |
---|---|---|---|---|---|
1 | mistral-large-it-2407-123b | 16701 | 728 | 57550 | 74979 |
2 | qwen2-it-72b | 16488 | 629 | 55160 | 72277 |
3 | llama3.1-it-70b | 16648 | 580 | 54445 | 71673 |
4 | gemma2-it-27b | 16342 | 599 | 49772 | 66713 |
5 | codestral-v0.1-22b | 16456 | 706 | 49379 | 66541 |
6 | codellama-it-34b | 15960 | 473 | 48957 | 65390 |
7 | mistral-nemo-it-2407-12.2b | 16221 | 526 | 48439 | 65186 |
8 | mistral-v0.3-it-7b | 16686 | 472 | 47935 | 65093 |
9 | phi3-medium-it-14b | 16802 | 467 | 45121 | 62390 |
10 | llama3.1-it-8b | 16125 | 492 | 44313 | 60930 |
11 | codellama-it-13b | 16214 | 479 | 43021 | 59714 |
12 | phi3-small-it-7.3b | 16155 | 422 | 38093 | 54670 |
13 | qwen2-it-7b | 15684 | 313 | 38109 | 54106 |
14 | HeaderGen | 14086 | 346 | 36370 | 50802 |
15 | phi3-mini-it-3.8b | 15908 | 320 | 30341 | 46569 |
16 | phi3.5-mini-it-3.8b | 15763 | 362 | 28694 | 44819 |
17 | codellama-it-7b | 13779 | 318 | 29346 | 43443 |
18 | Jedi | 13160 | 0 | 15403 | 28563 |
19 | Scalpel | 15383 | 171 | 18 | 15572 |
20 | gemma2-it-9b | 1611 | 66 | 5464 | 7141 |
21 | Type4Py | 3143 | 38 | 2243 | 5424 |
22 | tinyllama-1.1b | 1514 | 28 | 2699 | 4241 |
23 | mixtral-v0.1-it-8x7b | 3235 | 33 | 377 | 3645 |
24 | phi3.5-moe-it-41.9b | 3090 | 25 | 273 | 3388 |
25 | gemma2-it-2b | 1497 | 41 | 1848 | 3386 |
(Auto-generated based on the the analysis run on 30 Aug 2024)
git clone https://github.com/secure-software-engineering/TypeEvalPy.git
docker build -t typeevalpy .
🕒 Takes about 30mins on first run to build Docker containers.
📂 Results will be generated in the results
folder within the root directory of the repository.
Each results folder will have a timestamp, allowing you to easily track and compare different runs.
Correlation of CSV Files Generated to Tables in ICSE Paper
Here is how the auto-generated CSV tables relate to the paper's tables:-
Table 1 in the paper is derived from three auto-generated CSV tables:
paper_table_1.csv
- details Exact matches by type category.paper_table_2.csv
- lists Exact matches for 18 micro-benchmark categories.paper_table_3.csv
- provides Sound and Complete values for tools.
-
Table 2 in the paper is based on the following CSV table:
paper_table_5.csv
- shows Exact matches with top_n values for machine learning tools.
Additionally, there are CSV tables that are not included in the paper:
paper_table_4.csv
- containing Sound and Complete values for 18 micro-benchmark categories.paper_table_6.csv
- featuring Sensitivity analysis.
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy
🔧 Optionally, run analysis on specific tools:
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy --runners headergen scalpel
📊 Run analysis on custom benchmarks:
Here, running with the autogen benchmark on HeaderGen
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy \
--runners headergen \
--custom_benchmark_dir /app/autogen_typeevalpy_benchmark
🛠️ Available options: headergen
, pyright
, scalpel
, jedi
, hityper
, type4py
, hityperdl
TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:
- Create Configuration File: Copy the
config_template.yaml
from the src directory and rename it toconfig.yaml
.
In the config.yaml
, configure in the following:
openai_key
: your key for accessing OpenAI's models.ollama_url
: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here.prompt_id
: set this toquestions_based_2
for optimal performance, based on our tests.ollama_models
: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with theollama pull
command.
With the config.yaml
configured, run the following command:
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy --runners ollama
Running From Source...
-
Clone the repo
git clone https://github.com/secure-software-engineering/TypeEvalPy.git
-
Install Dependencies and Set Up Virtual Environment
Run the following commands to set up your virtual environment and activate the virtual environment.
python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
-
Navigate to the
src
Directorycd src
-
Execute the Analyzer
Run the following command to start the benchmarking process on all tools:
python main_runner.py
or
Run analysis on specific tools
python main_runner.py --runners headergen scalpel
To generate an extended version of the original TypeEvalPy benchmark to include many more Python types, run the following commands:
-
Navigate to the
autogen
Directorycd autogen
-
Execute the Generation Script
Run the following command to start the generation process:
python generate_typeevalpy_dataset.py
This will generate a folder in the repo root with the autogen benchmark with the current date.
Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.
To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md
Give a ⭐️ if this project helped you!