diff --git a/docs/examples/java/code_summarization.ipynb b/docs/examples/java/code_summarization.ipynb deleted file mode 100644 index 4ec3855..0000000 --- a/docs/examples/java/code_summarization.ipynb +++ /dev/null @@ -1,493 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "!pip install ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "eebee2515df69b96" - }, - { - "cell_type": "markdown", - "source": [ - "# Using CLDK to explain Java methods\n", - "\n", - "In this tutorial, we will use CLDK to explain or generate code summary for all the methods in a Java Application.\n", - "\n", - "By the end of this tutorial, you will have code summary for all the methods in a Java application. You'll be able to explore some of the benefits of using CLDK to perform fast and easy program analysis and build a LLM-based code summary generation.\n", - "\n", - "You will learn how to do the following:\n", - "\n", - "
    \n", - "
  1. Create a new instance of the CLDK class.\n", - "
  2. Create an analysis object over the Java application.\n", - "
  3. Iterate over all the files in the project.\n", - "
  4. Iterate over all the classes in the file.\n", - "
  5. Iterate over all the methods in the class.\n", - "
  6. Get the code body of the method.\n", - "
  7. Initialize the treesitter utils for the class file content.\n", - "
  8. Sanitize the class for analysis.\n", - "
\n", - "Next, we will write a couple of helper methods to:\n", - "\n", - "
    \n", - "
  1. Format the instruction for the given focal method and class.\n", - "
  2. Prompts the local model on Ollama.\n", - "
  3. Prints the instruction and LLM output.\n", - "
" - ], - "metadata": { - "collapsed": false - }, - "id": "59d05bbe28e62687" - }, - { - "cell_type": "markdown", - "source": [ - "## Prequisites\n", - "\n", - "Before we get started, let's make sure you have the following installed:\n", - "\n", - "
    \n", - "
  1. Python 3.11 or later\n", - "
  2. Ollama 0.3.4 or later\n", - "
\n", - "We will use ollama to spin up a local granite model that will act as our LLM for this turorial." - ], - "metadata": { - "collapsed": false - }, - "id": "92896c8ce12b0e9e" - }, - { - "cell_type": "markdown", - "source": [ - "### Prerequisite 1: Install ollama\n", - "\n", - "If you don't have ollama installed, please download and install it from here: [Ollama](https://ollama.com/download).\n", - "Once you have ollama, start the server and make sure it is running.\n", - "If you're on MacOS, Linux, or WSL, you can check to make sure the server is running by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "bfeb1e1227191e3b" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "systemctl status ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "c53214c8106642ce" - }, - { - "cell_type": "markdown", - "source": [ - "If not, you may have to start the server manually. You can do this by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "34a7b1802be15a3f" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "systemctl start ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "f60e2d9ec12f0bf6" - }, - { - "cell_type": "markdown", - "source": [ - "Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", - "\n", - "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags)." - ], - "metadata": { - "collapsed": false - }, - "id": "f629a10841aca9e2" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "ollama pull granite-code:8b-instruct" - ], - "metadata": { - "collapsed": false - }, - "id": "6ff900382e86a18e" - }, - { - "cell_type": "markdown", - "source": [ - "Let's make sure the model is downloaded by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "d076e98c390591b5" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" - ], - "metadata": { - "collapsed": false - }, - "id": "7aff854a031589f0" - }, - { - "cell_type": "markdown", - "source": [ - "### Prerequisite 3: Install ollama Python SDK" - ], - "metadata": { - "collapsed": false - }, - "id": "531205b489bbec73" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "pip install ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "e2a749932a800c9d" - }, - { - "cell_type": "markdown", - "source": [ - "### Prerequisite 4: Install CLDK\n", - "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "6f42dbd286b3f7a6" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "pip install git+https://github.com/IBM/codellm-devkit.git" - ], - "metadata": { - "collapsed": false - }, - "id": "327e212f20a489d6" - }, - { - "cell_type": "markdown", - "source": [ - "### Step 1: Get the sample Java application\n", - "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "dd8ec5b9c837898f" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" - ], - "metadata": { - "collapsed": false - }, - "id": "c196e58b3ce90c34" - }, - { - "cell_type": "markdown", - "source": [ - "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." - ], - "metadata": { - "collapsed": false - }, - "id": "44e875e7ce6db504" - }, - { - "cell_type": "markdown", - "source": [ - "### Generate code summary\n", - "Code summarization or code explanation is a task that converts a code written in a programming language to a natural language. This particular task has several\n", - "benefits, such as understanding code without looking at its intrinsic details, documenting code for better maintenance, etc. To do that, one needs to\n", - "understand the basic details of code structure works, and use that knowledge to generate the summary using various AI-based approaches. In this particular\n", - "example, we will be using Large Language Models (LLM), specifically Granite 8B, an open-source model built by IBM. We will show how easily a developer can use\n", - "CLDK to expose various parts of the code by calling various APIs without implementing various time-intensive program analyses from scratch." - ], - "metadata": { - "collapsed": false - }, - "id": "6ad70b81e8957fc0" - }, - { - "cell_type": "markdown", - "source": [ - "Step 1: Add all the neccessary imports" - ], - "metadata": { - "collapsed": false - }, - "id": "15555404790e1411" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from pathlib import Path\n", - "import ollama\n", - "from cldk import CLDK\n", - "from cldk.analysis import AnalysisLevel" - ], - "metadata": { - "collapsed": false - }, - "id": "8e8e5de7e5c68020" - }, - { - "cell_type": "markdown", - "source": [ - "Step 2: Formulate the LLM prompt. The prompt can be tailored towards various needs. In this case, we show a simple example of generating summary for each\n", - "method in a Java class" - ], - "metadata": { - "collapsed": false - }, - "id": "ffc4ee9a6d27acc2" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def format_inst(code, focal_method, focal_class, language):\n", - " \"\"\"\n", - " Format the instruction for the given focal method and class.\n", - " \"\"\"\n", - " inst = f\"Question: Can you write a brief summary for the method `{focal_method}` in the class `{focal_class}` below?\\n\"\n", - "\n", - " inst += \"\\n\"\n", - " inst += f\"```{language}\\n\"\n", - " inst += code\n", - " inst += \"```\" if code.endswith(\"\\n\") else \"\\n```\"\n", - " inst += \"\\n\"\n", - " return inst" - ], - "metadata": { - "collapsed": false - }, - "id": "9e23523c71636727" - }, - { - "cell_type": "markdown", - "source": [], - "metadata": { - "collapsed": false - }, - "id": "a4e9cb4e4f00b25c" - }, - { - "cell_type": "markdown", - "source": [ - "Step 3: Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally." - ], - "metadata": { - "collapsed": false - }, - "id": "dd8439be222b5caa" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", - " \"\"\"Prompt local model on Ollama\"\"\"\n", - " response_object = ollama.generate(model=model_id, prompt=message)\n", - " return response_object[\"response\"]" - ], - "metadata": { - "collapsed": false - }, - "id": "62807e0cbf985ae6" - }, - { - "cell_type": "markdown", - "source": [ - "Step 4: Create an object of CLDK and provide the programming language of the source code." - ], - "metadata": { - "collapsed": false - }, - "id": "1022e86e38e12767" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "# Create a new instance of the CLDK class\n", - "cldk = CLDK(language=\"java\")" - ], - "metadata": { - "collapsed": false - }, - "id": "a2c8bbe4e3244f60" - }, - { - "cell_type": "markdown", - "source": [ - "Step 5: CLDK uses different analysis engine--Codeanalyzer (built using WALA and Javaparser), Treesitter, and CodeQL (future). By default, codenanalyzer has\n", - "been selected as the default analysis engine. Also, CLDK support different analysis levels--(a) symbol table, (b) call graph, (c) program dependency graph, and\n", - "(d) system dependency graph. Analysis engine can be selected using ```AnalysisLevel``` enum. In this example, we will generate summarization of all the methods\n", - "of an application. " - ], - "metadata": { - "collapsed": false - }, - "id": "23dd4a6e5d5cb0c5" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "# Create an analysis object over the java application\n", - "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)" - ], - "metadata": { - "collapsed": false - }, - "id": "fdd09f5e77d4a68a" - }, - { - "cell_type": "markdown", - "source": [ - "Step 6: Iterate over all the class files and create the prompt. In this case, we want to provide a customized Java class in the prompt. For instance,\n", - "\n", - "```\n", - "package com.ibm.org;\n", - "import A.B.C.D;\n", - "...\n", - "public class Foo {\n", - " // code comment\n", - " public void bar(){ \n", - " int a;\n", - " a = baz();\n", - " // do something\n", - " }\n", - " private int baz()\n", - " {\n", - " // do something\n", - " }\n", - " public String dummy (String a)\n", - " {\n", - " // do somthing\n", - " } \n", - "```\n", - "Given the above class, let's say we want to generate a summary for the ```bar``` method. To understand what it does, we add the callee of this method in the prompt, which in this case is ```baz```. We also remove imports, comments, etc. All of these are done using a single call to ```sanitize_focal_class``` API. In this process, we also use Treesitter to analyze the code. Once the input code has been sanitized, we call the ```format_inst``` method to create the LLM prompt, which has been passed to ```prompt_ollama``` method to generate the summary using LLM." - ], - "metadata": { - "collapsed": false - }, - "id": "f148325e92781e13" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "# Iterate over all the files in the project\n", - "for file_path, class_file in analysis.get_symbol_table().items():\n", - " class_file_path = Path(file_path).absolute().resolve()\n", - " # Iterate over all the classes in the file\n", - " for type_name, type_declaration in class_file.type_declarations.items():\n", - " # Iterate over all the methods in the class\n", - " for method in type_declaration.callable_declarations.values():\n", - " # Get code body of the method\n", - " code_body = class_file_path.read_text()\n", - " \n", - " # Initialize the treesitter utils for the class file content\n", - " tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n", - " \n", - " # Sanitize the class for analysis\n", - " sanitized_class = tree_sitter_utils.sanitize_focal_class(method.declaration)\n", - " \n", - " # Format the instruction for the given focal method and class\n", - " instruction = format_inst(\n", - " code=sanitized_class,\n", - " focal_method=method.declaration,\n", - " focal_class=type_name,\n", - " language=\"java\"\n", - " )\n", - " \n", - " # Prompt the local model on Ollama\n", - " llm_output = prompt_ollama(\n", - " message=instruction,\n", - " model_id=\"granite-code:20b-instruct\",\n", - " )\n", - " \n", - " # Print the instruction and LLM output\n", - " print(f\"Instruction:\\n{instruction}\")\n", - " print(f\"LLM Output:\\n{llm_output}\")" - ], - "metadata": { - "collapsed": false - }, - "id": "462ef7dceae367ad" - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 2 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/examples/java/generate_unit_tests.ipynb b/docs/examples/java/generate_unit_tests.ipynb deleted file mode 100644 index 5acd853..0000000 --- a/docs/examples/java/generate_unit_tests.ipynb +++ /dev/null @@ -1,422 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "!pip install ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "ee51e198aaebcd9b" - }, - { - "cell_type": "markdown", - "source": [ - "# Using CLDK to generate JUnit tests\n", - "\n", - "In this tutorial, we will use CLDK to generate a JUnit test for all the methods in a Java Application.\n", - "\n", - "By the end of this tutorial, you will have a JUnit test for all the methods in a Java application. You'll be able to explore some of the benefits of using CLDK to perform fast and easy program analysis and build a LLM-based test generator.\n", - "\n", - "You will learn how to do the following:\n", - "\n", - "
    \n", - "
  1. Create a new instance of the CLDK class.\n", - "
  2. Create an analysis object over the Java application.\n", - "
  3. Iterate over all the files in the project.\n", - "
  4. Iterate over all the classes in the file.\n", - "
  5. Iterate over all the methods in the class.\n", - "
  6. Get the code body of the method.\n", - "
  7. Initialize the treesitter utils for the class file content.\n", - "
  8. Sanitize the class for analysis.\n", - "
\n", - "Next, we will write a couple of helper methods to:\n", - "\n", - "
    \n", - "
  1. Format the instruction for the given focal method and class.\n", - "
  2. Prompts the local model on Ollama.\n", - "
  3. Prints the instruction and LLM output.\n", - "
" - ], - "metadata": { - "collapsed": false - }, - "id": "428dbbfa206f5417" - }, - { - "cell_type": "markdown", - "source": [ - "## Prequisites\n", - "\n", - "Before we get started, let's make sure you have the following installed:\n", - "\n", - "
    \n", - "
  1. Python 3.11 or later\n", - "
  2. Ollama 0.3.4 or later\n", - "
\n", - "We will use ollama to spin up a local granite model that will act as our LLM for this turorial." - ], - "metadata": { - "collapsed": false - }, - "id": "f619a9379b9dd006" - }, - { - "cell_type": "markdown", - "source": [ - "### Prerequisite 1: Install ollama\n", - "\n", - "If you don't have ollama installed, please download and install it from here: [Ollama](https://ollama.com/download).\n", - "Once you have ollama, start the server and make sure it is running.\n", - "If you're on MacOS, Linux, or WSL, you can check to make sure the server is running by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "3485879a7733bcba" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "systemctl status ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "2f67be6c8c024e12" - }, - { - "cell_type": "markdown", - "source": [ - "If not, you may have to start the server manually. You can do this by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "273e60ca598e0a53" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "systemctl start ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "cc6877ce338e9102" - }, - { - "cell_type": "markdown", - "source": [ - "Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", - "\n", - "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags)." - ], - "metadata": { - "collapsed": false - }, - "id": "c024dc7ec2869a72" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "ollama pull granite-code:8b-instruct" - ], - "metadata": { - "collapsed": false - }, - "id": "5ad0e8ac33c7108e" - }, - { - "cell_type": "markdown", - "source": [ - "Let's make sure the model is downloaded by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "14f9946fdc5e2025" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" - ], - "metadata": { - "collapsed": false - }, - "id": "e3410ce4d0afa788" - }, - { - "cell_type": "markdown", - "source": [ - "### Prerequisite 3: Install ollama Python SDK" - ], - "metadata": { - "collapsed": false - }, - "id": "d8c0224c3c4ecf4d" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "pip install ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "5539b5251aee5642" - }, - { - "cell_type": "markdown", - "source": [ - "### Prerequisite 4: Install CLDK\n", - "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "cea573e625257581" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "pip install git+https://github.com/IBM/codellm-devkit.git" - ], - "metadata": { - "collapsed": false - }, - "id": "eeb38b312427329d" - }, - { - "cell_type": "markdown", - "source": [ - "### Step 1: Get the sample Java application\n", - "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "ca7682c71d844b68" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" - ], - "metadata": { - "collapsed": false - }, - "id": "a4d08ca64b9dbccb" - }, - { - "cell_type": "markdown", - "source": [ - "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." - ], - "metadata": { - "collapsed": false - }, - "id": "51d30f3eb726afc0" - }, - { - "cell_type": "markdown", - "source": [ - "### Building a JUnit test generator using CLDK and Granite Code Instruct Model\n", - "Now that we have all the prerequisites installed, let's start building a JUnit test generator using CLDK and the Granite Code Instruct Model." - ], - "metadata": { - "collapsed": false - }, - "id": "98e69eb0bccedfc9" - }, - { - "cell_type": "markdown", - "source": [ - "Generating unit tests for code is a very tedious task and often takes a significant effort from the developers to write good test cases. There are various tools that are available for automated test generation, such as EvoSuite, which uses evolutionary algorithms to generate test cases. However, the test cases that are being generated are not natural and often developers do not prefer to add them to their test suite. Whereas Large Language Models (LLM) being trained with developer-written code it has a better affinity towards generating more natural code--more readable, maintainable code. In this excercise, we will show we can leverage LLMs to generate test cases with the help of CLDK. \n", - "\n", - "For simplicity, we will cover certain aspects of test generation and provide some context information to LLM for better quality of test cases. In this exercise, we will generate a unit test for a non-private method from a Java class and provide the focal method body and the signature of all the constructors of the class so that LLM can understand how to create an object of the focal class during the setup phase of the tests. Also, we will ask LLMs to generate ```N``` number of test cases, where ```N``` is the cyclomatic complexity of the focal method. The intuition is that one test may not be sufficient for covering fairly complex methods, and a cyclomatic complexity score can provide some guidance towards that. \n", - "\n", - "(Step 1) First, we will import all the necessary libraries" - ], - "metadata": { - "collapsed": false - }, - "id": "5856baff4aa64ed7" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "import ollama\n", - "from cldk import CLDK\n", - "from cldk.analysis import AnalysisLevel" - ], - "metadata": { - "collapsed": false - }, - "id": "b3d2498ae092fcc" - }, - { - "cell_type": "markdown", - "source": [ - "(Step 2) Second, we will form the prompt for the model, which will include all the constructor signarures, and the body of the focal method." - ], - "metadata": { - "collapsed": false - }, - "id": "67eb24b29826d730" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def format_inst(focal_method_body, focal_method, focal_class, constructor_signatures, cyclomatic_complexity, language):\n", - " \"\"\"\n", - " Format the instruction for the given focal method and class.\n", - " \"\"\"\n", - " inst = f\"Question: Can you generate {cyclomatic_complexity} unit tests for the method `{focal_method}` in the class `{focal_class}` below?\\n\"\n", - "\n", - " inst += \"\\n\"\n", - " inst += f\"```{language}\\n\"\n", - " inst += \"```\\n\"\n", - " inst += \"public class {focal_class} {\"\n", - " inst += f\"<|constructors|>\\n{constructor_signatures}\\n<|constructors|>\\n\"\n", - " inst += f\"<|focal method|>\\n {focal_method_body} \\n <|focal method|>\\n\" \n", - " inst += \"}\"\n", - " inst += \"```\\n\"\n", - " inst += \"Answer:\\n\"\n", - " return inst" - ], - "metadata": { - "collapsed": false - }, - "id": "d7bc9bbaa917df24" - }, - { - "cell_type": "markdown", - "source": [ - "(Step 3) Third, use ollama to call LLM (in case Granite 8b)." - ], - "metadata": { - "collapsed": false - }, - "id": "ae9ceb150f5efa92" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def prompt_ollama(message: str, model_id: str = \"granite-code:20b-instruct\") -> str:\n", - " \"\"\"Prompt local model on Ollama\"\"\"\n", - " response_object = ollama.generate(model=model_id, prompt=message)\n", - " return response_object[\"response\"]" - ], - "metadata": { - "collapsed": false - }, - "id": "52634feae7374599" - }, - { - "cell_type": "markdown", - "source": [ - "(Step 4) Fourth, collect all the information needed for each method. In this process, we go through all the classes in the application, and then for each class, we collect the signature of all the constructors. If there is no constructor present, we add the signature of the default constructor. Then, we go through all the non-private methods of the class and formulate the prompt using the constructor and the method information. Finally, we use the prompt to call LLM and get the final output." - ], - "metadata": { - "collapsed": false - }, - "id": "308c3325116b87d4" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "# Create a new instance of the CLDK class\n", - "cldk = CLDK(language=\"java\")\n", - "# Create an analysis object over the java application. Provide the application path.\n", - "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)\n", - "# Go through all the classes in the application\n", - "for class_name in analysis.get_classes():\n", - " class_details = analysis.get_class(qualified_class_name=class_name)\n", - " # Generate test cases for non-interface and non-abstract classes\n", - " if not class_details.is_interface and 'abstract' not in class_details.modifiers:\n", - " # Get all constructor signatures\n", - " constructor_signatures = ''\n", - " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", - " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", - " if method_details.is_constructor:\n", - " constructor_signatures += method_details.signature + '\\n'\n", - " # If no constructor present, then add the signature of the default constructor\n", - " if constructor_signatures=='':\n", - " constructor_signatures = f'public {class_name} ()'\n", - " # Go through all the methods in the class\n", - " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", - " # Get the method details\n", - " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", - " # Generate test cases for non-private methods\n", - " if 'private' not in method_details.modifiers and not method_details.is_constructor:\n", - " # Gather all the information needed for the prompt, which are focal method body, focal method name, focal class name, constructor signature, and cyclomatic complexity\n", - " prompt = format_inst(focal_method_body=method_details.code,\n", - " focal_method=method,\n", - " focal_class=class_name,\n", - " constructor_signatures=constructor_signatures,\n", - " cyclomatic_complexity=method_details.cyclomatic_complexity)\n", - " # Prompt the local model on Ollama\n", - " llm_output = prompt_ollama(\n", - " message=prompt,\n", - " model_id=\"granite-code:20b-instruct\",\n", - " )\n", - " \n", - " # Print the instruction and LLM output\n", - " print(f\"Instruction:\\n{prompt}\")\n", - " print(f\"LLM Output:\\n{llm_output}\")" - ], - "metadata": { - "collapsed": false - }, - "id": "65c9558e4de65a52" - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 2 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/examples/java/notebook/code_summarization.ipynb b/docs/examples/java/notebook/code_summarization.ipynb new file mode 100644 index 0000000..48a3ee2 --- /dev/null +++ b/docs/examples/java/notebook/code_summarization.ipynb @@ -0,0 +1,426 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "59d05bbe28e62687", + "metadata": { + "collapsed": false + }, + "source": [ + "# Using CLDK to explain Java methods\n", + "\n", + "In this tutorial, we will use CLDK to explain or generate code summary for a Java method. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis and build an LLM-based code summarizer. By the end of this tutorial, you will have implemented such a tool and generated code summary for a Java method.\n", + "\n", + "Specifically, you will learn how to perform the following tasks on a Java application to create LLM prompts for code summarization:\n", + "\n", + "1. Create a new instance of the CLDK class.\n", + "2. Create an analysis object for the target Java application.\n", + "3. Iterate over all files in the application.\n", + "4. Iterate over all classes in a file.\n", + "5. Initialize treesitter utils for the class content.\n", + "6. Iterate over all methods in a class.\n", + "7. Get the code body of a method.\n", + "8. Sanitize the class for prompting the LLM.\n", + "\n", + "We will write a couple of helper methods to (1) format the LLM instruction for summarizing a given target method and (2) prompt the LLM via Ollama. We will then use CLDK to go through an application and generate the summary for the target method." + ] + }, + { + "cell_type": "markdown", + "id": "92896c8ce12b0e9e", + "metadata": { + "collapsed": false + }, + "source": [ + "## Prequisites\n", + "\n", + "Before we get started, let's make sure you have the following installed:\n", + "\n", + "1. Python 3.11 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)\n", + "2. Java 11 or later (you can use [SDKMAN!](https://sdkman.io) to instal Java)\n", + "3. Ollama 0.3.4 or later (you can get Ollama here: [Ollama download](https://ollama.com/download))\n", + "\n", + "We will use Ollama to spin up a local [Granite code model](https://ollama.com/library/granite-code), which will serve as our LLM for this turorial." + ] + }, + { + "cell_type": "markdown", + "id": "bfeb1e1227191e3b", + "metadata": { + "collapsed": false + }, + "source": [ + "### Download Granite code model\n", + "\n", + "After starting the Ollama server, please download the latest version of the Granite code 8b-instruct model by running the following command. There are other Granite code models available, but for this tutorial, we will use Granite code 8b-instruct. If you prefer to use a different Granite code model, you can replace `8b-instruct` with the tag of another version (see [Granite code tags](https://ollama.com/library/granite-code/tags))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "627e7184", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "ollama pull granite-code:8b-instruct" + ] + }, + { + "cell_type": "markdown", + "id": "8cc1ca5b", + "metadata": {}, + "source": [ + " Let's make sure the model is downloaded by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ff900382e86a18e", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%%bash\n", + "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" + ] + }, + { + "cell_type": "markdown", + "id": "531205b489bbec73", + "metadata": { + "collapsed": false + }, + "source": [ + "### Install Ollama Python SDK" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2a749932a800c9d", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pip install ollama" + ] + }, + { + "cell_type": "markdown", + "id": "6f42dbd286b3f7a6", + "metadata": { + "collapsed": false + }, + "source": [ + "### Install CLDK\n", + "CLDK is avaliable at https://github.com/IBM/codellm-devkit. You can install it by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "327e212f20a489d6", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pip install git+https://github.com/IBM/codellm-devkit.git" + ] + }, + { + "cell_type": "markdown", + "id": "dd8ec5b9c837898f", + "metadata": { + "collapsed": false + }, + "source": [ + "### Step 1: Get the sample Java application\n", + "For this tutorial, we will use [Apache Commons CLI](https://github.com/apache/commons-cli) as the sample Java application. You can download the source code to a temporary directory by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c196e58b3ce90c34", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%%bash\n", + "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" + ] + }, + { + "cell_type": "markdown", + "id": "44e875e7ce6db504", + "metadata": { + "collapsed": false + }, + "source": [ + "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`.\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "6ad70b81e8957fc0", + "metadata": { + "collapsed": false + }, + "source": [ + "## Generate code summary\n", + "\n", + "Code summarization or code explanation is the task of converting code written in a programming language to natural language. It has several benefits, such as understanding code without looking at its intrinsic details, documenting code for better maintenance, etc. To perform code summarization, one needs to understand the basic details of code implementation, and use that knowledge to generate the summary using various AI-based approaches. In this tutorial, we will use LLMs, specifically Granite code 8b-instruct. We will show how a developer can easily use CLDK to analyze code by calling various APIs without having to implement such analyses." + ] + }, + { + "cell_type": "markdown", + "id": "15555404790e1411", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 1: Add the neccessary imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e8e5de7e5c68020", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import ollama\n", + "from cldk import CLDK\n", + "from cldk.analysis import AnalysisLevel" + ] + }, + { + "cell_type": "markdown", + "id": "ffc4ee9a6d27acc2", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 2: Define a function for creating the LLM prompt, which instructs the LLM to summarize a Java method and includes relevant code for the task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e23523c71636727", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def format_inst(code, focal_method, focal_class, language):\n", + " \"\"\"\n", + " Format the instruction for the given focal method and class.\n", + " \"\"\"\n", + " inst = f\"Question: Can you write a brief summary for the method `{focal_method}` in the class `{focal_class}` below?\\n\"\n", + "\n", + " inst += \"\\n\"\n", + " inst += f\"```{language}\\n\"\n", + " inst += code\n", + " inst += \"```\" if code.endswith(\"\\n\") else \"\\n```\"\n", + " inst += \"\\n\"\n", + " return inst" + ] + }, + { + "cell_type": "markdown", + "id": "a4e9cb4e4f00b25c", + "metadata": { + "collapsed": false + }, + "source": [] + }, + { + "cell_type": "markdown", + "id": "dd8439be222b5caa", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 3: Define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62807e0cbf985ae6", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", + " \"\"\"Prompt local model on Ollama\"\"\"\n", + " response_object = ollama.generate(model=model_id, prompt=message, options={\"temperature\":0.2})\n", + " return response_object[\"response\"]" + ] + }, + { + "cell_type": "markdown", + "id": "1022e86e38e12767", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 4: Create an instance of CLDK and provide the programming language of the source code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2c8bbe4e3244f60", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create an instance of CLDK for Java analysis\n", + "cldk = CLDK(language=\"java\")" + ] + }, + { + "cell_type": "markdown", + "id": "23dd4a6e5d5cb0c5", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 5: Select the analysis engine and analysis level. CLDK uses different analysis engines---[CodeAnalyzer](https://github.com/IBM/codenet-minerva-code-analyzer) (built over [WALA](https://github.com/wala/WALA) and [JavaParser](https://github.com/javaparser/javaparser)), [Treesitter](https://tree-sitter.github.io/tree-sitter/), and [CodeQL](https://codeql.github.com/) (future)---with CodeAnalyzer being the default analysis engine. CLDK supports different analysis levels: (1) symbol table, (2) call graph, (3) program dependency graph, and (4) system dependency graph. The analysis level can be selected using the `AnalysisLevel` enumerated type. For this example, we select the symbol-table analysis level, with CodeAnalyzer as the default analysis engine." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdd09f5e77d4a68a", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create an analysis object for the Java application\n", + "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)" + ] + }, + { + "cell_type": "markdown", + "id": "f148325e92781e13", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 6: Iterate over all the class files and create the prompt. In this case, we want to provide a sanitized Java class in the prompt, containing only the relevant information for summarizing the target method. To illustrate, consider the floowing class:\n", + "\n", + "```java\n", + "package com.ibm.org;\n", + "import A.B.C.D;\n", + "...\n", + "public class Foo {\n", + " // code comment\n", + " public void bar(){ \n", + " int a;\n", + " a = baz();\n", + " // do something\n", + " }\n", + " private int baz()\n", + " {\n", + " // do something\n", + " }\n", + " public String dummy (String a)\n", + " {\n", + " // do somthing\n", + " } \n", + "```\n", + "Let's say we want to generate a summary for method `bar`. To understand what it does, we add the callees of this method in the prompt, which in this case includes `baz`. We remove the other methods, imports, comments, etc. All of this can be achieved with a single call to CLDK's `sanitize_focal_class` API. In this process, we also use Treesitter to analyze the code. After creating the sanitized code, we call the previously defined `format_inst` method to create the LLM prompt and pass the prompt to `prompt_ollama` to generate the method summary." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "462ef7dceae367ad", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# For simplicity, we run the code summarization on a single class and method (this filter can be removed to run this code over the entire application)\n", + "target_class = \"org.apache.commons.cli.GnuParser\"\n", + "target_method = \"flatten(Options, String[], boolean)\"\n", + "\n", + "# Iterate over all classes in the application\n", + "for class_name in analysis.get_classes():\n", + " if class_name == target_class:\n", + " class_file_path = analysis.get_java_file(qualified_class_name=class_name)\n", + "\n", + " # Read code for the class\n", + " with open(class_file_path, \"r\") as f:\n", + " code_body = f.read()\n", + "\n", + " # Initialize treesitter utils for the class file content\n", + " tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n", + " \n", + " # Iterate over all methods in class\n", + " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", + " if method == target_method:\n", + " \n", + " # Get all the method details\n", + " method_details = analysis.get_method(qualified_class_name=class_name,\n", + " qualified_method_name=method)\n", + " \n", + " # Sanitize the class for analysis with respect to the target method\n", + " sanitized_class = tree_sitter_utils.sanitize_focal_class(method_details.declaration)\n", + " \n", + " # Format the instruction for the given target method and class\n", + " instruction = format_inst(\n", + " code=sanitized_class,\n", + " focal_method=method_details.declaration,\n", + " focal_class=class_name.split(\".\")[-1],\n", + " language=\"java\"\n", + " )\n", + " \n", + " print(f\"Instruction:\\n{instruction}\\n\")\n", + " print(f\"Generating code summary ...\\n\")\n", + " \n", + " # Prompt the local model on Ollama\n", + " llm_output = prompt_ollama(message=instruction)\n", + " \n", + " # Print the LLM output\n", + " print(f\"LLM Output:\\n{llm_output}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/examples/java/notebook/generate_unit_tests.ipynb b/docs/examples/java/notebook/generate_unit_tests.ipynb new file mode 100644 index 0000000..57cb7b0 --- /dev/null +++ b/docs/examples/java/notebook/generate_unit_tests.ipynb @@ -0,0 +1,373 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "428dbbfa206f5417", + "metadata": { + "collapsed": false + }, + "source": [ + "# Using CLDK to generate JUnit tests\n", + "\n", + "In this tutorial, we will use CLDK to implement a simple unit test generator for Java. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis and build an LLM-based test generator. By the end of this tutorial, you will have implemented such a tool and generated a JUnit test case for a Java application.\n", + "\n", + "Specifically, you will learn how to perform the following tasks on the application under test to create LLM prompts for test generation:\n", + "\n", + "1. Create a new instance of the CLDK class.\n", + "2. Create an analysis object for the Java application under test.\n", + "3. Iterate over all files in the application.\n", + "4. Iterate over all classes in a file.\n", + "5. Iterate over all methods in a class.\n", + "6. Get the code body of a method.\n", + "7. Get the constructors of a class.\n", + "\n", + "\n", + "We will write a couple of helper methods to (1) format the LLM instruction for generating test cases for a given focal method (i.e., method under test) and (2) prompt the LLM via Ollama. We will then use CLDK to go through an application and generate unit test cases for the target method." + ] + }, + { + "cell_type": "markdown", + "id": "f619a9379b9dd006", + "metadata": { + "collapsed": false + }, + "source": [ + "## Prequisites\n", + "\n", + "Before we get started, let's make sure you have the following installed:\n", + "\n", + "1. Python 3.11 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)\n", + "2. Java 11 or later (you can use [SDKMAN!](https://sdkman.io) to instal Java)\n", + "3. Ollama 0.3.4 or later (you can get Ollama here: [Ollama download](https://ollama.com/download))\n", + "\n", + "We will use Ollama to spin up a local [Granite code model](https://ollama.com/library/granite-code), which will serve as our LLM for this turorial." + ] + }, + { + "cell_type": "markdown", + "id": "3485879a7733bcba", + "metadata": { + "collapsed": false + }, + "source": [ + "### Download Granite code model\n", + "\n", + "After starting the Ollama server, please download the latest version of the Granite code 8b-instruct model by running the following command. There are other Granite code models available, but for this tutorial, we will use Granite code 8b-instruct. If you prefer to use a different Granite code model, you can replace `8b-instruct` with the tag of another version (see [Granite code tags](https://ollama.com/library/granite-code/tags))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "670f2b23", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "ollama pull granite-code:8b-instruct" + ] + }, + { + "cell_type": "markdown", + "id": "02d5bbfa", + "metadata": {}, + "source": [ + " Let's make sure the model is downloaded by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e3410ce4d0afa788", + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-28T23:49:03.488152Z", + "start_time": "2024-08-28T23:49:03.424389Z" + }, + "collapsed": false + }, + "outputs": [], + "source": [ + "%%bash\n", + "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'\\\"" + ] + }, + { + "cell_type": "markdown", + "id": "d8c0224c3c4ecf4d", + "metadata": { + "collapsed": false + }, + "source": [ + "### Install Ollama Python SDK" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5539b5251aee5642", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pip install ollama" + ] + }, + { + "cell_type": "markdown", + "id": "cea573e625257581", + "metadata": { + "collapsed": false + }, + "source": [ + "### Install CLDK\n", + "CLDK is avaliable at https://github.com/IBM/codellm-devkit. You can install it by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eeb38b312427329d", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pip install git+https://github.com/IBM/codellm-devkit.git" + ] + }, + { + "cell_type": "markdown", + "id": "ca7682c71d844b68", + "metadata": { + "collapsed": false + }, + "source": [ + "### Get the sample Java application\n", + "For this tutorial, we will use [Apache Commons CLI](https://github.com/apache/commons-cli) as the Java application under test. You can download the source code to a temporary directory by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4d08ca64b9dbccb", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%%bash\n", + "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" + ] + }, + { + "cell_type": "markdown", + "id": "51d30f3eb726afc0", + "metadata": { + "collapsed": false + }, + "source": [ + "The project will be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`.\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "98e69eb0bccedfc9", + "metadata": { + "collapsed": false + }, + "source": [ + "## Build a JUnit test generator using CLDK and Granite Code Model\n", + "\n", + "Now that we have all the prerequisites installed, let's start building a JUnit test generator using CLDK and the Granite Code Instruct Model.\n", + "\n", + "Generating unit tests for code is a tedious task and developers often have to put in significant effort in writing good test cases. There are various tools available for automated test generation, such as EvoSuite, which uses evolutionary algorithms to generate unit test cases for Java. However, the generated test cases are not natural and often developers do not prefer to add them to their test suites. LLMs, having been trained with developer-written code, have a better affinity towards generating more natural code---code that is more readable, comprehensible, and maintainable. In this excercise, we will show how we can leverage LLMs to generate test cases with the help of CLDK. \n", + "\n", + "For simplicity, we will cover certain aspects of test generation and provide some context information to the LLM to help it create usable test cases. In this exercise, we will generate a unit test for a non-private method from a Java class and provide the focal method body and the signature of all the constructors of the class so that LLM can understand how to create an object of the focal class during the setup phase of the tests.\n", + "\n", + "\n", + "Step 1: Import the required modules." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3d2498ae092fcc", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import ollama\n", + "from cldk import CLDK\n", + "from cldk.analysis import AnalysisLevel" + ] + }, + { + "cell_type": "markdown", + "id": "67eb24b29826d730", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 2: Define a function for creating the LLM prompt, which instructs the LLM to generate unit tests cases and includes signatures of relevant constructors and the body of the focal method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d7bc9bbaa917df24", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def format_inst(focal_method_body, focal_method, focal_class, constructor_signatures, language):\n", + " \"\"\"\n", + " Format the LLM instruction for the given focal method and class.\n", + " \"\"\"\n", + " inst = f\"Question: Can you generate junit tests with @Test annotation for the method `{focal_method}` in the class `{focal_class}` below. Only generate the test and no description.\\n\"\n", + " inst += 'Use the constructor signatures to form the object if the method is not static. Generate the code under ``` code block.'\n", + " inst += \"\\n\"\n", + " inst += f\"```{language}\\n\"\n", + " inst += f\"public class {focal_class} \" + \"{\\n\"\n", + " inst += f\"{constructor_signatures}\\n\"\n", + " inst += f\"{focal_method_body} \\n\" \n", + " inst += \"}\"\n", + " inst += \"```\\n\"\n", + " inst += \"Answer:\\n\"\n", + " return inst" + ] + }, + { + "cell_type": "markdown", + "id": "ae9ceb150f5efa92", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 3: Define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52634feae7374599", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", + " \"\"\"Prompt local model on Ollama\"\"\"\n", + " response_object = ollama.generate(model=model_id, prompt=message, options={\"temperature\":0.2})\n", + " return response_object[\"response\"]" + ] + }, + { + "cell_type": "markdown", + "id": "308c3325116b87d4", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 4: Collect the relevant information for the focal method and prompt the LLM. To do this, we go through all the classes in the application, and for each class, we collect the signatures of its constructors. If a class has no constructors, we add the signature of the default constructor. Then, we go through each non-private method of the class and formulate the prompt using the constructor and the method information. Finally, we use the prompt to call LLM to generate test cases and get the LLM response." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65c9558e4de65a52", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create an instance of CLDK for Java analysis\n", + "cldk = CLDK(language=\"java\")\n", + "\n", + "# Create an analysis object for the Java application. Provide the application path.\n", + "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)\n", + "\n", + "# For simplicity, we run the test generation on a single focal class and method (this filter can be removed to run this code over the entire application)\n", + "focal_class = \"org.apache.commons.cli.GnuParser\"\n", + "focal_method = \"flatten(Options, String[], boolean)\"\n", + "\n", + "# Go through all the classes in the application\n", + "for class_name in analysis.get_classes():\n", + "\n", + " if class_name == focal_class:\n", + " class_details = analysis.get_class(qualified_class_name=class_name)\n", + " focal_class_name = class_name.split(\".\")[-1]\n", + "\n", + " # Generate test cases for non-interface and non-abstract classes\n", + " if not class_details.is_interface and \"abstract\" not in class_details.modifiers:\n", + " \n", + " # Get all constructor signatures\n", + " constructor_signatures = \"\"\n", + " \n", + " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", + " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", + " \n", + " if method_details.is_constructor:\n", + " constructor_signatures += method_details.signature + '\\n'\n", + " \n", + " # If no constructor present, then add the signature of the default constructor\n", + " if constructor_signatures == \"\":\n", + " constructor_signatures = f\"public {focal_class_name}() \" + \"{}\"\n", + " \n", + " # Go through all the methods in the class\n", + " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", + " \n", + " if method == focal_method:\n", + " # Get the method details\n", + " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", + " \n", + " # Generate test cases for non-private methods\n", + " if \"private\" not in method_details.modifiers and not method_details.is_constructor:\n", + " \n", + " # Gather all the information needed for the prompt, which are focal method body, focal method name, focal class name, and constructor signature\n", + " prompt = format_inst(\n", + " focal_method_body=method_details.declaration+method_details.code,\n", + " focal_method=method.split(\"(\")[0],\n", + " focal_class=focal_class_name,\n", + " constructor_signatures=constructor_signatures,\n", + " language=\"java\"\n", + " )\n", + " \n", + " # Print the instruction\n", + " print(f\"Instruction:\\n{prompt}\\n\")\n", + " print(f\"Generating test case ...\\n\")\n", + " \n", + " # Prompt the local model on Ollama\n", + " llm_output = prompt_ollama(message=prompt)\n", + " \n", + " # Print the LLM output\n", + " print(f\"LLM Output:\\n{llm_output}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/examples/java/notebook/validating_code_translation.ipynb b/docs/examples/java/notebook/validating_code_translation.ipynb new file mode 100644 index 0000000..9abc5d8 --- /dev/null +++ b/docs/examples/java/notebook/validating_code_translation.ipynb @@ -0,0 +1,353 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fcac940432e10687", + "metadata": { + "collapsed": false + }, + "source": [ + "# Using CLDK to validate code translation\n", + "\n", + "In this tutorial, we will use CLDK to translate code and check properties of the translated code. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis for this task. By the end of this tutorial, you will have implemented a simple Java-to-Python code translator that also performs light-weight property checking on the translated code.\n", + "\n", + "Specifically, you will learn how to perform the following tasks on a Java application to create LLM prompts for code translation and checking the translated code:\n", + "\n", + "1. Create a new instance of the CLDK class.\n", + "2. Create an analysis object for the target Java application.\n", + "3. Iterate over all files in the application.\n", + "4. Iterate over all classes in a file.\n", + "5. Sanitize the class for prompting the LLM.\n", + "6. Create treesitter-based Java and Python analysis objects\n", + "\n", + "We will write a couple of helper methods to (1) format the LLM instruction for translating a Java class to Python and (2) prompt the LLM via Ollama. We will then use CLDK to analyze code and get context information for translating code and also checking properties of the translated code." + ] + }, + { + "cell_type": "markdown", + "id": "e9411e761b32fcbc", + "metadata": { + "collapsed": false + }, + "source": [ + "## Prequisites\n", + "\n", + "Before we get started, let's make sure you have the following installed:\n", + "\n", + "1. Python 3.11 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)\n", + "2. Java 11 or later (you can use [SDKMAN!](https://sdkman.io) to instal Java)\n", + "3. Ollama 0.3.4 or later (you can get Ollama here: [Ollama download](https://ollama.com/download))\n", + "\n", + "We will use Ollama to spin up a local [Granite code model](https://ollama.com/library/granite-code), which will serve as our LLM for this turorial." + ] + }, + { + "cell_type": "markdown", + "id": "5c7c3ccb", + "metadata": {}, + "source": [ + "### Download Granite code model\n", + "\n", + "After starting the Ollama server, please download the latest version of the Granite code 8b-instruct model by running the following command. There are other Granite code models available, but for this tutorial, we will use Granite code 8b-instruct. If you prefer to use a different Granite code model, you can replace `8b-instruct` with the tag of another version (see [Granite code tags](https://ollama.com/library/granite-code/tags))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db17a05f", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "ollama pull granite-code:8b-instruct" + ] + }, + { + "cell_type": "markdown", + "id": "930b603c7eb3cd55", + "metadata": { + "collapsed": false + }, + "source": [ + " Let's make sure the model is downloaded by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "635bb847107749f8", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%%bash\n", + "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" + ] + }, + { + "cell_type": "markdown", + "id": "a6015cb7728debca", + "metadata": { + "collapsed": false + }, + "source": [ + "### Install Ollama Python SDK" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9dceb297bbab0ab3", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pip install ollama" + ] + }, + { + "cell_type": "markdown", + "id": "e06325ad56287f0b", + "metadata": { + "collapsed": false + }, + "source": [ + "### Install CLDK\n", + "CLDK is avaliable at https://github.com/IBM/codellm-devkit. You can install it by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6dc34436d0f2d15", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pip install git+https://github.com/IBM/codellm-devkit.git" + ] + }, + { + "cell_type": "markdown", + "id": "6e4ef425987e53ed", + "metadata": { + "collapsed": false + }, + "source": [ + "### Get the sample Java application\n", + "For this tutorial, we will use [Apache Commons CLI](https://github.com/apache/commons-cli) as the Java application under test. You can download the source code to a temporary directory by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98ddaf361bb8c025", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%%bash\n", + "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" + ] + }, + { + "cell_type": "markdown", + "id": "7a963481d3c7d083", + "metadata": { + "collapsed": false + }, + "source": [ + "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`.\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "47af1410ab0a3b4d", + "metadata": { + "collapsed": false + }, + "source": [ + "## Translate Jave code to Python and build a light-weight property checker (for translation validation)\n", + "Code translation aims to convert source code from one programming language to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. In our recent work, [presented at ICSE'24](https://dl.acm.org/doi/10.1145/3597503.3639226), we found that LLM-based code translation is very promising. In this example, we will walk through the steps of translating a Java class to Python and checking various properties of translated code (e.g., number of methods, number of fields, formal arguments, etc.) as a simple form of translation validation.\n", + "\n", + "Step 1: Import the required modules" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47a78f61a53b2b55", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cldk.analysis.python.treesitter import PythonSitter\n", + "from cldk.analysis.java.treesitter import JavaSitter\n", + "import ollama\n", + "from cldk import CLDK\n", + "from cldk.analysis import AnalysisLevel" + ] + }, + { + "cell_type": "markdown", + "id": "c6d2f67e1a17cf1", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 2: Define a function for creating the LLM prompt, which instructs the LLM to translate a Java class to Python and includes the body of the Java class after removing all the comments and import statements." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc1ec56e92e90c15", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def format_inst(code, focal_class, language):\n", + " \"\"\"\n", + " Format the instruction for the given focal method and class.\n", + " \"\"\"\n", + " inst = f\"Question: Can you translate the Java class `{focal_class}` below to Python and generate under code block (```)?\\n\"\n", + "\n", + " inst += \"\\n\"\n", + " inst += f\"```{language}\\n\"\n", + " inst += code\n", + " inst += \"```\" if code.endswith(\"\\n\") else \"\\n```\"\n", + " inst += \"\\n\"\n", + " return inst" + ] + }, + { + "cell_type": "markdown", + "id": "1239041c3315e5e5", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 3: Define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c86224032a6eb70", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", + " \"\"\"Prompt local model on Ollama\"\"\"\n", + " response_object = ollama.generate(model=model_id, prompt=message)\n", + " return response_object[\"response\"]" + ] + }, + { + "cell_type": "markdown", + "id": "518efea0d8c4d307", + "metadata": { + "collapsed": false + }, + "source": [ + "Step 4: Translate a class of the Java application to Python and check for two properties of the translated code: number of translated method and number of translated fields. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe3be3de6790f7b3", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create an instance of CLDK for Java analysis\n", + "cldk = CLDK(language=\"java\")\n", + "\n", + "# Create an analysis object for the Java application, providing the application path\n", + "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)\n", + "\n", + "# For simplicity, we run the code translation on a single class(this filter can be removed to run this code over the entire application)\n", + "target_class = \"org.apache.commons.cli.GnuParser\"\n", + "\n", + "# Go through all the classes in the application\n", + "for class_name in analysis.get_classes():\n", + " \n", + " if class_name == target_class:\n", + " # Get the location of the Java class\n", + " class_path = analysis.get_java_file(qualified_class_name=class_name)\n", + " \n", + " # Read the file content\n", + " if not class_path:\n", + " class_body = \"\"\n", + " with open(class_path, \"r\", encoding=\"utf-8\", errors=\"ignore\") as f:\n", + " class_body = f.read()\n", + " \n", + " # Sanitize the file content by removing comments\n", + " sanitized_class = JavaSitter().remove_all_comments(source_code=class_body)\n", + "\n", + " # Create prompt for translating sanitized Java class to Python\n", + " inst = format_inst(code=sanitized_class, language=\"java\", focal_class=class_name.split(\".\")[-1])\n", + "\n", + " print(f\"Instruction:\\n{inst}\\n\")\n", + " print(f\"Translating Java code to Python . . .\\n\")\n", + "\n", + " # Prompt the local model on Ollama\n", + " translated_code = prompt_ollama(message=inst)\n", + " \n", + " # Print translated code\n", + " print(f\"Translated Python code: {translated_code}\\n\")\n", + "\n", + " # Create python sitter instance for analyzing translated Python code\n", + " py_cldk = PythonSitter()\n", + "\n", + " # Compute methods, function, and field counts for translated code\n", + " all_methods = py_cldk.get_all_methods(module=translated_code)\n", + " all_functions = py_cldk.get_all_functions(module=translated_code)\n", + " all_fields = py_cldk.get_all_fields(module=translated_code)\n", + " \n", + " # Check counts against method and field counts for Java code\n", + " if len(all_methods) + len(all_functions) != len(analysis.get_methods_in_class(qualified_class_name=class_name)):\n", + " print(f'Number of translated method not matching in class {class_name}')\n", + " else:\n", + " print(f'Number of translated method in class {class_name} is {len(all_methods)}')\n", + " if all_fields is not None:\n", + " if len(all_fields) != len(analysis.get_class(qualified_class_name=class_name).field_declarations):\n", + " print(f'Number of translated field not matching in class {class_name}')\n", + " else:\n", + " print(f'Number of translated fields in class {class_name} is {len(all_fields)}')\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/examples/java/code_summarization.py b/docs/examples/java/python/code_summarization.py similarity index 100% rename from docs/examples/java/code_summarization.py rename to docs/examples/java/python/code_summarization.py diff --git a/docs/examples/java/validating_code_translation.ipynb b/docs/examples/java/validating_code_translation.ipynb deleted file mode 100644 index 2266cf0..0000000 --- a/docs/examples/java/validating_code_translation.ipynb +++ /dev/null @@ -1,171 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "!pip install ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "3195a8c0612cb428" - }, - { - "cell_type": "markdown", - "source": [ - "Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. In our recent paper [https://dl.acm.org/doi/10.1145/3597503.3639226] published at ICSE'24, we found that LLM-based code translation is very promising. In this example, we will walk through the steps of translating each Java class to Python and checking various properties of translated code, such as the number of methods, number of fields, formal arguments, etc.\n", - "\n", - "(Step 1) First, we will import all the necessary libraries" - ], - "metadata": { - "collapsed": false - }, - "id": "47af1410ab0a3b4d" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from cldk.analysis.python.treesitter import PythonSitter\n", - "from cldk.analysis.java.treesitter import JavaSitter\n", - "import ollama\n", - "from cldk import CLDK\n", - "from cldk.analysis import AnalysisLevel" - ], - "metadata": { - "collapsed": false - }, - "id": "47a78f61a53b2b55" - }, - { - "cell_type": "markdown", - "source": [ - "(Step 2) Second, we will form the prompt for the model, which will include the body of the Java class after removing all the comments and the import statements." - ], - "metadata": { - "collapsed": false - }, - "id": "c6d2f67e1a17cf1" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def format_inst(code, focal_class, language):\n", - " \"\"\"\n", - " Format the instruction for the given focal method and class.\n", - " \"\"\"\n", - " inst = f\"Question: Can you translate the Java class `{focal_class}` below to Python and generate under code block (```)?\\n\"\n", - "\n", - " inst += \"\\n\"\n", - " inst += f\"```{language}\\n\"\n", - " inst += code\n", - " inst += \"```\" if code.endswith(\"\\n\") else \"\\n```\"\n", - " inst += \"\\n\"\n", - " return inst" - ], - "metadata": { - "collapsed": false - }, - "id": "dc1ec56e92e90c15" - }, - { - "cell_type": "markdown", - "source": [ - "(Step 3) Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally." - ], - "metadata": { - "collapsed": false - }, - "id": "1239041c3315e5e5" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", - " \"\"\"Prompt local model on Ollama\"\"\"\n", - " response_object = ollama.generate(model=model_id, prompt=message)\n", - " return response_object[\"response\"]" - ], - "metadata": { - "collapsed": false - }, - "id": "1c86224032a6eb70" - }, - { - "cell_type": "markdown", - "source": [ - "(Step 4) Translate each class in the application (provide the application path as an environment variable, ```JAVA_APP_PATH```) and check certain properties of the translated code, such as (a) number of translated method, and (b) number of translated fields. " - ], - "metadata": { - "collapsed": false - }, - "id": "518efea0d8c4d307" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "# Create a new instance of the CLDK class\n", - "cldk = CLDK(language=\"java\")\n", - "# Create an analysis object over the java application. Provide the application path using JAVA_APP_PATH\n", - "analysis = cldk.analysis(project_path=\"JAVA_APP_PATH\", analysis_level=AnalysisLevel.symbol_table)\n", - "# Go through all the classes in the application\n", - "for class_name in analysis.get_classes():\n", - " # Get the location of the Java class\n", - " class_path = analysis.get_java_file(qualified_class_name=class_name)\n", - " # Read the file content\n", - " if not class_path:\n", - " class_body = ''\n", - " with open(class_path, 'r', encoding='utf-8', errors='ignore') as f:\n", - " class_body = f.read()\n", - " # Sanitize the file content by removing comments.\n", - " tree_sitter_utils = cldk.tree_sitter_utils(source_code=class_body)\n", - " sanitized_class = JavaSitter.remove_all_comments(source_code=class_body)\n", - " translated_code = prompt_ollama(\n", - " message=sanitized_class,\n", - " model_id=\"granite-code:20b-instruct\")\n", - " py_cldk = PythonSitter()\n", - " all_methods = py_cldk.get_all_methods(module=translated_code)\n", - " all_functions = py_cldk.get_all_functions(module=translated_code)\n", - " all_fields = py_cldk.get_all_fields(module=translated_code)\n", - " if len(all_methods) + len(all_functions) != len(analysis.get_methods_in_class(qualified_class_name=class_name)):\n", - " print(f'Number of translated method not matching in class {class_name}')\n", - " if len(all_fields) != len(analysis.get_class(qualified_class_name=class_name).field_declarations):\n", - " print(f'Number of translated field not matching in class {class_name}') " - ], - "metadata": { - "collapsed": false - }, - "id": "fe3be3de6790f7b3" - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 2 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}