02. Project Architecture

Overview

RepoAudit is a multi-agent framework. Each agent targets a specific code auditing task, such as data-flow bug detection, program slicing, and general bug detection. An agent can utilize parsing-based analyzers (e.g., the interfaces of src/tstool/analyzer/TS_analyzer.py), LLM-driven analyzers (e.g., different LLM-based tools in src/llmtool), and even agents.

When scanning a repository, we first initialize a parsing-based analyzer (i.e., an instance of TSAnalyzer) for code indexing and then invoke a specific agent for scanning. The results of the parsing-based analyzer, agent, and final results (such as bug reports) are maintained in the memory of RepoAudit.

Here is a pipeline of RepoAudit


                         +-------------+     +---------+      
                         |    TSTool   |     | LLMTool |
                         +-------------+     +---------+
                            ↑        ↓            ↓
            +----------------+     +-------------------+     +------------------+
Code   →    |   TSAnalyzer   |  →  |      Agents       |  →  |      Reports     |
            | (AST Parsing)  |     |(Semantic Analysis)|     |  (Final Results) |
            +----------------+     +-------------------+     +------------------+
                    ↓                     ↑    ↓                       ↓
        +--------------------------------------------------------------------------+  
        |   [Syntactic Memory]       [Semantic Memory]         [Report Memory]     |
Memory  |     Syntactic Info        Semantic Properties          Scan Report       |
        |  (Value/Function/API)        (Agent State)          (Bug/Debug results)  |
        +--------------------------------------------------------------------------+

Core Components

TSAnalyzer: Parsing-based Analysis

RepoAudit leverages tree-sitter to derive the abstract syntax tree (AST) of the repository code. Specifically, it extracts the basic constructs of each function, including critical values (e.g., parameters, arguments, output values, and return values), branches (e.g., if-statements), and loops (e.g., for-loops and while-loops). Based on the derived constructs, it further constructs a call graph (based on function names and parameter/argument numbers), control-flow order analysis, and CFL-reachability analysis. Notably, such parsing-based analysis may approximate the semantic properties, especially caller-callee relationships, though it may not be sound or complete in cases involving class hierarchy and function pointers.

The above functionalities are supported by different sub-classes of TSAnalyzer, targeting different programming languages. The source code is located in the directory src/tstool/analyzer.

Agent

An agent is a component targeting a specific code auditing task, such as program slicing, debugging, bug detection, program repair. Currently, RepoAudit only targets the bug detection task, while it can be easily extended to support other tasks. Notably, as a multi-agent framework, an agent in RepoAudit can leverage the results of other agents. In the file src/agent/agent.py, we offer the definition of the base class Agent, which have the following several sub-classes focusing on concrete tasks:

MetaScanAgent

MetaScanAgent is a simple agent for demo. It wraps the parsing-based analyzer without additional symbolic or neural analysis.

DFBScanAgent

DFBScanAgent is our current open-sourced agent for data-flow bug detection. It implements the analysis workflow presented in this paper. Our implemented version can support the detection of the following bug types in different programming languages.

Bug Type	C	C++	Java	Python	Go
Null Pointer Dereference	✓	✓	✓	✓	✓
Memory Leak	✓	✓
Use After Free	✓	✓

TSTool: Parsing-based Tools

To support a specific agent, we currently offer several additional parsing-based tools for different bug types in different programming languages. For example, in the detection of Null Pointer Dereference (NPD) in C++ programs, we need to identify the source values (i.e., potential NULL values). Utilizing the interfaces offered by TSAnalyzer, we create Cpp_NPD_Extractor for the extraction of NULL values.

You can also follow the definition of Cpp_NPD_Extractor when you define your own agent and the relevant parsing-based tools. We will also integrate a synthesis agent into RepoAudit to synthesize specific parsing-based tools, such as source extractors, by following the design in our previous work LLMDFA.

LLMTool: LLM-driven Tools

LLM-driven tools enable semantic analysis of source code without compilation. Similar to traditional IR-based program analyzers, these tools derive program facts or transform source code for further analysis, functioning similarly to LLVM passes in LLVM-based C/C++ analyzers.

As shown in the file src/llmtool/LLM_tool.py containing the base class LLMTool, an instance of a LLM-driven tool recieves and returns a specific form of input and output objects, respectively. When defining a LLM-driven tool, i.e., the sub-class of LLMTool, we also need to define the sub-classes of LLMToolInput and LLMToolOutput. Also, we have to provide the corresponding prompting template in the directory src/prompt.

Consider the LLM-driven tools used by DFBScanAgent. We include two LLM-driven tools in the directory src/llmtool/dfbscan.

IntraDataFlowAnalyzer derives the data-flow facts along different program paths in single functions. It corresponds to explorer in the paper.
PathValidator validates the feasiblity of a program path. It corresponds to validator in the paper.

Memory

As a multi-agent framework for code auditing, RepoAudit contains three kinds of memory, which are implemented in the directory src/memory.

Syntactic Memory

Syntactic memory maintains critical constructs for code auditing. RepoAudit mainly focuses on the program values in different functions. Utilizing TSAnalyzer, it stores the Function, API, and Value info in the syntactic memory. These three constructs are then retrieved by agents when the agents invoke the LLM-driven tools.

In the future, we may need to extend the syntactic memory and maintain more expressive compilation-independent IR constructs.

Semantic Memory

Semantic memory maintains the intermediate states of agents. For each agent, we define a corresponding state as the sub-class of State. For example, DFBScanState stores the data-flow facts along different paths and also the relevant parameters/return values/arguments/output values. Based on the semantic memory, the agents can finally compute the outputs and obtain the reports of the agents.

Report Memory

Reports maintain the final results of the agents. Currently, there is only one type of reports, i.e., bug reports and debug reports, which are the outputs of the end-user agents including DFBScanAgent. For MetaScanAgent, since they do not compute additional program facts, we do not explicitly define its specific report format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

02. Project Architecture

Overview

Core Components

TSAnalyzer: Parsing-based Analysis

Agent

MetaScanAgent

DFBScanAgent

TSTool: Parsing-based Tools

LLMTool: LLM-driven Tools

Memory

Syntactic Memory

Semantic Memory

Report Memory

Uh oh!

Uh oh!

Clone this wiki locally