WaDec

WaDec is an approach leveraging a fine-tuned LLM to decompile Wasm binary code into a more comprehensible source code. The model's training on a specialized wat-c dataset, coupled with self-supervised learning, has proven pivotal in enhancing decompilation efficacy. Our results indicate that WaDec significantly outperforms existing tools, achieving a minimal code inflation rate and maintaining high recompilability and re-execution rates. This advancement not only bolsters the readability and analyzability of Wasm code but also paves the way for more robust automated code analysis, optimization, and security auditing processes.

Dataset

Our dataset is specifically designed for decompiling WebAssembly (Wasm). It includes100k+ pairs of WebAssembly Text (Wat) snippets and C code snippets at the loop level, providing a finer granularity than function-level datasets. The dataset has been uploaded to Hugging Face, and it is available at (https://huggingface.co/datasets/wadecc/FGW2C). The main features of the dataset are as follows:

Wat snippet: Segmented based on loop blocks.
C snippet: Segmented based on loop blocks, corresponding to wat snippet.
Spatial info: Function declarations for called functions.
Temporal info: Local variables already defined before current snippet.
Offset2string: Mapping from offsets to string constants.

Getting Started

Prerequisites

Ensure you have the following prerequisites installed:

Installation

Clone the repository to your local machine:

git clone https://anonymous.4open.science/r/WaDec-EDDE
cd WaDec-EDDE

Infering

Our fine-tuned LLM has been uploaded to Hugging Face, and it can be accessed via (https://huggingface.co/wadecc/Wat2c).

For infering, please run infering.py:

python infering.py
  --base_model wadecc/Wat2c
  --wat_path {wat_path}
  --dst_path {output_path}
  --invoke {invoked_functions}

Evaluation

CodeBLEU

In Section 5.2 of our paper, we disscuss one of the External Threats to Validity, i.e., the limitations of CodeBLEU, which is detailed in -> Limitations of CodeBLEU

For calculating CodeBLEU scores, please run cal_codebleu.py:

python cal_codebleu.py
  --reference {source_c}
  --prediction {decompiled_c}
  --lang c

Others

Besides, we also evaluate our method using additional metrics such as AST Edit Distance Similarity, Cosine Similarity, Cyclomatic Complexity Similarity, and code bloat rate. To compute these metrics, please run eval.ipynb

:)

We thank all the reviewers for the valuable feedback!

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
Dataset		Dataset
Scripts		Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WaDec

Dataset

Getting Started

Prerequisites

Installation

Infering

Evaluation

CodeBLEU

Others

:)

About

Releases

Packages

Languages

security-pride/WaDec

Folders and files

Latest commit

History

Repository files navigation

WaDec

Dataset

Getting Started

Prerequisites

Installation

Infering

Evaluation

CodeBLEU

Others

:)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages