Skip to content

Commit e7377dd

Browse files
committed
readme update with results
1 parent 4be4045 commit e7377dd

File tree

1 file changed

+95
-4
lines changed

1 file changed

+95
-4
lines changed

README.md

+95-4
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ Implementation of Devign Model in Python with code for processing the dataset an
1818
* [Create Task](#create-task)
1919
* [Embed Task](#embed-task)
2020
* [Process Task](#process-task)
21-
* [Roadmap](#Roadmap)
21+
* [Results](#results)
22+
* [Roadmap](#roadmap)
2223
* [License](#license)
2324
* [Contact](#contact)
2425
* [Acknowledgements](#acknowledgements)
@@ -59,6 +60,7 @@ That can be done by changing the ```"slice_size"``` value under ```"create"``` i
5960
needs to match ```"in_channels"```, under ```"devign" -> "model" -> "conv_args" -> "conv1d_1"```.
6061
* The embedding size is equal to Word2Vec vector size plus 1.
6162
* When executing the **Create** task, a directory named ```joern``` is created and deleted automatically under ```'project'\data\```.
63+
* The dataset split for modeling during **Process** task is done under ```src/data/datamanger.py```. The sets are balanced and the train/val/test ratio are 0.8/0.1/0.1 respectively.
6264
### Setup
6365

6466
---
@@ -163,8 +165,8 @@ The dataset used is the [partial dataset](https://sites.google.com/view/devign)
163165
The dataset is handled with Pandas and the file ```src/data/datamanger.py``` contains wrapper functions for the most essential operations.
164166
<br/>
165167
<br/>
166-
A small sample from the original dataset is available for testing purposes.
167-
The sample dataset contains functions from the **FFmpeg** project with maximum of 287 nodes per function.
168+
A small sample of 994 entries from the original dataset is available for testing purposes.
169+
The sample dataset contains functions from the **FFmpeg** project with a maximum of 287 nodes per function.
168170
For each task, the necessary dataset files are available under the respective folders.
169171
<br/>
170172
<br/>
@@ -229,11 +231,53 @@ for the initial embeddings. The nodes embeddings are done as explained in the pa
229231
Execute with:
230232
``` console
231233
$ python main.py -e
234+
235+
```
236+
237+
##### Tokenization example
238+
Source code:
239+
```
240+
'static void v4l2_free_buffer(void *opaque, uint8_t *unused)
241+
{
242+
243+
V4L2Buffer* avbuf = opaque;
244+
245+
V4L2m2mContext *s = buf_to_m2mctx(avbuf);
246+
247+
248+
249+
if (atomic_fetch_sub(&avbuf->context_refcount, 1) == 1) {
250+
251+
atomic_fetch_sub_explicit(&s->refcount, 1, memory_order_acq_rel);
252+
253+
254+
255+
if (s->reinit) {
256+
257+
if (!atomic_load(&s->refcount))
258+
259+
sem_post(&s->refsync);
260+
261+
} else if (avbuf->context->streamon)
262+
263+
ff_v4l2_buffer_enqueue(avbuf);
264+
265+
266+
267+
av_buffer_unref(&avbuf->context_ref);
268+
269+
}
270+
271+
}
272+
'
232273
```
274+
Tokens:
275+
['static', 'void', 'FUN1', '(', 'void', '*', 'VAR1', ',', 'uint8_t', '*', 'VAR2)', '{', 'VAR3', '*', 'VAR4', '=', 'VAR1', ';', 'V4L2m2mContext', '*', 'VAR5', '=', 'FUN2', '(', 'VAR4)', ';', 'if', '(', 'FUN3', '(', '&', 'VAR4', '-', '>', 'VAR6', ',', '1)', '==', '1)', '{', 'FUN4', '(', '&', 'VAR5', '-', '>', 'VAR7', ',', '1', ',', 'VAR8)', ';', 'if', '(', 'VAR5', '-', '>', 'VAR9)', '{', 'if', '(', '!', 'FUN5', '(', '&', 'VAR5', '-', '>', 'VAR7))', 'FUN6', '(', '&', 'VAR5', '-', '>', 'VAR10)', ';', '}', 'else', 'if', '(', 'VAR4', '-', '>', 'VAR11', '-', '>', 'VAR12)', 'FUN7', '(', 'VAR4)', ';', 'FUN8', '(', '&', 'VAR4', '-', '>', 'VAR13)', ';', '}', '}']
233276

277+
234278
#### Process Task
235279
In this task the previous transformed dataset is split into train, validation and test sets which are
236-
used to train an evaluate the model.
280+
used to train an evaluate the model. The accuracy from training output is **softmax accuracy**.
237281

238282
Execute with:
239283
``` console
@@ -246,6 +290,53 @@ Enable EarlyStopping for training with:
246290
$ python main.py -pS
247291
```
248292

293+
## Results
294+
Train/Val/Test ratios - 0.8/0.1/0.1
295+
Example results of training with early stopping on the sample dataset.
296+
Last Model checkpoint at 5 epochs.
297+
298+
Parameters used:
299+
- "learning_rate" : 1e-4
300+
- "weight_decay" : 1.3e-6
301+
- "loss_lambda" : 1.3e-6
302+
- "epochs" : 100
303+
- "patience" : 10
304+
- "batch_size" : 8
305+
- "dataset_ratio" : 1 (Total entries)
306+
- "shuffle" : false
307+
308+
True Pos.: 37, False Pos.: 27, True Neg.: 22, False Neg.: 15
309+
Accuracy: 0.5841584158415841
310+
Precision: 0.578125
311+
Recall: 0.7115384615384616
312+
F-measure: 0.6379310344827586
313+
Precision-Recall AUC: 0.5388430220841324
314+
AUC: 0.5569073783359497
315+
MCC: 0.166507096257419
316+
317+
Example results of training without early stopping on the sample dataset.
318+
319+
Parameters used:
320+
- "learning_rate" : 1e-4
321+
- "weight_decay" : 1.3e-6
322+
- "loss_lambda" : 1.3e-6
323+
- "epochs" : 30
324+
- "patience" : 10
325+
- "batch_size" : 8
326+
- "dataset_ratio" : 1 (Total entries)
327+
- "shuffle" : false
328+
329+
True Pos.: 38, False Pos.: 34, True Neg.: 15, False Neg.: 14
330+
Accuracy: 0.5247524752475248
331+
Precision: 0.5277777777777778
332+
Recall: 0.7307692307692307
333+
F-measure: 0.6129032258064515
334+
Precision-Recall AUC: 0.5592493611149129
335+
AUC: 0.5429748822605965
336+
MCC: 0.04075331061223071
337+
Error: 53.56002758457897
338+
339+
249340
## Roadmap
250341

251342
See the [open issues](https://github.com/epicosy/devign/issues) for a list of proposed features (and known issues).

0 commit comments

Comments
 (0)