Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what is the relation between target and cwe in dataset? #7

Closed
Vela-zz opened this issue Nov 29, 2023 · 3 comments
Closed

what is the relation between target and cwe in dataset? #7

Vela-zz opened this issue Nov 29, 2023 · 3 comments

Comments

@Vela-zz
Copy link

Vela-zz commented Nov 29, 2023

I think target means 'label', that 1 corresponds to 'vulnerability'.
In the dataset, some target = 1 data do not with a cwe label,

  {'func': 'static int vfat_ioctl32(unsigned fd, unsigned cmd, unsigned long arg)\n{\n\tstruct compat_dirent __user *p = compat_ptr(arg);\n\tint ret;\n\tmm_segment_t oldfs = get_fs();\n\tstruct dirent d[2];\n\n\tswitch(cmd)\n\t{\n        \tcase VFAT_IOCTL_READDIR_BOTH32:\n                \tcmd = VFAT_IOCTL_READDIR_BOTH;\n                \tbreak;\n        \tcase VFAT_IOCTL_READDIR_SHORT32:\n                \tcmd = VFAT_IOCTL_READDIR_SHORT;\n                \tbreak;\n\t}\n\n\tset_fs(KERNEL_DS);\n\tret = sys_ioctl(fd,cmd,(unsigned long)&d);\n\tset_fs(oldfs);\n\tif (ret >= 0) {\n\t\tret |= put_dirent32(&d[0], p);\n\t\tret |= put_dirent32(&d[1], p + 1);\n\t}\n\treturn ret;\n}',
   'target': 1,
   'cwe': [],
   'project': 'linux-2.6',
   'commit_id': '188f83dfe0eeecd1427d0d255cc97dbf7ef6b4b7',
   'hash': 1.1189872415117941e+38,
   'size': 26,
   'message': "[PATCH] BLOCK: Move the msdos device ioctl compat stuff to the msdos driver [try #6]\n\nMove the msdos device ioctl compat stuff from fs/compat_ioctl.c to the msdos\ndriver so that the msdos header file doesn't need to be included.\n\nSigned-Off-By: David Howells <dhowells@redhat.com>\nSigned-off-by: Jens Axboe <axboe@kernel.dk>"}

But other data labeled 0 have cwe id.

{'func': 'static int bget_one(handle_t *handle, struct buffer_head *bh)\n{\n\tget_bh(bh);\n\treturn 0;\n}',
 'target': 0,
 'cwe': ['CWE-399'],
 'project': 'linux-2.6',
 'commit_id': '06a279d636734da32bb62dd2f7b0ade666f65d7c',
 'hash': 2.2296779682414648e+38,
 'size': 5,
 'message': 'ext4: only use i_size_high for regular files\n\nDirectories are not allowed to be bigger than 2GB, so don\'t use\ni_size_high for anything other than regular files.  E2fsck should\ncomplain about these inodes, but the simplest thing to do for the\nkernel is to only use i_size_high for regular files.\n\nThis prevents an intentially corrupted filesystem from causing the\nkernel to burn a huge amount of CPU and issuing error messages such\nas:\n\nEXT4-fs warning (device loop0): ext4_block_to_path: block 135090028 > max\n\nThanks to David Maciejak from Fortinet\'s FortiGuard Global Security\nResearch Team for reporting this issue.\n\nhttp://bugzilla.kernel.org/show_bug.cgi?id=12375\n\nSigned-off-by: "Theodore Ts\'o" <tytso@mit.edu>\nCc: stable@kernel.org'}
  1. Another question is have your group done any preprocessing to code when training, like removing comments, removing trailing space, remove the gap line between two code section etc.
@habaneraa
Copy link

habaneraa commented Dec 13, 2023

According to my understanding: Both vulnerable and non-vulnerable functions come from vulnerability-fixing commits. And vulnerability-fixing commits could have CWE information because they are mined from vulnerability datasets. A non-vulnerable(target == 0) function in dataset may be extracted from (1). the fixed version of a vulnerable function, (2). another function within the file that involves the fix but this function was not changed by that fixing commit. So in summary, there is no relation between "target" and cve/cwe information. Only functions with target == 1 have meaningful cwe information...

It appears that diversevul was not preprocessed. I had tried to apply CST parser (specifically tree-sitter) to the functions in diversevul and I got lots of errors. Some functions seem to be truncated. And there are a few of functions ends with "\r" (carriage return) character, which does not make any sense ("\n" or "\r\n" is expected generally)... So I tend to believe that this dataset presentes the raw source code that has not been preprocessed.

@Vela-zz
Copy link
Author

Vela-zz commented Dec 23, 2023

According to my understanding: Both vulnerable and non-vulnerable functions come from vulnerability-fixing commits. And vulnerability-fixing commits could have CWE information because they are mined from vulnerability datasets. A non-vulnerable(target == 0) function in dataset may be extracted from (1). the fixed version of a vulnerable function, (2). another function within the file that involves the fix but this function was not changed by that fixing commit. So in summary, there is no relation between "target" and cve/cwe information. Only functions with target == 1 have meaningful cwe information...

It appears that diversevul was not preprocessed. I had tried to apply CST parser (specifically tree-sitter) to the functions in diversevul and I got lots of errors. Some functions seem to be truncated. And there are a few of functions ends with "\r" (carriage return) character, which does not make any sense ("\n" or "\r\n" is expected generally)... So I tend to believe that this dataset presentes the raw source code that has not been preprocessed.

Thanks for your answer!

@Vela-zz Vela-zz closed this as completed Dec 27, 2023
@surrealyz
Copy link
Collaborator

According to my understanding: Both vulnerable and non-vulnerable functions come from vulnerability-fixing commits. And vulnerability-fixing commits could have CWE information because they are mined from vulnerability datasets. A non-vulnerable(target == 0) function in dataset may be extracted from (1). the fixed version of a vulnerable function, (2). another function within the file that involves the fix but this function was not changed by that fixing commit. So in summary, there is no relation between "target" and cve/cwe information. Only functions with target == 1 have meaningful cwe information...

It appears that diversevul was not preprocessed. I had tried to apply CST parser (specifically tree-sitter) to the functions in diversevul and I got lots of errors. Some functions seem to be truncated. And there are a few of functions ends with "\r" (carriage return) character, which does not make any sense ("\n" or "\r\n" is expected generally)... So I tend to believe that this dataset presentes the raw source code that has not been preprocessed.

This is correct. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants