-
Notifications
You must be signed in to change notification settings - Fork 2
Xref paths
Shares of different function-to-string cross reference path types were evaluated to better understand and balance the datasets.
Mixed paths have not been extracted for simplicity - measured cases are down-only and up-only paths. Only unique paths were extracted, that is those characterized by a unique key of 4 values:
- root function address
- 1st path node function address
- 2nd path node function address
- target string address
3 middle-sized binaries have been evaluated to estimate average share of different cross-reference types within extracted datasets. Additionally, a large debug build of an Unreal Engine 5 game has been evaluated for comparison.
Let path length be a number of function-function edges in cross-reference graph which make a path from the root function to a string.
For each binary and for each subroutine, the following paths were extracted using experimental values for path length parameter:
- direct string references (path length = 0)
- upward reference paths (path length = 1)
- downward reference paths (path length <= 2)
Bomber UE5 game binary evaluation:
- 723632 decompiled subroutines
- 194635 strings
- 15335354 unique function-string path samples
Type | Path length | Samples | Share (%) |
---|---|---|---|
upward | 1 | 5321662 | 34,70 |
direct | 0 | 573797 | 3,74 |
downward | 1 | 1609203 | 10,49 |
downward | 2 | 7830692 | 51,06 |
Diamond Engine binary evaluation:
- 2717 decompiled subroutines
- 4032 strings
- 178167 unique function-string path samples
Type | Path length | Samples | Share (%) |
---|---|---|---|
upward | 1 | 126380 | 70,93 |
direct | 0 | 5598 | 3,14 |
downward | 1 | 12884 | 7,23 |
downward | 2 | 33305 | 18,69 |
Nginx for Visual Studio 2015 binary evaluation:
- 2387 decompiled subroutines
- 1894 strings
- 20844 unique function-string path samples
Type | Path length | Samples | Share (%) |
---|---|---|---|
upward | 1 | 15446 | 74,10 |
direct | 0 | 1756 | 8,42 |
downward | 1 | 1225 | 5,88 |
downward | 2 | 2417 | 11,60 |
Xash3D engine binary evaluation:
- 2380 decompiled subroutines
- 7184 strings
- 97551 unique function-string path samples
Type | Path length | Samples | Share (%) |
---|---|---|---|
upward | 1 | 48533 | 49,75 |
direct | 0 | 5681 | 5,82 |
downward | 1 | 12871 | 13,19 |
downward | 2 | 30466 | 31,23 |
Average binary evaluation:
- 296562 unique function-string path samples (total)
Type | Path length | Samples | Share (%) |
---|---|---|---|
upward | 1 | 190359 | 64,19 |
direct | 0 | 13035 | 4,40 |
downward | 1 | 26980 | 9,10 |
downward | 2 | 96654 | 32,59 |
The share of upward references tends to vary within 50-75% range, which makes it stand out in contrast to 7 times lesser average share of downward references of the same path length.
The experiment showed upward references constitute a majority in average binary. To evaluate their relevance for machine learning model predictions we calculate positive conversion rates for each path type.
All positive paths (the ones referencing a string containing root function's name) were labelled. Positive-to-all path conversion rates are:
- 15335354 path samples
- 501 positives
- 15334853 negatives
Type | Path length | Samples | Positives | All positives share (%) | Type sample share (%) |
---|---|---|---|---|---|
upward | 1 | 5321662 | 106 | 21,16 | 1,99E-3 |
direct | 0 | 573797 | 353 | 70,46 | 6,15E-2 |
downward | 1 | 1609203 | 27 | 5,39 | 1,68E-3 |
downward | 2 | 7830692 | 15 | 2,99 | 1,92E-4 |
- 178167 path samples
- 53 positives
- 178114 negatives
Type | Path length | Samples | Positives | All positives share (%) | Type sample share (%) |
---|---|---|---|---|---|
upward | 1 | 126380 | 21 | 39,62 | 0,02 |
direct | 0 | 5598 | 15 | 28,30 | 0,27 |
downward | 1 | 12884 | 12 | 22,64 | 0,09 |
downward | 2 | 33305 | 5 | 9,43 | 0,015 |
- 20844 path samples
- 10 positives
- 20834 negatives
Type | Path length | Samples | Positives | All positives share (%) | Type sample share (%) |
---|---|---|---|---|---|
upward | 1 | 15446 | 0 | 0,00 | 0,00 |
direct | 0 | 1756 | 3 | 30,00 | 0,17 |
downward | 1 | 1225 | 1 | 10,00 | 0,08 |
downward | 2 | 2417 | 6 | 60,00 | 0,25 |
- 97551 path samples
- 294 positives
- 97257 negatives
Type | Path length | Samples | Positives | All positives share (%) | Type sample share (%) |
---|---|---|---|---|---|
upward | 1 | 48533 | 30 | 10,20 | 0,06 |
direct | 0 | 5681 | 236 | 80,27 | 4,15 |
downward | 1 | 12871 | 22 | 7,48 | 0,17 |
downward | 2 | 30466 | 6 | 2,04 | 0,02 |
The results show that upward references are rare despite consistently being the most common type of references. Given the fact that callbacks are ignored due to their cross reference types being data and given above results, it is reasonable to limit the dataset to direct and downward references of path length 1.
- Select positive tokens:
SELECT * FROM tokens WHERE is_name = 1
-
For every token, take its value and search for the correct function name(s) (methodology below) in PDB JSON or IDB with applied PDB (latter preferred).
-
Retrieve addresses of the function and token's parent string to complete the path query:
SELECT * FROM paths WHERE func_addr = 4528400 and string_addr = 5351792
- Take returned path's
id
and label the path positive:
UPDATE paths SET to_name = 1 WHERE id = 190995
- Select unlabelled paths:
SELECT * FROM paths WHERE to_name IS NULL
- Look up the referenced string in the IDB file and the function's name.
-- provide the function's offset
SELECT * FROM pdb WHERE func_addr = 12345678
- If you are sure the string does not contain any function names, mark all paths to it negatives:
-- provide the string's offset
UPDATE paths SET to_name = 0 WHERE string_addr = 5376670492
If you are unsure about other references, only mark the reviewed path:
-- provide path id
UPDATE paths SET to_name = 0 WHERE id = 1234
A function is:
- a positive when it has at least 1 positive path
- a negative if it has no positive paths
- unlabelled if it has any
NULL
-labelled paths
A function-string path is labelled positive when it includes a substring with full or partial name of the original name retrieved from symbol file.
A function-token path is labelled positive when it ends with full or partial name of the original name retrieved from symbol file.
A function name is labelled positive when:
- the positive token name matches the exact name retrieved from PDB (full)
- the positive token name is a substring of the name retrieved from PDB and is functionally equivalent to the original (partial)
Examples for the latter case:
Token value | PDB name (demangled) |
---|---|
AreFileApisANSI |
__acrt_AreFileApisANSI |
InitializeCriticalSectionEx |
__vcrt_InitializeCriticalSectionEx |
AppPolicyGetProcessTerminationMethod |
__acrt_AppPolicyGetProcessTerminationMethodInternal |
acos |
__CIacos |
Functions that meet the second requirement but are outside of max reference path length should be labelled as negatives.
In cases where token name included namespaces, all occurences of the function name without namespace prefixes were checked.