Xref paths

Cross reference paths

Shares of different function-to-string cross reference path types were evaluated to better understand and balance the datasets.

Mixed paths have not been extracted for simplicity - measured cases are down-only and up-only paths. Only unique paths were extracted, that is those characterized by a unique key of 4 values:

root function address
1st path node function address
2nd path node function address
target string address

3 middle-sized binaries have been evaluated to estimate average share of different cross-reference types within extracted datasets.

Let path length be a number of function-function edges in cross-reference graph which make a path from the root function to a string.

For each binary and for each subroutine, the following paths were extracted using experimental values for path length parameter:

direct string references (path length = 0)
upward reference paths (path length = 1)
downward reference paths (path length <= 2)

Type frequency

Diamond Engine

Diamond Engine binary evaluation:

2717 decompiled subroutines
4032 strings
178167 unique function-string path samples

Type	Path length	Samples	Share (%)
upward	1	126380	70,93
direct	0	5598	3,14
downward	1	12884	7,23
downward	2	33305	18,69

Nginx CLI

Nginx for Visual Studio 2015 binary evaluation:

2387 decompiled subroutines
1894 strings
20844 unique function-string path samples

Type	Path length	Samples	Share (%)
upward	1	15446	74,10
direct	0	1756	8,42
downward	1	1225	5,88
downward	2	2417	11,60

Xash3D

Xash3D engine binary evaluation:

2380 decompiled subroutines
7184 strings
97551 unique function-string path samples

Type	Path length	Samples	Share (%)
upward	1	48533	49,75
direct	0	5681	5,82
downward	1	12871	13,19
downward	2	30466	31,23

Average

Average binary evaluation:

296562 unique function-string path samples (total)

Type	Path length	Samples	Share (%)
upward	1	190359	64,19
direct	0	13035	4,40
downward	1	26980	9,10
downward	2	96654	32,59

The share of upward references tends to vary within 50-75% range, which makes it stand out in contrast to 7 times lesser average share of downward references of the same path length.

The experiment showed upward references constitute a majority in average binary. To evaluate their relevance for machine learning model predictions we calculate positive conversion rates for each path type.

Positives

All positive paths (the ones referencing a string containing root function's name) were labelled. Positive-to-all path conversion rates are:

Diamond Engine

178167 path samples
53 positives
178114 negatives

Type	Path length	Samples	Positives	All positives share (%)	Type sample share (%)
upward	1	126380	21	39,62	0,02
direct	0	5598	15	28,30	0,27
downward	1	12884	12	22,64	0,09
downward	2	33305	5	9,43	0,015

Nginx CLI

20844 path samples
10 positives
20834 negatives

Type	Path length	Samples	Positives	All positives share (%)	Type sample share (%)
upward	1	15446	0	0,00	0,00
direct	0	1756	3	30,00	0,17
downward	1	1225	1	10,00	0,08
downward	2	2417	6	60,00	0,25

Xash3D

97551 path samples
294 positives
97257 negatives

Type	Path length	Samples	Positives	All positives share (%)	Type sample share (%)
upward	1	48533	30	10,20	0,06
direct	0	5681	236	80,27	4,15
downward	1	12871	22	7,48	0,17
downward	2	30466	6	2,04	0,02

The results show that upward references are rare despite consistently being the most common type of references. Given the fact that callbacks are ignored due to their cross reference types being data and given above results, it is reasonable to limit the dataset to direct and downward references of path length 1.

Path labelling

Steps

To label paths (to_name column):

Select positive tokens:

SELECT * FROM tokens WHERE is_name = 1

For every token, take its value and search for the correct function name(s) (methodology below) in PDB JSON or IDB with applied PDB (latter preferred).
Retrieve addresses of the function and token's parent string to complete the path query:

SELECT * FROM paths WHERE func_addr = 4528400 and string_addr = 5351792

Take returned path's id and label the path positive:

UPDATE paths SET to_name = 1 WHERE id = 190995

Methodology

A function name is labelled positive when:

the positive token name matches the exact name retrieved from PDB
the positive token name is a substring of the name retrieved from PDB and is functionally equivalent to the original

Examples for the latter case:

Token value	PDB name (demangled)
`AreFileApisANSI`	`__acrt_AreFileApisANSI`
`InitializeCriticalSectionEx`	`__vcrt_InitializeCriticalSectionEx`
`AppPolicyGetProcessTerminationMethod`	`__acrt_AppPolicyGetProcessTerminationMethodInternal`
`acos`	`__CIacos`

Functions that meet the second requirement but are outside of max reference path length are labelled as negatives.

In cases where token name included namespaces, all occurences of the function name without namespace prefixes were checked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xref paths

Cross reference paths

Type frequency

Diamond Engine

Nginx CLI

Xash3D

Average

Positives

Diamond Engine

Nginx CLI

Xash3D

Path labelling

Steps

Methodology

Clone this wiki locally