Xref paths

Cross reference paths

Shares of different function-to-string cross reference path types were evaluated to better understand and balance the datasets.

Mixed paths have not been extracted for simplicity - measured cases are down-only and up-only paths. Only unique paths were extracted, that is those characterized by a unique key of 4 values:

root function address
1st path node function address
2nd path node function address
target string address

3 middle-sized binaries have been evaluated to estimate average share of different cross-reference types within extracted datasets. Additionally, a large debug build of an Unreal Engine 5 game has been evaluated for comparison.

Let path length be a number of function-function edges in cross-reference graph which make a path from the root function to a string.

For each binary and for each subroutine, the following paths were extracted using experimental values for path length parameter:

direct string references (path length = 0)
upward reference paths (path length = 1)
downward reference paths (path length <= 2)

Type frequency

Bomber game

Bomber UE5 game binary evaluation:

723632 decompiled subroutines
194635 strings
15335354 unique function-string path samples

Type	Path length	Samples	Share (%)
upward	1	5321662	34,70
direct	0	573797	3,74
downward	1	1609203	10,49
downward	2	7830692	51,06

Diamond Engine

Diamond Engine binary evaluation:

2717 decompiled subroutines
4032 strings
178167 unique function-string path samples

Type	Path length	Samples	Share (%)
upward	1	126380	70,93
direct	0	5598	3,14
downward	1	12884	7,23
downward	2	33305	18,69

Nginx CLI

Nginx for Visual Studio 2015 binary evaluation:

2387 decompiled subroutines
1894 strings
20844 unique function-string path samples

Type	Path length	Samples	Share (%)
upward	1	15446	74,10
direct	0	1756	8,42
downward	1	1225	5,88
downward	2	2417	11,60

Xash3D

Xash3D engine binary evaluation:

2380 decompiled subroutines
7184 strings
97551 unique function-string path samples

Type	Path length	Samples	Share (%)
upward	1	48533	49,75
direct	0	5681	5,82
downward	1	12871	13,19
downward	2	30466	31,23

Average

Average binary evaluation:

296562 unique function-string path samples (total)

Type	Path length	Samples	Share (%)
upward	1	190359	64,19
direct	0	13035	4,40
downward	1	26980	9,10
downward	2	96654	32,59

The share of upward references tends to vary within 50-75% range, which makes it stand out in contrast to 7 times lesser average share of downward references of the same path length.

The experiment showed upward references constitute a majority in average binary. To evaluate their relevance for machine learning model predictions we calculate positive conversion rates for each path type.

Positives

All positive paths (the ones referencing a string containing root function's name) were labelled. Positive-to-all path conversion rates are:

Bomber game

15335354 path samples
501 positives
15334853 negatives

Type	Path length	Samples	Positives	All positives share (%)	Type sample share (%)
upward	1	5321662	106	21,16	1,99E-3
direct	0	573797	353	70,46	6,15E-2
downward	1	1609203	27	5,39	1,68E-3
downward	2	7830692	15	2,99	1,92E-4

Diamond Engine

178167 path samples
53 positives
178114 negatives

Type	Path length	Samples	Positives	All positives share (%)	Type sample share (%)
upward	1	126380	21	39,62	0,02
direct	0	5598	15	28,30	0,27
downward	1	12884	12	22,64	0,09
downward	2	33305	5	9,43	0,015

Nginx CLI

20844 path samples
10 positives
20834 negatives

Type	Path length	Samples	Positives	All positives share (%)	Type sample share (%)
upward	1	15446	0	0,00	0,00
direct	0	1756	3	30,00	0,17
downward	1	1225	1	10,00	0,08
downward	2	2417	6	60,00	0,25

Xash3D

97551 path samples
294 positives
97257 negatives

Type	Path length	Samples	Positives	All positives share (%)	Type sample share (%)
upward	1	48533	30	10,20	0,06
direct	0	5681	236	80,27	4,15
downward	1	12871	22	7,48	0,17
downward	2	30466	6	2,04	0,02

The results show that upward references are rare despite consistently being the most common type of references. Given the fact that callbacks are ignored due to their cross reference types being data and given above results, it is reasonable to limit the dataset to direct and downward references of path length 1.

Path labelling

Steps

Positives

Select positive tokens:

SELECT * FROM tokens WHERE is_name = 1

For every token, take its value and search for the correct function name(s) (methodology below) in PDB JSON or IDB with applied PDB (latter preferred).
Retrieve addresses of the function and token's parent string to complete the path query:

SELECT * FROM paths WHERE func_addr = 4528400 and string_addr = 5351792

Take returned path's id and label the path positive:

UPDATE paths SET to_name = 1 WHERE id = 190995

Negatives

Select unlabelled paths:

SELECT * FROM paths WHERE to_name IS NULL

Look up the referenced string in the IDB file and the function's name.

-- provide the function's offset
SELECT * FROM pdb WHERE func_addr = 12345678

If you are sure the string does not contain any function names, mark all paths to it negatives:

-- provide the string's offset
UPDATE paths SET to_name = 0 WHERE string_addr = 5376670492

If you are unsure about other references, only mark the reviewed path:

-- provide path id
UPDATE paths SET to_name = 0 WHERE id = 1234

Methodology

A function is:

a positive when it has at least 1 positive path
a negative if it has no positive paths
unlabelled if it has any NULL-labelled paths

A function-string path is labelled positive when it includes a substring with full or partial name of the original name retrieved from symbol file.

A function-token path is labelled positive when it ends with full or partial name of the original name retrieved from symbol file.

A function name is labelled positive when:

the positive token name matches the exact name retrieved from PDB (full)
the positive token name is a substring of the name retrieved from PDB and is functionally equivalent to the original (partial)

Examples for the latter case:

Token value	PDB name (demangled)
`AreFileApisANSI`	`__acrt_AreFileApisANSI`
`InitializeCriticalSectionEx`	`__vcrt_InitializeCriticalSectionEx`
`AppPolicyGetProcessTerminationMethod`	`__acrt_AppPolicyGetProcessTerminationMethodInternal`
`acos`	`__CIacos`

Functions that meet the second requirement but are outside of max reference path length should be labelled as negatives.

In cases where token name included namespaces, all occurences of the function name without namespace prefixes were checked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xref paths

Cross reference paths

Type frequency

Bomber game

Diamond Engine

Nginx CLI

Xash3D

Average

Positives

Bomber game

Diamond Engine

Nginx CLI

Xash3D

Path labelling

Steps

Positives

Negatives

Methodology

Clone this wiki locally