Skip to content

Xref paths

michal-kapala edited this page Aug 23, 2023 · 11 revisions

Cross reference paths

Shares of different function-to-string cross reference path types were evaluated to better understand and balance the datasets.

Mixed paths have not been extracted for simplicity - measured cases are down-only and up-only paths. Only unique paths were extracted, that is those characterized by a unique key of 4 values:

  • root function address
  • 1st path node function address
  • 2nd path node function address
  • target string address

3 middle-sized binaries have been evaluated to estimate average share of different cross-reference types within extracted datasets. Additionally, a large debug build of an Unreal Engine 5 game has been evaluated for comparison.

Let path length be a number of function-function edges in cross-reference graph which make a path from the root function to a string.

For each binary and for each subroutine, the following paths were extracted using experimental values for path length parameter:

  • direct string references (path length = 0)
  • upward reference paths (path length = 1)
  • downward reference paths (path length <= 2)

Type frequency

Bomber game

Bomber UE5 game binary evaluation:

  • 723632 decompiled subroutines
  • 194635 strings
  • 15335354 unique function-string path samples
Type Path length Samples Share (%)
upward 1 5321662 34,70
direct 0 573797 3,74
downward 1 1609203 10,49
downward 2 7830692 51,06

Diamond Engine

Diamond Engine binary evaluation:

  • 2717 decompiled subroutines
  • 4032 strings
  • 178167 unique function-string path samples
Type Path length Samples Share (%)
upward 1 126380 70,93
direct 0 5598 3,14
downward 1 12884 7,23
downward 2 33305 18,69

Nginx CLI

Nginx for Visual Studio 2015 binary evaluation:

  • 2387 decompiled subroutines
  • 1894 strings
  • 20844 unique function-string path samples
Type Path length Samples Share (%)
upward 1 15446 74,10
direct 0 1756 8,42
downward 1 1225 5,88
downward 2 2417 11,60

Xash3D

Xash3D engine binary evaluation:

  • 2380 decompiled subroutines
  • 7184 strings
  • 97551 unique function-string path samples
Type Path length Samples Share (%)
upward 1 48533 49,75
direct 0 5681 5,82
downward 1 12871 13,19
downward 2 30466 31,23

Average

Average binary evaluation:

  • 296562 unique function-string path samples (total)
Type Path length Samples Share (%)
upward 1 190359 64,19
direct 0 13035 4,40
downward 1 26980 9,10
downward 2 96654 32,59

The share of upward references tends to vary within 50-75% range, which makes it stand out in contrast to 7 times lesser average share of downward references of the same path length.

The experiment showed upward references constitute a majority in average binary. To evaluate their relevance for machine learning model predictions we calculate positive conversion rates for each path type.

Positives

All positive paths (the ones referencing a string containing root function's name) were labelled. Positive-to-all path conversion rates are:

Bomber game

  • 15335354 path samples
  • 501 positives
  • 15334853 negatives
Type Path length Samples Positives All positives share (%) Type sample share (%)
upward 1 5321662 106 21,16 1,99E-3
direct 0 573797 353 70,46 6,15E-2
downward 1 1609203 27 5,39 1,68E-3
downward 2 7830692 15 2,99 1,92E-4

Diamond Engine

  • 178167 path samples
  • 53 positives
  • 178114 negatives
Type Path length Samples Positives All positives share (%) Type sample share (%)
upward 1 126380 21 39,62 0,02
direct 0 5598 15 28,30 0,27
downward 1 12884 12 22,64 0,09
downward 2 33305 5 9,43 0,015

Nginx CLI

  • 20844 path samples
  • 10 positives
  • 20834 negatives
Type Path length Samples Positives All positives share (%) Type sample share (%)
upward 1 15446 0 0,00 0,00
direct 0 1756 3 30,00 0,17
downward 1 1225 1 10,00 0,08
downward 2 2417 6 60,00 0,25

Xash3D

  • 97551 path samples
  • 294 positives
  • 97257 negatives
Type Path length Samples Positives All positives share (%) Type sample share (%)
upward 1 48533 30 10,20 0,06
direct 0 5681 236 80,27 4,15
downward 1 12871 22 7,48 0,17
downward 2 30466 6 2,04 0,02

The results show that upward references are rare despite consistently being the most common type of references. Given the fact that callbacks are ignored due to their cross reference types being data and given above results, it is reasonable to limit the dataset to direct and downward references of path length 1.

Path labelling

Steps

Positives

  1. Select positive tokens:
SELECT * FROM tokens WHERE is_name = 1
  1. For every token, take its value and search for the correct function name(s) (methodology below) in PDB JSON or IDB with applied PDB (latter preferred).

  2. Retrieve addresses of the function and token's parent string to complete the path query:

SELECT * FROM paths WHERE func_addr = 4528400 and string_addr = 5351792
  1. Take returned path's id and label the path positive:
UPDATE paths SET to_name = 1 WHERE id = 190995

Negatives

  1. Select unlabelled paths:
SELECT * FROM paths WHERE to_name IS NULL
  1. Look up the referenced string in the IDB file and the function's name.
-- provide the function's offset
SELECT * FROM pdb WHERE func_addr = 12345678
  1. If you are sure the string does not contain any function names, mark all paths to it negatives:
-- provide the string's offset
UPDATE paths SET to_name = 0 WHERE string_addr = 5376670492

If you are unsure about other references, only mark the reviewed path:

-- provide path id
UPDATE paths SET to_name = 0 WHERE id = 1234

Methodology

A function is:

  • a positive when it has at least 1 positive path
  • a negative if it has no positive paths
  • unlabelled if it has any NULL-labelled paths

A function-string path is labelled positive when it includes a substring with full or partial name of the original name retrieved from symbol file.

A function-token path is labelled positive when it ends with full or partial name of the original name retrieved from symbol file.

A function name is labelled positive when:

  • the positive token name matches the exact name retrieved from PDB (full)
  • the positive token name is a substring of the name retrieved from PDB and is functionally equivalent to the original (partial)

Examples for the latter case:

Token value PDB name (demangled)
AreFileApisANSI __acrt_AreFileApisANSI
InitializeCriticalSectionEx __vcrt_InitializeCriticalSectionEx
AppPolicyGetProcessTerminationMethod __acrt_AppPolicyGetProcessTerminationMethodInternal
acos __CIacos

Functions that meet the second requirement but are outside of max reference path length should be labelled as negatives.

In cases where token name included namespaces, all occurences of the function name without namespace prefixes were checked.