Skip to content

Xref paths

michal-kapala edited this page Jun 22, 2023 · 11 revisions

Cross reference paths

Shares of different function-to-string cross reference path types were evaluated to better understand and balance the datasets.

Mixed paths have not been extracted for simplicity - measured cases are down-only and up-only paths. Only unique paths were extracted, that is those characterized by a unique key of 4 values:

  • root function address
  • 1st path node function address
  • 2nd path node function address
  • target string address

3 middle-sized binaries have been evaluated to estimate average share of different cross-reference types within extracted datasets.

Let path length be a number of function-function edges in cross-reference graph which make a path from the root function to a string.

For each binary and for each subroutine, the following paths were extracted using experimental values for path length parameter:

  • direct string references (path length = 0)
  • upward reference paths (path length = 1)
  • downward reference paths (path length <= 2)

Type frequency

Diamond Engine

Diamond Engine binary evaluation:

  • 2717 decompiled subroutines
  • 4032 strings
  • 178167 unique function-string path samples
Type Path length Samples Share (%)
upward 1 126380 70,93
direct 0 5598 3,14
downward 1 12884 7,23
downward 2 33305 18,69

Nginx CLI

Nginx for Visual Studio 2015 binary evaluation:

  • 2387 decompiled subroutines
  • 1894 strings
  • 20844 unique function-string path samples
Type Path length Samples Share (%)
upward 1 15446 74,10
direct 0 1756 8,42
downward 1 1225 5,88
downward 2 2417 11,60

Xash3D

Xash3D engine binary evaluation:

  • 2380 decompiled subroutines
  • 7184 strings
  • 97551 unique function-string path samples
Type Path length Samples Share (%)
upward 1 48533 49,75
direct 0 5681 5,82
downward 1 12871 13,19
downward 2 30466 31,23

Average

Average binary evaluation:

  • 296562 unique function-string path samples (total)
Type Path length Samples Share (%)
upward 1 190359 64,19
direct 0 13035 4,40
downward 1 26980 9,10
downward 2 96654 32,59

The share of upward references tends to vary within 50-75% range, which makes it stand out in contrast to 7 times lesser average share of downward references of the same path length.

The experiment showed upward references constitute a majority in average binary. To evaluate their relevance for machine learning model predictions we calculate positive conversion rates for each path type.

Positives

All positive paths (the ones referencing a string containing root function's name) were labelled. Positive-to-all path conversion rates are:

Diamond Engine

  • 178167 path samples
  • 53 positives
  • 178114 negatives
Type Path length Samples Positives All positives share (%) Type sample share (%)
upward 1 126380 21 39,62 0,02
direct 0 5598 15 28,30 0,27
downward 1 12884 12 22,64 0,09
downward 2 33305 5 9,43 0,015

Nginx CLI

  • 20844 path samples
  • 10 positives
  • 20834 negatives
Type Path length Samples Positives All positives share (%) Type sample share (%)
upward 1 15446 0 0,00 0,00
direct 0 1756 3 30,00 0,17
downward 1 1225 1 10,00 0,08
downward 2 2417 6 60,00 0,25

Xash3D

  • 97551 path samples
  • 294 positives
  • 97257 negatives
Type Path length Samples Positives All positives share (%) Type sample share (%)
upward 1 48533 30 10,20 0,06
direct 0 5681 236 80,27 4,15
downward 1 12871 22 7,48 0,17
downward 2 30466 6 2,04 0,02

The results show that upward references are rare despite consistently being the most common type of references. Given the fact that callbacks are ignored due to their cross reference types being data and given above results, it is reasonable to limit the dataset to direct and downward references of path length 1.

Path labelling

Steps

To label paths (to_name column):

  1. Select positive tokens:
SELECT * FROM tokens WHERE is_name = 1
  1. For every token, take its value and search for the correct function name(s) (methodology below) in PDB JSON or IDB with applied PDB (latter preferred).

  2. Retrieve addresses of the function and token's parent string to complete the path query:

SELECT * FROM paths WHERE func_addr = 4528400 and string_addr = 5351792
  1. Take returned path's id and label the path positive:
UPDATE paths SET to_name = 1 WHERE id = 190995

Methodology

A function name is labelled positive when:

  • the positive token name matches the exact name retrieved from PDB
  • the positive token name is a substring of the name retrieved from PDB and is functionally equivalent to the original

Examples for the latter case:

Token value PDB name (demangled)
AreFileApisANSI __acrt_AreFileApisANSI
InitializeCriticalSectionEx __vcrt_InitializeCriticalSectionEx
AppPolicyGetProcessTerminationMethod __acrt_AppPolicyGetProcessTerminationMethodInternal
acos __CIacos

Functions that meet the second requirement but are outside of max reference path length are labelled as negatives.

In cases where token name included namespaces, all occurences of the function name without namespace prefixes were checked.

Clone this wiki locally