Skip to content

GraphSense TagPacks

mdragaschnig edited this page Nov 7, 2023 · 13 revisions

A GraphSense TagPack is a data structure for packaging and sharing attribution tags in an interoperable, machine-processable format.

What is an attribution tag?

An attribution tag associates a cryptoasset addresses with any form of context information. The following example attributes a Bitcoin address (1Archive1n2C579dMsAu3iC6tWzuQJz8dN) to the Internet Archive, which according to this source controls that address:

label: Internet Archive
address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
source: https://archive.org/donate/cryptocurrency/

What is a TagPack?

Attribution tags are typically created by someone (the creator), who can be an individual or some organization. To share tags with others, tags can be packaged into a so-called TagPack, which is represented as YAML file and can easily be created by hand or exported automatically from any system.

Here is a minimal TagPack example containing two attribution tags with mandatory properties:

title: First Address Tag Example
creator: John Doe
tags:
    - address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
      label: Internet Archive
      source: https://archive.org/donate/cryptocurrency/
      lastmod: 2019-03-15
      currency: BTC
    - address: 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2
      label: Example
      source: https://example.com
      lastmod: 2019-03-15
      currency: BTC

TagPacks can be stored and managed using any public or private storage system. It is recommended to use some Git-Service, because this enables version control and fine-grained recording of modifications, thus full provenance. This is important for safeguarding the evidential value of forensic cryptocurrency investigations.

TagPacks can be shared among users and forensic tool providers using any communication channel (e.g. via email).

Here is a collection of public, collaboratively collected TagPacks.

Why are attribution tags important?

Cryptoasset analytics relies on two complementary techniques: address clustering, which relies on heuristics to group multiple addresses into maximal subsets that can likely be assigned to the same real-world actor, and attribution tags as shown above. The strength lies in the combination of these techniques: a tag attributed to a single address belonging to a larger cluster can easily add contextual information to hundreds of thousands cryptocurrency addresses.

Note: certain types of transactions (e.g., CoinJoins, Mixing Services) can distort clustering results and lead to false, unreliable, or intentionally misplaced attribution tags that could associate unrelated actors with a given cluster.

TagPack properties

A TagPack defines a header with several mandatory and optional fields and a body containing a list of tags. In the above example, title and creator are part of the TagPack header; the list of tags represents the body.

Please note that allowed properties are defined in the TagPack schema, which defines mandatory and optional fields for the TagPack header and body. In the above example, label, address, and source are mandatory properties as they describe where a certain piece of information is coming from (either in the form of a URI or a textual description).

The current TagPack schema is available here and looks as follows:

header:
  title:
    type: text
    mandatory: true
  creator:
    type: text
    mandatory: true
  description:
    type: text
    mandatory: false
  tags:
    type: list
    mandatory: true
tag:
  label:
    type: text
    mandatory: true
  source:
    type: text
    mandatory: true
  currency:
    type: text
    mandatory: true
  context:
    type: text
    mandatory: false
  confidence:
    type: text
    mandatory: false
  is_cluster_definer:
    type: boolean
    mandatory: false
  lastmod:
    type: datetime
    mandatory: false
  category:
    type: text
    mandatory: false
    taxonomy: entity
  abuse:
    type: text
    mandatory: false
    taxonomy: abuse
  address:
    type: text
    mandatory: true

The source should provide a backlink to the websource where the tag information originates, e.g. https://archive.org/donate/cryptocurrency/. If no backlink is to be provided (e.g. for private TagPacks), the field can contain other informative text (e.g., Manual transaction).

For currency, use the corresponding currency codes such as: BCH, BTC, ETH, LTC, ZEC.

For confidence values read on below.

The context property contains any additional information associated with the tag, in JSON format, e.g.

context: '{"count": 42, "verified_by": "operator", "flags": ["scam", "misuse"]}'

Property inheritance

In the above example, the same lastmod and currency property values are repeated for both tags, which represents an unnecessary repetition of the same information.

To avoid repeating field values shared by all tags, one can add body fields to the header, and they will be automatically be applied to all tags in the body. Thus, they are abstracted into the header and then inherited by all body elements.

Here is an example that abstracts the currency and lastmod fields into the header and avoids unnecessary repetition of values:

title: Second TagPack Example
creator: John Doe
lastmod: 2019-03-15
currency: BTC
tags:
    - label: Internet Archive
      address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
      source: https://archive.org/donate/cryptocurrency/
    - label: Example
      address: 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2
      source: https://example.com

Property override

It is also possible to override abstracted fields in the body. This could be relevant if someone creates a TagPacks comprising several tags and then adds additional tags later on, which then, of course, have different lastmod property values.

The following example shows several tags associating addresses from various cryptocurrencies with the label Internet Archive. Most of them were collected at the same time (2019-03-15), except the Zcash tag that was collected and added later (2019-03-20).

title: Third TagPack Example
creator: John Doe
description: A collection of tags commonly used for demonstrating GraphSense features
lastmod: 2019-03-15
label: Internet Archive
source: https://archive.org/donate/cryptocurrency
category: organization
tags:
    - address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
      currency: BTC
    - address: 1K1rgZ1dz9w7dsR1HGS1drmzfUHMtqx1Tc
      currency: BCH
    - address: "0xFA8E3920daF271daB92Be9B87d9998DDd94FEF08"
      currency: ETH
    - address: rGeyCsqc6vKXuyTGF39WJxmTRemoV3c97h
      currency: XRP
    - address: t1ZmpK4QFcvyQZ3ghTgSboBW8b4HgiZHQF9
      currency: ZEC
      lastmod: 2019-04-16

Identification and Uniqueness of TagPacks and Tags

TagPacks are uniquely identified by an URI, which can be resolvable.

If TagPacks are files are maintained in some Git repository, they can be uniquely identified by their Git URI (e.g., https://github.com/graphsense/graphsense-tagpacks/blob/master/packs/demo.yaml).

Within a TagPack, tags are treated as first-class objects that are identified by the combination of the mandatory body fields address, label, source.

That implies that the same label (e.g., Internet Archive) can be assigned several times to the same address (e.g., 1Archive1n2C579dMsAu3iC6tWzuQJz8dN), typically by different parties.

Using Concepts from Public Taxonomies

The use of common terminology is essential for data sharing and establishing interoperability across tools. Therefore, the TagPack schema defines two properties that take concepts from agreed-upon taxonomies as values:

Both category and abuse properties are optional, and can be also combined, e.g.

category: organization
abuse: extremism 

The "narrower" (more specific) concept should always be preferred over the "broader" (more abstract) concept in the taxonomy hierarchy, e.g. using ico_wallet should be preferred over wallet_service.

Confidence score

Attribution tags originate from distinct sources, which have various confidence levels. E.g., Bitcoin addresses retrieved via a Web crawl are less trustworthy than a Bitcoin address with proven private key ownership.

The TagPack creator can choose an id from the list of confidence score ids available here, e.g.

confidence: ownership

Mapping attribution tags to UTXO clusters

An attribution tag assigned to a single address may or may not be applicable to the entire address cluster. To signal this applicability, the field is_cluster_definer must be set to true or false, respectively. GraphSense selects all address tags with is_cluster_definer: true as candidates for becoming the cluster's tag.

If a tag is applicable to the cluster depends very much on the context of the TagPacks and the cluster, and must be decided on an individual basis: i.e. after manually inspecting cluster characteristics.

Thus the recommended approach for TagPack creators is:

  • always set is_cluster_definer to false, unless you are experienced and have done the necessary cluster analysis

Up to the tagpack-tool release 23.01 the default value of is_cluster_definer was NULL, which was confusing. From release 23.03 the default value is set to false.

How are cluster mappings resolved?

As a rule-of-thumb, the following aspects are considered when resolving address attribution tags for a cluster:

  1. If an address maps to a cluster that already carries a tag with higher confidence, then the address tag does not define the cluster.

  2. If an address maps to a cluster of size 1, then the address tag also defines the entity.

  3. If an address is not a service address (e.g. exchange) and maps to a large unknown cluster, then it might be some form of custodial wallet, i.e. is_cluster_definer can be set to true

In many cases one may end up with several different address tags all of which are mapped to the same cluster and carry the flat is_cluster_definer: true: imagine for example different user wallets tagged as e.g. ransomware or extortion which are hosted by an exchange. In such cases, we select the one with the higher confidence value. In case of conflicts, address tags mappings must be revised as part of an attribution tag curation workflow.

Organizing large TagPack collections

Large TagPack files

TagPacks are based on the YAML data-serialisation format. YAML has many advantages in comparison to other data-formats, like well-defined data-type handling and good readability. Unfortunately, it also comes with some downsides. Processing large YAML files can be very slow and even lead to resource exhaustion errors while parsing them (out of memory). To avoid situations like that, we recommend keeping the files small (< 200mb), by partitioning the files e.g. on actor or collection-time.

File Includes

TagPacks are often created and exported automatically, for instance, by some cron-triggered script. To avoid data duplication, it is possible to outsource common header fields into a separate file, which is included by individual TagPack files. Here is a simple example directory structure:

home
    user
        tagpack_provider
             header.yaml
             2021
                 01
                     tp_20200101.yaml
                     ..
                 02
                      tp_20200201.yaml

The header.yaml file contains all common fields:

title: BadHack TagPack
creator: GraphSense Team
description: Addresses used for BadHack
confidence: forensic
abuse: scam
currency: BTC
label: BadHack

and each tagpack file (e.g. tp_20200101.yaml) includes the header file:

header: !include header.yaml
tags:
     - address: bc1qxy2kgdygjrsqtzq2n0yrf2493p83kkfjhx0wlh
       context: '{"validated": true}'

The directory structure can be arbitrarily deep. The syntax is always the same:

header: !include header.yaml

Header file location resolution

The tagpack tool resolves the header file as follows: starting from the directory given on the command line, it traverses the parent directories, i.e. upwards, until the header.yaml is found.

For the directories structure example given above, starting with /home/user/tagpack_provider/2021/01 the header file is detected in /home/user/tagpack_provider/

TagPack Repository Configuration

Versioning with git

TagPacks are stored in a Git repository - a so-called TagPack Repository.

When inserting tagpacks into a TagStore database, this git repository information will be used to provide the backlink to the remote repository, e.g.:

tagstore=# select id,uri from tagpack ;
      id               |                                                                   uri                                                                   
 -----------------------+-----------------------------------------------------------------------------------------------------------------------------------------
 testpacks:20210101.yaml         | https://github.com/graphsense/graphsense-tagpack-tool/tree/develop/tests/testfiles/yaml_inclusion/2021/01/20210101.yaml
 testpacks:20210106-special.yaml | https://github.com/graphsense/graphsense-tagpack-tool/tree/develop/tests/testfiles/yaml_inclusion/2021/01/special/20210106-special.yaml

Currently having a single remote is supported.

Taxonomy configuration

For most use cases it is recommended to rely on the standard taxonomies included in the GraphSense tagpack tool.

If required however, a TagPack repository can include a custom file called config.yaml which defines explicit pointers to the taxonomies the tagpacks are build upon, e.g.

taxonomies:
  entity: https://graphsense.github.io/DW-VA-Taxonomy/assets/data/entities.csv
  abuse: https://graphsense.github.io/DW-VA-Taxonomy/assets/data/abuses.csv