-
Notifications
You must be signed in to change notification settings - Fork 22
GraphSense TagPacks
A GraphSense TagPack is a data structure for packaging and sharing attribution tags in an interoperable, machine-processable format.
An attribution tag associates a cryptoasset addresses with any form of context information. The following example attributes a Bitcoin address (1Archive1n2C579dMsAu3iC6tWzuQJz8dN
) to the Internet Archive, which according to this source controls that address:
label: Internet Archive
address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
source: https://archive.org/donate/cryptocurrency/
Attribution tags are typically created by someone (the creator), who can be an individual or some organization. To share tags with others, tags can be packaged into a so-called TagPack, which is represented as YAML file and can easily be created by hand or exported automatically from any system.
Here is a minimal TagPack example containing two attribution tags with mandatory properties:
title: First Address Tag Example
creator: John Doe
tags:
- address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
label: Internet Archive
source: https://archive.org/donate/cryptocurrency/
lastmod: 2019-03-15
currency: BTC
- address: 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2
label: Example
source: https://example.com
lastmod: 2019-03-15
currency: BTC
TagPacks can be stored and managed using any public or private storage system. It is recommended to use some Git-Service, because this enables version control and fine-grained recording of modifications, thus full provenance. This is important for safeguarding the evidential value of forensic cryptocurrency investigations.
TagPacks can be shared among users and forensic tool providers using any communication channel (e.g. via email).
Here is a collection of public, collaboratively collected TagPacks.
Cryptoasset analytics relies on two complementary techniques: address clustering, which relies on heuristics to group multiple addresses into maximal subsets that can likely be assigned to the same real-world actor, and attribution tags as shown above. The strength lies in the combination of these techniques: a tag attributed to a single address belonging to a larger cluster can easily add contextual information to hundreds of thousands cryptocurrency addresses.
Note: certain types of transactions (e.g., CoinJoins, Mixing Services) can distort clustering results and lead to false, unreliable, or intentionally misplaced attribution tags that could associate unrelated actors with a given cluster.
A TagPack defines a header with several mandatory and optional fields and a body containing a list of tags. In the above example, title
and creator
are part of the TagPack header; the list of tags
represents the body.
Please note that allowed properties are defined in the TagPack schema, which defines mandatory and optional fields for the TagPack header and body. In the above example, label
, address
, and source
are mandatory properties as they describe where a certain piece of information is coming from (either in the form of a URI or a textual description).
The current TagPack schema is available here and looks as follows:
header:
title:
type: text
mandatory: true
creator:
type: text
mandatory: true
description:
type: text
mandatory: false
tags:
type: list
mandatory: true
tag:
label:
type: text
mandatory: true
source:
type: text
mandatory: true
currency:
type: text
mandatory: true
context:
type: text
mandatory: false
confidence:
type: text
mandatory: false
is_cluster_definer:
type: boolean
mandatory: false
lastmod:
type: datetime
mandatory: false
category:
type: text
mandatory: false
taxonomy: entity
abuse:
type: text
mandatory: false
taxonomy: abuse
address:
type: text
mandatory: true
The source
should provide a backlink to the websource where the tag information originates, e.g. https://archive.org/donate/cryptocurrency/. If no backlink is to be provided (e.g. for private TagPacks), the field can contain other informative text (e.g., Manual transaction).
For currency
, use the corresponding currency codes such as: BCH, BTC, ETH, LTC, ZEC
.
For confidence
values read on below.
The context
property contains any additional information associated with the tag, in JSON format, e.g.
context: '{"count": 42, "verified_by": "operator", "flags": ["scam", "misuse"]}'
In the above example, the same lastmod
and currency
property values are repeated for both tags, which represents an unnecessary repetition of the same information.
To avoid repeating field values shared by all tags, one can add body fields to the header, and they will be automatically be applied to all tags in the body. Thus, they are abstracted into the header and then inherited by all body elements.
Here is an example that abstracts the currency
and lastmod
fields into the header and avoids unnecessary repetition of values:
title: Second TagPack Example
creator: John Doe
lastmod: 2019-03-15
currency: BTC
tags:
- label: Internet Archive
address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
source: https://archive.org/donate/cryptocurrency/
- label: Example
address: 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2
source: https://example.com
It is also possible to override abstracted fields in the body. This could be relevant if someone creates a TagPacks comprising several tags and then adds additional tags later on, which then, of course, have different lastmod
property values.
The following example shows several tags associating addresses from various cryptocurrencies with the label Internet Archive
. Most of them were collected at the same time (2019-03-15), except the Zcash tag that was collected and added later (2019-03-20).
title: Third TagPack Example
creator: John Doe
description: A collection of tags commonly used for demonstrating GraphSense features
lastmod: 2019-03-15
label: Internet Archive
source: https://archive.org/donate/cryptocurrency
category: organization
tags:
- address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
currency: BTC
- address: 1K1rgZ1dz9w7dsR1HGS1drmzfUHMtqx1Tc
currency: BCH
- address: "0xFA8E3920daF271daB92Be9B87d9998DDd94FEF08"
currency: ETH
- address: rGeyCsqc6vKXuyTGF39WJxmTRemoV3c97h
currency: XRP
- address: t1ZmpK4QFcvyQZ3ghTgSboBW8b4HgiZHQF9
currency: ZEC
lastmod: 2019-04-16
TagPacks are uniquely identified by an URI, which can be resolvable.
If TagPacks are files are maintained in some Git repository, they can be uniquely identified by their Git URI
(e.g., https://github.com/graphsense/graphsense-tagpacks/blob/master/packs/demo.yaml
).
Within a TagPack, tags are treated as first-class objects that are identified by the combination of the mandatory body fields address
, label
, source
.
That implies that the same label (e.g., Internet Archive
) can be assigned several times to the same address (e.g., 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
), typically by different parties.
The use of common terminology is essential for data sharing and establishing interoperability across tools. Therefore, the TagPack schema defines two properties that take concepts from agreed-upon taxonomies as values:
-
category
: defines the type of real-world entity that is in control of a given address. Possible concepts (e.g., exchange, marketplace) are defined in the INTERPOL Darkweb and Cryptoassets Entity Taxonomy. -
abuse
: if an address was involved in some abusive behavior, this property's value defines the type of abuse and can take values from the INTERPOL Darkweb and Cryptoassets Abuse Taxonomy.
Both category
and abuse
properties are optional, and can be also combined, e.g.
category: organization
abuse: extremism
The "narrower" (more specific) concept should always be preferred over the "broader" (more abstract) concept in the taxonomy hierarchy,
e.g. using ico_wallet
should be preferred over wallet_service
.
Attribution tags originate from distinct sources, which have various confidence levels. E.g., Bitcoin addresses retrieved via a Web crawl are less trustworthy than a Bitcoin address with proven private key ownership.
The TagPack creator can choose an id from the list of confidence score ids available here, e.g.
confidence: ownership
An attribution tag assigned to a single address may or may not be applicable to the entire address cluster. To signal this applicability, the field is_cluster_definer
must be set to true
or false
, respectively. GraphSense selects all address tags with is_cluster_definer: true
as candidates for becoming the cluster's tag.
If a tag is applicable to the cluster depends very much on the context of the TagPacks and the cluster, and must be decided on an individual basis: i.e. after manually inspecting cluster characteristics.
Thus the recommended approach for TagPack creators is:
- always set
is_cluster_definer
to false, unless you are experienced and have done the necessary cluster analysis
Up to the tagpack-tool release 23.01 the default value of is_cluster_definer
was NULL, which was confusing. From release 23.03 the default value is set to false.
As a rule-of-thumb, the following aspects are considered when resolving address attribution tags for a cluster:
-
If an address maps to a cluster that already carries a tag with higher confidence, then the address tag does not define the cluster.
-
If an address maps to a cluster of size 1, then the address tag also defines the entity.
-
If an address is not a service address (e.g. exchange) and maps to a large unknown cluster, then it might be some form of custodial wallet, i.e.
is_cluster_definer
can be set totrue
In many cases one may end up with several different address tags all of which are mapped to the same cluster and carry the flat is_cluster_definer: true
: imagine for example different user wallets tagged as e.g. ransomware
or extortion
which are hosted by an exchange.
In such cases, we select the one with the higher
confidence
value. In case of conflicts, address tags mappings must be revised as part of an attribution tag curation workflow.
TagPacks are based on the YAML data-serialisation format. YAML has many advantages in comparison to other data-formats, like well-defined data-type handling and good readability. Unfortunately, it also comes with some downsides. Processing large YAML files can be very slow and even lead to resource exhaustion errors while parsing them (out of memory). To avoid situations like that, we recommend keeping the files small (< 200mb), by partitioning the files e.g. on actor or collection-time.
TagPacks are often created and exported automatically, for instance, by some cron-triggered script. To avoid data duplication, it is possible to outsource common header fields into a separate file, which is included by individual TagPack files. Here is a simple example directory structure:
home
user
tagpack_provider
header.yaml
2021
01
tp_20200101.yaml
..
02
tp_20200201.yaml
The header.yaml
file contains all common fields:
title: BadHack TagPack
creator: GraphSense Team
description: Addresses used for BadHack
confidence: forensic
abuse: scam
currency: BTC
label: BadHack
and each tagpack file (e.g. tp_20200101.yaml
) includes the header file:
header: !include header.yaml
tags:
- address: bc1qxy2kgdygjrsqtzq2n0yrf2493p83kkfjhx0wlh
context: '{"validated": true}'
The directory structure can be arbitrarily deep. The syntax is always the same:
header: !include header.yaml
The tagpack tool resolves the header file as follows: starting from the directory given on the command line, it traverses the parent directories, i.e. upwards, until the header.yaml
is found.
For the directories structure example given above, starting with /home/user/tagpack_provider/2021/01
the header file is detected in /home/user/tagpack_provider/
TagPacks are stored in a Git repository - a so-called TagPack Repository.
When inserting tagpacks into a TagStore database, this git repository information will be used to provide the backlink to the remote repository, e.g.:
tagstore=# select id,uri from tagpack ;
id | uri
-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------
testpacks:20210101.yaml | https://github.com/graphsense/graphsense-tagpack-tool/tree/develop/tests/testfiles/yaml_inclusion/2021/01/20210101.yaml
testpacks:20210106-special.yaml | https://github.com/graphsense/graphsense-tagpack-tool/tree/develop/tests/testfiles/yaml_inclusion/2021/01/special/20210106-special.yaml
Currently having a single remote is supported.
For most use cases it is recommended to rely on the standard taxonomies included in the GraphSense tagpack tool.
If required however, a TagPack repository can include a custom file called config.yaml
which defines explicit pointers to the taxonomies the tagpacks are build upon, e.g.
taxonomies:
entity: https://graphsense.github.io/DW-VA-Taxonomy/assets/data/entities.csv
abuse: https://graphsense.github.io/DW-VA-Taxonomy/assets/data/abuses.csv