diff --git a/data-formats/README.md b/data-formats/README.md index 1295228a..496c7f00 100644 --- a/data-formats/README.md +++ b/data-formats/README.md @@ -2,7 +2,7 @@ | Authors | Arturo Filastò et al. | |------------|-----------------------| -| Version | 0.4.0 | +| Version | 0.2.0 | | Maintainer | Simone Basso | ## Overview @@ -21,15 +21,8 @@ of more fields generated by the experiment itself. That is: ```JavaScript { - "data_format_version": "0.3.1", - // other toplevel keys that are part of the base data format - // ... - "test_keys": { - // keys written by test templates - // ... - // keys written by each experiment - // ... - } + "data_format_version": "0.2.0", + "test_keys": {} } ``` @@ -42,30 +35,14 @@ emit `null`s) unless this has been explicitly documented in the field descriptio ## Data format version -For compatibility with the legacy OONI pipeline, since `data_format_version` equal -to `0.4.0`, a OONI probe SHOULD send `data_format_version` equal to `0.2.0`. In such -case, the probe MUST use the `real_data_format_version` annotation to declare the -the real data format version being used: - -```JavaScript -{ - "annotations": { - "real_data_format_version": "0.4.0" - }, - "data_format_version": "0.2.0", - // other toplevel keys that are part of the base data format - // ... - "test_keys": { - // keys written by test templates - // ... - // keys written by each experiment - // ... - } -} -``` +The current `data_format_version` is `0.2.0`. This applies only to the keys +in the external envelope. Since 2020-04-06, the `extensions` top-level key describes +the data formats contained inside the `test_keys` (see below). -The annotation allows to distinguish between probes using data format version -`0.2.0` and probes using subsequent versions. +Between November 2019 and April 2020, experimental versions of OONI probe had +version numbers ranging from `0.2.1` to `0.4.0`. Since 2020-04-06, the version +is back again to `0.2.0`. Because such larger version numbers were used by +experimental versions of OONI the next major data format version will be `0.5.0`. ## Example @@ -75,11 +52,15 @@ The following is a valid JSON that was edited for brevity. { "annotations": { "platform": "macos", - "real_data_format_version": "0.4.0" }, "data_format_version": "0.2.0", + "extensions": { + "dnst": 0, + "httpt": 0, + "tcpconnect": 0 + }, + "input": null, "measurement_start_time": "2020-01-10 17:25:19", - "test_runtime": 4.426603178, "probe_asn": "AS30722", "probe_cc": "IT", "probe_ip": "127.0.0.1", @@ -181,6 +162,7 @@ The following is a valid JSON that was edited for brevity. "telegram_web_status": "ok" }, "test_name": "telegram", + "test_runtime": 4.426603178, "test_start_time": "2020-01-10 17:25:19", "test_version": "0.0.4" } @@ -191,22 +173,26 @@ In this example: - all toplevel keys belong to the base data format. - the `agent` and `requests` keys within the `test_keys` -belong to the HTTP data format. +belong to the HTTP data format, which is declared as +`httpt` in of the `extensions` map. - the `queries` key within the `test_keys` belongs to -the DNS data format. +the DNS data format, which is declared as `dnst` in the +`extensions` map. - the `tcp_connect` key within the `test_keys` belongs -to the TCPConnect data format. +to the TCPConnect data format, which is declared as +`tcpconnect` in the `extensions` map. - all the other keys within `test_keys` are generated by the `telegram` experiment. ## Index -This directory contains the specification of: +This directory contains the specification of +[the base data format](df-000-base.md) as well as +of the following extensions: -- [the common data format](df-000-base.md) - [the HTTP data format](df-001-httpt.md) - [the DNS data format](df-002-dnst.md) - [the Scapy data format](df-003-scapy.md) @@ -225,29 +211,7 @@ See the [nettests](../nettests) directory for the experiments' specs. - `0.2.0` [2016-01-27]: the new JSON format. OONI Probe CLI v2.x and OONI Probe Mobile when using Measurement Kit as the measurement engine. -- `0.2.1` [2019-11-11]: should have been `0.3.0` because we moved the -`resolver_ip` field to toplevel keys. Was briefly used by unstable OONI -Probe CLI v3.0.0. - -- `0.3.0` [2019-12-02]: same as `0.2.1` but renamed for correctness. - -- `0.3.1` [2019-12-29]: added the `resolver_asn` and -`resolver_network_name` toplevel keys. - -- `0.3.2` [2020-01-10]: initially used to deprecate `test_start_time`, but -that was a mistake, and it was reverted. Thus this is now an alias for `0.3.1`. - -- `0.3.3` [2020-01-11]: document TLSHandshake and NetworkEvents templates, -add timing support via `t` to DNS and TCPConnect templates, allow cross -referencing DNS, TCPConnect, NetworkEvents and possibly TLSHandshake using -the `conn_id` and `dial_id` identifiers. - -- `0.3.4` [2020-01-11]: add `transaction_id` to link the HTTP template -with DNS template, TCPConnect template, TLSHandshake template, and -NetworkEvents template. Make sure we clarify which IDs may be reused -and which are expected to appear only once per measurement session. - -- `0.4.0` [2020-02-10]: clients SHOULD set `data_format_version` to -`0.2.0` to avoid breaking the legacy pipeline. If they do so, they -MUST set the `real_data_format_version` annotation to the real data -format version that is being used. +Between `2019-11-11` and `2020-04-06`, experimental versions of OONI have +used the following versions `0.2.1`, `0.3.0`, `0.3.1`, `0.3.2`, `0.3.3`, +and `0.4.0`. Since `2020-04-06`, `0.2.0` is again used by experimental and +stable versions of OONI Probe. diff --git a/data-formats/df-000-base.md b/data-formats/df-000-base.md index b2ab1bcb..aff256ba 100644 --- a/data-formats/df-000-base.md +++ b/data-formats/df-000-base.md @@ -10,7 +10,8 @@ directory's [README](README.md) for the basic concepts. "annotations": {}, "backend_version": "", "bucket_date": "2019-10-10", - "data_format_version": "0.3.1", + "data_format_version": "0.2.0", + "extensions": {}, "id": "bc1ff44a-04e7-45e0-81a6-46bc95c7c6b0", "input": "http://example.com/", "input_hashes": [], @@ -28,7 +29,7 @@ directory's [README](README.md) for the basic concepts. "software_name": "ooniprobe-ios", "software_version": "2.1.0", "test_helpers": {}, - "test_keys": {}, + "test_keys": {}, "test_name": "web_connectivity", "test_runtime": 2.2955930233, "test_start_time": "2019-10-10 23:58:13", @@ -50,6 +51,12 @@ of course are not supposed to emit this field. - `data_format_version` (`string`): indicates the data format version. See [README.md](README.md) for the current version and for the versions history. +- `extensions` (`map[string]int`; optional): SHOULD describe the extensions +to the base data format included in the `test_keys` field. The name of an +extension is the obtained directly from the file name describing the extension +in this directory, with the `df-xxx` prefix and the `.md` suffix removed. A +probe SHOULD describe the extensions included by its measurements. + - `id` (`string`; optional): client-generated UUID4 identifying this measurement in the context of a set of measurements (i.e. a report). Consumers of OONI data SHOULD NOT trust this identifier to uniquely identify the measurement. This @@ -57,14 +64,15 @@ identifier is only meaningful for measurements that have not been submitted to a OONI collector yet. In fact, OONI collectors SHOULD clear this field to avoid any potential confusion caused by it. -- `input` (`string`; optional; nullable): if this experiment accepts any input, the +- `input` (`string`; nullable): if this experiment accepts any input, the input that was used to produce this measurement. For example, the Web -Connectivity experiment uses URLs as input. +Connectivity experiment uses URLs as input. Otherwise, this field SHOULD +be present and set to `null`. - `input_hashes` (`[]string`; optional; deprecated): historical field that used to contain the SHA256s of all inputs provided to the experiment. Modern implementations, e.g. Measurement Kit, typically emit an empty list. All -clients using `v0.2.1`, or greater, of the data format SHOULD NOT emit this field at all. +modern clients SHOULD NOT emit this field at all. - `measurement_start_time` (`string`): time when this measurement was started in UTC, using the `"2006-01-02 08:04:05"` format. Note that @@ -72,8 +80,8 @@ ooniprobe <= 1.4.0 generates skewed time information. - `options` (`[]string`; optional; deprecated): list of options passed on the command line when running this specific experiment. Modern implementations, -e.g. Measurement Kit, typically emit an empty list here. All clients using -`v0.2.1` or greater of the data format SHOULD NOT emit this field at all. +e.g. Measurement Kit, typically emit an empty list here. All modern clients +SHOULD NOT emit this field at all. - `probe_asn` (`string`): AS Number of the probe (prefixed by AS, e.g., `"AS1234"`), or `"AS0"` if the user does not want to share their ASN. @@ -83,8 +91,7 @@ e.g. Measurement Kit, typically emit an empty list here. All clients using - `probe_city` (`string`; optional; deprecated): name of the city where the measurement was run. If the user does not want to share this information, -this field should be set to `null` by `v0.2.0` clients. Clients using `v0.2.1` -or greater SHOULD NOT emit this field. +this field was historically set to `null`; modern clients SHOULD NOT emit it. - `probe_ip` (`string`): IP address of the probe, or `"127.0.0.1"` if the user does not want to share their IP. @@ -96,14 +103,14 @@ course are not supposed to emit this field. - `report_id` (`string`): identifier of a set of related measurements generated by OONI backends when submitting one or more measurements. -- `resolver_asn` (`string`; since `0.3.1`): like `probe_asn` but for +- `resolver_asn` (`string`; optional; since `2019-12-29`): like `probe_asn` but for `resolver_ip` rather than for `probe_ip`. -- `resolver_ip` (`string`; since `0.2.1`): IP of the DNS resolver used by +- `resolver_ip` (`string`; optional; since `2019-11-11`): IP of the DNS resolver used by the probe, as determined by the measurement engine. -- `resolver_network_name` (`string`; since `0.3.1`): like `probe_network_name` -but for `resolver_ip` rather than for `probe_ip`. +- `resolver_network_name` (`string`; optional; since `2019-12-29`): like +`probe_network_name` but for `resolver_ip` rather than for `probe_ip`. - `software_name` (`string`): name of the software that has generated this specific measurement (e.g., `"ooniprobe"`). @@ -155,36 +162,35 @@ list of input URLs. ``` Annotations is defined as `map[string]string` but the consumer of this field -SHOULD NOT assume that clients using `data_format_version < 0.2.1` have always -used string values. Since `v0.2.1` clients MUST always use string values. A -client SHOULD always add to the map of annotations: +SHOULD NOT assume that measurements use string values. A client SHOULD always +add to the map of annotations: - - `engine_name` (`string`): the name of the measurement engine +- `engine_name` (`string`): the name of the measurement engine - - `engine_version` (`string`): the version of the measurement engine +- `engine_version` (`string`): the version of the measurement engine - - `engine_version_full` (`string`): the version of the measurement - engine as generated by `git describe --tags` +- `engine_version_full` (`string`): the version of the measurement +engine as generated by `git describe --tags` - - `network_type` (`string`): one of: +- `network_type` (`string`): one of: - - `mobile`: when OONI Probe Mobile is using 2G/3G/4G/5G networks. + - `mobile`: when OONI Probe Mobile is using 2G/3G/4G/5G networks. - - `wifi`: when OONI Probe Mobile is using Wi-Fi networks. + - `wifi`: when OONI Probe Mobile is using Wi-Fi networks. - - `platform` (`string`): one of: +- `platform` (`string`): one of: - - `android` + - `android` - - `ios` + - `ios` - - `lepidopter` + - `lepidopter` - - `linux` + - `linux` - - `macos` + - `macos` - - `windows` + - `windows` ## Test Helpers @@ -218,9 +224,14 @@ because it was not relevant for this discussion. "annotations": { "platform": "macos" }, - "data_format_version": "0.3.1", + "data_format_version": "0.2.0", + "extensions": { + "dnst": 0, + "httpt": 0, + "tcpconnect": 0 + }, + "input": null, "measurement_start_time": "2020-01-10 17:25:19", - "test_runtime": 4.426603178, "probe_asn": "AS30722", "probe_cc": "IT", "probe_ip": "127.0.0.1", @@ -232,6 +243,7 @@ because it was not relevant for this discussion. "software_version": "0.1.0-dev", "test_keys": {}, "test_name": "telegram", + "test_runtime": 4.426603178, "test_start_time": "2020-01-10 17:25:19", "test_version": "0.0.4" } diff --git a/data-formats/df-001-httpt.md b/data-formats/df-001-httpt.md index 380e77ae..293ae4f1 100644 --- a/data-formats/df-001-httpt.md +++ b/data-formats/df-001-httpt.md @@ -4,6 +4,10 @@ This document describes the keys with `test_keys` that all experiments using HTTP SHOULD populate, possibly using directly the specific template code. See this directory's [README](README.md) for the basic concepts. +| Name | `httpt` | +|------------|---------| +| Version | 0 | + ## Specification ```JSON @@ -46,7 +50,7 @@ a string indicating the error, otherwise it MUST be `null`. contains the response length and is typically set to `null` or directly omitted by modern clients (e.g. from Measurement Kit onwards). -- `transaction_id` (`int`; optional; since v0.3.4): unique ID of this +- `transaction_id` (`int`; optional; since 2020-01-11): unique ID of this transaction. A zero transaction ID means "unspecified". The code SHOULD NOT include the transaction ID in this case. This ID will be unique within a single measurement session; do not assume it will be unique @@ -69,7 +73,7 @@ for longer than that. `string` if it can be represented using UTF-8. Otherwise it is a `BinaryData` instance, as described below. See also `MaybeBinaryData` below. -- `body_is_truncated` (`bool`; optional; since v0.2.1): `true` if the body +- `body_is_truncated` (`bool`; optional; since 2019-12-02): `true` if the body has been truncated, `false` or omitted otherwise. - `headers` (`map[string]MaybeBinaryData`): legacy map containing HTTP headers @@ -77,7 +81,7 @@ where the value is `string` if it can be represented using UTF-8 and a `BinaryData` instance otherwise. In case multiple headers have the same key, the map SHOULD only contain the first value. -- `headers_list` (`[]HeaderValue`); since v0.2.1): this is a better +- `headers_list` (`[]HeaderValue`); since 2019-12-02): this is a better representation of headers that allows us to represent the case where there are multiple values for the same header key. See below the definition of `HeaderValue`, which in the value-is-UTF-8 case boils down to the @@ -90,7 +94,7 @@ instance of `tor` that we may be using for measuring. In case we have the following headers: -``` +```bash > Foo: bar > Foo: ``` diff --git a/data-formats/df-002-dnst.md b/data-formats/df-002-dnst.md index 1c96f8fe..333e2773 100644 --- a/data-formats/df-002-dnst.md +++ b/data-formats/df-002-dnst.md @@ -4,6 +4,10 @@ This document describes the keys with `test_keys` that all experiments using DNS SHOULD populate, possibly using directly the specific template code. See this directory's [README](README.md) for the basic concepts. +| Name | `dnst` | +|------------|--------| +| Version | 0 | + ## Specification ```JSON @@ -34,7 +38,7 @@ code. See this directory's [README](README.md) for the basic concepts. - `answers` (`[]Answer`): list of answer objects. See below. -- `dial_id` (`int`; optional; since v0.3.3): identifier of a dialing +- `dial_id` (`int`; optional; since 2020-01-11): identifier of a dialing operation (i.e. name resolution followed by connect). The zero dial_id means that we don't know the real dial ID. Applications SHOULD NOT emit the dial_id when it is zero. Rest assured that the dial_id will @@ -52,7 +56,7 @@ be reversed for PTR lookups like `1.0.0.127.in-addr.arpa`. - `query_type`: (`string`): a valid DNS query type (e.g. `MX`). -- `resolver_address`: (`string`; since v0.3.1): more flexible way of +- `resolver_address`: (`string`; since 2019-12-29): more flexible way of specifying the resolver address that also allows for DoH, because it does not assume that the resolver is identified by an address, port tuple. @@ -67,7 +71,7 @@ used by ooni/probe-engine, which sets it to `null`. - `t` (`float`): number of seconds elapsed since `measurement_start_time` measured in the moment in which `failure` is determined. -- `transaction_id` (`int`; optional; since v0.3.4): if present, this is the +- `transaction_id` (`int`; optional; since 2020-01-11): if present, this is the ID of the HTTP transaction that caused this query. ## Answer @@ -115,7 +119,7 @@ mailbox of the person responsible for this zone. that should be elapsed before the zone should be retried in case of failure. - `serial_number` (`int`; only for SOA answers): version number -of the original copy of the zone. +of the original copy of the zone. - `ttl` (`int`; nullable): the TTL if known, otherwise `null`. diff --git a/data-formats/df-005-tcpconnect.md b/data-formats/df-005-tcpconnect.md index a566bef9..c08af283 100644 --- a/data-formats/df-005-tcpconnect.md +++ b/data-formats/df-005-tcpconnect.md @@ -5,6 +5,10 @@ performing TCP connects SHOULD populate, possibly using directly the specific template code. See this directory's [README](README.md) for the basic concepts. +| Name | `tcpconnect` | +|------------|--------------| +| Version | 0 | + ## Specification ```JSON @@ -29,10 +33,10 @@ basic concepts. } ``` -- `conn_id` (`int`; optional; since v0.3.3): identifier of the connection. See +- `conn_id` (`int`; optional; since 2020-01-11): identifier of the connection. See the discussion in `df-008-netevents.md`. -- `dial_id` (`int`; optional; since v0.3.3): identifier of a dialing +- `dial_id` (`int`; optional; since 2020-01-11): identifier of a dialing operation (i.e. name resolution followed by connect). See the discussion in `df-002-dnst.md`. @@ -45,7 +49,7 @@ discussion in `df-002-dnst.md`. - `t` (`float`): number of seconds elapsed since `measurement_start_time` measured when `connect` is complete. -- `transaction_id` (`int`; optional; since v0.3.4): if present, this is the +- `transaction_id` (`int`; optional; since 2020-01-11): if present, this is the ID of the HTTP transaction that caused this TCP connect. ## Status diff --git a/data-formats/df-006-tlshandshake.md b/data-formats/df-006-tlshandshake.md index 789ae342..9702a943 100644 --- a/data-formats/df-006-tlshandshake.md +++ b/data-formats/df-006-tlshandshake.md @@ -4,7 +4,9 @@ This document describes the keys with `test_keys` that all experiments using TLS SHOULD populate, possibly using directly the specific template code. See this directory's [README](README.md) for the basic concepts. -This data format is available since data format version 0.3.3. +| Name | `tlshandshake` | +|------------|----------------| +| Version | 0 | ## Specification @@ -33,7 +35,7 @@ This data format is available since data format version 0.3.3. - `cipher_suite` (`string`): the negotiated cipher suite, if any. -- `conn_id` (`int`; optional; since v0.3.3): identifier of the connection. See +- `conn_id` (`int`; optional; since 2020-01-11): identifier of the connection. See the discussion in `df-008-netevents.md`. - `failure` (`string`; nullable): if there was an error, this field is @@ -50,7 +52,7 @@ measured in the moment in which `failure` is determined. - `tls_version` (`string`): the negotiated TLS version, if any. -- `transaction_id` (`int`; optional; since v0.3.4): if present, this is the +- `transaction_id` (`int`; optional; since 2020-01-11): if present, this is the ID of the HTTP transaction that caused this TLS handshake. ## Example diff --git a/data-formats/df-008-netevents.md b/data-formats/df-008-netevents.md index 104eb53a..cafbf3e0 100644 --- a/data-formats/df-008-netevents.md +++ b/data-formats/df-008-netevents.md @@ -4,7 +4,9 @@ This document describes the keys with `test_keys` that experiments MAY use to include network-level events. See this directory's [README](README.md) for the basic concepts. -This data format is available since data format version 0.3.3. +| Name | `netevents` | +|------------|-------------| +| Version | 0 | ## Specification @@ -37,7 +39,7 @@ This data format is available since data format version 0.3.3. - `conn_id` (`int`; optional): identifier of this connection (see below). When zero, it means we don't know the conn ID. SHOULD be omitted when zero. -- `dial_id` (`int`; optional; since v0.3.3): identifier of a dialing +- `dial_id` (`int`; optional; since 2020-01-11): identifier of a dialing operation (i.e. name resolution followed by connect). See the discussion in `df-002-dnst.md`. @@ -54,7 +56,7 @@ a string indicating the error, otherwise it MUST be `null`. - `t` (`float`): number of seconds elapsed since `measurement_start_time` measured when `operation` is complete. -- `transaction_id` (`int`; optional; since v0.3.4): if present, this is the +- `transaction_id` (`int`; optional; since 2020-01-11): if present, this is the ID of the HTTP transaction that caused this TCP connect. ## Connection ID and Life Cycle