Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OONI Pipeline v5 alpha #64

Merged
merged 65 commits into from
Apr 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
3693337
Bump version number to v5
hellais Apr 24, 2024
2238346
Improvements to the docker based test setup
hellais Apr 24, 2024
c0209f2
Fix bug in web analysis that lead to tcp ok loni not being set
hellais Apr 24, 2024
f5b530c
Tweak loni value calculations
hellais Apr 24, 2024
9feef77
Improvements to the analysis tests
hellais Apr 24, 2024
b88f5cc
Switch to ubuntu-20.04 for oonipipeline tests
hellais Apr 24, 2024
eccbc59
Only dump failing rows when flag is set
hellais Apr 24, 2024
ff68c06
Run tests by default with less verbosity
hellais Apr 24, 2024
58f41a2
Add more test cases
hellais Apr 24, 2024
012660e
Fix bugs in DNS analysis for NXDOMAIN case
hellais Apr 24, 2024
f4c7d52
Add test case for down NXDOMAIN case
hellais Apr 24, 2024
d5aab62
Remove test that's already implemented
hellais Apr 24, 2024
207c371
Add test case for nxdomain blocking
hellais Apr 24, 2024
64342c2
When a blockpage isn't country consistent we consider it a "down" mea…
hellais Apr 24, 2024
0786412
Add tests for the create tables
hellais Apr 25, 2024
d63befb
Enable more tests
hellais Apr 25, 2024
5381d2f
Rename workflows module to temporal
hellais Apr 25, 2024
9b8adc5
Move observations into activities
hellais Apr 25, 2024
daf1546
Move ground_truths into activities
hellais Apr 25, 2024
205a53c
Move analysis into activities
hellais Apr 25, 2024
df867e7
Port Observation generation over to schedule based workflow
hellais Apr 25, 2024
c690ab2
Fix arguments of tests
hellais Apr 25, 2024
c3e3d41
Adjust semantics of the start_at and end_at intervals
hellais Apr 25, 2024
386b588
Implement temporary workaround to obtain the timestamp of the run
hellais Apr 25, 2024
f7f463a
Fix start and end time arithmetics
hellais Apr 25, 2024
1035bd3
Re-enable optimize table call
hellais Apr 25, 2024
ee36957
Refactor observations workflow based on pattern in https://github.com…
hellais Apr 25, 2024
88dfc5f
Refactor analysis and groundtruth workflows
hellais Apr 25, 2024
e05a863
Fix time ranges in tests
hellais Apr 25, 2024
236c399
Fix missing activity import
hellais Apr 25, 2024
5f3cba0
Pass parallelism cli option to commands
hellais Apr 25, 2024
2ed17ad
Calculate the final_ok value as the min of ok values
hellais Apr 26, 2024
cc0e302
Properly setup logging
hellais Apr 26, 2024
01e7ccf
Properly setup workflow_ids
hellais Apr 26, 2024
f370a7f
Disabling the sandbox is no longer needed following the refactor
hellais Apr 26, 2024
da9ec28
Add support for exporting prometheus metrics
hellais Apr 26, 2024
85c3c93
Add support for collecting full traces using open telemetry
hellais Apr 26, 2024
8f35368
Port observations metrics over to open telemetry
hellais Apr 26, 2024
34338cc
Improvements to telemetry generation
hellais Apr 26, 2024
b325d1a
Move design outside of the main readme file
hellais Apr 26, 2024
1f07342
Move analysis over to open telemetry
hellais Apr 26, 2024
0b5e912
Add more useful metadata to traces
hellais Apr 26, 2024
6d32e00
Fix observation count trace collection
hellais Apr 26, 2024
8fe6117
Take note of an improvement related to idempotence of activities
hellais Apr 26, 2024
7c7f958
Fix logging outputs
hellais Apr 26, 2024
f02fe0e
Bump to alpha0 release
hellais Apr 26, 2024
fa6d069
Pass telemetry endpoint option
hellais Apr 26, 2024
d5c3387
Add temporal address endpoint
hellais Apr 26, 2024
0fd4bce
Refactoring of workers for production
hellais Apr 26, 2024
f28d95e
Update docs based on spearate worker processes
hellais Apr 26, 2024
152008d
Add note about zombies
hellais Apr 26, 2024
0af4726
Properly handle ctrl-c process pool termination
hellais Apr 26, 2024
d6b2adc
Bump up the max times
hellais Apr 26, 2024
8a06b7a
Make the processing file logs less noisy
hellais Apr 26, 2024
e6a8bd5
Properly handle ctrl-c in workers
hellais Apr 30, 2024
31a820a
Implement docker-compose.yml for the oonipipeline
hellais Apr 30, 2024
3c975fe
Fix nesting of spans
hellais Apr 30, 2024
330c50b
Change telemtry port
hellais Apr 30, 2024
5ca70b5
Fix readme
hellais Apr 30, 2024
92a7c23
Set correct kibana version
hellais Apr 30, 2024
7833789
Add support for running superset in dockercompose
hellais Apr 30, 2024
630d863
Fix /var/run/superset permissions
hellais Apr 30, 2024
f34f217
Add persistent of elastic and postgres data
hellais Apr 30, 2024
686fe8e
Update instructions for superset setup
hellais Apr 30, 2024
16ebc35
Add pgdata to .gitignore
hellais Apr 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test_oonipipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: test oonipipeline
on: push
jobs:
run_tests:
runs-on: ubuntu-latest
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v3

Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,3 @@ coverage.xml
/output
/attic
/prof
/clickhouse-data
218 changes: 4 additions & 214 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,230 +7,20 @@ Most users will likely be interested in using this as a CLI tool for downloading
measurements.

If that is your goal, getting started is easy, run:

```
pip install oonidata
```

You will then be able to download measurements via:

```
oonidata sync --probe-cc IT --start-day 2022-10-01 --end-day 2022-10-02 --output-dir measurements/
```

This will download all OONI measurements for Italy into the directory
`./measurements` that were uploaded between 2022-10-01 and 2022-10-02.

If you are interested in learning more about the design of the analysis tooling,
please read on.

## Developer setup

This project makes use of [poetry](https://python-poetry.org/) for dependency
management. Follow [their
instructions](https://python-poetry.org/docs/#installation) on how to set it up.

Once you have done that you should be able to run:
```
poetry install
poetry run python -m oonidata --help
```
## Architecture overview

The analysis engine is made up of several components:
* Observation generation
* Response body archiving
* Ground truth generation
* Experiment result generation

Below we explain each step of this process in detail

At a high level the pipeline looks like this:

```mermaid
graph
M{{Measurement}} --> OGEN[[make_observations]]
OGEN --> |many| O{{Observations}}
NDB[(NetInfoDB)] --> OGEN
OGEN --> RB{{ResponseBodies}}
RB --> BA[(BodyArchive)]
FDB[(FingerprintDB)] --> FPH
FPH --> BA
RB --> FPH[[fingerprint_hunter]]
O --> ODB[(ObservationTables)]

ODB --> MKGT[[make_ground_truths]]
MKGT --> GTDB[(GroundTruthDB)]
GTDB --> MKER
BA --> MKER
ODB --> MKER[[make_experiment_results]]
MKER --> |one| ER{{ExperimentResult}}
```

### Observation generation

The goal of the Observation generation stage is to take raw OONI measurements
as input data and produce as output observations.

An observation is a timestamped statement about some network condition that was
observed by a particular vantage point. For example, an observation could be
"the TLS handshake to 8.8.4.4:443 with SNI equal to dns.google failed with
a connection reset by peer error".

What these observations mean for the
target in question (e.g., is there blocking or is the target down?) is something
that is to be determined when looking at data in aggregate and is the
responsibility of the Verdict generation stage.

During this stage we are also going to enrich observations with metadata about
IP addresses (using the IPInfoDB).

Each each measurement ends up producing observations that are all of the same
type and are written to the same DB table.

This has the benefit that we don't need to lookup the observations we care about
in several disparate tables, but can do it all in the same one, which is
incredibly fast.

A side effect is that we end up with tables are can be a bit sparse (several
columns are NULL).

The tricky part, in the case of complex tests like web_connectivity, is to
figure out which individual sub measurements fit into the same observation row.
For example we would like to have the TCP connect result to appear in the same
row as the DNS query that lead to it with the TLS handshake towards that IP,
port combination.

You can run the observation generation with a clickhouse backend like so:
```
poetry run python -m oonidata mkobs --clickhouse clickhouse://localhost/ --data-dir tests/data/datadir/ --start-day 2022-08-01 --end-day 2022-10-01 --create-tables --parallelism 20
```

Here is the list of supported observations so far:
* [x] WebObservation, which has information about DNS, TCP, TLS and HTTP(s)
* [x] WebControlObservation, has the control measurements run by web connectivity (is used to generate ground truths)
* [ ] CircumventionToolObservation, still needs to be designed and implemented
(ideally we would use the same for OpenVPN, Psiphon, VanillaTor)

### Response body archiving

It is optionally possible to also create WAR archives of HTTP response bodies
when running the observation generation.
### OONI Pipeline

This is enabled by passing the extra command line argument `--archives-dir`.

Whenever a response body is detected in a measurement it is sent to the
archiving queue which takes the response body, looks up in the database if it
has seen it already (so we don't store exact duplicate bodies).
If we haven't archived it yet, we write the body to a WAR file and record it's
sha1 hash together with the filename where we wrote it to into a database.

These WAR archives can then be mined asynchronously for blockpages using the
fingerprint hunter command:
```
oonidata fphunt --data-dir tests/data/datadir/ --archives-dir warchives/ --parallelism 20
```

When a blockpage matching the fingerprint is detected, the relevant database row
for that fingerprint is updated with the ID of the fingerprint which was
detected.

### Ground Truth generation

In order to establish if something is being blocked or not, we need some ground truth for comparison.

The goal of the ground truth generation task is to build a ground truth
database, which contains all the ground truths for every target that has been
tested in a particular day.

Currently it's implemented using the WebControlObservations, but in the future
we could just use other WebObservation.

Each ground truth database is actually just a sqlite3 database. For a given day
it's approximately 150MB in size and we load them in memory when we are running
the analysis workflow.

### ExperimentResult generation

An experiment result is the interpretation of one or more observations with a
determination of whether the target is `BLOCKED`, `DOWN` or `OK`.

For each of these states a confidence indicator is given which is an estimate of the
likelyhood of that result to be accurate.

For each of the 3 states, it's possible also specify a `blocking_detail`, which
gives more information as to why the block might be occurring.

It's important to note that for a given measurement, multiple experiment results
can be generated, because a target might be blocked in multiple ways or be OK in
some regards, but not in orders.

This is best explained through a concrete example. Let's say a censor is
blocking https://facebook.com/ with the following logic:
* any DNS query for facebook.com get's as answer "127.0.0.1"
* any TCP connect request to 157.240.231.35 gets a RST
* any TLS handshake with SNI facebook.com gets a RST

In this scenario, assuming the probe has discovered other IPs for facebook.com
through other means (ex. through the test helper or DoH as web_connectivity 0.5
does), we would like to emit the following experiment results:
* BLOCKED, `dns.bogon`, `facebook.com`
* BLOCKED, `tcp.rst`, `157.240.231.35:80`
* BLOCKED, `tcp.rst`, `157.240.231.35:443`
* OK, `tcp.ok`, `157.240.231.100:80`
* OK, `tcp.ok`, `157.240.231.100:443`
* BLOCKED, `tls.rst`, `157.240.231.35:443`
* BLOCKED, `tls.rst`, `157.240.231.100:443`

This way we are fully characterising the block in all the methods through which
it is implemented.

### Current pipeline

This section documents the current [ooni/pipeline](https://github.com/ooni/pipeline)
design.

```mermaid
graph LR

Probes --> ProbeServices
ProbeServices --> Fastpath
Fastpath --> S3MiniCans
Fastpath --> S3JSONL
Fastpath --> FastpathClickhouse
S3JSONL --> API
FastpathClickhouse --> API
API --> Explorer
```

```mermaid
classDiagram
direction RL
class CommonMeta{
measurement_uid
report_id
input
domain
probe_cc
probe_asn
test_name
test_start_time
measurement_start_time
platform
software_name
software_version
}

class Measurement{
+Dict test_keys
}

class Fastpath{
anomaly
confirmed
msm_failure
blocking_general
+Dict scores
}
Fastpath "1" --> "1" Measurement
Measurement *-- CommonMeta
Fastpath *-- CommonMeta
```
For documentation on OONI Pipeline v5, see the subdirectory `oonipipeline`.
12 changes: 12 additions & 0 deletions oonipipeline/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
COMPOSE_PROJECT_NAME=temporal
CASSANDRA_VERSION=3.11.9
ELASTICSEARCH_VERSION=7.16.2
MYSQL_VERSION=8
TEMPORAL_VERSION=1.23.0
TEMPORAL_UI_VERSION=2.26.2
POSTGRESQL_VERSION=13
POSTGRES_PASSWORD=temporal
POSTGRES_USER=temporal
POSTGRES_DEFAULT_PORT=5432
OPENSEARCH_VERSION=2.5.0
JAEGER_VERSION=1.56
1 change: 1 addition & 0 deletions oonipipeline/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/_clickhouse-data
Loading
Loading