diff --git a/.github/workflows/test_oonipipeline.yml b/.github/workflows/test_oonipipeline.yml index abd1cb91..cf22ebe0 100644 --- a/.github/workflows/test_oonipipeline.yml +++ b/.github/workflows/test_oonipipeline.yml @@ -2,7 +2,7 @@ name: test oonipipeline on: push jobs: run_tests: - runs-on: ubuntu-latest + runs-on: ubuntu-20.04 steps: - uses: actions/checkout@v3 diff --git a/.gitignore b/.gitignore index 52d8bbe3..999050a9 100644 --- a/.gitignore +++ b/.gitignore @@ -9,4 +9,3 @@ coverage.xml /output /attic /prof -/clickhouse-data diff --git a/Readme.md b/Readme.md index 8e81ddfe..98159cda 100644 --- a/Readme.md +++ b/Readme.md @@ -7,11 +7,13 @@ Most users will likely be interested in using this as a CLI tool for downloading measurements. If that is your goal, getting started is easy, run: + ``` pip install oonidata ``` You will then be able to download measurements via: + ``` oonidata sync --probe-cc IT --start-day 2022-10-01 --end-day 2022-10-02 --output-dir measurements/ ``` @@ -19,218 +21,6 @@ oonidata sync --probe-cc IT --start-day 2022-10-01 --end-day 2022-10-02 --output This will download all OONI measurements for Italy into the directory `./measurements` that were uploaded between 2022-10-01 and 2022-10-02. -If you are interested in learning more about the design of the analysis tooling, -please read on. - -## Developer setup - -This project makes use of [poetry](https://python-poetry.org/) for dependency -management. Follow [their -instructions](https://python-poetry.org/docs/#installation) on how to set it up. - -Once you have done that you should be able to run: -``` -poetry install -poetry run python -m oonidata --help -``` -## Architecture overview - -The analysis engine is made up of several components: -* Observation generation -* Response body archiving -* Ground truth generation -* Experiment result generation - -Below we explain each step of this process in detail - -At a high level the pipeline looks like this: - -```mermaid -graph - M{{Measurement}} --> OGEN[[make_observations]] - OGEN --> |many| O{{Observations}} - NDB[(NetInfoDB)] --> OGEN - OGEN --> RB{{ResponseBodies}} - RB --> BA[(BodyArchive)] - FDB[(FingerprintDB)] --> FPH - FPH --> BA - RB --> FPH[[fingerprint_hunter]] - O --> ODB[(ObservationTables)] - - ODB --> MKGT[[make_ground_truths]] - MKGT --> GTDB[(GroundTruthDB)] - GTDB --> MKER - BA --> MKER - ODB --> MKER[[make_experiment_results]] - MKER --> |one| ER{{ExperimentResult}} -``` - -### Observation generation - -The goal of the Observation generation stage is to take raw OONI measurements -as input data and produce as output observations. - -An observation is a timestamped statement about some network condition that was -observed by a particular vantage point. For example, an observation could be -"the TLS handshake to 8.8.4.4:443 with SNI equal to dns.google failed with -a connection reset by peer error". - -What these observations mean for the -target in question (e.g., is there blocking or is the target down?) is something -that is to be determined when looking at data in aggregate and is the -responsibility of the Verdict generation stage. - -During this stage we are also going to enrich observations with metadata about -IP addresses (using the IPInfoDB). - -Each each measurement ends up producing observations that are all of the same -type and are written to the same DB table. - -This has the benefit that we don't need to lookup the observations we care about -in several disparate tables, but can do it all in the same one, which is -incredibly fast. - -A side effect is that we end up with tables are can be a bit sparse (several -columns are NULL). - -The tricky part, in the case of complex tests like web_connectivity, is to -figure out which individual sub measurements fit into the same observation row. -For example we would like to have the TCP connect result to appear in the same -row as the DNS query that lead to it with the TLS handshake towards that IP, -port combination. - -You can run the observation generation with a clickhouse backend like so: -``` -poetry run python -m oonidata mkobs --clickhouse clickhouse://localhost/ --data-dir tests/data/datadir/ --start-day 2022-08-01 --end-day 2022-10-01 --create-tables --parallelism 20 -``` - -Here is the list of supported observations so far: -* [x] WebObservation, which has information about DNS, TCP, TLS and HTTP(s) -* [x] WebControlObservation, has the control measurements run by web connectivity (is used to generate ground truths) -* [ ] CircumventionToolObservation, still needs to be designed and implemented - (ideally we would use the same for OpenVPN, Psiphon, VanillaTor) - -### Response body archiving - -It is optionally possible to also create WAR archives of HTTP response bodies -when running the observation generation. +### OONI Pipeline -This is enabled by passing the extra command line argument `--archives-dir`. - -Whenever a response body is detected in a measurement it is sent to the -archiving queue which takes the response body, looks up in the database if it -has seen it already (so we don't store exact duplicate bodies). -If we haven't archived it yet, we write the body to a WAR file and record it's -sha1 hash together with the filename where we wrote it to into a database. - -These WAR archives can then be mined asynchronously for blockpages using the -fingerprint hunter command: -``` -oonidata fphunt --data-dir tests/data/datadir/ --archives-dir warchives/ --parallelism 20 -``` - -When a blockpage matching the fingerprint is detected, the relevant database row -for that fingerprint is updated with the ID of the fingerprint which was -detected. - -### Ground Truth generation - -In order to establish if something is being blocked or not, we need some ground truth for comparison. - -The goal of the ground truth generation task is to build a ground truth -database, which contains all the ground truths for every target that has been -tested in a particular day. - -Currently it's implemented using the WebControlObservations, but in the future -we could just use other WebObservation. - -Each ground truth database is actually just a sqlite3 database. For a given day -it's approximately 150MB in size and we load them in memory when we are running -the analysis workflow. - -### ExperimentResult generation - -An experiment result is the interpretation of one or more observations with a -determination of whether the target is `BLOCKED`, `DOWN` or `OK`. - -For each of these states a confidence indicator is given which is an estimate of the -likelyhood of that result to be accurate. - -For each of the 3 states, it's possible also specify a `blocking_detail`, which -gives more information as to why the block might be occurring. - -It's important to note that for a given measurement, multiple experiment results -can be generated, because a target might be blocked in multiple ways or be OK in -some regards, but not in orders. - -This is best explained through a concrete example. Let's say a censor is -blocking https://facebook.com/ with the following logic: -* any DNS query for facebook.com get's as answer "127.0.0.1" -* any TCP connect request to 157.240.231.35 gets a RST -* any TLS handshake with SNI facebook.com gets a RST - -In this scenario, assuming the probe has discovered other IPs for facebook.com -through other means (ex. through the test helper or DoH as web_connectivity 0.5 -does), we would like to emit the following experiment results: -* BLOCKED, `dns.bogon`, `facebook.com` -* BLOCKED, `tcp.rst`, `157.240.231.35:80` -* BLOCKED, `tcp.rst`, `157.240.231.35:443` -* OK, `tcp.ok`, `157.240.231.100:80` -* OK, `tcp.ok`, `157.240.231.100:443` -* BLOCKED, `tls.rst`, `157.240.231.35:443` -* BLOCKED, `tls.rst`, `157.240.231.100:443` - -This way we are fully characterising the block in all the methods through which -it is implemented. - -### Current pipeline - -This section documents the current [ooni/pipeline](https://github.com/ooni/pipeline) -design. - -```mermaid -graph LR - - Probes --> ProbeServices - ProbeServices --> Fastpath - Fastpath --> S3MiniCans - Fastpath --> S3JSONL - Fastpath --> FastpathClickhouse - S3JSONL --> API - FastpathClickhouse --> API - API --> Explorer -``` - -```mermaid -classDiagram - direction RL - class CommonMeta{ - measurement_uid - report_id - input - domain - probe_cc - probe_asn - test_name - test_start_time - measurement_start_time - platform - software_name - software_version - } - - class Measurement{ - +Dict test_keys - } - - class Fastpath{ - anomaly - confirmed - msm_failure - blocking_general - +Dict scores - } - Fastpath "1" --> "1" Measurement - Measurement *-- CommonMeta - Fastpath *-- CommonMeta -``` +For documentation on OONI Pipeline v5, see the subdirectory `oonipipeline`. diff --git a/oonipipeline/.env b/oonipipeline/.env new file mode 100644 index 00000000..cc96a50b --- /dev/null +++ b/oonipipeline/.env @@ -0,0 +1,12 @@ +COMPOSE_PROJECT_NAME=temporal +CASSANDRA_VERSION=3.11.9 +ELASTICSEARCH_VERSION=7.16.2 +MYSQL_VERSION=8 +TEMPORAL_VERSION=1.23.0 +TEMPORAL_UI_VERSION=2.26.2 +POSTGRESQL_VERSION=13 +POSTGRES_PASSWORD=temporal +POSTGRES_USER=temporal +POSTGRES_DEFAULT_PORT=5432 +OPENSEARCH_VERSION=2.5.0 +JAEGER_VERSION=1.56 diff --git a/oonipipeline/.gitignore b/oonipipeline/.gitignore new file mode 100644 index 00000000..b537087e --- /dev/null +++ b/oonipipeline/.gitignore @@ -0,0 +1 @@ +/_clickhouse-data diff --git a/oonipipeline/Design.md b/oonipipeline/Design.md new file mode 100644 index 00000000..5f46a05e --- /dev/null +++ b/oonipipeline/Design.md @@ -0,0 +1,207 @@ +## Architecture overview + +The analysis engine is made up of several components: + +- Observation generation +- Response body archiving +- Ground truth generation +- Experiment result generation + +Below we explain each step of this process in detail + +At a high level the pipeline looks like this: + +```mermaid +graph + M{{Measurement}} --> OGEN[[make_observations]] + OGEN --> |many| O{{Observations}} + NDB[(NetInfoDB)] --> OGEN + OGEN --> RB{{ResponseBodies}} + RB --> BA[(BodyArchive)] + FDB[(FingerprintDB)] --> FPH + FPH --> BA + RB --> FPH[[fingerprint_hunter]] + O --> ODB[(ObservationTables)] + + ODB --> MKGT[[make_ground_truths]] + MKGT --> GTDB[(GroundTruthDB)] + GTDB --> MKER + BA --> MKER + ODB --> MKER[[make_experiment_results]] + MKER --> |one| ER{{ExperimentResult}} +``` + +### Observation generation + +The goal of the Observation generation stage is to take raw OONI measurements +as input data and produce as output observations. + +An observation is a timestamped statement about some network condition that was +observed by a particular vantage point. For example, an observation could be +"the TLS handshake to 8.8.4.4:443 with SNI equal to dns.google failed with +a connection reset by peer error". + +What these observations mean for the +target in question (e.g., is there blocking or is the target down?) is something +that is to be determined when looking at data in aggregate and is the +responsibility of the Verdict generation stage. + +During this stage we are also going to enrich observations with metadata about +IP addresses (using the IPInfoDB). + +Each each measurement ends up producing observations that are all of the same +type and are written to the same DB table. + +This has the benefit that we don't need to lookup the observations we care about +in several disparate tables, but can do it all in the same one, which is +incredibly fast. + +A side effect is that we end up with tables are can be a bit sparse (several +columns are NULL). + +The tricky part, in the case of complex tests like web_connectivity, is to +figure out which individual sub measurements fit into the same observation row. +For example we would like to have the TCP connect result to appear in the same +row as the DNS query that lead to it with the TLS handshake towards that IP, +port combination. + +You can run the observation generation with a clickhouse backend like so: + +``` +poetry run python -m oonidata mkobs --clickhouse clickhouse://localhost/ --data-dir tests/data/datadir/ --start-day 2022-08-01 --end-day 2022-10-01 --create-tables --parallelism 20 +``` + +Here is the list of supported observations so far: + +- [x] WebObservation, which has information about DNS, TCP, TLS and HTTP(s) +- [x] WebControlObservation, has the control measurements run by web connectivity (is used to generate ground truths) +- [ ] CircumventionToolObservation, still needs to be designed and implemented + (ideally we would use the same for OpenVPN, Psiphon, VanillaTor) + +### Response body archiving + +It is optionally possible to also create WAR archives of HTTP response bodies +when running the observation generation. + +This is enabled by passing the extra command line argument `--archives-dir`. + +Whenever a response body is detected in a measurement it is sent to the +archiving queue which takes the response body, looks up in the database if it +has seen it already (so we don't store exact duplicate bodies). +If we haven't archived it yet, we write the body to a WAR file and record it's +sha1 hash together with the filename where we wrote it to into a database. + +These WAR archives can then be mined asynchronously for blockpages using the +fingerprint hunter command: + +``` +oonidata fphunt --data-dir tests/data/datadir/ --archives-dir warchives/ --parallelism 20 +``` + +When a blockpage matching the fingerprint is detected, the relevant database row +for that fingerprint is updated with the ID of the fingerprint which was +detected. + +### Ground Truth generation + +In order to establish if something is being blocked or not, we need some ground truth for comparison. + +The goal of the ground truth generation task is to build a ground truth +database, which contains all the ground truths for every target that has been +tested in a particular day. + +Currently it's implemented using the WebControlObservations, but in the future +we could just use other WebObservation. + +Each ground truth database is actually just a sqlite3 database. For a given day +it's approximately 150MB in size and we load them in memory when we are running +the analysis workflow. + +### ExperimentResult generation + +An experiment result is the interpretation of one or more observations with a +determination of whether the target is `BLOCKED`, `DOWN` or `OK`. + +For each of these states a confidence indicator is given which is an estimate of the +likelyhood of that result to be accurate. + +For each of the 3 states, it's possible also specify a `blocking_detail`, which +gives more information as to why the block might be occurring. + +It's important to note that for a given measurement, multiple experiment results +can be generated, because a target might be blocked in multiple ways or be OK in +some regards, but not in orders. + +This is best explained through a concrete example. Let's say a censor is +blocking https://facebook.com/ with the following logic: + +- any DNS query for facebook.com get's as answer "127.0.0.1" +- any TCP connect request to 157.240.231.35 gets a RST +- any TLS handshake with SNI facebook.com gets a RST + +In this scenario, assuming the probe has discovered other IPs for facebook.com +through other means (ex. through the test helper or DoH as web_connectivity 0.5 +does), we would like to emit the following experiment results: + +- BLOCKED, `dns.bogon`, `facebook.com` +- BLOCKED, `tcp.rst`, `157.240.231.35:80` +- BLOCKED, `tcp.rst`, `157.240.231.35:443` +- OK, `tcp.ok`, `157.240.231.100:80` +- OK, `tcp.ok`, `157.240.231.100:443` +- BLOCKED, `tls.rst`, `157.240.231.35:443` +- BLOCKED, `tls.rst`, `157.240.231.100:443` + +This way we are fully characterising the block in all the methods through which +it is implemented. + +### Current pipeline + +This section documents the current [ooni/pipeline](https://github.com/ooni/pipeline) +design. + +```mermaid +graph LR + + Probes --> ProbeServices + ProbeServices --> Fastpath + Fastpath --> S3MiniCans + Fastpath --> S3JSONL + Fastpath --> FastpathClickhouse + S3JSONL --> API + FastpathClickhouse --> API + API --> Explorer +``` + +```mermaid +classDiagram + direction RL + class CommonMeta{ + measurement_uid + report_id + input + domain + probe_cc + probe_asn + test_name + test_start_time + measurement_start_time + platform + software_name + software_version + } + + class Measurement{ + +Dict test_keys + } + + class Fastpath{ + anomaly + confirmed + msm_failure + blocking_general + +Dict scores + } + Fastpath "1" --> "1" Measurement + Measurement *-- CommonMeta + Fastpath *-- CommonMeta +``` diff --git a/oonipipeline/Readme.md b/oonipipeline/Readme.md index 8775c9b2..1c2f87e5 100644 --- a/oonipipeline/Readme.md +++ b/oonipipeline/Readme.md @@ -3,37 +3,121 @@ This it the fifth major iteration of the OONI Data Pipeline. For historical context, these are the major revisions: -* `v0` - The "pipeline" is basically just writing the RAW json files into a public `www` directory. Used until ~2013 -* `v1` - OONI Pipeline based on custom CLI scripts using mongodb as a backend. Used until ~2015. -* `v2` - OONI Pipeline based on [luigi](https://luigi.readthedocs.io/en/stable/). Used until ~2017. -* `v3` - OONI Pipeline based on [airflow](https://airflow.apache.org/). Used until ~2020. -* `v4` - OONI Pipeline basedon custom script and systemd units (aka fastpath). Currently in use in production. -* `v5` - Next generation OONI Pipeline. What this readme is relevant to. Expected to become in production by Q4 2024. + +- `v0` - The "pipeline" is basically just writing the RAW json files into a public `www` directory. Used until ~2013 +- `v1` - OONI Pipeline based on custom CLI scripts using mongodb as a backend. Used until ~2015. +- `v2` - OONI Pipeline based on [luigi](https://luigi.readthedocs.io/en/stable/). Used until ~2017. +- `v3` - OONI Pipeline based on [airflow](https://airflow.apache.org/). Used until ~2020. +- `v4` - OONI Pipeline basedon custom script and systemd units (aka fastpath). Currently in use in production. +- `v5` - Next generation OONI Pipeline. What this readme is relevant to. Expected to become in production by Q4 2024. ## Setup In order to run the pipeline you should setup the following dependencies: -* [Temporal for python](https://learn.temporal.io/getting_started/python/dev_environment/) -* [Clickhouse](https://clickhouse.com/docs/en/install) -* [hatch](https://hatch.pypa.io/1.9/install/) +- [Temporal for python](https://learn.temporal.io/getting_started/python/dev_environment/) +- [Clickhouse](https://clickhouse.com/docs/en/install) +- [hatch](https://hatch.pypa.io/1.9/install/) ### Quick start Start temporal dev server: + ``` temporal server start-dev ``` Start clickhouse server: + ``` -mkdir -p clickhouse-data +mkdir -p _clickhouse-data +cd _clickhouse-data clickhouse server ``` You can then start the desired workflow, for example to create signal observations for the US: + ``` hatch run oonipipeline mkobs --probe-cc US --test-name signal --start-day 2024-01-01 --end-day 2024-01-02 ``` Monitor the workflow executing by accessing: http://localhost:8233/ + +If you would like to also collect OpenTelemetry traces, you can set it up like so: + +``` +docker run -d --name jaeger \ + -e COLLECTOR_OTLP_ENABLED=true \ + -p 16686:16686 \ + -p 4317:4317 \ + -p 4318:4318 \ + jaegertracing/all-in-one:latest +``` + +They are then visible at the following address: http://localhost:16686/search + +### Production usage + +By default we use thread based parallelism, but in production you really want +to have multiple workers processes which have inside of them multiple threads. + +You should also be using the production temporal server with an elasticsearch +backend as opposed to the dev server. + +To start all the server side components, we have a handy docker-compose.yml +that sets everything up. + +It can be started by running from this directory: + +``` +docker compose up +``` + +The important services you can access are the following: + +- Temporal UI: http://localhost:8080 +- Superset UI: http://localhost:8083 (u: `admin`, p: `oonity`) +- OpenTelemetry UI: http://localhost:8088 + +We don't include a clickhouse instance inside of the docker-compose file by +design. The reason for that is that it's recommended you set that up separately +and not inside of docker. + +To start the worker processes: + +``` +hatch run oonipipeline startworkers +``` + +Then you can trigger the workflow by passing the `--no-start-workers` flag: + +``` +hatch run oonipipeline mkobs --probe-cc US --start-day 2024-01-01 --end-day 2024-01-20 --no-start-workers +``` + +#### Superset + +Superset is a neat data viz platform. + +In order to set it up to speak to your clickhouse instance, assuming it's +listening on localhost of the host container, you should: + +1. Click Settings -> Data - Database connections +2. Click + Database +3. In the Supported Databases drop down pick "Clickhouse Connect" +4. Enter as Host `host.docker.internal` and port `8123` + +Note: `host.docker.internal` only works reliably on windows, macOS and very +recent linux+docker versions. In linux the needed configuration is a bit more +complex and requires discovering the gateway IP of the host container, +adjusting the clickhouse setup to bind to that IP and setting up correct nft or +similar firewall rules. + +5. Click connect +6. Go to datasets and click + Dataset +7. Add all the tables from the `clickhouse` database in the `default` schema. + Recommended tables to add are `obs_web` and `measurement_experiment_result`. +8. You are now able to start building dashboards + +For more information on superset usage and setup refer to [their +documentation](https://superset.apache.org/docs/). diff --git a/oonipipeline/docker-compose.yml b/oonipipeline/docker-compose.yml new file mode 100644 index 00000000..aa90f17f --- /dev/null +++ b/oonipipeline/docker-compose.yml @@ -0,0 +1,215 @@ +--- +version: "3.5" +services: + +#### Common services + elasticsearch: + container_name: elasticsearch + hostname: elasticsearch + environment: + - cluster.routing.allocation.disk.threshold_enabled=true + - cluster.routing.allocation.disk.watermark.low=512mb + - cluster.routing.allocation.disk.watermark.high=256mb + - cluster.routing.allocation.disk.watermark.flood_stage=128mb + - discovery.type=single-node + - ES_JAVA_OPTS=-Xms256m -Xmx256m + - xpack.security.enabled=false + image: elasticsearch:${ELASTICSEARCH_VERSION} + networks: + - main-network + expose: + - 9200 + volumes: + - ./docker/esdata/:/var/lib/elasticsearch/data + healthcheck: + test: curl -s http://elasticsearch:9200 >/dev/null || exit 1 + interval: 30s + timeout: 10s + retries: 50 + postgresql: + container_name: postgresql + hostname: postgresql + environment: + POSTGRES_PASSWORD: oonipipeline + POSTGRES_USER: oonipipeline + image: postgres:${POSTGRESQL_VERSION} + networks: + - main-network + expose: + - 5432 + volumes: + - ./docker/pgdata:/var/lib/postgresql/data + kibana: + image: docker.elastic.co/kibana/kibana:${ELASTICSEARCH_VERSION} + ports: + - "5601:5601" + environment: + ELASTICSEARCH_URL: http://elasticsearch:9200 + depends_on: + - elasticsearch + networks: + - main-network +#### Temporal + temporal: + container_name: temporal + hostname: temporal + depends_on: + - postgresql + - elasticsearch + environment: + - DB=postgres12 + - DB_PORT=5432 + - POSTGRES_USER=oonipipeline + - POSTGRES_PWD=oonipipeline + - POSTGRES_SEEDS=postgresql + - DYNAMIC_CONFIG_FILE_PATH=config/dynamicconfig/development-sql.yaml + - ENABLE_ES=true + - ES_SEEDS=elasticsearch + - ES_VERSION=v7 + image: temporalio/auto-setup:${TEMPORAL_VERSION} + networks: + - main-network + ports: + - 7233:7233 + labels: + kompose.volume.type: configMap + volumes: + - ./docker/temporal-config:/etc/temporal/config/dynamicconfig + temporal-admin-tools: + container_name: temporal-admin-tools + depends_on: + - temporal + environment: + - TEMPORAL_ADDRESS=temporal:7233 + - TEMPORAL_CLI_ADDRESS=temporal:7233 + image: temporalio/admin-tools:${TEMPORAL_VERSION} + networks: + - main-network + stdin_open: true + tty: true + temporal-ui: + container_name: temporal-ui + depends_on: + - temporal + environment: + - TEMPORAL_ADDRESS=temporal:7233 + - TEMPORAL_CORS_ORIGINS=http://localhost:3000 + image: temporalio/ui:${TEMPORAL_UI_VERSION} + networks: + - main-network + ports: + - 8080:8080 + +#### Jaeger for open telemetry + jaeger: + image: jaegertracing/all-in-one:${JAEGER_VERSION} + ports: + - "8088:16686" + - "6831:6831/udp" + - "6832:6832/udp" + - "5778:5778" + - "4317:4317" + - "4318:4318" + - "14250:14250" + - "14268:14268" + - "14269:14269" + - "9411:9411" + container_name: jaeger + hostname: jaeger + restart: unless-stopped + networks: + - main-network + environment: + COLLECTOR_ZIPKIN_HOST_PORT: ":9411" + + +### TODO(art): currently jaeger setup with elastic is not working, so we +## are temporarily just using the all-in-one container that's not meant for production use + # jaeger-collector: + # image: jaegertracing/jaeger-collector:${JAEGER_VERSION} + # ports: + # - "14267:14267" + # - "14268:14268" + # - "9411:9411" + # - "4317:4317" + # - "4318:4318" + # depends_on: + # - elasticsearch + # container_name: jaeger-collector + # hostname: jaeger-collector + # restart: unless-stopped + # networks: + # - main-network + # volumes: + # - ./scripts/:/scripts + # environment: + # SPAN_STORAGE_TYPE: "elasticsearch" + # ES_SERVER_URLS: "http://elasticsearch:9200" + # entrypoint: ["/bin/sh", "/scripts/wait-for.sh", "elasticsearch:9200"] + # command: + # - "/go/bin/collector-linux" + + # jaeger-agent: + # image: jaegertracing/jaeger-agent:${JAEGER_VERSION} + # ports: + # - "5775:5775/udp" + # - "5778:5778" + # - "6831:6831/udp" + # - "6832:6832/udp" + # depends_on: + # - elasticsearch + # - jaeger-collector + # restart: unless-stopped + # container_name: jaeger-agent + # hostname: jaeger-agent + # networks: + # - main-network + # command: + # - "--reporter.grpc.host-port=jaeger-collector:14250" + + # jaeger-query: + # image: jaegertracing/jaeger-query:${JAEGER_VERSION} + # ports: + # - 8081:16686 + # depends_on: + # - elasticsearch + # - jaeger-collector + # restart: unless-stopped + # container_name: jaeger-query + # hostname: jaeger-query + # networks: + # - main-network + # volumes: + # - ./scripts/:/scripts + # entrypoint: ["/bin/sh", "/scripts/wait-for.sh", "elasticsearch:9200"] + # environment: + # SPAN_STORAGE_TYPE: "elasticsearch" + # ES_SERVER_URLS: "http://elasticsearch:9200" + # command: + # - "/go/bin/query-linux" + +### Superset + superset: + image: ooni/oonipipeline-superset + build: + context: . + dockerfile: ./docker/superset.Dockerfile + + ports: + - "8083:8088" + container_name: superset + hostname: superset + restart: unless-stopped + networks: + - main-network + volumes: + - ./docker/superset-config:/etc/superset + depends_on: + - postgresql + environment: + SUPERSET_CONFIG_PATH: "/etc/superset/superset_config.py" + +networks: + main-network: + driver: bridge + name: main-network diff --git a/oonipipeline/docker/.gitignore b/oonipipeline/docker/.gitignore new file mode 100644 index 00000000..16377704 --- /dev/null +++ b/oonipipeline/docker/.gitignore @@ -0,0 +1 @@ +/pgdata diff --git a/oonipipeline/docker/run-server-with-setup.sh b/oonipipeline/docker/run-server-with-setup.sh new file mode 100644 index 00000000..f3606f9b --- /dev/null +++ b/oonipipeline/docker/run-server-with-setup.sh @@ -0,0 +1,18 @@ +#!/usr/bin/env bash +set -ex + +echo "starting superset" + +if [ ! -f /var/run/superset/superset_is_configured ]; then + echo "superset is not configured, setting it up" + superset fab create-admin \ + --username admin \ + --firstname OONI \ + --lastname Tarian \ + --email admin@ooni.org \ + --password oonity + superset db upgrade + superset init + touch /var/run/superset/superset_is_configured +fi +/usr/bin/run-server.sh \ No newline at end of file diff --git a/oonipipeline/src/oonipipeline/workflows/__init__.py b/oonipipeline/docker/superset-config/_isconfigured similarity index 100% rename from oonipipeline/src/oonipipeline/workflows/__init__.py rename to oonipipeline/docker/superset-config/_isconfigured diff --git a/oonipipeline/docker/superset-config/superset_config.py b/oonipipeline/docker/superset-config/superset_config.py new file mode 100644 index 00000000..e4c21b3e --- /dev/null +++ b/oonipipeline/docker/superset-config/superset_config.py @@ -0,0 +1,2 @@ +SQLALCHEMY_DATABASE_URI = 'postgresql://oonipipeline:oonipipeline@postgresql' +SECRET_KEY = 'oonity_superset_supersecret_CHANGEME' diff --git a/oonipipeline/docker/superset.Dockerfile b/oonipipeline/docker/superset.Dockerfile new file mode 100644 index 00000000..0d847942 --- /dev/null +++ b/oonipipeline/docker/superset.Dockerfile @@ -0,0 +1,10 @@ + +FROM apache/superset +USER root +RUN pip install clickhouse-connect + +RUN mkdir -p /var/run/superset/ && chown superset:superset /var/run/superset/ +COPY --chown=superset --chmod=755 ./docker/run-server-with-setup.sh /usr/bin/ + +USER superset +CMD ["/usr/bin/env", "bash", "/usr/bin/run-server-with-setup.sh"] \ No newline at end of file diff --git a/oonipipeline/docker/temporal-config/development-cass.yml b/oonipipeline/docker/temporal-config/development-cass.yml new file mode 100644 index 00000000..4b916163 --- /dev/null +++ b/oonipipeline/docker/temporal-config/development-cass.yml @@ -0,0 +1,3 @@ +system.forceSearchAttributesCacheRefreshOnRead: + - value: true # Dev setup only. Please don't turn this on in production. + constraints: {} diff --git a/oonipipeline/docker/temporal-config/development-sql.yaml b/oonipipeline/docker/temporal-config/development-sql.yaml new file mode 100644 index 00000000..8862dfad --- /dev/null +++ b/oonipipeline/docker/temporal-config/development-sql.yaml @@ -0,0 +1,6 @@ +limit.maxIDLength: + - value: 255 + constraints: {} +system.forceSearchAttributesCacheRefreshOnRead: + - value: true # Dev setup only. Please don't turn this on in production. + constraints: {} diff --git a/oonipipeline/docker/temporal-config/docker.yaml b/oonipipeline/docker/temporal-config/docker.yaml new file mode 100644 index 00000000..e69de29b diff --git a/oonipipeline/pyproject.toml b/oonipipeline/pyproject.toml index f6ae82a5..902bce2f 100644 --- a/oonipipeline/pyproject.toml +++ b/oonipipeline/pyproject.toml @@ -35,6 +35,8 @@ dependencies = [ "flask ~= 2.2.0", "jupyterlab ~= 4.0.7", "temporalio ~= 1.5.1", + "temporalio[opentelemetry] ~= 1.5.1", + "opentelemetry-exporter-otlp-proto-grpc ~= 1.18.0" ] [tool.hatch.build.targets.sdist] @@ -55,6 +57,7 @@ dependencies = [ "memray", "viztracer", "pytest-docker", + "ipdb" ] python = "3.11" path = ".venv/" @@ -65,6 +68,7 @@ path = "src/oonipipeline/__about__.py" [tool.hatch.envs.default.scripts] oonipipeline = "python -m oonipipeline.main {args}" test = "pytest {args:tests}" -test-cov = "pytest -s --full-trace --log-level=INFO --log-cli-level=INFO -v --setup-show --cov=./ --cov-report=xml --cov-report=html --cov-report=term {args:tests}" +# --full-trace --log-level=INFO --log-cli-level=INFO -v --setup-show -s +test-cov = "pytest --cov=./ --cov-report=xml --cov-report=html --cov-report=term {args:tests}" cov-report = ["coverage report"] cov = ["test-cov", "cov-report"] diff --git a/oonipipeline/scripts/wait-for.sh b/oonipipeline/scripts/wait-for.sh new file mode 100755 index 00000000..0acf2a94 --- /dev/null +++ b/oonipipeline/scripts/wait-for.sh @@ -0,0 +1,15 @@ +#!/bin/sh + +set -e + +host="$1" +shift +cmd="$@" + +until wget --spider --quiet $host > /dev/null; do + >&2 echo "Waiting for $host to become available..." + sleep 1 +done + +>&2 echo "$host is up - executing command $cmd" +exec $cmd diff --git a/oonipipeline/src/oonipipeline/__about__.py b/oonipipeline/src/oonipipeline/__about__.py index 8f3f4b21..ed042c45 100644 --- a/oonipipeline/src/oonipipeline/__about__.py +++ b/oonipipeline/src/oonipipeline/__about__.py @@ -1 +1 @@ -VERSION = "4.0.0dev1" +VERSION = "5.0.0a0" diff --git a/oonipipeline/src/oonipipeline/analysis/control.py b/oonipipeline/src/oonipipeline/analysis/control.py index 1f910381..6af25208 100644 --- a/oonipipeline/src/oonipipeline/analysis/control.py +++ b/oonipipeline/src/oonipipeline/analysis/control.py @@ -197,6 +197,11 @@ def build_from_rows(self, rows: Iterable): self.db.execute("pragma optimize;") self.create_indexes() + def count_rows(self) -> int: + row = self.db.execute(f"SELECT COUNT() FROM {self._table_name};").fetchone() + assert len(row) == 1 + return row[0] + def build_from_existing(self, db_str: str): with sqlite3.connect(db_str) as src_db: self.db = sqlite3.connect(":memory:") @@ -283,8 +288,13 @@ def select_query( if hostnames: sub_q = "(" sub_q += "OR ".join( - # When hostname was supplied, we only care about it in relation to DNS resolutions - [" hostname = ? AND dns_success = 1 " for _ in range(len(hostnames))] + # When hostname was supplied, we only care about it in relation + # to DNS resolutions, so we only get DNS failure or DNS success + # rows + [ + " hostname = ? AND (dns_success = 1 OR dns_failure IS NOT NULL) " + for _ in range(len(hostnames)) + ] ) sub_q += ")" q_args += hostnames diff --git a/oonipipeline/src/oonipipeline/analysis/signal.py b/oonipipeline/src/oonipipeline/analysis/signal.py index 95d0eb99..11ffbc0c 100644 --- a/oonipipeline/src/oonipipeline/analysis/signal.py +++ b/oonipipeline/src/oonipipeline/analysis/signal.py @@ -12,6 +12,9 @@ from ..fingerprintdb import FingerprintDB +## TODO(art): port this over to the new MeasurementExperimentResult model + + def make_signal_experiment_result( web_observations: List[WebObservation], fingerprintdb: FingerprintDB, diff --git a/oonipipeline/src/oonipipeline/analysis/web_analysis.py b/oonipipeline/src/oonipipeline/analysis/web_analysis.py index 48cf7677..0f793a1a 100644 --- a/oonipipeline/src/oonipipeline/analysis/web_analysis.py +++ b/oonipipeline/src/oonipipeline/analysis/web_analysis.py @@ -199,7 +199,7 @@ def make_dns_ground_truth(ground_truths: Iterable[WebGroundTruth]): failure_count = 0 nxdomain_count = 0 for gt in ground_truths: - if gt.dns_success is None: + if gt.dns_success is None and gt.dns_failure is None: continue if gt.dns_failure == "dns_nxdomain_error": @@ -207,7 +207,7 @@ def make_dns_ground_truth(ground_truths: Iterable[WebGroundTruth]): nxdomain_cc_asn.add((gt.vp_cc, gt.vp_asn)) continue - if not gt.dns_success: + if gt.dns_failure is not None: failure_count += gt.count failure_cc_asn.add((gt.vp_cc, gt.vp_asn)) continue @@ -697,18 +697,7 @@ def make_web_analysis( ) if dns_analysis: - website_analysis.dns_ground_truth_nxdomain_count = ( - dns_analysis.ground_truth.nxdomain_count - ) - website_analysis.dns_ground_truth_ok_cc_asn_count = ( - dns_analysis.ground_truth.ok_cc_asn_count - ) - website_analysis.dns_ground_truth_failure_cc_asn_count = ( - dns_analysis.ground_truth.failure_cc_asn_count - ) - website_analysis.dns_ground_truth_nxdomain_cc_asn_count = ( - dns_analysis.ground_truth.nxdomain_cc_asn_count - ) + website_analysis.dns_consistency_system_answers = ( dns_analysis.consistency_system.answers ) @@ -775,6 +764,26 @@ def make_web_analysis( website_analysis.dns_consistency_system_answer_asn_ground_truth_asn_count = ( dns_analysis.consistency_system.answer_asn_ground_truth_asn_count ) + + website_analysis.dns_ground_truth_failure_count = ( + dns_analysis.ground_truth.failure_count + ) + website_analysis.dns_ground_truth_ok_count = ( + dns_analysis.ground_truth.ok_count + ) + website_analysis.dns_ground_truth_nxdomain_count = ( + dns_analysis.ground_truth.nxdomain_count + ) + website_analysis.dns_ground_truth_ok_cc_asn_count = ( + dns_analysis.ground_truth.ok_cc_asn_count + ) + website_analysis.dns_ground_truth_failure_cc_asn_count = ( + dns_analysis.ground_truth.failure_cc_asn_count + ) + website_analysis.dns_ground_truth_nxdomain_cc_asn_count = ( + dns_analysis.ground_truth.nxdomain_cc_asn_count + ) + """ website_analysis.dns_ground_truth_nxdomain_cc_asn = ( dns_analysis.ground_truth.nxdomain_cc_asn @@ -782,15 +791,9 @@ def make_web_analysis( website_analysis.dns_ground_truth_failure_cc_asn = ( dns_analysis.ground_truth.failure_cc_asn ) - website_analysis.dns_ground_truth_failure_count = ( - dns_analysis.ground_truth.failure_count - ) website_analysis.dns_ground_truth_ok_cc_asn = ( dns_analysis.ground_truth.ok_cc_asn ) - website_analysis.dns_ground_truth_ok_count = ( - dns_analysis.ground_truth.ok_count - ) website_analysis.dns_ground_truth_other_ips = ( dns_analysis.ground_truth.other_ips ) diff --git a/oonipipeline/src/oonipipeline/analysis/website_experiment_results.py b/oonipipeline/src/oonipipeline/analysis/website_experiment_results.py index de4c4e3d..9c442e60 100644 --- a/oonipipeline/src/oonipipeline/analysis/website_experiment_results.py +++ b/oonipipeline/src/oonipipeline/analysis/website_experiment_results.py @@ -48,10 +48,13 @@ def to_dict(self) -> Dict[str, float]: return d def sum(self) -> float: - s = 0 - for _, val in self.to_dict().items(): - s += val - return s + return sum([v for v in self.to_dict().values()]) + + def max(self) -> float: + return max([v for v in self.to_dict().values()]) + + def min(self) -> float: + return min([v for v in self.to_dict().values()]) @dataclass @@ -214,21 +217,23 @@ def calculate_web_loni( blocked_key = "dns.confirmed" blocking_scope = web_analysis.dns_consistency_system_answer_fp_scope blocked_value = 0.9 + down_value = 0.0 if ( web_analysis.dns_consistency_system_is_answer_fp_country_consistent == True ): blocked_key = "dns.confirmed.country_consistent" blocked_value = 1.0 + down_value = 0.0 elif ( web_analysis.dns_consistency_system_is_answer_fp_country_consistent == False ): - # We let the blocked value be slightly less for cases where the fingerprint is not country consistent + # If the fingerprint is not country consistent, we consider it down to avoid false positives blocked_key = "dns.confirmed.not_country_consistent" - blocked_value = 0.8 - ok_value = 0 - down_value = 0 + down_value = 0.8 + blocked_value = 0.2 + ok_value = 0.0 elif web_analysis.dns_consistency_system_is_answer_bogon == True: # Bogons are always fishy, yet we don't know if we see it because # the site is misconfigured. @@ -383,6 +388,7 @@ def calculate_web_loni( down_value, blocked_value = 0.0, 0.0 blocked.tcp = OutcomeStatus(key=blocked_key, value=blocked_value) down.tcp = OutcomeStatus(key=down_key, value=down_value) + ok.tcp = OutcomeStatus(key="tcp", value=1 - (blocked.sum() + down.sum())) elif web_analysis.tcp_success == False: analysis_transcript.append("web_analysis.tcp_success == False") @@ -649,7 +655,10 @@ def calculate_web_loni( blocked_value = 0.8 elif web_analysis.http_is_http_fp_false_positive == True: blocked_value = 0.0 - else: + elif ( + web_analysis.http_response_body_length is not None + and web_analysis.http_ground_truth_body_length is not None + ): # We need to apply some fuzzy logic to fingerprint it # TODO(arturo): in the future can use more features, such as the following """ @@ -815,6 +824,9 @@ def make_website_experiment_results( loni_ok_list: List[OutcomeSpace] = [] for wa in web_analysis: loni, analysis_transcript = calculate_web_loni(wa) + log.debug("wa: %s", wa) + log.debug("analysis_transcript: %s", analysis_transcript) + log.debug("loni: %s", loni) analysis_transcript_list.append(analysis_transcript) loni_list.append(loni) loni_blocked_list.append(loni.blocked) @@ -953,7 +965,7 @@ def get_agg_outcome(loni_list, category, agg_func) -> Optional[OutcomeStatus]: ) log.debug(f"final_loni: {final_loni}") - loni_ok_value = final_loni.ok_final + loni_ok_value = final_ok.min() loni_down = final_loni.down.to_dict() loni_down_keys, loni_down_values = list(loni_down.keys()), list(loni_down.values()) diff --git a/oonipipeline/src/oonipipeline/cli/commands.py b/oonipipeline/src/oonipipeline/cli/commands.py index 072948b0..659703d2 100644 --- a/oonipipeline/src/oonipipeline/cli/commands.py +++ b/oonipipeline/src/oonipipeline/cli/commands.py @@ -1,86 +1,204 @@ +from concurrent.futures import ProcessPoolExecutor, as_completed +from dataclasses import dataclass +import dataclasses import logging import multiprocessing from pathlib import Path +import asyncio +import signal import sys from typing import List, Optional -from datetime import date, timedelta, datetime +from datetime import date, timedelta, datetime, timezone from typing import List, Optional +import opentelemetry.context +from opentelemetry import trace +from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter +from opentelemetry.sdk.resources import SERVICE_NAME, Resource +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.sdk.trace.export import BatchSpanProcessor + import click from click_loglevel import LogLevel +from temporalio.runtime import ( + OpenTelemetryConfig, + Runtime as TemporalRuntime, + TelemetryConfig, +) +from temporalio.client import ( + Client as TemporalClient, +) +from temporalio.types import MethodAsyncSingleParam, SelfType, ParamType, ReturnType + +from temporalio.contrib.opentelemetry import TracingInterceptor + +from ..temporal.workers import make_threaded_worker + +from ..temporal.workflows import ( + AnalysisBackfillWorkflow, + BackfillWorkflowParams, + GroundTruthsWorkflow, + GroundTruthsWorkflowParams, + ObservationsBackfillWorkflow, + TASK_QUEUE_NAME, +) + from ..__about__ import VERSION from ..db.connections import ClickhouseConnection from ..db.create_tables import create_queries, list_all_table_diffs from ..netinfo import NetinfoDB -log = logging.getLogger("oonidata") - -import asyncio -import concurrent.futures +def init_runtime_with_telemetry(endpoint: str) -> TemporalRuntime: + provider = TracerProvider(resource=Resource.create({SERVICE_NAME: "oonipipeline"})) + exporter = OTLPSpanExporter( + endpoint=endpoint, insecure=endpoint.startswith("http://") + ) + provider.add_span_processor(BatchSpanProcessor(exporter)) + trace.set_tracer_provider(provider) -from temporalio.client import Client as TemporalClient -from temporalio.worker import Worker, SharedStateManager + return TemporalRuntime( + telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=endpoint)) + ) -from temporalio.types import MethodAsyncSingleParam, SelfType, ParamType, ReturnType -from ..workflows.observations import ( - ObservationsWorkflow, - ObservationsWorkflowParams, - make_observation_in_day, -) - -from ..workflows.ground_truths import ( - GroundTruthsWorkflow, - GroundTruthsWorkflowParams, - make_ground_truths_in_day, -) - -from ..workflows.analysis import ( - AnalysisWorkflow, - AnalysisWorkflowParams, - make_analysis_in_a_day, -) +async def temporal_connect(telemetry_endpoint: str, temporal_address: str): + runtime = init_runtime_with_telemetry(telemetry_endpoint) + client = await TemporalClient.connect( + temporal_address, + interceptors=[TracingInterceptor()], + runtime=runtime, + ) + return client -TASK_QUEUE_NAME = "oonipipeline-task-queue" +@dataclass +class WorkerParams: + temporal_address: str + telemetry_endpoint: str + thread_count: int + process_idx: int = 0 -async def run_workflow( +async def start_threaded_worker(params: WorkerParams): + client = await temporal_connect( + telemetry_endpoint=params.telemetry_endpoint, + temporal_address=params.temporal_address, + ) + worker = make_threaded_worker(client, parallelism=params.thread_count) + await worker.run() + + +def run_worker(params: WorkerParams): + try: + asyncio.run(start_threaded_worker(params)) + except KeyboardInterrupt: + print("shutting down") + + +def start_workers(params: WorkerParams, process_count: int): + def signal_handler(signal, frame): + print("shutdown requested: Ctrl+C detected") + sys.exit(0) + + signal.signal(signal.SIGINT, signal_handler) + + process_params = [ + dataclasses.replace(params, process_idx=idx) for idx in range(process_count) + ] + executor = ProcessPoolExecutor(max_workers=process_count) + try: + futures = [executor.submit(run_worker, param) for param in process_params] + for future in as_completed(futures): + future.result() + except KeyboardInterrupt: + print("ctrl+C detected, cancelling tasks...") + for future in futures: + future.cancel() + executor.shutdown(wait=True) + print("all tasks have been cancelled and cleaned up") + except Exception as e: + print(f"an error occurred: {e}") + executor.shutdown(wait=False) + raise + + +async def execute_workflow_with_workers( workflow: MethodAsyncSingleParam[SelfType, ParamType, ReturnType], arg: ParamType, - parallelism: int = 5, - temporal_address: str = "localhost:7233", + parallelism, + workflow_id_prefix: str, + telemetry_endpoint: str, + temporal_address: str, ): - client = await TemporalClient.connect(temporal_address) - async with Worker( - client, - task_queue=TASK_QUEUE_NAME, - workflows=[ - ObservationsWorkflow, - GroundTruthsWorkflow, - AnalysisWorkflow, - ], - activities=[ - make_observation_in_day, - make_ground_truths_in_day, - make_analysis_in_a_day, - ], - activity_executor=concurrent.futures.ProcessPoolExecutor(parallelism + 2), - max_concurrent_activities=parallelism, - shared_state_manager=SharedStateManager.create_from_multiprocessing( - multiprocessing.Manager() - ), - ): + click.echo( + f"running workflow {workflow} temporal_address={temporal_address} telemetry_address={telemetry_endpoint} parallelism={parallelism}" + ) + ts = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S") + client = await temporal_connect( + telemetry_endpoint=telemetry_endpoint, temporal_address=temporal_address + ) + async with make_threaded_worker(client, parallelism=parallelism): await client.execute_workflow( workflow, arg, - id=TASK_QUEUE_NAME, + id=f"{workflow_id_prefix}-{ts}", task_queue=TASK_QUEUE_NAME, ) +async def execute_workflow( + workflow: MethodAsyncSingleParam[SelfType, ParamType, ReturnType], + arg: ParamType, + parallelism, + workflow_id_prefix: str, + telemetry_endpoint: str, + temporal_address: str, +): + click.echo( + f"running workflow {workflow} temporal_address={temporal_address} telemetry_address={telemetry_endpoint} parallelism={parallelism}" + ) + ts = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S") + client = await temporal_connect( + telemetry_endpoint=telemetry_endpoint, temporal_address=temporal_address + ) + await client.execute_workflow( + workflow, + arg, + id=f"{workflow_id_prefix}-{ts}", + task_queue=TASK_QUEUE_NAME, + ) + + +def run_workflow( + workflow: MethodAsyncSingleParam[SelfType, ParamType, ReturnType], + arg: ParamType, + parallelism, + start_workers: bool, + workflow_id_prefix: str, + telemetry_endpoint: str, + temporal_address: str, +): + action = execute_workflow + if start_workers: + print("starting also workers") + action = execute_workflow_with_workers + try: + asyncio.run( + action( + workflow=workflow, + arg=arg, + parallelism=parallelism, + workflow_id_prefix=workflow_id_prefix, + telemetry_endpoint=telemetry_endpoint, + temporal_address=temporal_address, + ) + ) + except KeyboardInterrupt: + print("shutting down") + + def _parse_csv(ctx, param, s: Optional[str]) -> List[str]: if s: return s.split(",") @@ -115,9 +233,35 @@ def _parse_csv(ctx, param, s: Optional[str]) -> List[str]: """, ) +start_at_option = click.option( + "--start-at", + type=click.DateTime(), + default=str(datetime.now(timezone.utc).date() - timedelta(days=14)), + help="""the timestamp of the day for which we should start processing data (inclusive). + + Note: this is the upload date, which doesn't necessarily match the measurement date. + """, +) +end_at_option = click.option( + "--end-at", + type=click.DateTime(), + default=str(datetime.now(timezone.utc).date() + timedelta(days=1)), + help="""the timestamp of the day for which we should start processing data (inclusive). + + Note: this is the upload date, which doesn't necessarily match the measurement date. + """, +) + clickhouse_option = click.option( "--clickhouse", type=str, required=True, default="clickhouse://localhost" ) +telemetry_endpoint_option = click.option( + "--telemetry-endpoint", type=str, required=True, default="http://localhost:4317" +) +temporal_address_option = click.option( + "--temporal-address", type=str, required=True, default="localhost:7233" +) +start_workers_option = click.option("--start-workers/--no-start-workers", default=True) datadir_option = click.option( "--data-dir", @@ -126,10 +270,15 @@ def _parse_csv(ctx, param, s: Optional[str]) -> List[str]: default="tests/data/datadir", help="data directory to store fingerprint and geoip databases", ) +parallelism_option = click.option( + "--parallelism", + type=int, + default=multiprocessing.cpu_count() + 2, + help="number of processes to use. Only works when writing to a database", +) @click.group() -@click.option("--error-log-file", type=Path) @click.option( "-l", "--log-level", @@ -139,13 +288,8 @@ def _parse_csv(ctx, param, s: Optional[str]) -> List[str]: show_default=True, ) @click.version_option(VERSION) -def cli(error_log_file: Path, log_level: int): - log.addHandler(logging.StreamHandler(sys.stderr)) - log.setLevel(log_level) - if error_log_file: - logging.basicConfig( - filename=error_log_file, encoding="utf-8", level=logging.ERROR - ) +def cli(log_level: int): + logging.basicConfig(level=log_level) @cli.command() @@ -155,12 +299,10 @@ def cli(error_log_file: Path, log_level: int): @end_day_option @clickhouse_option @datadir_option -@click.option( - "--parallelism", - type=int, - default=multiprocessing.cpu_count() + 2, - help="number of processes to use. Only works when writing to a database", -) +@parallelism_option +@telemetry_endpoint_option +@temporal_address_option +@start_workers_option @click.option( "--fast-fail", is_flag=True, @@ -187,6 +329,9 @@ def mkobs( fast_fail: bool, create_tables: bool, drop_tables: bool, + telemetry_endpoint: str, + temporal_address: str, + start_workers: bool, ): """ Make observations for OONI measurements and write them into clickhouse or a CSV file @@ -207,22 +352,24 @@ def mkobs( NetinfoDB(datadir=Path(data_dir), download=True) click.echo("downloaded netinfodb") - arg = ObservationsWorkflowParams( + params = BackfillWorkflowParams( probe_cc=probe_cc, test_name=test_name, - start_day=start_day, - end_day=end_day, clickhouse=clickhouse, data_dir=str(data_dir), fast_fail=fast_fail, + start_day=start_day, + end_day=end_day, ) - click.echo(f"starting to make observations with arg={arg}") - asyncio.run( - run_workflow( - ObservationsWorkflow.run, - arg, - parallelism=parallelism, - ) + click.echo(f"starting to make observations with params={params}") + run_workflow( + ObservationsBackfillWorkflow.run, + params, + parallelism=parallelism, + workflow_id_prefix="oonipipeline-mkobs", + telemetry_endpoint=telemetry_endpoint, + temporal_address=temporal_address, + start_workers=start_workers, ) @@ -233,12 +380,10 @@ def mkobs( @end_day_option @clickhouse_option @datadir_option -@click.option( - "--parallelism", - type=int, - default=multiprocessing.cpu_count() + 2, - help="number of processes to use. Only works when writing to a database", -) +@parallelism_option +@telemetry_endpoint_option +@temporal_address_option +@start_workers_option @click.option( "--fast-fail", is_flag=True, @@ -249,11 +394,6 @@ def mkobs( is_flag=True, help="should we attempt to create the required clickhouse tables", ) -@click.option( - "--rebuild-ground-truths", - is_flag=True, - help="should we force the rebuilding of ground truths", -) def mkanalysis( probe_cc: List[str], test_name: List[str], @@ -264,7 +404,9 @@ def mkanalysis( parallelism: int, fast_fail: bool, create_tables: bool, - rebuild_ground_truths: bool, + telemetry_endpoint: str, + temporal_address: str, + start_workers: bool, ): if create_tables: with ClickhouseConnection(clickhouse) as db: @@ -276,24 +418,23 @@ def mkanalysis( NetinfoDB(datadir=Path(data_dir), download=True) click.echo("downloaded netinfodb") - arg = AnalysisWorkflowParams( + params = BackfillWorkflowParams( probe_cc=probe_cc, test_name=test_name, start_day=start_day, end_day=end_day, clickhouse=clickhouse, data_dir=str(data_dir), - parallelism=parallelism, fast_fail=fast_fail, - rebuild_ground_truths=rebuild_ground_truths, ) - click.echo(f"starting to make analysis with arg={arg}") - asyncio.run( - run_workflow( - AnalysisWorkflow.run, - arg, - parallelism=parallelism, - ) + run_workflow( + AnalysisBackfillWorkflow.run, + params, + parallelism=parallelism, + workflow_id_prefix="oonipipeline-mkanalysis", + telemetry_endpoint=telemetry_endpoint, + temporal_address=temporal_address, + start_workers=start_workers, ) @@ -302,28 +443,64 @@ def mkanalysis( @end_day_option @clickhouse_option @datadir_option +@parallelism_option +@telemetry_endpoint_option +@temporal_address_option +@start_workers_option def mkgt( start_day: str, end_day: str, clickhouse: str, data_dir: Path, + parallelism: int, + telemetry_endpoint: str, + temporal_address: str, + start_workers: bool, ): click.echo("Starting to build ground truths") NetinfoDB(datadir=Path(data_dir), download=True) click.echo("downloaded netinfodb") - arg = GroundTruthsWorkflowParams( + params = GroundTruthsWorkflowParams( start_day=start_day, end_day=end_day, clickhouse=clickhouse, data_dir=str(data_dir), ) - click.echo(f"starting to make ground truths with arg={arg}") - asyncio.run( - run_workflow( - GroundTruthsWorkflow.run, - arg, - ) + click.echo(f"starting to make ground truths with arg={params}") + run_workflow( + GroundTruthsWorkflow.run, + params, + parallelism=parallelism, + workflow_id_prefix="oonipipeline-mkgt", + telemetry_endpoint=telemetry_endpoint, + temporal_address=temporal_address, + start_workers=start_workers, + ) + + +@cli.command() +@datadir_option +@parallelism_option +@telemetry_endpoint_option +@temporal_address_option +def startworkers( + data_dir: Path, + parallelism: int, + telemetry_endpoint: str, + temporal_address: str, +): + click.echo(f"starting {parallelism} workers") + click.echo(f"downloading NetinfoDB to {data_dir}") + NetinfoDB(datadir=Path(data_dir), download=True) + click.echo("done downloading netinfodb") + start_workers( + params=WorkerParams( + temporal_address=temporal_address, + telemetry_endpoint=telemetry_endpoint, + thread_count=parallelism, + ), + process_count=parallelism, ) diff --git a/oonipipeline/src/oonipipeline/db/connections.py b/oonipipeline/src/oonipipeline/db/connections.py index 43f29eec..62be40f0 100644 --- a/oonipipeline/src/oonipipeline/db/connections.py +++ b/oonipipeline/src/oonipipeline/db/connections.py @@ -6,6 +6,7 @@ from datetime import datetime, timezone from pprint import pformat import logging +from typing import Optional log = logging.getLogger("oonidata.processing") @@ -26,7 +27,13 @@ def close(self): class ClickhouseConnection(DatabaseConnection): - def __init__(self, conn_url, row_buffer_size=0, max_block_size=1_000_000): + def __init__( + self, + conn_url, + row_buffer_size=0, + max_block_size=1_000_000, + dump_failing_rows: Optional[str] = None, + ): from clickhouse_driver import Client self.clickhouse_url = conn_url @@ -37,6 +44,7 @@ def __init__(self, conn_url, row_buffer_size=0, max_block_size=1_000_000): self._column_names = {} self._row_buffer = defaultdict(list) + self.dump_failing_rows = dump_failing_rows def __enter__(self): return self @@ -91,8 +99,10 @@ def flush_rows(self, table_name, rows): time.sleep(0.1) except Exception as exc: log.error(f"Failed to write {row} ({exc}) {query_str}") - with open(f"failing-rows.pickle", "ab") as out_file: - pickle.dump({"query_str": query_str, "row": row}, out_file) + + if self.dump_failing_rows: + with open(self.dump_failing_rows, "ab") as out_file: + pickle.dump({"query_str": query_str, "row": row}, out_file) def flush_all_rows(self): for table_name, rows in self._row_buffer.items(): diff --git a/oonipipeline/src/oonipipeline/temporal/__init__.py b/oonipipeline/src/oonipipeline/temporal/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/oonipipeline/src/oonipipeline/temporal/activities/__init__.py b/oonipipeline/src/oonipipeline/temporal/activities/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/oonipipeline/src/oonipipeline/temporal/activities/analysis.py b/oonipipeline/src/oonipipeline/temporal/activities/analysis.py new file mode 100644 index 00000000..37b3615a --- /dev/null +++ b/oonipipeline/src/oonipipeline/temporal/activities/analysis.py @@ -0,0 +1,241 @@ +import dataclasses +from dataclasses import dataclass +import pathlib + +from datetime import datetime +from typing import Dict, List + +import opentelemetry.trace +from temporalio import workflow, activity + +with workflow.unsafe.imports_passed_through(): + import clickhouse_driver + + import orjson + + from oonidata.models.analysis import WebAnalysis + from oonidata.models.experiment_result import MeasurementExperimentResult + + from ...analysis.control import BodyDB, WebGroundTruthDB + from ...analysis.datasources import iter_web_observations + from ...analysis.web_analysis import make_web_analysis + from ...analysis.website_experiment_results import make_website_experiment_results + from ...db.connections import ClickhouseConnection + from ...fingerprintdb import FingerprintDB + + from ..common import ( + get_prev_range, + make_db_rows, + maybe_delete_prev_range, + ) + +log = activity.logger + + +def make_cc_batches( + cnt_by_cc: Dict[str, int], + probe_cc: List[str], + parallelism: int, +) -> List[List[str]]: + """ + The goal of this function is to spread the load of each batch of + measurements by probe_cc. This allows us to parallelize analysis on a + per-country basis based on the number of measurements. + We assume that the measurements are uniformly distributed over the tested + interval and then break them up into a number of batches equivalent to the + parallelism count based on the number of measurements in each country. + + Here is a concrete example, suppose we have 3 countries IT, IR, US with 300, + 400, 1000 measurements respectively and a parallelism of 2, we will be + creating 2 batches where the first has in it IT, IR and the second has US. + """ + if len(probe_cc) > 0: + selected_ccs_with_cnt = set(probe_cc).intersection(set(cnt_by_cc.keys())) + if len(selected_ccs_with_cnt) == 0: + raise Exception( + f"No observations for {probe_cc} in the time range. Try adjusting the date range or choosing different countries" + ) + # We remove from the cnt_by_cc all the countries we are not interested in + cnt_by_cc = {k: cnt_by_cc[k] for k in selected_ccs_with_cnt} + + total_obs_cnt = sum(cnt_by_cc.values()) + + # We assume uniform distribution of observations per (country, day) + max_obs_per_batch = total_obs_cnt / parallelism + + # We break up the countries into batches where the count of observations in + # each batch is roughly equal. + # This is done so that we can spread the load based on the countries in + # addition to the time range. + cc_batches = [] + current_cc_batch_size = 0 + current_cc_batch = [] + cnt_by_cc_sorted = sorted(cnt_by_cc.items(), key=lambda x: x[0]) + while cnt_by_cc_sorted: + while current_cc_batch_size <= max_obs_per_batch: + try: + cc, cnt = cnt_by_cc_sorted.pop() + except IndexError: + break + current_cc_batch.append(cc) + current_cc_batch_size += cnt + cc_batches.append(current_cc_batch) + current_cc_batch = [] + current_cc_batch_size = 0 + if len(current_cc_batch) > 0: + cc_batches.append(current_cc_batch) + return cc_batches + + +@dataclass +class MakeAnalysisParams: + probe_cc: List[str] + test_name: List[str] + clickhouse: str + data_dir: str + fast_fail: bool + day: str + + +@activity.defn +def make_analysis_in_a_day(params: MakeAnalysisParams) -> dict: + data_dir = pathlib.Path(params.data_dir) + clickhouse = params.clickhouse + day = datetime.strptime(params.day, "%Y-%m-%d").date() + probe_cc = params.probe_cc + test_name = params.test_name + + tracer = opentelemetry.trace.get_tracer(__name__) + + with opentelemetry.trace.get_current_span(): + fingerprintdb = FingerprintDB(datadir=data_dir, download=False) + body_db = BodyDB(db=ClickhouseConnection(clickhouse)) + db_writer = ClickhouseConnection(clickhouse, row_buffer_size=10_000) + db_lookup = ClickhouseConnection(clickhouse) + + column_names_wa = [f.name for f in dataclasses.fields(WebAnalysis)] + column_names_er = [ + f.name for f in dataclasses.fields(MeasurementExperimentResult) + ] + + # TODO(art): this previous range search and deletion makes the idempotence + # of the activity not 100% accurate. + # We should look into fixing it. + prev_range_list = [ + get_prev_range( + db=db_lookup, + table_name=WebAnalysis.__table_name__, + timestamp=datetime.combine(day, datetime.min.time()), + test_name=[], + probe_cc=probe_cc, + timestamp_column="measurement_start_time", + ), + get_prev_range( + db=db_lookup, + table_name=MeasurementExperimentResult.__table_name__, + timestamp=datetime.combine(day, datetime.min.time()), + test_name=[], + probe_cc=probe_cc, + timestamp_column="timeofday", + probe_cc_column="location_network_cc", + ), + ] + + log.info(f"loading ground truth DB for {day}") + with tracer.start_as_current_span( + "MakeObservations:load_ground_truths" + ) as span: + ground_truth_db_path = ( + data_dir / "ground_truths" / f"web-{day.strftime('%Y-%m-%d')}.sqlite3" + ) + web_ground_truth_db = WebGroundTruthDB() + web_ground_truth_db.build_from_existing( + str(ground_truth_db_path.absolute()) + ) + log.info(f"loaded ground truth DB for {day}") + span.add_event(f"loaded ground truth DB for {day}") + span.set_attribute("day", day.strftime("%Y-%m-%d")) + span.set_attribute( + "ground_truth_row_count", web_ground_truth_db.count_rows() + ) + + failures = 0 + no_exp_results = 0 + observation_count = 0 + with tracer.start_as_current_span( + "MakeObservations:iter_web_observations" + ) as span: + for web_obs in iter_web_observations( + db_lookup, + measurement_day=day, + probe_cc=probe_cc, + test_name="web_connectivity", + ): + try: + relevant_gts = web_ground_truth_db.lookup_by_web_obs( + web_obs=web_obs + ) + except: + log.error( + f"failed to lookup relevant_gts for {web_obs[0].measurement_uid}", + exc_info=True, + ) + failures += 1 + continue + + try: + website_analysis = list( + make_web_analysis( + web_observations=web_obs, + body_db=body_db, + web_ground_truths=relevant_gts, + fingerprintdb=fingerprintdb, + ) + ) + if len(website_analysis) == 0: + log.info(f"no website analysis for {probe_cc}, {test_name}") + no_exp_results += 1 + continue + + observation_count += 1 + table_name, rows = make_db_rows( + dc_list=website_analysis, column_names=column_names_wa + ) + + db_writer.write_rows( + table_name=table_name, + rows=rows, + column_names=column_names_wa, + ) + + website_er = list(make_website_experiment_results(website_analysis)) + table_name, rows = make_db_rows( + dc_list=website_er, + column_names=column_names_er, + custom_remap={"loni_list": orjson.dumps}, + ) + + db_writer.write_rows( + table_name=table_name, + rows=rows, + column_names=column_names_er, + ) + + except: + web_obs_ids = ",".join(map(lambda wo: wo.observation_id, web_obs)) + log.error( + f"failed to generate analysis for {web_obs_ids}", exc_info=True + ) + failures += 1 + + span.set_attribute("total_failure_count", failures) + span.set_attribute("total_observation_count", observation_count) + span.set_attribute("no_experiment_results_count", no_exp_results) + span.set_attribute("day", day.strftime("%Y-%m-%d")) + span.set_attribute("probe_cc", probe_cc) + + for prev_range in prev_range_list: + maybe_delete_prev_range(db=db_lookup, prev_range=prev_range) + db_writer.close() + + return {"count": observation_count} diff --git a/oonipipeline/src/oonipipeline/temporal/activities/common.py b/oonipipeline/src/oonipipeline/temporal/activities/common.py new file mode 100644 index 00000000..f623fcc6 --- /dev/null +++ b/oonipipeline/src/oonipipeline/temporal/activities/common.py @@ -0,0 +1,48 @@ +from dataclasses import dataclass +from datetime import date +from typing import Dict, List, Tuple +from oonipipeline.db.connections import ClickhouseConnection +from oonipipeline.db.create_tables import create_queries + +from temporalio import activity + + +@dataclass +class ClickhouseParams: + clickhouse_url: str + + +@activity.defn +def optimize_all_tables(params: ClickhouseParams): + with ClickhouseConnection(params.clickhouse_url) as db: + for _, table_name in create_queries: + db.execute(f"OPTIMIZE TABLE {table_name}") + + +@dataclass +class ObsCountParams: + clickhouse_url: str + # TODO(art): we should also be using test_name here + # test_name: List[str] + start_day: str + end_day: str + table_name: str = "obs_web" + + +@activity.defn +def get_obs_count_by_cc( + params: ObsCountParams, +) -> Dict[str, int]: + with ClickhouseConnection(params.clickhouse_url) as db: + q = f""" + SELECT + probe_cc, COUNT() + FROM {params.table_name} + WHERE measurement_start_time > %(start_day)s AND measurement_start_time < %(end_day)s + GROUP BY probe_cc + """ + cc_list: List[Tuple[str, int]] = db.execute( + q, {"start_day": params.start_day, "end_day": params.end_day} + ) # type: ignore + assert isinstance(cc_list, list) + return dict(cc_list) diff --git a/oonipipeline/src/oonipipeline/temporal/activities/ground_truths.py b/oonipipeline/src/oonipipeline/temporal/activities/ground_truths.py new file mode 100644 index 00000000..54df2fe4 --- /dev/null +++ b/oonipipeline/src/oonipipeline/temporal/activities/ground_truths.py @@ -0,0 +1,54 @@ +from dataclasses import dataclass +import pathlib +import logging + +from datetime import datetime + +from temporalio import workflow, activity + +with workflow.unsafe.imports_passed_through(): + import clickhouse_driver + + from oonidata.datautils import PerfTimer + from ...analysis.control import WebGroundTruthDB, iter_web_ground_truths + from ...netinfo import NetinfoDB + from ...db.connections import ( + ClickhouseConnection, + ) + +log = activity.logger + + +@dataclass +class MakeGroundTruthsParams: + clickhouse: str + data_dir: str + day: str + + +def get_ground_truth_db_path(data_dir: str, day: str): + ground_truth_dir = pathlib.Path(data_dir) / "ground_truths" + ground_truth_dir.mkdir(exist_ok=True) + return ground_truth_dir / f"web-{day}.sqlite3" + + +@activity.defn +def make_ground_truths_in_day(params: MakeGroundTruthsParams): + clickhouse = params.clickhouse + + db = ClickhouseConnection(clickhouse) + netinfodb = NetinfoDB(datadir=pathlib.Path(params.data_dir), download=False) + + dst_path = get_ground_truth_db_path(data_dir=params.data_dir, day=params.day) + + if dst_path.exists(): + dst_path.unlink() + + t = PerfTimer() + day = datetime.strptime(params.day, "%Y-%m-%d").date() + log.info(f"building ground truth DB for {day}") + web_ground_truth_db = WebGroundTruthDB(connect_str=str(dst_path.absolute())) + web_ground_truth_db.build_from_rows( + rows=iter_web_ground_truths(db=db, measurement_day=day, netinfodb=netinfodb) + ) + log.info(f"built ground truth DB {day} in {t.pretty}") diff --git a/oonipipeline/src/oonipipeline/temporal/activities/observations.py b/oonipipeline/src/oonipipeline/temporal/activities/observations.py new file mode 100644 index 00000000..51631fa9 --- /dev/null +++ b/oonipipeline/src/oonipipeline/temporal/activities/observations.py @@ -0,0 +1,201 @@ +from dataclasses import dataclass +import dataclasses +from typing import List, Sequence, Tuple +from oonidata.dataclient import ( + ccs_set, + list_file_entries_batches, + load_measurement, + stream_measurements, +) +from oonidata.datautils import PerfTimer +from oonidata.models.nettests import SupportedDataformats +from oonipipeline.db.connections import ClickhouseConnection +from oonipipeline.netinfo import NetinfoDB +from oonipipeline.temporal.common import ( + get_prev_range, + make_db_rows, + maybe_delete_prev_range, +) + +from opentelemetry import trace + +from temporalio import activity + + +import pathlib +from datetime import datetime, timedelta + +from oonipipeline.transforms.observations import measurement_to_observations + +log = activity.logger + + +@dataclass +class MakeObservationsParams: + probe_cc: List[str] + test_name: List[str] + clickhouse: str + data_dir: str + fast_fail: bool + bucket_date: str + + +def write_observations_to_db( + msmt: SupportedDataformats, + netinfodb: NetinfoDB, + db: ClickhouseConnection, + bucket_date: str, +): + for observations in measurement_to_observations(msmt=msmt, netinfodb=netinfodb): + if len(observations) == 0: + continue + + column_names = [f.name for f in dataclasses.fields(observations[0])] + table_name, rows = make_db_rows( + bucket_date=bucket_date, + dc_list=observations, + column_names=column_names, + ) + db.write_rows(table_name=table_name, rows=rows, column_names=column_names) + + +def make_observations_for_file_entry_batch( + file_entry_batch: Sequence[Tuple[str, str, str, int]], + clickhouse: str, + row_buffer_size: int, + data_dir: pathlib.Path, + bucket_date: str, + probe_cc: List[str], + fast_fail: bool, +): + netinfodb = NetinfoDB(datadir=data_dir, download=False) + tbatch = PerfTimer() + + tracer = trace.get_tracer(__name__) + + total_failure_count = 0 + current_span = trace.get_current_span() + with current_span, ClickhouseConnection( + clickhouse, row_buffer_size=row_buffer_size + ) as db: + ccs = ccs_set(probe_cc) + idx = 0 + for bucket_name, s3path, ext, fe_size in file_entry_batch: + failure_count = 0 + # Nest the traced span within the current span + with tracer.start_as_current_span( + "MakeObservations:stream_file_entry" + ) as span: + log.debug(f"processing file s3://{bucket_name}/{s3path}") + t = PerfTimer() + try: + for msmt_dict in stream_measurements( + bucket_name=bucket_name, s3path=s3path, ext=ext + ): + # Legacy cans don't allow us to pre-filter on the probe_cc, so + # we need to check for probe_cc consistency in here. + if ccs and msmt_dict["probe_cc"] not in ccs: + continue + msmt = None + try: + t = PerfTimer() + msmt = load_measurement(msmt_dict) + if not msmt.test_keys: + log.error( + f"measurement with empty test_keys: ({msmt.measurement_uid})", + exc_info=True, + ) + continue + write_observations_to_db(msmt, netinfodb, db, bucket_date) + idx += 1 + except Exception as exc: + msmt_str = msmt_dict.get("report_id", None) + if msmt: + msmt_str = msmt.measurement_uid + log.error( + f"failed at idx: {idx} ({msmt_str})", exc_info=True + ) + failure_count += 1 + + if fast_fail: + db.close() + raise exc + log.debug(f"done processing file s3://{bucket_name}/{s3path}") + except Exception as exc: + log.error( + f"failed to stream measurements from s3://{bucket_name}/{s3path}" + ) + log.error(exc) + # TODO(art): figure out if the rate of these metrics is too + # much. For each processed file a telemetry event is generated. + span.set_attribute("kb_per_sec", fe_size / 1024 / t.s) + span.set_attribute("fe_size", fe_size) + span.set_attribute("failure_count", failure_count) + span.add_event(f"s3_path: s3://{bucket_name}/{s3path}") + total_failure_count += failure_count + + current_span.set_attribute("total_runtime_ms", tbatch.ms) + current_span.set_attribute("total_failure_count", total_failure_count) + return idx + + +@activity.defn +def make_observation_in_day(params: MakeObservationsParams) -> dict: + day = datetime.strptime(params.bucket_date, "%Y-%m-%d").date() + + # TODO(art): this previous range search and deletion makes the idempotence + # of the activity not 100% accurate. + # We should look into fixing it. + with ClickhouseConnection(params.clickhouse, row_buffer_size=10_000) as db: + prev_ranges = [] + for table_name in ["obs_web"]: + prev_ranges.append( + ( + table_name, + get_prev_range( + db=db, + table_name=table_name, + bucket_date=params.bucket_date, + test_name=params.test_name, + probe_cc=params.probe_cc, + ), + ) + ) + log.info(f"prev_ranges: {prev_ranges}") + + t = PerfTimer() + total_t = PerfTimer() + file_entry_batches, total_size = list_file_entries_batches( + probe_cc=params.probe_cc, + test_name=params.test_name, + start_day=day, + end_day=day + timedelta(days=1), + ) + log.info(f"running {len(file_entry_batches)} batches took {t.pretty}") + + total_msmt_count = 0 + for batch in file_entry_batches: + msmt_cnt = make_observations_for_file_entry_batch( + batch, + params.clickhouse, + 10_000, + pathlib.Path(params.data_dir), + params.bucket_date, + params.probe_cc, + params.fast_fail, + ) + total_msmt_count += msmt_cnt + + mb_per_sec = round(total_size / total_t.s / 10**6, 1) + msmt_per_sec = round(total_msmt_count / total_t.s) + log.info( + f"finished processing all batches in {total_t.pretty} speed: {mb_per_sec}MB/s ({msmt_per_sec}msmt/s)" + ) + + if len(prev_ranges) > 0: + with ClickhouseConnection(params.clickhouse, row_buffer_size=10_000) as db: + for table_name, pr in prev_ranges: + log.info("deleting previous range of {pr}") + maybe_delete_prev_range(db=db, prev_range=pr) + + return {"size": total_size, "measurement_count": total_msmt_count} diff --git a/oonipipeline/src/oonipipeline/workflows/common.py b/oonipipeline/src/oonipipeline/temporal/common.py similarity index 79% rename from oonipipeline/src/oonipipeline/workflows/common.py rename to oonipipeline/src/oonipipeline/temporal/common.py index 25764d3a..6c1bf32e 100644 --- a/oonipipeline/src/oonipipeline/workflows/common.py +++ b/oonipipeline/src/oonipipeline/temporal/common.py @@ -4,7 +4,7 @@ import multiprocessing as mp from multiprocessing.synchronize import Event as EventClass -from datetime import date, datetime, timedelta +from datetime import datetime, timedelta from typing import ( Any, @@ -21,7 +21,6 @@ MeasurementListProgress, ) from ..db.connections import ClickhouseConnection -from ..db.create_tables import create_queries log = logging.getLogger("oonidata.processing") @@ -89,7 +88,7 @@ def maybe_delete_prev_range(db: ClickhouseConnection, prev_range: PrevRange): q_args["max_created_at"] = prev_range.max_created_at q_args["min_created_at"] = prev_range.min_created_at where = f"{where} AND created_at <= %(max_created_at)s AND created_at >= %(min_created_at)s" - log.info(f"runing {where} with {q_args}") + log.debug(f"runing {where} with {q_args}") q = f"ALTER TABLE {prev_range.table_name} DELETE " final_query = q + where @@ -165,27 +164,6 @@ def get_prev_range( return prev_range -def optimize_all_tables(clickhouse): - with ClickhouseConnection(clickhouse) as db: - for _, table_name in create_queries: - db.execute(f"OPTIMIZE TABLE {table_name}") - - -def get_obs_count_by_cc( - db: ClickhouseConnection, - test_name: List[str], - start_day: date, - end_day: date, - table_name: str = "obs_web", -) -> Dict[str, int]: - q = f"SELECT probe_cc, COUNT() FROM {table_name} WHERE measurement_start_time > %(start_day)s AND measurement_start_time < %(end_day)s GROUP BY probe_cc" - cc_list: List[Tuple[str, int]] = db.execute( - q, {"start_day": start_day, "end_day": end_day} - ) # type: ignore - assert isinstance(cc_list, list) - return dict(cc_list) - - def make_db_rows( dc_list: List, column_names: List[str], @@ -208,32 +186,3 @@ def maybe_remap(k, value): rows.append(tuple(maybe_remap(k, getattr(d, k)) for k in column_names)) return table_name, rows - - -class StatusMessage(NamedTuple): - src: str - exception: Optional[Exception] = None - traceback: Optional[str] = None - progress: Optional[MeasurementListProgress] = None - idx: Optional[int] = None - day_str: Optional[str] = None - archive_queue_size: Optional[int] = None - - -def run_progress_thread( - status_queue: mp.Queue, shutdown_event: EventClass, desc: str = "analyzing data" -): - pbar = tqdm(position=0) - - log.info("starting error handling thread") - while not shutdown_event.is_set(): - try: - count = status_queue.get(block=True, timeout=0.1) - except queue.Empty: - continue - - try: - pbar.update(count) - pbar.set_description(desc) - finally: - status_queue.task_done() # type: ignore diff --git a/oonipipeline/src/oonipipeline/workflows/to_port/fingerprint_hunter.py b/oonipipeline/src/oonipipeline/temporal/to_port/fingerprint_hunter.py similarity index 100% rename from oonipipeline/src/oonipipeline/workflows/to_port/fingerprint_hunter.py rename to oonipipeline/src/oonipipeline/temporal/to_port/fingerprint_hunter.py diff --git a/oonipipeline/src/oonipipeline/workflows/to_port/response_archiver.py b/oonipipeline/src/oonipipeline/temporal/to_port/response_archiver.py similarity index 100% rename from oonipipeline/src/oonipipeline/workflows/to_port/response_archiver.py rename to oonipipeline/src/oonipipeline/temporal/to_port/response_archiver.py diff --git a/oonipipeline/src/oonipipeline/temporal/workers.py b/oonipipeline/src/oonipipeline/temporal/workers.py new file mode 100644 index 00000000..93d5e3b8 --- /dev/null +++ b/oonipipeline/src/oonipipeline/temporal/workers.py @@ -0,0 +1,64 @@ +import multiprocessing +from oonipipeline.temporal.activities.analysis import make_analysis_in_a_day +from oonipipeline.temporal.activities.common import ( + get_obs_count_by_cc, + optimize_all_tables, +) +from oonipipeline.temporal.activities.ground_truths import make_ground_truths_in_day +from oonipipeline.temporal.activities.observations import make_observation_in_day +from oonipipeline.temporal.workflows import ( + TASK_QUEUE_NAME, + AnalysisBackfillWorkflow, + AnalysisWorkflow, + GroundTruthsWorkflow, + ObservationsBackfillWorkflow, + ObservationsWorkflow, +) + + +from temporalio.client import Client as TemporalClient +from temporalio.worker import SharedStateManager, Worker + + +from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor + +WORKFLOWS = [ + ObservationsWorkflow, + GroundTruthsWorkflow, + AnalysisWorkflow, + ObservationsBackfillWorkflow, + AnalysisBackfillWorkflow, +] + +ACTIVTIES = [ + make_observation_in_day, + make_ground_truths_in_day, + make_analysis_in_a_day, + optimize_all_tables, + get_obs_count_by_cc, +] + + +def make_threaded_worker(client: TemporalClient, parallelism: int) -> Worker: + return Worker( + client, + task_queue=TASK_QUEUE_NAME, + workflows=WORKFLOWS, + activities=ACTIVTIES, + activity_executor=ThreadPoolExecutor(parallelism + 2), + max_concurrent_activities=parallelism, + ) + + +def make_multiprocess_worker(client: TemporalClient, parallelism: int) -> Worker: + return Worker( + client, + task_queue=TASK_QUEUE_NAME, + workflows=WORKFLOWS, + activities=ACTIVTIES, + activity_executor=ProcessPoolExecutor(parallelism + 2), + max_concurrent_activities=parallelism, + shared_state_manager=SharedStateManager.create_from_multiprocessing( + multiprocessing.Manager() + ), + ) diff --git a/oonipipeline/src/oonipipeline/temporal/workflows.py b/oonipipeline/src/oonipipeline/temporal/workflows.py new file mode 100644 index 00000000..738e47f1 --- /dev/null +++ b/oonipipeline/src/oonipipeline/temporal/workflows.py @@ -0,0 +1,410 @@ +from dataclasses import dataclass +from typing import List, Optional + +import logging +import asyncio +from datetime import datetime, timedelta, timezone + + +from temporalio import workflow +from temporalio.common import SearchAttributeKey +from temporalio.client import ( + Client as TemporalClient, + Schedule, + ScheduleActionStartWorkflow, + ScheduleIntervalSpec, + ScheduleSpec, + ScheduleState, +) + +from oonipipeline.temporal.activities.common import ( + optimize_all_tables, + ClickhouseParams, +) +from oonipipeline.temporal.activities.ground_truths import get_ground_truth_db_path + +with workflow.unsafe.imports_passed_through(): + import clickhouse_driver + + from oonidata.dataclient import date_interval + from oonidata.datautils import PerfTimer + from oonipipeline.db.connections import ClickhouseConnection + from oonipipeline.temporal.activities.analysis import ( + MakeAnalysisParams, + log, + make_analysis_in_a_day, + make_cc_batches, + ) + from oonipipeline.temporal.activities.common import ( + get_obs_count_by_cc, + ObsCountParams, + ) + from oonipipeline.temporal.activities.observations import ( + MakeObservationsParams, + make_observation_in_day, + ) + + from oonipipeline.temporal.activities.ground_truths import ( + MakeGroundTruthsParams, + make_ground_truths_in_day, + ) + +# Handle temporal sandbox violations related to calls to self.processName = +# mp.current_process().name in logger, see: +# https://github.com/python/cpython/blob/1316692e8c7c1e1f3b6639e51804f9db5ed892ea/Lib/logging/__init__.py#L362 +logging.logMultiprocessing = False + +log = workflow.logger + +TASK_QUEUE_NAME = "oonipipeline-task-queue" +OBSERVATION_WORKFLOW_ID = "oonipipeline-observations" + +MAKE_OBSERVATIONS_START_TO_CLOSE_TIMEOUT = timedelta(hours=24) +MAKE_GROUND_TRUTHS_START_TO_CLOSE_TIMEOUT = timedelta(hours=1) +MAKE_ANALYSIS_START_TO_CLOSE_TIMEOUT = timedelta(hours=10) + + +def get_workflow_start_time() -> datetime: + workflow_start_time = workflow.info().typed_search_attributes.get( + SearchAttributeKey.for_datetime("TemporalScheduledStartTime") + ) + assert workflow_start_time is not None, "TemporalScheduledStartTime not set" + return workflow_start_time + + +@dataclass +class ObservationsWorkflowParams: + probe_cc: List[str] + test_name: List[str] + clickhouse: str + data_dir: str + fast_fail: bool + log_level: int = logging.INFO + bucket_date: Optional[str] = None + + +@workflow.defn +class ObservationsWorkflow: + @workflow.run + async def run(self, params: ObservationsWorkflowParams) -> dict: + if params.bucket_date is None: + params.bucket_date = ( + get_workflow_start_time() - timedelta(days=1) + ).strftime("%Y-%m-%d") + + await workflow.execute_activity( + optimize_all_tables, + ClickhouseParams(clickhouse_url=params.clickhouse), + start_to_close_timeout=timedelta(minutes=5), + ) + + log.info( + f"Starting observation making with probe_cc={params.probe_cc},test_name={params.test_name} bucket_date={params.bucket_date}" + ) + + res = await workflow.execute_activity( + make_observation_in_day, + MakeObservationsParams( + probe_cc=params.probe_cc, + test_name=params.test_name, + clickhouse=params.clickhouse, + data_dir=params.data_dir, + fast_fail=params.fast_fail, + bucket_date=params.bucket_date, + ), + start_to_close_timeout=MAKE_OBSERVATIONS_START_TO_CLOSE_TIMEOUT, + ) + res["bucket_date"] = params.bucket_date + return res + + +@dataclass +class BackfillWorkflowParams: + probe_cc: List[str] + test_name: List[str] + start_day: str + end_day: str + clickhouse: str + data_dir: str + fast_fail: bool + log_level: int = logging.INFO + + +@workflow.defn +class ObservationsBackfillWorkflow: + @workflow.run + async def run(self, params: BackfillWorkflowParams) -> dict: + start_day = datetime.strptime(params.start_day, "%Y-%m-%d") + end_day = datetime.strptime(params.end_day, "%Y-%m-%d") + + t = PerfTimer(unstoppable=True) + task_list = [] + workflow_id = workflow.info().workflow_id + for day in date_interval(start_day, end_day): + bucket_date = day.strftime("%Y-%m-%d") + task_list.append( + workflow.execute_child_workflow( + ObservationsWorkflow.run, + ObservationsWorkflowParams( + bucket_date=bucket_date, + probe_cc=params.probe_cc, + test_name=params.test_name, + clickhouse=params.clickhouse, + data_dir=params.data_dir, + fast_fail=params.fast_fail, + log_level=params.log_level, + ), + id=f"{workflow_id}/{bucket_date}", + ) + ) + + total_size = 0 + total_measurement_count = 0 + + for task in asyncio.as_completed(task_list): + res = await task + bucket_date = res["bucket_date"] + total_size += res["size"] + total_measurement_count += res["measurement_count"] + + mb_per_sec = round(total_size / t.s / 10**6, 1) + msmt_per_sec = round(total_measurement_count / t.s) + log.info( + f"finished processing {bucket_date} speed: {mb_per_sec}MB/s ({msmt_per_sec}msmt/s)" + ) + + mb_per_sec = round(total_size / t.s / 10**6, 1) + msmt_per_sec = round(total_measurement_count / t.s) + log.info( + f"finished processing {params.start_day} - {params.end_day} speed: {mb_per_sec}MB/s ({msmt_per_sec}msmt/s)" + ) + + return { + "size": total_size, + "measurement_count": total_measurement_count, + "runtime_ms": t.ms, + "mb_per_sec": mb_per_sec, + "msmt_per_sec": msmt_per_sec, + "start_day": params.start_day, + "end_day": params.start_day, + } + + +OBSERVATIONS_SCHEDULE_ID = "oonipipeline-observations-schedule-id" + + +def gen_observation_schedule_id(params: ObservationsWorkflowParams) -> str: + probe_cc_key = "ALLCCS" + if len(params.probe_cc) > 0: + probe_cc_key = ".".join(map(lambda x: x.lower(), sorted(params.probe_cc))) + test_name_key = "ALLTNS" + if len(params.test_name) > 0: + test_name_key = ".".join(map(lambda x: x.lower(), sorted(params.test_name))) + + return f"oonipipeline-observations-{probe_cc_key}-{test_name_key}" + + +async def schedule_observations( + client: TemporalClient, params: ObservationsWorkflowParams +): + schedule_id = gen_observation_schedule_id(params) + + await client.create_schedule( + schedule_id, + Schedule( + action=ScheduleActionStartWorkflow( + ObservationsWorkflow.run, + params, + id=OBSERVATION_WORKFLOW_ID, + task_queue=TASK_QUEUE_NAME, + execution_timeout=MAKE_OBSERVATIONS_START_TO_CLOSE_TIMEOUT, + task_timeout=MAKE_OBSERVATIONS_START_TO_CLOSE_TIMEOUT, + run_timeout=MAKE_OBSERVATIONS_START_TO_CLOSE_TIMEOUT, + ), + spec=ScheduleSpec( + intervals=[ + ScheduleIntervalSpec( + every=timedelta(days=1), offset=timedelta(hours=2) + ) + ], + ), + state=ScheduleState( + note="Run the observations workflow every day with an offset of 2 hours to ensure the files have been written to s3" + ), + ), + ) + + +@dataclass +class GroundTruthsWorkflowParams: + start_day: str + end_day: str + clickhouse: str + data_dir: str + + +@workflow.defn +class GroundTruthsWorkflow: + @workflow.run + async def run( + self, + params: GroundTruthsWorkflowParams, + ): + start_day = datetime.strptime(params.start_day, "%Y-%m-%d").date() + end_day = datetime.strptime(params.end_day, "%Y-%m-%d").date() + + async with asyncio.TaskGroup() as tg: + for day in date_interval(start_day, end_day): + tg.create_task( + workflow.execute_activity( + make_ground_truths_in_day, + MakeGroundTruthsParams( + clickhouse=params.clickhouse, + data_dir=params.data_dir, + day=day.strftime("%Y-%m-%d"), + ), + start_to_close_timeout=MAKE_GROUND_TRUTHS_START_TO_CLOSE_TIMEOUT, + ) + ) + + +@dataclass +class AnalysisWorkflowParams: + probe_cc: List[str] + test_name: List[str] + day: str + clickhouse: str + data_dir: str + parallelism: int + fast_fail: bool + force_rebuild_ground_truths: bool = False + log_level: int = logging.INFO + + +@workflow.defn +class AnalysisWorkflow: + @workflow.run + async def run(self, params: AnalysisWorkflowParams) -> dict: + await workflow.execute_activity( + optimize_all_tables, + ClickhouseParams(clickhouse_url=params.clickhouse), + start_to_close_timeout=timedelta(minutes=5), + ) + + log.info("building ground truth databases") + t = PerfTimer() + if ( + params.force_rebuild_ground_truths + or not get_ground_truth_db_path( + day=params.day, data_dir=params.data_dir + ).exists() + ): + await workflow.execute_activity( + make_ground_truths_in_day, + MakeGroundTruthsParams( + clickhouse=params.clickhouse, + data_dir=params.data_dir, + day=params.day, + ), + start_to_close_timeout=timedelta(minutes=30), + ) + log.info(f"built ground truth db in {t.pretty}") + + start_day = datetime.strptime(params.day, "%Y-%m-%d").date() + cnt_by_cc = await workflow.execute_activity( + get_obs_count_by_cc, + ObsCountParams( + clickhouse_url=params.clickhouse, + start_day=start_day.strftime("%Y-%m-%d"), + end_day=(start_day + timedelta(days=1)).strftime("%Y-%m-%d"), + ), + start_to_close_timeout=timedelta(minutes=30), + ) + + cc_batches = make_cc_batches( + cnt_by_cc=cnt_by_cc, + probe_cc=params.probe_cc, + parallelism=params.parallelism, + ) + + log.info( + f"starting processing of {len(cc_batches)} batches for {params.day} days (parallelism = {params.parallelism})" + ) + log.info(f"({cc_batches})") + + task_list = [] + async with asyncio.TaskGroup() as tg: + for probe_cc in cc_batches: + task = tg.create_task( + workflow.execute_activity( + make_analysis_in_a_day, + MakeAnalysisParams( + probe_cc=probe_cc, + test_name=params.test_name, + clickhouse=params.clickhouse, + data_dir=params.data_dir, + fast_fail=params.fast_fail, + day=params.day, + ), + start_to_close_timeout=MAKE_ANALYSIS_START_TO_CLOSE_TIMEOUT, + ) + ) + task_list.append(task) + + total_obs_count = sum(map(lambda x: x.result()["count"], task_list)) + return {"obs_count": total_obs_count, "day": params.day} + + +@workflow.defn +class AnalysisBackfillWorkflow: + @workflow.run + async def run(self, params: BackfillWorkflowParams) -> dict: + start_day = datetime.strptime(params.start_day, "%Y-%m-%d") + end_day = datetime.strptime(params.end_day, "%Y-%m-%d") + + t = PerfTimer(unstoppable=True) + task_list = [] + workflow_id = workflow.info().workflow_id + for day in date_interval(start_day, end_day): + day_str = day.strftime("%Y-%m-%d") + task_list.append( + workflow.execute_child_workflow( + AnalysisWorkflow.run, + AnalysisWorkflowParams( + day=day_str, + probe_cc=params.probe_cc, + test_name=params.test_name, + clickhouse=params.clickhouse, + data_dir=params.data_dir, + fast_fail=params.fast_fail, + log_level=params.log_level, + parallelism=10, + ), + id=f"{workflow_id}/{day_str}", + ) + ) + + total_obs_count = 0 + + for task in asyncio.as_completed(task_list): + res = await task + day = res["day"] + total_obs_count += res["obs_count"] + + obs_per_sec = round(total_obs_count / t.s, 1) + log.info( + f"finished processing {day} in {t.pretty} total_obs_count={total_obs_count} ({obs_per_sec}obs/s)" + ) + + obs_per_sec = round(total_obs_count / t.s, 1) + log.info( + f"finished processing {day} in {t.pretty} total_obse_count={total_obs_count} ({obs_per_sec}obs/s)" + ) + + return { + "observation_count": total_obs_count, + "runtime_ms": t.ms, + "obs_per_sec": obs_per_sec, + "start_day": params.start_day, + "end_day": params.start_day, + } diff --git a/oonipipeline/src/oonipipeline/workflows/analysis.py b/oonipipeline/src/oonipipeline/workflows/analysis.py deleted file mode 100644 index 76f854e2..00000000 --- a/oonipipeline/src/oonipipeline/workflows/analysis.py +++ /dev/null @@ -1,339 +0,0 @@ -import asyncio -import dataclasses -from dataclasses import dataclass -import logging -import pathlib - -from datetime import date, datetime, timedelta, timezone -from typing import Dict, List - -from temporalio import workflow, activity - -with workflow.unsafe.imports_passed_through(): - import clickhouse_driver - - import orjson - import statsd - - from oonidata.dataclient import date_interval - from oonidata.datautils import PerfTimer - from oonidata.models.analysis import WebAnalysis - from oonidata.models.experiment_result import MeasurementExperimentResult - - from ..analysis.control import BodyDB, WebGroundTruthDB - from ..analysis.datasources import iter_web_observations - from ..analysis.web_analysis import make_web_analysis - from ..analysis.website_experiment_results import make_website_experiment_results - from ..db.connections import ClickhouseConnection - from ..fingerprintdb import FingerprintDB - - from .ground_truths import make_ground_truths_in_day, MakeGroundTruthsParams - - from .common import ( - get_obs_count_by_cc, - get_prev_range, - make_db_rows, - maybe_delete_prev_range, - optimize_all_tables, - ) - -log = logging.getLogger("oonidata.processing") - - -@dataclass -class AnalysisWorkflowParams: - probe_cc: List[str] - test_name: List[str] - start_day: str - end_day: str - clickhouse: str - data_dir: str - parallelism: int - fast_fail: bool - rebuild_ground_truths: bool - log_level: int = logging.INFO - - -@dataclass -class MakeAnalysisParams: - probe_cc: List[str] - test_name: List[str] - clickhouse: str - data_dir: str - fast_fail: bool - day: str - - -@activity.defn -def make_analysis_in_a_day(params: MakeAnalysisParams) -> dict: - t_total = PerfTimer() - log.info("Optimizing all tables") - optimize_all_tables(params.clickhouse) - data_dir = pathlib.Path(params.data_dir) - clickhouse = params.clickhouse - day = datetime.strptime(params.day, "%Y-%m-%d").date() - probe_cc = params.probe_cc - test_name = params.test_name - - statsd_client = statsd.StatsClient("localhost", 8125) - fingerprintdb = FingerprintDB(datadir=data_dir, download=False) - body_db = BodyDB(db=ClickhouseConnection(clickhouse)) - db_writer = ClickhouseConnection(clickhouse, row_buffer_size=10_000) - db_lookup = ClickhouseConnection(clickhouse) - - column_names_wa = [f.name for f in dataclasses.fields(WebAnalysis)] - column_names_er = [f.name for f in dataclasses.fields(MeasurementExperimentResult)] - - prev_range_list = [ - get_prev_range( - db=db_lookup, - table_name=WebAnalysis.__table_name__, - timestamp=datetime.combine(day, datetime.min.time()), - test_name=[], - probe_cc=probe_cc, - timestamp_column="measurement_start_time", - ), - get_prev_range( - db=db_lookup, - table_name=MeasurementExperimentResult.__table_name__, - timestamp=datetime.combine(day, datetime.min.time()), - test_name=[], - probe_cc=probe_cc, - timestamp_column="timeofday", - probe_cc_column="location_network_cc", - ), - ] - - log.info(f"loading ground truth DB for {day}") - t = PerfTimer() - ground_truth_db_path = ( - data_dir / "ground_truths" / f"web-{day.strftime('%Y-%m-%d')}.sqlite3" - ) - web_ground_truth_db = WebGroundTruthDB() - web_ground_truth_db.build_from_existing(str(ground_truth_db_path.absolute())) - statsd_client.timing("oonidata.web_analysis.ground_truth", t.ms) - log.info(f"loaded ground truth DB for {day} in {t.pretty}") - - idx = 0 - for web_obs in iter_web_observations( - db_lookup, measurement_day=day, probe_cc=probe_cc, test_name="web_connectivity" - ): - try: - t_er_gen = PerfTimer() - t = PerfTimer() - relevant_gts = web_ground_truth_db.lookup_by_web_obs(web_obs=web_obs) - except: - log.error( - f"failed to lookup relevant_gts for {web_obs[0].measurement_uid}", - exc_info=True, - ) - continue - - try: - statsd_client.timing("oonidata.web_analysis.gt_lookup", t.ms) - website_analysis = list( - make_web_analysis( - web_observations=web_obs, - body_db=body_db, - web_ground_truths=relevant_gts, - fingerprintdb=fingerprintdb, - ) - ) - log.info(f"generated {len(website_analysis)} website_analysis") - if len(website_analysis) == 0: - log.info(f"no website analysis for {probe_cc}, {test_name}") - continue - idx += 1 - table_name, rows = make_db_rows( - dc_list=website_analysis, column_names=column_names_wa - ) - statsd_client.incr("oonidata.web_analysis.analysis.obs", 1, rate=0.1) # type: ignore - statsd_client.gauge("oonidata.web_analysis.analysis.obs_idx", idx, rate=0.1) # type: ignore - statsd_client.timing("oonidata.web_analysis.analysis.obs", t_er_gen.ms, rate=0.1) # type: ignore - - with statsd_client.timer("db_write_rows.timing"): - db_writer.write_rows( - table_name=table_name, - rows=rows, - column_names=column_names_wa, - ) - - with statsd_client.timer("oonidata.web_analysis.experiment_results.timing"): - website_er = list(make_website_experiment_results(website_analysis)) - log.info(f"generated {len(website_er)} website_er") - table_name, rows = make_db_rows( - dc_list=website_er, - column_names=column_names_er, - custom_remap={"loni_list": orjson.dumps}, - ) - - db_writer.write_rows( - table_name=table_name, - rows=rows, - column_names=column_names_er, - ) - - except: - web_obs_ids = ",".join(map(lambda wo: wo.observation_id, web_obs)) - log.error(f"failed to generate analysis for {web_obs_ids}", exc_info=True) - - for prev_range in prev_range_list: - maybe_delete_prev_range(db=db_lookup, prev_range=prev_range) - db_writer.close() - - with ClickhouseConnection(clickhouse) as db: - db.execute( - "INSERT INTO oonidata_processing_logs (key, timestamp, runtime_ms, bytes, msmt_count, comment) VALUES", - [ - [ - "oonidata.analysis.made_day_analysis", - datetime.now(timezone.utc).replace(tzinfo=None), - int(t_total.ms), - 0, - idx, - day.strftime("%Y-%m-%d"), - ] - ], - ) - return {"count": idx} - - -def make_cc_batches( - cnt_by_cc: Dict[str, int], - probe_cc: List[str], - parallelism: int, -) -> List[List[str]]: - """ - The goal of this function is to spread the load of each batch of - measurements by probe_cc. This allows us to parallelize analysis on a - per-country basis based on the number of measurements. - We assume that the measurements are uniformly distributed over the tested - interval and then break them up into a number of batches equivalent to the - parallelism count based on the number of measurements in each country. - - Here is a concrete example, suppose we have 3 countries IT, IR, US with 300, - 400, 1000 measurements respectively and a parallelism of 2, we will be - creating 2 batches where the first has in it IT, IR and the second has US. - """ - if len(probe_cc) > 0: - selected_ccs_with_cnt = set(probe_cc).intersection(set(cnt_by_cc.keys())) - if len(selected_ccs_with_cnt) == 0: - raise Exception( - f"No observations for {probe_cc} in the time range. Try adjusting the date range or choosing different countries" - ) - # We remove from the cnt_by_cc all the countries we are not interested in - cnt_by_cc = {k: cnt_by_cc[k] for k in selected_ccs_with_cnt} - - total_obs_cnt = sum(cnt_by_cc.values()) - - # We assume uniform distribution of observations per (country, day) - max_obs_per_batch = total_obs_cnt / parallelism - - # We break up the countries into batches where the count of observations in - # each batch is roughly equal. - # This is done so that we can spread the load based on the countries in - # addition to the time range. - cc_batches = [] - current_cc_batch_size = 0 - current_cc_batch = [] - cnt_by_cc_sorted = sorted(cnt_by_cc.items(), key=lambda x: x[0]) - while cnt_by_cc_sorted: - while current_cc_batch_size <= max_obs_per_batch: - try: - cc, cnt = cnt_by_cc_sorted.pop() - except IndexError: - break - current_cc_batch.append(cc) - current_cc_batch_size += cnt - cc_batches.append(current_cc_batch) - current_cc_batch = [] - current_cc_batch_size = 0 - if len(current_cc_batch) > 0: - cc_batches.append(current_cc_batch) - return cc_batches - - -# TODO(art) -# We disable the sanbox for all this workflow, since otherwise pytz fails to -# work which is a requirement for clickhouse. -# This is most likely due to it doing an open() in order to read the timezone -# definitions. -# I spent some time debugging this, but eventually gave up. We should eventually -# look into making this run OK inside of the sandbox. -@workflow.defn(sandboxed=False) -class AnalysisWorkflow: - @workflow.run - async def run(self, params: AnalysisWorkflowParams) -> dict: - t_total = PerfTimer() - - t = PerfTimer() - start_day = datetime.strptime(params.start_day, "%Y-%m-%d").date() - end_day = datetime.strptime(params.end_day, "%Y-%m-%d").date() - - log.info("building ground truth databases") - - async with asyncio.TaskGroup() as tg: - for day in date_interval(start_day, end_day): - tg.create_task( - workflow.execute_activity( - make_ground_truths_in_day, - MakeGroundTruthsParams( - day=day.strftime("%Y-%m-%d"), - clickhouse=params.clickhouse, - data_dir=params.data_dir, - rebuild_ground_truths=params.rebuild_ground_truths, - ), - start_to_close_timeout=timedelta(minutes=2), - ) - ) - log.info(f"built ground truth db in {t.pretty}") - - with ClickhouseConnection(params.clickhouse) as db: - cnt_by_cc = get_obs_count_by_cc( - db, start_day=start_day, end_day=end_day, test_name=params.test_name - ) - cc_batches = make_cc_batches( - cnt_by_cc=cnt_by_cc, - probe_cc=params.probe_cc, - parallelism=params.parallelism, - ) - log.info( - f"starting processing of {len(cc_batches)} batches over {(end_day - start_day).days} days (parallelism = {params.parallelism})" - ) - log.info(f"({cc_batches} from {start_day} to {end_day}") - - task_list = [] - async with asyncio.TaskGroup() as tg: - for probe_cc in cc_batches: - for day in date_interval(start_day, end_day): - task = tg.create_task( - workflow.execute_activity( - make_analysis_in_a_day, - MakeAnalysisParams( - probe_cc=probe_cc, - test_name=params.test_name, - clickhouse=params.clickhouse, - data_dir=params.data_dir, - fast_fail=params.fast_fail, - day=day.strftime("%Y-%m-%d"), - ), - start_to_close_timeout=timedelta(minutes=30), - ) - ) - task_list.append(task) - - t = PerfTimer() - # size, msmt_count = - total_obs_count = 0 - for task in task_list: - res = task.result() - - total_obs_count += res["count"] - - log.info(f"produces a total of {total_obs_count} analysis") - obs_per_sec = round(total_obs_count / t_total.s) - log.info( - f"finished processing {start_day} - {end_day} speed: {obs_per_sec}obs/s)" - ) - log.info(f"{total_obs_count} msmts in {t_total.pretty}") - return {"total_obs_count": total_obs_count} diff --git a/oonipipeline/src/oonipipeline/workflows/ground_truths.py b/oonipipeline/src/oonipipeline/workflows/ground_truths.py deleted file mode 100644 index eda81727..00000000 --- a/oonipipeline/src/oonipipeline/workflows/ground_truths.py +++ /dev/null @@ -1,90 +0,0 @@ -import asyncio -from dataclasses import dataclass -import pathlib -import logging - -from datetime import datetime, timedelta - -from temporalio import workflow, activity - -with workflow.unsafe.imports_passed_through(): - import clickhouse_driver - - from oonidata.dataclient import date_interval - from oonidata.datautils import PerfTimer - from ..analysis.control import WebGroundTruthDB, iter_web_ground_truths - from ..netinfo import NetinfoDB - from ..db.connections import ( - ClickhouseConnection, - ) - -log = logging.getLogger("oonidata.processing") - - -@dataclass -class GroundTruthsWorkflowParams: - start_day: str - end_day: str - clickhouse: str - data_dir: str - - -@dataclass -class MakeGroundTruthsParams: - clickhouse: str - data_dir: str - day: str - rebuild_ground_truths: bool - - -@activity.defn -def make_ground_truths_in_day(params: MakeGroundTruthsParams): - clickhouse = params.clickhouse - day = datetime.strptime(params.day, "%Y-%m-%d").date() - data_dir = pathlib.Path(params.data_dir) - rebuild_ground_truths = params.rebuild_ground_truths - - db = ClickhouseConnection(clickhouse) - netinfodb = NetinfoDB(datadir=data_dir, download=False) - ground_truth_dir = data_dir / "ground_truths" - ground_truth_dir.mkdir(exist_ok=True) - dst_path = ground_truth_dir / f"web-{day.strftime('%Y-%m-%d')}.sqlite3" - if not dst_path.exists() or rebuild_ground_truths != False: - if dst_path.exists(): - dst_path.unlink() - - t = PerfTimer() - log.info(f"building ground truth DB for {day}") - web_ground_truth_db = WebGroundTruthDB(connect_str=str(dst_path.absolute())) - web_ground_truth_db.build_from_rows( - rows=iter_web_ground_truths(db=db, measurement_day=day, netinfodb=netinfodb) - ) - log.info(f"built ground truth DB {day} in {t.pretty}") - - -@workflow.defn -class GroundTruthsWorkflow: - @workflow.run - async def run( - self, - params: GroundTruthsWorkflowParams, - ): - task_list = [] - start_day = datetime.strptime(params.start_day, "%Y-%m-%d").date() - end_day = datetime.strptime(params.end_day, "%Y-%m-%d").date() - - async with asyncio.TaskGroup() as tg: - for day in date_interval(start_day, end_day): - task = tg.create_task( - workflow.execute_activity( - make_ground_truths_in_day, - MakeGroundTruthsParams( - clickhouse=params.clickhouse, - data_dir=params.data_dir, - day=day.strftime("%Y-%m-%d"), - rebuild_ground_truths=True, - ), - start_to_close_timeout=timedelta(minutes=30), - ) - ) - task_list.append(task) diff --git a/oonipipeline/src/oonipipeline/workflows/observations.py b/oonipipeline/src/oonipipeline/workflows/observations.py deleted file mode 100644 index 1232165a..00000000 --- a/oonipipeline/src/oonipipeline/workflows/observations.py +++ /dev/null @@ -1,278 +0,0 @@ -import asyncio -import pathlib -import logging -import dataclasses -from dataclasses import dataclass -from datetime import datetime, timedelta - -from typing import ( - List, - Sequence, - Tuple, -) - -from temporalio import workflow, activity - -with workflow.unsafe.imports_passed_through(): - import statsd - import clickhouse_driver - from oonidata.datautils import PerfTimer - from oonidata.dataclient import ( - date_interval, - list_file_entries_batches, - stream_measurements, - ccs_set, - load_measurement, - ) - from oonidata.models.nettests import SupportedDataformats - - from ..netinfo import NetinfoDB - from ..db.connections import ClickhouseConnection - from ..transforms.observations import measurement_to_observations - - from .common import ( - get_prev_range, - make_db_rows, - maybe_delete_prev_range, - optimize_all_tables, - ) - -log = logging.getLogger("oonidata.processing") - - -def write_observations_to_db( - msmt: SupportedDataformats, - netinfodb: NetinfoDB, - db: ClickhouseConnection, - bucket_date: str, -): - for observations in measurement_to_observations(msmt, netinfodb=netinfodb): - if len(observations) == 0: - continue - - column_names = [f.name for f in dataclasses.fields(observations[0])] - table_name, rows = make_db_rows( - bucket_date=bucket_date, - dc_list=observations, - column_names=column_names, - ) - db.write_rows(table_name=table_name, rows=rows, column_names=column_names) - - -def make_observations_for_file_entry_batch( - file_entry_batch: Sequence[Tuple[str, str, str, int]], - clickhouse: str, - row_buffer_size: int, - data_dir: pathlib.Path, - bucket_date: str, - probe_cc: List[str], - fast_fail: bool, -): - netinfodb = NetinfoDB(datadir=data_dir, download=False) - tbatch = PerfTimer() - with ClickhouseConnection(clickhouse, row_buffer_size=row_buffer_size) as db: - statsd_client = statsd.StatsClient("localhost", 8125) - ccs = ccs_set(probe_cc) - idx = 0 - for bucket_name, s3path, ext, fe_size in file_entry_batch: - log.info(f"processing file s3://{bucket_name}/{s3path}") - t = PerfTimer() - try: - for msmt_dict in stream_measurements( - bucket_name=bucket_name, s3path=s3path, ext=ext - ): - # Legacy cans don't allow us to pre-filter on the probe_cc, so - # we need to check for probe_cc consistency in here. - if ccs and msmt_dict["probe_cc"] not in ccs: - continue - msmt = None - try: - t = PerfTimer() - msmt = load_measurement(msmt_dict) - if not msmt.test_keys: - log.error( - f"measurement with empty test_keys: ({msmt.measurement_uid})", - exc_info=True, - ) - continue - write_observations_to_db(msmt, netinfodb, db, bucket_date) - # following types ignored due to https://github.com/jsocol/pystatsd/issues/146 - statsd_client.timing("oonidata.make_observations.timed", t.ms, rate=0.1) # type: ignore - statsd_client.incr("oonidata.make_observations.msmt_count", rate=0.1) # type: ignore - idx += 1 - except Exception as exc: - msmt_str = msmt_dict.get("report_id", None) - if msmt: - msmt_str = msmt.measurement_uid - log.error(f"failed at idx: {idx} ({msmt_str})", exc_info=True) - - if fast_fail: - db.close() - raise exc - log.info(f"done processing file s3://{bucket_name}/{s3path}") - except Exception as exc: - log.error( - f"failed to stream measurements from s3://{bucket_name}/{s3path}" - ) - log.error(exc) - statsd_client.timing("oonidata.dataclient.stream_file_entry.timed", t.ms, rate=0.1) # type: ignore - statsd_client.gauge("oonidata.dataclient.file_entry.kb_per_sec.gauge", fe_size / 1024 / t.s, rate=0.1) # type: ignore - statsd_client.timing("oonidata.dataclient.batch.timed", tbatch.ms) # type: ignore - return idx - - -@dataclass -class ObservationsWorkflowParams: - probe_cc: List[str] - test_name: List[str] - start_day: str - end_day: str - clickhouse: str - data_dir: str - fast_fail: bool - log_level: int = logging.INFO - - -@dataclass -class MakeObservationsParams: - probe_cc: List[str] - test_name: List[str] - clickhouse: str - data_dir: str - fast_fail: bool - bucket_date: str - - -@activity.defn -def make_observation_in_day(params: MakeObservationsParams) -> dict: - statsd_client = statsd.StatsClient("localhost", 8125) - - day = datetime.strptime(params.bucket_date, "%Y-%m-%d").date() - - with ClickhouseConnection(params.clickhouse, row_buffer_size=10_000) as db: - prev_ranges = [] - for table_name in ["obs_web"]: - prev_ranges.append( - ( - table_name, - get_prev_range( - db=db, - table_name=table_name, - bucket_date=params.bucket_date, - test_name=params.test_name, - probe_cc=params.probe_cc, - ), - ) - ) - - t = PerfTimer() - total_t = PerfTimer() - file_entry_batches, total_size = list_file_entries_batches( - probe_cc=params.probe_cc, - test_name=params.test_name, - start_day=day, - end_day=day + timedelta(days=1), - ) - log.info(f"running {len(file_entry_batches)} batches took {t.pretty}") - - total_msmt_count = 0 - for batch in file_entry_batches: - msmt_cnt = make_observations_for_file_entry_batch( - batch, - params.clickhouse, - 10_000, - pathlib.Path(params.data_dir), - params.bucket_date, - params.probe_cc, - params.fast_fail, - ) - total_msmt_count += msmt_cnt - - mb_per_sec = round(total_size / total_t.s / 10**6, 1) - msmt_per_sec = round(total_msmt_count / total_t.s) - log.info( - f"finished processing all batches in {total_t.pretty} speed: {mb_per_sec}MB/s ({msmt_per_sec}msmt/s)" - ) - statsd_client.timing("oonidata.dataclient.daily.timed", total_t.ms) - - if len(prev_ranges) > 0: - with ClickhouseConnection(params.clickhouse, row_buffer_size=10_000) as db: - for table_name, pr in prev_ranges: - maybe_delete_prev_range(db=db, prev_range=pr) - - return {"size": total_size, "measurement_count": total_msmt_count} - - -@workflow.defn -class ObservationsWorkflow: - @workflow.run - async def run(self, params: ObservationsWorkflowParams) -> dict: - log.info("Optimizing all tables") - optimize_all_tables(params.clickhouse) - - t_total = PerfTimer() - log.info( - f"Starting observation making on {params.probe_cc} ({params.start_day} - {params.end_day})" - ) - task_list = [] - start_day = datetime.strptime(params.start_day, "%Y-%m-%d").date() - end_day = datetime.strptime(params.end_day, "%Y-%m-%d").date() - - async with asyncio.TaskGroup() as tg: - for day in date_interval(start_day, end_day): - task = tg.create_task( - workflow.execute_activity( - make_observation_in_day, - MakeObservationsParams( - probe_cc=params.probe_cc, - test_name=params.test_name, - clickhouse=params.clickhouse, - data_dir=params.data_dir, - fast_fail=params.fast_fail, - bucket_date=day.strftime("%Y-%m-%d"), - ), - start_to_close_timeout=timedelta(minutes=30), - ) - ) - task_list.append(task) - - t = PerfTimer() - # size, msmt_count = - total_size, total_msmt_count = 0, 0 - for task in task_list: - res = task.result() - - total_size += res["size"] - total_msmt_count += res["measurement_count"] - - # This needs to be adjusted once we get the the per entry concurrency working - # mb_per_sec = round(total_size / t.s / 10**6, 1) - # msmt_per_sec = round(total_msmt_count / t.s) - # log.info( - # f"finished processing {day} speed: {mb_per_sec}MB/s ({msmt_per_sec}msmt/s)" - # ) - - # with ClickhouseConnection(params.clickhouse) as db: - # db.execute( - # "INSERT INTO oonidata_processing_logs (key, timestamp, runtime_ms, bytes, msmt_count, comment) VALUES", - # [ - # [ - # "oonidata.bucket_processed", - # datetime.now(timezone.utc).replace(tzinfo=None), - # int(t.ms), - # total_size, - # total_msmt_count, - # day.strftime("%Y-%m-%d"), - # ] - # ], - # ) - - mb_per_sec = round(total_size / t_total.s / 10**6, 1) - msmt_per_sec = round(total_msmt_count / t_total.s) - log.info( - f"finished processing {params.start_day} - {params.end_day} speed: {mb_per_sec}MB/s ({msmt_per_sec}msmt/s)" - ) - log.info( - f"{round(total_size/10**9, 2)}GB {total_msmt_count} msmts in {t_total.pretty}" - ) - return {"size": total_size, "measurement_count": total_msmt_count} diff --git a/oonipipeline/tests/_fixtures.py b/oonipipeline/tests/_fixtures.py index 47960b32..358dbe95 100644 --- a/oonipipeline/tests/_fixtures.py +++ b/oonipipeline/tests/_fixtures.py @@ -33,6 +33,11 @@ "20221101055235.141387_RU_webconnectivity_046ce024dd76b564", # ru_blocks_twitter "20230907000740.785053_BR_httpinvalidrequestline_bdfe6d70dcbda5e9", # middlebox detected "20221110235922.335062_IR_webconnectivity_e4114ee32b8dbf74", # Iran blocking reddit + "20240420235427.477327_US_webconnectivity_9b3cac038dc2ba22", # down site + "20240302000048.790188_RU_webconnectivity_e7ffd3bc0f525eb7", # connection reset RU + "20240302000050.000654_SN_webconnectivity_fe4221088fbdcb0a", # nxdomain down + "20240302000305.316064_EG_webconnectivity_397bca9091b07444", # nxdomain blocked, unknown_failure and from the future + "20240309112858.009725_SE_webconnectivity_dce757ef4ec9b6c8", # blockpage for Iran in Sweden ] SAMPLE_POSTCANS = ["2024030100_AM_webconnectivity.n1.0.tar.gz"] diff --git a/oonipipeline/tests/data/.gitignore b/oonipipeline/tests/data/.gitignore index 213f74a9..ec372b8e 100644 --- a/oonipipeline/tests/data/.gitignore +++ b/oonipipeline/tests/data/.gitignore @@ -1,2 +1,3 @@ /datadir /measurements +/raw_measurements diff --git a/oonipipeline/tests/docker-compose.yml b/oonipipeline/tests/docker-compose.yml index 7546ca5b..b0dcb40d 100644 --- a/oonipipeline/tests/docker-compose.yml +++ b/oonipipeline/tests/docker-compose.yml @@ -3,4 +3,4 @@ services: clickhouse: image: "clickhouse/clickhouse-server" ports: - - "9000:9000" + - "19000:9000" diff --git a/oonipipeline/tests/fixme_test_workers.py b/oonipipeline/tests/fixme_test_workers.py deleted file mode 100644 index 17556d91..00000000 --- a/oonipipeline/tests/fixme_test_workers.py +++ /dev/null @@ -1,341 +0,0 @@ -from datetime import date, datetime, timedelta, timezone -import gzip -from pathlib import Path -import sqlite3 -from typing import List, Tuple -from unittest.mock import MagicMock -import time - -from oonidata.dataclient import stream_jsonl, load_measurement -from oonidata.models.nettests.dnscheck import DNSCheck -from oonidata.models.nettests.web_connectivity import WebConnectivity -from oonidata.models.nettests.http_invalid_request_line import HTTPInvalidRequestLine -from oonidata.models.observations import HTTPMiddleboxObservation - -from oonipipeline.workflows.analysis import ( - make_analysis_in_a_day, - make_cc_batches, - make_ctrl, -) -from oonipipeline.workflows.common import ( - get_obs_count_by_cc, - get_prev_range, - maybe_delete_prev_range, -) -from oonipipeline.workflows.observations import ( - make_observations_for_file_entry_batch, - write_observations_to_db, -) -from oonipipeline.workflows.response_archiver import ResponseArchiver -from oonipipeline.workflows.fingerprint_hunter import fingerprint_hunter -from oonipipeline.transforms import measurement_to_observations -from oonipipeline.transforms.nettests.measurement_transformer import ( - MeasurementTransformer, -) - - -def wait_for_mutations(db, table_name): - while True: - res = db.execute( - f"SELECT * FROM system.mutations WHERE is_done=0 AND table='{table_name}';" - ) - if len(res) == 0: # type: ignore - break - time.sleep(1) - - -def test_get_prev_range(db): - db.execute("DROP TABLE IF EXISTS test_range") - db.execute( - """CREATE TABLE test_range ( - created_at DateTime64(3, 'UTC'), - bucket_date String, - test_name String, - probe_cc String - ) - ENGINE = MergeTree - ORDER BY (bucket_date, created_at) - """ - ) - bucket_date = "2000-01-01" - test_name = "web_connectivity" - probe_cc = "IT" - min_time = datetime(2000, 1, 1, 23, 42, 00) - rows = [(min_time, bucket_date, test_name, probe_cc)] - for i in range(200): - rows.append((min_time + timedelta(seconds=i), bucket_date, test_name, probe_cc)) - db.execute( - "INSERT INTO test_range (created_at, bucket_date, test_name, probe_cc) VALUES", - rows, - ) - prev_range = get_prev_range( - db, - "test_range", - test_name=[test_name], - bucket_date=bucket_date, - probe_cc=[probe_cc], - ) - assert prev_range.min_created_at and prev_range.max_created_at - assert prev_range.min_created_at == (min_time - timedelta(seconds=1)) - assert prev_range.max_created_at == (rows[-1][0] + timedelta(seconds=1)) - db.execute("TRUNCATE TABLE test_range") - - bucket_date = "2000-03-01" - test_name = "web_connectivity" - probe_cc = "IT" - min_time = datetime(2000, 1, 1, 23, 42, 00) - rows: List[Tuple[datetime, str, str, str]] = [] - for i in range(10): - rows.append( - (min_time + timedelta(seconds=i), "2000-02-01", test_name, probe_cc) - ) - min_time = rows[-1][0] - for i in range(10): - rows.append((min_time + timedelta(seconds=i), bucket_date, test_name, probe_cc)) - - db.execute( - "INSERT INTO test_range (created_at, bucket_date, test_name, probe_cc) VALUES", - rows, - ) - prev_range = get_prev_range( - db, - "test_range", - test_name=[test_name], - bucket_date=bucket_date, - probe_cc=[probe_cc], - ) - assert prev_range.min_created_at and prev_range.max_created_at - assert prev_range.min_created_at == (min_time - timedelta(seconds=1)) - assert prev_range.max_created_at == (rows[-1][0] + timedelta(seconds=1)) - - maybe_delete_prev_range( - db=db, - prev_range=prev_range, - ) - wait_for_mutations(db, "test_range") - res = db.execute("SELECT COUNT() FROM test_range") - assert res[0][0] == 10 - db.execute("DROP TABLE test_range") - - -def test_make_cc_batches(): - cc_batches = make_cc_batches( - cnt_by_cc={"IT": 100, "IR": 300, "US": 1000}, - probe_cc=["IT", "IR", "US"], - parallelism=2, - ) - assert len(cc_batches) == 2 - # We expect the batches to be broken up into (IT, IR), ("US") - assert any([set(x) == set(["US"]) for x in cc_batches]) == True - - -def test_make_file_entry_batch(datadir, db): - file_entry_batch = [ - ( - "ooni-data-eu-fra", - "raw/20231031/15/IR/webconnectivity/2023103115_IR_webconnectivity.n1.0.tar.gz", - "tar.gz", - 4074306, - ) - ] - obs_msmt_count = make_observations_for_file_entry_batch( - file_entry_batch, db.clickhouse_url, 100, datadir, "2023-10-31", "IR", False - ) - assert obs_msmt_count == 453 - - make_ctrl( - clickhouse=db.clickhouse_url, - data_dir=datadir, - rebuild_ground_truths=True, - day=date(2023, 10, 31), - ) - analysis_msmt_count = make_analysis_in_a_day( - probe_cc=["IR"], - test_name=["webconnectivity"], - clickhouse=db.clickhouse_url, - data_dir=datadir, - day=date(2023, 10, 31), - fast_fail=False, - ) - assert analysis_msmt_count == obs_msmt_count - - -def test_write_observations(measurements, netinfodb, db): - msmt_uids = [ - ("20210101190046.780850_US_webconnectivity_3296f126f79ca186", "2021-01-01"), - ("20210101181154.037019_CH_webconnectivity_68ce38aa9e3182c2", "2021-01-01"), - ("20231031032643.267235_GR_dnscheck_abcbfc460b9424b6", "2023-10-31"), - ( - "20231101164541.763506_NP_httpinvalidrequestline_0cf676868fa36cc4", - "2023-10-31", - ), - ( - "20231101164544.534107_BR_httpheaderfieldmanipulation_4caa0b0556f0b141", - "2023-10-31", - ), - ("20231101164649.235575_RU_tor_ccf7519bf683c022", "2023-10-31"), - ( - "20230907000740.785053_BR_httpinvalidrequestline_bdfe6d70dcbda5e9", - "2023-09-07", - ), - ] - for msmt_uid, bucket_date in msmt_uids: - msmt = load_measurement(msmt_path=measurements[msmt_uid]) - write_observations_to_db(msmt, netinfodb, db, bucket_date) - db.close() - cnt_by_cc = get_obs_count_by_cc( - db, - test_name=[], - start_day=date(2020, 1, 1), - end_day=date(2023, 12, 1), - ) - assert cnt_by_cc["CH"] == 2 - assert cnt_by_cc["GR"] == 4 - assert cnt_by_cc["US"] == 3 - assert cnt_by_cc["RU"] == 3 - - -def test_hirl_observations(measurements, netinfodb): - msmt = load_measurement( - msmt_path=measurements[ - "20230907000740.785053_BR_httpinvalidrequestline_bdfe6d70dcbda5e9" - ] - ) - assert isinstance(msmt, HTTPInvalidRequestLine) - middlebox_obs: List[HTTPMiddleboxObservation] = measurement_to_observations( - msmt, netinfodb=netinfodb - )[0] - assert isinstance(middlebox_obs[0], HTTPMiddleboxObservation) - assert middlebox_obs[0].hirl_success == True - assert middlebox_obs[0].hirl_sent_0 != middlebox_obs[0].hirl_received_0 - - -def test_insert_query_for_observation(measurements, netinfodb): - http_blocked = load_measurement( - msmt_path=measurements[ - "20220608121828.356206_RU_webconnectivity_80e3fa60eb2cd026" - ] - ) - assert isinstance(http_blocked, WebConnectivity) - mt = MeasurementTransformer(measurement=http_blocked, netinfodb=netinfodb) - all_web_obs = [ - obs - for obs in mt.make_http_observations( - http_blocked.test_keys.requests, - ) - ] - assert all_web_obs[-1].request_url == "http://proxy.org/" - - -def test_web_connectivity_processor(netinfodb, measurements): - msmt = load_measurement( - msmt_path=measurements[ - "20220627131742.081225_GB_webconnectivity_e1e2cf4db492b748" - ] - ) - assert isinstance(msmt, WebConnectivity) - - web_obs_list, web_ctrl_list = measurement_to_observations(msmt, netinfodb=netinfodb) - assert len(web_obs_list) == 3 - assert len(web_ctrl_list) == 3 - - -def test_dnscheck_processor(measurements, netinfodb): - db = MagicMock() - db.write_row = MagicMock() - - msmt = load_measurement( - msmt_path=measurements["20221013000000.517636_US_dnscheck_bfd6d991e70afa0e"] - ) - assert isinstance(msmt, DNSCheck) - obs_list = measurement_to_observations(msmt=msmt, netinfodb=netinfodb)[0] - assert len(obs_list) == 20 - - -def test_full_processing(raw_measurements, netinfodb): - for msmt_path in raw_measurements.glob("*/*/*.jsonl.gz"): - with msmt_path.open("rb") as in_file: - for msmt_dict in stream_jsonl(in_file): - msmt = load_measurement(msmt_dict) - measurement_to_observations( - msmt=msmt, - netinfodb=netinfodb, - ) - - -def test_archive_http_transaction(measurements, tmpdir): - db = MagicMock() - db.write_row = MagicMock() - - msmt = load_measurement( - msmt_path=measurements[ - "20220627131742.081225_GB_webconnectivity_e1e2cf4db492b748" - ] - ) - assert isinstance(msmt, WebConnectivity) - assert msmt.test_keys.requests - dst_dir = Path(tmpdir) - with ResponseArchiver(dst_dir=dst_dir) as archiver: - for http_transaction in msmt.test_keys.requests: - if not http_transaction.response or not http_transaction.request: - continue - request_url = http_transaction.request.url - status_code = http_transaction.response.code or 0 - response_headers = http_transaction.response.headers_list_bytes or [] - response_body = http_transaction.response.body_bytes - assert response_body - archiver.archive_http_transaction( - request_url=request_url, - status_code=status_code, - response_headers=response_headers, - response_body=response_body, - matched_fingerprints=[], - ) - - warc_files = list(dst_dir.glob("*.warc.gz")) - assert len(warc_files) == 1 - with gzip.open(warc_files[0], "rb") as in_file: - assert b"Run OONI Probe to detect internet censorship" in in_file.read() - - conn = sqlite3.connect(dst_dir / "graveyard.sqlite3") - res = conn.execute("SELECT COUNT() FROM oonibodies_archive") - assert res.fetchone()[0] == 1 - - -def test_fingerprint_hunter(fingerprintdb, measurements, tmpdir): - db = MagicMock() - db.write_rows = MagicMock() - - archives_dir = Path(tmpdir) - http_blocked = load_measurement( - msmt_path=measurements[ - "20220608121828.356206_RU_webconnectivity_80e3fa60eb2cd026" - ] - ) - assert isinstance(http_blocked, WebConnectivity) - with ResponseArchiver(dst_dir=archives_dir) as response_archiver: - assert http_blocked.test_keys.requests - for http_transaction in http_blocked.test_keys.requests: - if not http_transaction.response or not http_transaction.request: - continue - request_url = http_transaction.request.url - status_code = http_transaction.response.code or 0 - response_headers = http_transaction.response.headers_list_bytes or [] - response_body = http_transaction.response.body_bytes - assert response_body - response_archiver.archive_http_transaction( - request_url=request_url, - status_code=status_code, - response_headers=response_headers, - response_body=response_body, - matched_fingerprints=[], - ) - - archive_path = list(archives_dir.glob("*.warc.gz"))[0] - detected_fps = list( - fingerprint_hunter( - fingerprintdb=fingerprintdb, - archive_path=archive_path, - ) - ) - assert len(detected_fps) == 1 diff --git a/oonipipeline/tests/test_analysis.py b/oonipipeline/tests/test_analysis.py index 242da280..7b9cea84 100644 --- a/oonipipeline/tests/test_analysis.py +++ b/oonipipeline/tests/test_analysis.py @@ -1,15 +1,23 @@ from base64 import b64decode from datetime import datetime +from pprint import pprint import random -from typing import List +from typing import List, Tuple from unittest.mock import MagicMock import pytest from oonidata.dataclient import load_measurement +from oonidata.models.analysis import WebAnalysis +from oonidata.models.experiment_result import MeasurementExperimentResult from oonidata.models.nettests.signal import Signal from oonidata.models.nettests.web_connectivity import WebConnectivity -from oonidata.models.observations import WebObservation, print_nice, print_nice_vertical +from oonidata.models.observations import ( + WebControlObservation, + WebObservation, + print_nice, + print_nice_vertical, +) from oonidata.datautils import validate_cert_chain from oonipipeline.analysis.web_analysis import make_web_analysis @@ -19,10 +27,14 @@ iter_ground_truths_from_web_control, WebGroundTruthDB, ) -from oonipipeline.analysis.signal import make_signal_experiment_result from oonipipeline.transforms.nettests.signal import SIGNAL_PEM_STORE from oonipipeline.transforms.observations import measurement_to_observations +from oonipipeline.analysis.signal import make_signal_experiment_result +from oonipipeline.analysis.website_experiment_results import ( + make_website_experiment_results, +) + def test_signal(fingerprintdb, netinfodb, measurements): signal_old_ca = load_measurement( @@ -120,67 +132,6 @@ def test_signal(fingerprintdb, netinfodb, measurements): assert blocking_event[0].confirmed == True -def test_website_dns_blocking_event(fingerprintdb, netinfodb, measurements): - pytest.skip("TODO(arturo): implement this with the new analysis") - msmt_path = measurements[ - "20220627030703.592775_IR_webconnectivity_80e199b3c572f8d3" - ] - er = list(make_experiment_result_from_wc_ctrl(msmt_path, fingerprintdb, netinfodb)) - be = list( - filter( - lambda be: be.outcome_scope == "n", - er, - ) - ) - assert len(be) == 1 - - msmt_path = measurements[ - "20220627134426.194308_DE_webconnectivity_15675b61ec62e268" - ] - er = list(make_experiment_result_from_wc_ctrl(msmt_path, fingerprintdb, netinfodb)) - be = list( - filter( - lambda be: be.blocked_score > 0.5, - er, - ) - ) - assert len(be) == 1 - assert be[0].outcome_detail == "inconsistent.bogon" - - msmt_path = measurements[ - "20220627125833.737451_FR_webconnectivity_bca9ad9d3371919a" - ] - er = make_experiment_result_from_wc_ctrl(msmt_path, fingerprintdb, netinfodb) - be = list( - filter( - lambda be: be.blocked_score > 0.6, - er, - ) - ) - # TODO: is it reasonable to double count NXDOMAIN for AAAA and A queries? - assert len(be) == 2 - assert be[0].outcome_detail == "inconsistent.nxdomain" - - msmt_path = measurements[ - "20220625234824.235023_HU_webconnectivity_3435a5df0e743d39" - ] - er = list(make_experiment_result_from_wc_ctrl(msmt_path, fingerprintdb, netinfodb)) - be = list( - filter( - lambda be: be.ok_score > 0.5, - er, - ) - ) - nok_be = list( - filter( - lambda be: be.ok_score < 0.5, - er, - ) - ) - assert len(be) == len(er) - assert len(nok_be) == 0 - - def make_experiment_result_from_wc_ctrl(msmt_path, fingerprintdb, netinfodb): msmt = load_measurement(msmt_path=msmt_path) assert isinstance(msmt, WebConnectivity) @@ -203,32 +154,41 @@ def make_experiment_result_from_wc_ctrl(msmt_path, fingerprintdb, netinfodb): return [] -def test_website_experiment_result_blocked(fingerprintdb, netinfodb, measurements): - pytest.skip("TODO(arturo): implement this with the new analysis") - experiment_results = list( - make_experiment_result_from_wc_ctrl( - measurements["20220627030703.592775_IR_webconnectivity_80e199b3c572f8d3"], - fingerprintdb, - netinfodb, - ) +def make_web_er_from_msmt(msmt, fingerprintdb, netinfodb) -> Tuple[ + List[MeasurementExperimentResult], + List[WebAnalysis], + List[WebObservation], + List[WebControlObservation], +]: + assert isinstance(msmt, WebConnectivity) + web_observations, web_control_observations = measurement_to_observations( + msmt, netinfodb=netinfodb + ) + assert isinstance(msmt.input, str) + web_ground_truth_db = WebGroundTruthDB() + web_ground_truth_db.build_from_rows( + rows=iter_ground_truths_from_web_control( + web_control_observations=web_control_observations, + netinfodb=netinfodb, + ), ) - assert len(experiment_results) == 1 - assert experiment_results[0].anomaly == True - -def test_website_experiment_result_ok(fingerprintdb, netinfodb, measurements): - pytest.skip("TODO(arturo): implement this with the new analysis") - experiment_results = list( - make_experiment_result_from_wc_ctrl( - measurements["20220608132401.787399_AM_webconnectivity_2285fc373f62729e"], - fingerprintdb, - netinfodb, + web_ground_truths = web_ground_truth_db.lookup_by_web_obs(web_obs=web_observations) + web_analysis = list( + make_web_analysis( + web_observations=web_observations, + web_ground_truths=web_ground_truths, + body_db=BodyDB(db=None), # type: ignore + fingerprintdb=fingerprintdb, ) ) - assert len(experiment_results) == 4 - assert experiment_results[0].anomaly == False - for er in experiment_results: - assert er.ok_score > 0.5 + + return ( + list(make_website_experiment_results(web_analysis)), + web_analysis, + web_observations, + web_control_observations, + ) def test_website_web_analysis_blocked(fingerprintdb, netinfodb, measurements, datadir): @@ -237,74 +197,266 @@ def test_website_web_analysis_blocked(fingerprintdb, netinfodb, measurements, da "20221110235922.335062_IR_webconnectivity_e4114ee32b8dbf74" ], ) - web_obs: List[WebObservation] = measurement_to_observations( - msmt, netinfodb=netinfodb - )[0] - FASTLY_IPS = [ - "151.101.1.140", - "151.101.129.140", - "151.101.193.140", - "151.101.65.140", - "199.232.253.140", - "2a04:4e42:400::396", - "2a04:4e42::396", - "2a04:4e42:fd3::396", + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 5 + + assert len(er) == 1 + assert er[0].loni_blocked_values == [1.0] + assert er[0].loni_ok_value == 0 + assert er[0].loni_blocked_keys[0].startswith("dns.") + + +def test_website_web_analysis_plaintext_ok(fingerprintdb, netinfodb, measurements): + msmt = load_measurement( + msmt_path=measurements[ + "20220608132401.787399_AM_webconnectivity_2285fc373f62729e" + ], + ) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 2 + + assert len(er) == 1 + ok_dict = dict(zip(er[0].loni_ok_keys, er[0].loni_ok_values)) + assert ok_dict["dns"] > 0.8 + assert ok_dict["tcp"] > 0.8 + assert ok_dict["tls"] > 0.8 + assert ok_dict["http"] > 0.8 + + assert er[0].loni_ok_value > 0.8 + + +def test_website_web_analysis_blocked_2(fingerprintdb, netinfodb, measurements): + msmt = load_measurement( + msmt_path=measurements[ + "20220627030703.592775_IR_webconnectivity_80e199b3c572f8d3" + ], + ) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 6 + + assert len(er) == 1 + assert er[0].loni_blocked_values == [1.0] + assert er[0].loni_ok_value == 0 + assert er[0].loni_blocked_keys[0].startswith("dns.") + + +def test_website_dns_blocking_event(fingerprintdb, netinfodb, measurements): + msmt_path = measurements[ + "20220627134426.194308_DE_webconnectivity_15675b61ec62e268" + ] + msmt = load_measurement( + msmt_path=msmt_path, + ) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 6 + + assert len(er) == 1 + assert er[0].loni_ok_value == 0 + assert er[0].loni_blocked_values[0] > 0.7 + assert er[0].loni_blocked_keys[0].startswith("dns.") + + +def test_website_dns_blocking_event_2(fingerprintdb, netinfodb, measurements): + msmt_path = measurements[ + "20220627125833.737451_FR_webconnectivity_bca9ad9d3371919a" ] - # Equivalent to the following call, but done manually - # relevant_gts = web_ground_truth_db.lookup_by_web_obs(web_obs=web_obs) - relevant_gts = [] - for is_trusted in [True, False]: - for ip in FASTLY_IPS: - relevant_gts.append( - WebGroundTruth( - vp_asn=0, - vp_cc="ZZ", - # TODO FIXME in lookup - is_trusted_vp=is_trusted, - hostname="www.reddit.com", - ip=ip, - # TODO FIXME in webgroundtruth lookup - port=443, - dns_failure=None, - # TODO fixme in lookup - dns_success=True, - tcp_failure=None, - # TODO fixme in lookup - tcp_success=True, - tls_failure=None, - tls_success=True, - tls_is_certificate_valid=True, - http_request_url=None, - http_failure=None, - http_success=None, - # FIXME in lookup function "ZZ", - http_response_body_length=131072 - random.randint(0, 100), - # TODO FIXME in lookup function - timestamp=datetime( - 2022, - 11, - 10, - 0, - 0, - ), - count=2, - ip_asn=54113, - # TODO FIXME in lookup function - ip_as_org_name="Fastly, Inc.", - ), - ) - # XXX currently not working - body_db = BodyDB(db=None) # type: ignore + msmt = load_measurement( + msmt_path=msmt_path, + ) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 5 - web_analysis = list( - make_web_analysis( - web_observations=web_obs, - body_db=body_db, - web_ground_truths=relevant_gts, - fingerprintdb=fingerprintdb, - ) + assert len(er) == 1 + assert er[0].loni_ok_value == 0 + assert er[0].loni_blocked_values[0] > 0.5 + assert er[0].loni_blocked_keys[0].startswith("dns.") + + +def test_website_dns_ok(fingerprintdb, netinfodb, measurements): + msmt_path = measurements[ + "20220625234824.235023_HU_webconnectivity_3435a5df0e743d39" + ] + msmt = load_measurement( + msmt_path=msmt_path, + ) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + # assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 5 + + assert len(er) == 1 + assert er[0].loni_ok_value == 1 + + +# Check this for wc 0.5 overwriting tls analsysis +# 20231031000227.813597_MY_webconnectivity_2f0b80761373aa7e +def test_website_experiment_results(measurements, netinfodb, fingerprintdb): + msmt = load_measurement( + msmt_path=measurements[ + "20221101055235.141387_RU_webconnectivity_046ce024dd76b564" + ] + ) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 3 + + assert len(er) == 1 + assert er[0].loni_ok_value < 0.2 + ok_dict = dict(zip(er[0].loni_ok_keys, er[0].loni_ok_values)) + assert ok_dict["tcp"] == 0 + + blocked_dict = dict(zip(er[0].loni_blocked_keys, er[0].loni_blocked_values)) + assert blocked_dict["tcp.timeout"] > 0.4 + + +def test_website_web_analysis_down(measurements, netinfodb, fingerprintdb): + msmt = load_measurement( + msmt_path=measurements[ + "20240420235427.477327_US_webconnectivity_9b3cac038dc2ba22" + ] + ) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb ) assert len(web_analysis) == len(web_obs) - # for wa in web_analysis: - # print(wa.measurement_uid) - # print_nice_vertical(wa) + assert len(web_ctrl_obs) == 3 + + assert len(er) == 1 + assert er[0].loni_ok_value < 0.2 + ok_dict = dict(zip(er[0].loni_ok_keys, er[0].loni_ok_values)) + assert ok_dict["tcp"] == 0 + + down_dict = dict(zip(er[0].loni_down_keys, er[0].loni_down_values)) + + blocked_dict = dict(zip(er[0].loni_blocked_keys, er[0].loni_blocked_values)) + + assert sum(down_dict.values()) > sum(blocked_dict.values()) + assert down_dict["tcp.timeout"] > 0.5 + + +def test_website_web_analysis_blocked_connect_reset( + measurements, netinfodb, fingerprintdb +): + msmt_path = measurements[ + "20240302000048.790188_RU_webconnectivity_e7ffd3bc0f525eb7" + ] + msmt = load_measurement(msmt_path=msmt_path) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + # assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 4 + + assert len(er) == 1 + # TODO(art): this should be changed + # assert er[0].loni_ok_value == 0 + assert er[0].loni_ok_value < 0.2 + + ok_dict = dict(zip(er[0].loni_ok_keys, er[0].loni_ok_values)) + assert ok_dict["tls"] == 0 + + down_dict = dict(zip(er[0].loni_down_keys, er[0].loni_down_values)) + blocked_dict = dict(zip(er[0].loni_blocked_keys, er[0].loni_blocked_values)) + + assert sum(down_dict.values()) < sum(blocked_dict.values()) + assert blocked_dict["tls.connection_reset"] > 0.5 + + +def print_debug_er(er): + for idx, e in enumerate(er): + print(f"\n# ER#{idx}") + for idx, transcript in enumerate(e.analysis_transcript_list): + print(f"## Analysis #{idx}") + print("\n".join(transcript)) + pprint(er) + + +def test_website_web_analysis_nxdomain_down(measurements, netinfodb, fingerprintdb): + msmt_path = measurements[ + "20240302000050.000654_SN_webconnectivity_fe4221088fbdcb0a" + ] + msmt = load_measurement(msmt_path=msmt_path) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 2 + + assert len(er) == 1 + assert er[0].loni_ok_value < 0.2 + + ok_dict = dict(zip(er[0].loni_ok_keys, er[0].loni_ok_values)) + assert ok_dict["dns"] == 0 + + down_dict = dict(zip(er[0].loni_down_keys, er[0].loni_down_values)) + blocked_dict = dict(zip(er[0].loni_blocked_keys, er[0].loni_blocked_values)) + + assert sum(down_dict.values()) > sum(blocked_dict.values()) + assert down_dict["dns.nxdomain"] > 0.7 + + +def test_website_web_analysis_nxdomain_blocked(measurements, netinfodb, fingerprintdb): + msmt_path = measurements[ + "20240302000305.316064_EG_webconnectivity_397bca9091b07444" + ] + msmt = load_measurement(msmt_path=msmt_path) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 7 + + assert len(er) == 1 + assert er[0].loni_ok_value < 0.2 + + ok_dict = dict(zip(er[0].loni_ok_keys, er[0].loni_ok_values)) + assert ok_dict["dns"] == 0 + + down_dict = dict(zip(er[0].loni_down_keys, er[0].loni_down_values)) + blocked_dict = dict(zip(er[0].loni_blocked_keys, er[0].loni_blocked_values)) + + assert sum(down_dict.values()) < sum(blocked_dict.values()) + assert blocked_dict["dns.nxdomain"] > 0.7 + + +def test_website_web_analysis_blocked_inconsistent_country( + measurements, netinfodb, fingerprintdb +): + msmt_path = measurements[ + "20240309112858.009725_SE_webconnectivity_dce757ef4ec9b6c8" + ] + msmt = load_measurement(msmt_path=msmt_path) + er, web_analysis, web_obs, web_ctrl_obs = make_web_er_from_msmt( + msmt, fingerprintdb=fingerprintdb, netinfodb=netinfodb + ) + assert len(web_analysis) == len(web_obs) + assert len(web_ctrl_obs) == 3 + + assert len(er) == 1 + assert er[0].loni_ok_value < 0.2 + + ok_dict = dict(zip(er[0].loni_ok_keys, er[0].loni_ok_values)) + assert ok_dict["dns"] == 0 + + down_dict = dict(zip(er[0].loni_down_keys, er[0].loni_down_values)) + blocked_dict = dict(zip(er[0].loni_blocked_keys, er[0].loni_blocked_values)) + + assert sum(down_dict.values()) > sum(blocked_dict.values()) diff --git a/oonipipeline/tests/test_cli.py b/oonipipeline/tests/test_cli.py new file mode 100644 index 00000000..9e2f9959 --- /dev/null +++ b/oonipipeline/tests/test_cli.py @@ -0,0 +1,140 @@ +import asyncio +from multiprocessing import Process +from pathlib import Path +import time + +from oonipipeline.cli.commands import cli + + +def wait_for_mutations(db, table_name): + while True: + res = db.execute( + f"SELECT * FROM system.mutations WHERE is_done=0 AND table='{table_name}';" + ) + if len(res) == 0: # type: ignore + break + time.sleep(1) + + +def test_full_workflow( + db, + cli_runner, + fingerprintdb, + netinfodb, + datadir, + tmp_path: Path, + temporal_dev_server, +): + result = cli_runner.invoke( + cli, + [ + "mkobs", + "--probe-cc", + "BA", + "--start-day", + "2022-10-20", + "--end-day", + "2022-10-21", + "--test-name", + "web_connectivity", + "--create-tables", + "--data-dir", + datadir, + "--clickhouse", + db.clickhouse_url, + # "--archives-dir", + # tmp_path.absolute(), + ], + ) + assert result.exit_code == 0 + # assert len(list(tmp_path.glob("*.warc.gz"))) == 1 + res = db.execute( + "SELECT bucket_date, COUNT(DISTINCT(measurement_uid)) FROM obs_web WHERE probe_cc = 'BA' GROUP BY bucket_date" + ) + bucket_dict = dict(res) + assert "2022-10-20" in bucket_dict, bucket_dict + assert bucket_dict["2022-10-20"] == 200, bucket_dict + obs_count = bucket_dict["2022-10-20"] + + result = cli_runner.invoke( + cli, + [ + "mkobs", + "--probe-cc", + "BA", + "--start-day", + "2022-10-20", + "--end-day", + "2022-10-21", + "--test-name", + "web_connectivity", + "--create-tables", + "--data-dir", + datadir, + "--clickhouse", + db.clickhouse_url, + ], + ) + assert result.exit_code == 0 + + # Wait for the mutation to finish running + wait_for_mutations(db, "obs_web") + res = db.execute( + "SELECT bucket_date, COUNT(DISTINCT(measurement_uid)) FROM obs_web WHERE probe_cc = 'BA' GROUP BY bucket_date" + ) + bucket_dict = dict(res) + assert "2022-10-20" in bucket_dict, bucket_dict + # By re-running it against the same date, we should still get the same observation count + assert bucket_dict["2022-10-20"] == obs_count, bucket_dict + + result = cli_runner.invoke( + cli, + [ + "mkgt", + "--start-day", + "2022-10-20", + "--end-day", + "2022-10-21", + "--data-dir", + datadir, + "--clickhouse", + db.clickhouse_url, + ], + ) + assert result.exit_code == 0 + + # result = cli_runner.invoke( + # cli, + # [ + # "fphunt", + # "--data-dir", + # datadir, + # "--archives-dir", + # tmp_path.absolute(), + # ], + # ) + # assert result.exit_code == 0 + + result = cli_runner.invoke( + cli, + [ + "mkanalysis", + "--probe-cc", + "BA", + "--start-day", + "2022-10-20", + "--end-day", + "2022-10-21", + "--test-name", + "web_connectivity", + "--data-dir", + datadir, + "--clickhouse", + db.clickhouse_url, + ], + ) + assert result.exit_code == 0 + res = db.execute( + "SELECT COUNT(DISTINCT(measurement_uid)) FROM measurement_experiment_result WHERE measurement_uid LIKE '20221020%' AND location_network_cc = 'BA'" + ) + assert res[0][0] == 200 # type: ignore diff --git a/oonipipeline/tests/test_ctrl.py b/oonipipeline/tests/test_ctrl.py index 1a7f384a..f03418bc 100644 --- a/oonipipeline/tests/test_ctrl.py +++ b/oonipipeline/tests/test_ctrl.py @@ -8,7 +8,9 @@ WebGroundTruthDB, iter_web_ground_truths, ) -from oonipipeline.workflows.observations import make_observations_for_file_entry_batch +from oonipipeline.temporal.activities.observations import ( + make_observations_for_file_entry_batch, +) def test_web_ground_truth_from_clickhouse(db, datadir, netinfodb, tmp_path): diff --git a/oonipipeline/tests/test_db.py b/oonipipeline/tests/test_db.py index a1b025a5..e07ac08b 100644 --- a/oonipipeline/tests/test_db.py +++ b/oonipipeline/tests/test_db.py @@ -1,8 +1,71 @@ +from dataclasses import dataclass +from typing import Dict, List, Optional, Tuple from unittest.mock import MagicMock, call from clickhouse_driver import Client from oonipipeline.db.connections import ClickhouseConnection +from oonipipeline.db.create_tables import ( + get_table_column_diff, + get_column_map_from_create_query, + typing_to_clickhouse, +) + + +def test_create_tables(): + col_map = get_column_map_from_create_query( + """ + CREATE TABLE IF NOT EXISTS my_table + ( + col_int Int32, + col_str String, + col_dict String, + col_opt_list_str Nullable(Array(String)), + col_opt_tup_str_str Nullable(Tuple(String, String)), + col_opt_list_tup_str_byt Nullable(Array(Array(String))), + col_dict_str_str Map(String, String) + ) + ENGINE = MergeTree() + PRIMARY KEY (col_int) +""" + ) + assert col_map["col_int"] == typing_to_clickhouse(int) + assert col_map["col_str"] == typing_to_clickhouse(str) + assert col_map["col_dict"] == typing_to_clickhouse(dict) + assert col_map["col_opt_list_str"] == typing_to_clickhouse(Optional[List[str]]) + assert col_map["col_opt_tup_str_str"] == typing_to_clickhouse( + Optional[Tuple[str, str]] + ) + assert col_map["col_opt_list_tup_str_byt"] == typing_to_clickhouse( + Optional[List[Tuple[str, bytes]]] + ) + assert col_map["col_dict_str_str"] == typing_to_clickhouse(Dict[str, str]) + + @dataclass + class SampleTable: + __table_name__ = "my_table" + + my_col_int: int + my_new_col_str: str + + db = MagicMock() + db.execute.return_value = [ + [ + """ + CREATE TABLE IF NOT EXISTS my_table + ( + my_col_int Int32, + ) + ENGINE = MergeTree() + PRIMARY KEY (my_col_int)""" + ] + ] + diff = get_table_column_diff(db=db, base_class=SampleTable) + assert len(diff) == 1 + assert diff[0].table_name == "my_table" + assert diff[0].column_name == "my_new_col_str" + assert diff[0].expected_type == "String" + assert diff[0].actual_type == None def test_flush_rows(db): diff --git a/oonipipeline/tests/test_experiment_results.py b/oonipipeline/tests/test_experiment_results.py deleted file mode 100644 index ffa1a551..00000000 --- a/oonipipeline/tests/test_experiment_results.py +++ /dev/null @@ -1,71 +0,0 @@ -from pprint import pprint - -from oonidata.models.observations import print_nice, print_nice_vertical -from oonidata.dataclient import load_measurement - -from oonipipeline.analysis.control import ( - BodyDB, - WebGroundTruthDB, - iter_ground_truths_from_web_control, -) -from oonipipeline.analysis.web_analysis import make_web_analysis -from oonipipeline.analysis.website_experiment_results import ( - make_website_experiment_results, -) -from oonipipeline.transforms.observations import measurement_to_observations - - -# Check this for wc 0.5 overwriting tls analsysis -# 20231031000227.813597_MY_webconnectivity_2f0b80761373aa7e -def test_website_experiment_results(measurements, netinfodb, fingerprintdb): - msmt = load_measurement( - msmt_path=measurements[ - "20221101055235.141387_RU_webconnectivity_046ce024dd76b564" - ] - ) - web_observations, web_control_observations = measurement_to_observations( - msmt, netinfodb=netinfodb - ) - assert isinstance(msmt.input, str) - web_ground_truth_db = WebGroundTruthDB() - web_ground_truth_db.build_from_rows( - rows=iter_ground_truths_from_web_control( - web_control_observations=web_control_observations, - netinfodb=netinfodb, - ), - ) - - web_ground_truths = web_ground_truth_db.lookup_by_web_obs(web_obs=web_observations) - web_analysis = list( - make_web_analysis( - web_observations=web_observations, - web_ground_truths=web_ground_truths, - body_db=BodyDB(db=None), # type: ignore - fingerprintdb=fingerprintdb, - ) - ) - - # TODO(arturo): there is currently an edge case here which is when we get an - # IPv6 answer, since we are ignoring them in the analysis, we will have N - # less analysis where N is the number of IPv6 addresses. - assert len(web_analysis) == len(web_observations) - # for wa in web_analysis: - # print_nice_vertical(wa) - - website_er = list(make_website_experiment_results(web_analysis)) - assert len(website_er) == 1 - - wer = website_er[0] - analysis_transcript_list = wer.analysis_transcript_list - - assert ( - sum(wer.loni_blocked_values) + sum(wer.loni_down_values) + wer.loni_ok_value - == 1 - ) - assert wer.anomaly == True - - # wer.analysis_transcript_list = None - # print_nice_vertical(wer) - # for loni in wer.loni_list: - # pprint(loni.to_dict()) - # print(analysis_transcript_list) diff --git a/oonipipeline/tests/test_scoring.py b/oonipipeline/tests/test_scoring.py deleted file mode 100644 index 790fd2c8..00000000 --- a/oonipipeline/tests/test_scoring.py +++ /dev/null @@ -1,55 +0,0 @@ -from unittest.mock import MagicMock - -import pytest - -from oonidata.models.experiment_result import print_nice_er -from oonidata.dataclient import load_measurement - -from oonipipeline.analysis.control import ( - WebGroundTruthDB, - iter_ground_truths_from_web_control, -) -from oonipipeline.transforms.observations import measurement_to_observations - - -def test_tcp_scoring(measurements, netinfodb, fingerprintdb): - pytest.skip("TODO(arturo): implement this with the new analysis") - msmt = load_measurement( - msmt_path=measurements[ - "20221101055235.141387_RU_webconnectivity_046ce024dd76b564" - ] - ) - web_observations, web_control_observations = measurement_to_observations( - msmt, netinfodb=netinfodb - ) - assert isinstance(msmt.input, str) - web_ground_truth_db = WebGroundTruthDB() - web_ground_truth_db.build_from_rows( - rows=iter_ground_truths_from_web_control( - web_control_observations=web_control_observations, - netinfodb=netinfodb, - ), - ) - gt = web_ground_truth_db.lookup( - probe_cc="RU", probe_asn=8402, ip_ports=[("104.244.42.1", 443)] - ) - assert len(gt) == 1 - assert gt[0].tcp_success == 1 - - body_db = MagicMock() - body_db.lookup = MagicMock() - body_db.lookup.return_value = [] - - web_ground_truths = web_ground_truth_db.lookup_by_web_obs(web_obs=web_observations) - assert len(web_ground_truths) == 3 - er = make_website_experiment_result( - web_observations=web_observations, - web_ground_truths=web_ground_truths, - body_db=body_db, - fingerprintdb=fingerprintdb, - ) - all_er = list(er) - - tcp_er = list(filter(lambda er: er.outcome_category == "tcp", all_er)) - assert len(tcp_er) == 1 - assert tcp_er[0].blocked_score > 0.6 diff --git a/oonipipeline/tests/test_workflows.py b/oonipipeline/tests/test_workflows.py index 4305c993..fea940e7 100644 --- a/oonipipeline/tests/test_workflows.py +++ b/oonipipeline/tests/test_workflows.py @@ -1,9 +1,44 @@ -import asyncio -from multiprocessing import Process +from datetime import date, datetime, timedelta, timezone +import gzip from pathlib import Path +import sqlite3 +from typing import List, Tuple +from unittest.mock import MagicMock import time -from oonipipeline.cli.commands import cli +import pytest + +from oonidata.dataclient import stream_jsonl, load_measurement +from oonidata.models.nettests.dnscheck import DNSCheck +from oonidata.models.nettests.web_connectivity import WebConnectivity +from oonidata.models.nettests.http_invalid_request_line import HTTPInvalidRequestLine +from oonidata.models.observations import HTTPMiddleboxObservation + +from oonipipeline.temporal.activities.common import get_obs_count_by_cc, ObsCountParams +from oonipipeline.temporal.activities.observations import ( + make_observations_for_file_entry_batch, +) +from oonipipeline.transforms.measurement_transformer import MeasurementTransformer +from oonipipeline.transforms.observations import measurement_to_observations +from oonipipeline.temporal.activities.analysis import ( + MakeAnalysisParams, + make_analysis_in_a_day, + make_cc_batches, +) +from oonipipeline.temporal.common import ( + get_prev_range, + maybe_delete_prev_range, +) +from oonipipeline.temporal.activities.ground_truths import ( + MakeGroundTruthsParams, + make_ground_truths_in_day, +) +from oonipipeline.temporal.activities.observations import ( + write_observations_to_db, +) + +# from oonipipeline.workflows.response_archiver import ResponseArchiver +# from oonipipeline.workflows.fingerprint_hunter import fingerprint_hunter def wait_for_mutations(db, table_name): @@ -16,124 +51,303 @@ def wait_for_mutations(db, table_name): time.sleep(1) -def test_full_workflow( - db, - cli_runner, - fingerprintdb, - netinfodb, - datadir, - tmp_path: Path, - temporal_dev_server, -): - result = cli_runner.invoke( - cli, - [ - "mkobs", - "--probe-cc", - "BA", - "--start-day", - "2022-10-20", - "--end-day", - "2022-10-21", - "--test-name", - "web_connectivity", - "--create-tables", - "--data-dir", - datadir, - "--clickhouse", - db.clickhouse_url, - # "--archives-dir", - # tmp_path.absolute(), - ], - ) - assert result.exit_code == 0 - # assert len(list(tmp_path.glob("*.warc.gz"))) == 1 - res = db.execute( - "SELECT COUNT(DISTINCT(measurement_uid)) FROM obs_web WHERE bucket_date = '2022-10-20' AND probe_cc = 'BA'" - ) - assert res[0][0] == 200 # type: ignore - res = db.execute( - "SELECT COUNT() FROM obs_web WHERE bucket_date = '2022-10-20' AND probe_cc = 'BA'" - ) - obs_count = res[0][0] # type: ignore - - result = cli_runner.invoke( - cli, - [ - "mkobs", - "--probe-cc", - "BA", - "--start-day", - "2022-10-20", - "--end-day", - "2022-10-21", - "--test-name", - "web_connectivity", - "--create-tables", - "--data-dir", - datadir, - "--clickhouse", - db.clickhouse_url, - ], - ) - assert result.exit_code == 0 - - # Wait for the mutation to finish running - wait_for_mutations(db, "obs_web") - res = db.execute( - "SELECT COUNT() FROM obs_web WHERE bucket_date = '2022-10-20' AND probe_cc = 'BA'" - ) - # By re-running it against the same date, we should still get the same observation count - assert res[0][0] == obs_count # type: ignore - - result = cli_runner.invoke( - cli, - [ - "mkgt", - "--start-day", - "2022-10-20", - "--end-day", - "2022-10-21", - "--data-dir", - datadir, - "--clickhouse", - "clickhouse://localhost/testing_oonidata", - ], - ) - assert result.exit_code == 0 - - # result = cli_runner.invoke( - # cli, - # [ - # "fphunt", - # "--data-dir", - # datadir, - # "--archives-dir", - # tmp_path.absolute(), - # ], - # ) - # assert result.exit_code == 0 - - result = cli_runner.invoke( - cli, - [ - "mkanalysis", - "--probe-cc", - "BA", - "--start-day", - "2022-10-20", - "--end-day", - "2022-10-21", - "--test-name", - "web_connectivity", - "--data-dir", - datadir, - "--clickhouse", - db.clickhouse_url, - ], - ) - assert result.exit_code == 0 - res = db.execute( - "SELECT COUNT(DISTINCT(measurement_uid)) FROM measurement_experiment_result WHERE measurement_uid LIKE '20221020%' AND location_network_cc = 'BA'" - ) - assert res[0][0] == 200 # type: ignore +def test_get_prev_range(db): + db.execute("DROP TABLE IF EXISTS test_range") + db.execute( + """CREATE TABLE test_range ( + created_at DateTime64(3, 'UTC'), + bucket_date String, + test_name String, + probe_cc String + ) + ENGINE = MergeTree + ORDER BY (bucket_date, created_at) + """ + ) + bucket_date = "2000-01-01" + test_name = "web_connectivity" + probe_cc = "IT" + min_time = datetime(2000, 1, 1, 23, 42, 00) + rows = [(min_time, bucket_date, test_name, probe_cc)] + for i in range(200): + rows.append((min_time + timedelta(seconds=i), bucket_date, test_name, probe_cc)) + db.execute( + "INSERT INTO test_range (created_at, bucket_date, test_name, probe_cc) VALUES", + rows, + ) + prev_range = get_prev_range( + db, + "test_range", + test_name=[test_name], + bucket_date=bucket_date, + probe_cc=[probe_cc], + ) + assert prev_range.min_created_at and prev_range.max_created_at + assert prev_range.min_created_at == (min_time - timedelta(seconds=1)) + assert prev_range.max_created_at == (rows[-1][0] + timedelta(seconds=1)) + db.execute("TRUNCATE TABLE test_range") + + bucket_date = "2000-03-01" + test_name = "web_connectivity" + probe_cc = "IT" + min_time = datetime(2000, 1, 1, 23, 42, 00) + rows: List[Tuple[datetime, str, str, str]] = [] + for i in range(10): + rows.append( + (min_time + timedelta(seconds=i), "2000-02-01", test_name, probe_cc) + ) + min_time = rows[-1][0] + for i in range(10): + rows.append((min_time + timedelta(seconds=i), bucket_date, test_name, probe_cc)) + + db.execute( + "INSERT INTO test_range (created_at, bucket_date, test_name, probe_cc) VALUES", + rows, + ) + prev_range = get_prev_range( + db, + "test_range", + test_name=[test_name], + bucket_date=bucket_date, + probe_cc=[probe_cc], + ) + assert prev_range.min_created_at and prev_range.max_created_at + assert prev_range.min_created_at == (min_time - timedelta(seconds=1)) + assert prev_range.max_created_at == (rows[-1][0] + timedelta(seconds=1)) + + maybe_delete_prev_range( + db=db, + prev_range=prev_range, + ) + wait_for_mutations(db, "test_range") + res = db.execute("SELECT COUNT() FROM test_range") + assert res[0][0] == 10 + db.execute("DROP TABLE test_range") + + +def test_make_cc_batches(): + cc_batches = make_cc_batches( + cnt_by_cc={"IT": 100, "IR": 300, "US": 1000}, + probe_cc=["IT", "IR", "US"], + parallelism=2, + ) + assert len(cc_batches) == 2 + # We expect the batches to be broken up into (IT, IR), ("US") + assert any([set(x) == set(["US"]) for x in cc_batches]) == True + + +def test_make_file_entry_batch(datadir, db): + file_entry_batch = [ + ( + "ooni-data-eu-fra", + "raw/20231031/15/IR/webconnectivity/2023103115_IR_webconnectivity.n1.0.tar.gz", + "tar.gz", + 4074306, + ) + ] + obs_msmt_count = make_observations_for_file_entry_batch( + file_entry_batch, db.clickhouse_url, 100, datadir, "2023-10-31", ["IR"], False + ) + assert obs_msmt_count == 453 + make_ground_truths_in_day( + MakeGroundTruthsParams( + day=date(2023, 10, 31).strftime("%Y-%m-%d"), + clickhouse=db.clickhouse_url, + data_dir=datadir, + ), + ) + analysis_res = make_analysis_in_a_day( + MakeAnalysisParams( + probe_cc=["IR"], + test_name=["webconnectivity"], + clickhouse=db.clickhouse_url, + data_dir=datadir, + fast_fail=False, + day=date(2023, 10, 31).strftime("%Y-%m-%d"), + ), + ) + assert analysis_res["count"] == obs_msmt_count + + +def test_write_observations(measurements, netinfodb, db): + msmt_uids = [ + ("20210101190046.780850_US_webconnectivity_3296f126f79ca186", "2021-01-01"), + ("20210101181154.037019_CH_webconnectivity_68ce38aa9e3182c2", "2021-01-01"), + ("20231031032643.267235_GR_dnscheck_abcbfc460b9424b6", "2023-10-31"), + ( + "20231101164541.763506_NP_httpinvalidrequestline_0cf676868fa36cc4", + "2023-10-31", + ), + ( + "20231101164544.534107_BR_httpheaderfieldmanipulation_4caa0b0556f0b141", + "2023-10-31", + ), + ("20231101164649.235575_RU_tor_ccf7519bf683c022", "2023-10-31"), + ( + "20230907000740.785053_BR_httpinvalidrequestline_bdfe6d70dcbda5e9", + "2023-09-07", + ), + ] + for msmt_uid, bucket_date in msmt_uids: + msmt = load_measurement(msmt_path=measurements[msmt_uid]) + write_observations_to_db(msmt, netinfodb, db, bucket_date) + db.close() + cnt_by_cc = get_obs_count_by_cc( + ObsCountParams( + clickhouse_url=db.clickhouse_url, + start_day="2020-01-01", + end_day="2023-12-01", + ) + ) + assert cnt_by_cc["CH"] == 2 + assert cnt_by_cc["GR"] == 4 + assert cnt_by_cc["US"] == 3 + assert cnt_by_cc["RU"] == 3 + + +def test_hirl_observations(measurements, netinfodb): + msmt = load_measurement( + msmt_path=measurements[ + "20230907000740.785053_BR_httpinvalidrequestline_bdfe6d70dcbda5e9" + ] + ) + assert isinstance(msmt, HTTPInvalidRequestLine) + middlebox_obs: List[HTTPMiddleboxObservation] = measurement_to_observations( + msmt, netinfodb=netinfodb + )[0] + assert isinstance(middlebox_obs[0], HTTPMiddleboxObservation) + assert middlebox_obs[0].hirl_success == True + assert middlebox_obs[0].hirl_sent_0 != middlebox_obs[0].hirl_received_0 + + +def test_insert_query_for_observation(measurements, netinfodb): + http_blocked = load_measurement( + msmt_path=measurements[ + "20220608121828.356206_RU_webconnectivity_80e3fa60eb2cd026" + ] + ) + assert isinstance(http_blocked, WebConnectivity) + mt = MeasurementTransformer(measurement=http_blocked, netinfodb=netinfodb) + all_web_obs = [ + obs + for obs in mt.make_http_observations( + http_blocked.test_keys.requests, + ) + ] + assert all_web_obs[-1].request_url == "http://proxy.org/" + + +def test_web_connectivity_processor(netinfodb, measurements): + msmt = load_measurement( + msmt_path=measurements[ + "20220627131742.081225_GB_webconnectivity_e1e2cf4db492b748" + ] + ) + assert isinstance(msmt, WebConnectivity) + + web_obs_list, web_ctrl_list = measurement_to_observations(msmt, netinfodb=netinfodb) + assert len(web_obs_list) == 3 + assert len(web_ctrl_list) == 3 + + +def test_dnscheck_processor(measurements, netinfodb): + db = MagicMock() + db.write_row = MagicMock() + + msmt = load_measurement( + msmt_path=measurements["20221013000000.517636_US_dnscheck_bfd6d991e70afa0e"] + ) + assert isinstance(msmt, DNSCheck) + obs_list = measurement_to_observations(msmt=msmt, netinfodb=netinfodb)[0] + assert len(obs_list) == 20 + + +def test_full_processing(raw_measurements, netinfodb): + for msmt_path in raw_measurements.glob("*/*/*.jsonl.gz"): + with msmt_path.open("rb") as in_file: + for msmt_dict in stream_jsonl(in_file): + msmt = load_measurement(msmt_dict) + measurement_to_observations( + msmt=msmt, + netinfodb=netinfodb, + ) + + +def test_archive_http_transaction(measurements, tmpdir): + pytest.skip("TODO(art): fixme") + db = MagicMock() + db.write_row = MagicMock() + + msmt = load_measurement( + msmt_path=measurements[ + "20220627131742.081225_GB_webconnectivity_e1e2cf4db492b748" + ] + ) + assert isinstance(msmt, WebConnectivity) + assert msmt.test_keys.requests + dst_dir = Path(tmpdir) + with ResponseArchiver(dst_dir=dst_dir) as archiver: + for http_transaction in msmt.test_keys.requests: + if not http_transaction.response or not http_transaction.request: + continue + request_url = http_transaction.request.url + status_code = http_transaction.response.code or 0 + response_headers = http_transaction.response.headers_list_bytes or [] + response_body = http_transaction.response.body_bytes + assert response_body + archiver.archive_http_transaction( + request_url=request_url, + status_code=status_code, + response_headers=response_headers, + response_body=response_body, + matched_fingerprints=[], + ) + + warc_files = list(dst_dir.glob("*.warc.gz")) + assert len(warc_files) == 1 + with gzip.open(warc_files[0], "rb") as in_file: + assert b"Run OONI Probe to detect internet censorship" in in_file.read() + + conn = sqlite3.connect(dst_dir / "graveyard.sqlite3") + res = conn.execute("SELECT COUNT() FROM oonibodies_archive") + assert res.fetchone()[0] == 1 + + +def test_fingerprint_hunter(fingerprintdb, measurements, tmpdir): + pytest.skip("TODO(art): fixme") + db = MagicMock() + db.write_rows = MagicMock() + + archives_dir = Path(tmpdir) + http_blocked = load_measurement( + msmt_path=measurements[ + "20220608121828.356206_RU_webconnectivity_80e3fa60eb2cd026" + ] + ) + assert isinstance(http_blocked, WebConnectivity) + with ResponseArchiver(dst_dir=archives_dir) as response_archiver: + assert http_blocked.test_keys.requests + for http_transaction in http_blocked.test_keys.requests: + if not http_transaction.response or not http_transaction.request: + continue + request_url = http_transaction.request.url + status_code = http_transaction.response.code or 0 + response_headers = http_transaction.response.headers_list_bytes or [] + response_body = http_transaction.response.body_bytes + assert response_body + response_archiver.archive_http_transaction( + request_url=request_url, + status_code=status_code, + response_headers=response_headers, + response_body=response_body, + matched_fingerprints=[], + ) + + archive_path = list(archives_dir.glob("*.warc.gz"))[0] + detected_fps = list( + fingerprint_hunter( + fingerprintdb=fingerprintdb, + archive_path=archive_path, + ) + ) + assert len(detected_fps) == 1