Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-lists: Create script to automatically delete expired and parked domains #1227

Closed
agrabeli opened this issue Sep 9, 2022 · 1 comment
Assignees
Labels
data quality duplicate duplicate issue (link to duplicate in comments) test lists

Comments

@agrabeli
Copy link
Member

agrabeli commented Sep 9, 2022

Given that the Citizen Lab test lists (https://github.com/citizenlab/test-lists/tree/master/lists) were originally created by Open Net Initiative researchers between 2008-2012, they include many URLs with expired and parked domains.

It would therefore be great if we could create a script that automatically detects and deletes URLs with expired and parked domains.

This would significantly simplify the test list review process of researchers, and it would also improve OONI measurement quality.

This activity has been included as an OONI challenge in Roskomsvoboda's DEMHACK hackathon (September 2022): https://demhack.ru/

If this activity is not implemented as part of the hackathon, the OONI team should pick it up.

[Update: 2023-03-15 - we did half of the work; please, see https://github.com/ooni/probe/issues/1826, which covers the remaining part of the work originally covered by this issue.]

@bassosimone
Copy link
Contributor

I am going to close this issue as a duplicate because:

Because this issue covers both cases, it is a full duplicate of those two issues.

@bassosimone bassosimone closed this as not planned Won't fix, can't repro, duplicate, stale Mar 15, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Mar 16, 2023
This pull request introduces the gardener, a tool to curate the test
lists. With @sloncocs, @hellais, @agrabeli and other colleagues from
OONI and Netalytica we have been working on improving the policies to
update the test lists for quite some time. The tool included in this
pull request helps addressing one of the easiest cases, i.e., the one
where a domain inside the test list does not exist anymore. Because it
is important to balance removing the domain with the fact that the
domain could still be censored, this tool does not automatically remove
a domain from the test lists. Rather, if you run the `dnsreport`
subcommand, it just produces a CSV report that a researcher could
inspect to choose whether to keep the domain. That said, the committed
tool also includes a `dnsfix` subcommand that applies `dnsreport`
results to the test lists and removes all entries for which we did not
observe any anomaly or confirmed in the last month.

This pull request touches upon several issues related to managing the
test lists that we opened:

* ooni/probe#1748 advocates for creating a
gardener prototype, which we did here;

* ooni/probe#1747 advocates for automatically
removing expired domains, which it is possible to do by combining the
`dnsreport` and `dnsfix` gardener subcommands;

* ooni/probe#1745 advocates for creating a
process for test lists maintenance, which we can now start doing thanks
to the gardener tool introduced in this PR;

* ooni/ooni.org#1227 advocates for creating a
script to automatically remove expired and parked domains, which we
start to address here by having a documented way of removing uncensored,
expired domains;

* ooni/ooni.org#363 is an umbrella issue about
collaborating with Netalytica and writing software to make the
collaboration easier, and we have done that by releasing a tool that
starts moving us in the right direction and helps us to know which
domains have now expired and automatically remove _some_ of them.

Updating the test list is a delicate balancing exercise between removing
what is now parked or expired and keeping what is still heavily censored
and helps us fingerprinting censorship in a country. It took us quite
some time and lots of internal and external discussion to figure out the
requirements for the gardener. Now that all this discussion is finally
being converted to pull requests, we should all celebrate a bit to
acknowledge that this work is a stepping stone towards making the whole
test lists ecosystem easier to maintain and evolve. 🥳 🥳 🥳 🥳

The related test-lists pull request is
citizenlab/test-lists#1247.
cyBerta pushed a commit to cyBerta/probe-cli that referenced this issue Aug 4, 2023
This pull request introduces the gardener, a tool to curate the test
lists. With @sloncocs, @hellais, @agrabeli and other colleagues from
OONI and Netalytica we have been working on improving the policies to
update the test lists for quite some time. The tool included in this
pull request helps addressing one of the easiest cases, i.e., the one
where a domain inside the test list does not exist anymore. Because it
is important to balance removing the domain with the fact that the
domain could still be censored, this tool does not automatically remove
a domain from the test lists. Rather, if you run the `dnsreport`
subcommand, it just produces a CSV report that a researcher could
inspect to choose whether to keep the domain. That said, the committed
tool also includes a `dnsfix` subcommand that applies `dnsreport`
results to the test lists and removes all entries for which we did not
observe any anomaly or confirmed in the last month.

This pull request touches upon several issues related to managing the
test lists that we opened:

* ooni/probe#1748 advocates for creating a
gardener prototype, which we did here;

* ooni/probe#1747 advocates for automatically
removing expired domains, which it is possible to do by combining the
`dnsreport` and `dnsfix` gardener subcommands;

* ooni/probe#1745 advocates for creating a
process for test lists maintenance, which we can now start doing thanks
to the gardener tool introduced in this PR;

* ooni/ooni.org#1227 advocates for creating a
script to automatically remove expired and parked domains, which we
start to address here by having a documented way of removing uncensored,
expired domains;

* ooni/ooni.org#363 is an umbrella issue about
collaborating with Netalytica and writing software to make the
collaboration easier, and we have done that by releasing a tool that
starts moving us in the right direction and helps us to know which
domains have now expired and automatically remove _some_ of them.

Updating the test list is a delicate balancing exercise between removing
what is now parked or expired and keeping what is still heavily censored
and helps us fingerprinting censorship in a country. It took us quite
some time and lots of internal and external discussion to figure out the
requirements for the gardener. Now that all this discussion is finally
being converted to pull requests, we should all celebrate a bit to
acknowledge that this work is a stepping stone towards making the whole
test lists ecosystem easier to maintain and evolve. 🥳 🥳 🥳 🥳

The related test-lists pull request is
citizenlab/test-lists#1247.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data quality duplicate duplicate issue (link to duplicate in comments) test lists
Projects
None yet
Development

No branches or pull requests

2 participants