Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into applying some more rigorous approach to generating experiment results #84

Open
5 tasks
hellais opened this issue Aug 26, 2024 · 1 comment
Open
5 tasks

Comments

@hellais
Copy link
Member

hellais commented Aug 26, 2024

Currently experiment results are semi-manually coded using bayesian style reasoning to come up with the weights.

It's however possible to do this using a more rigorous approach that makes use of well established graph based modeling systems such as bayesian networks.

Work on this has started already since a few months and had a very fruitful conversation about this topic with Joss who provided key insight.

As part of this activity the plan is to move this forward by doing some more modeling using bayes networks and see how it works.

Some sub-activities as part of this might include:

  • Coming up with labeled data (probably enriched with what we have from the feedback reporting system) to validate the model and/or bootstrap/train it
    • Build some kind of web interface to make it easier to label data quickly (currently it's too many clicks to do it via explorer for many measurements)
  • Refine and experiment with different features for the bayes net
  • Iterate on various configurations of the bayes network
  • Consider extending the observation data format to make it easier to extract the necessary features
@hellais hellais self-assigned this Aug 26, 2024
@hellais
Copy link
Member Author

hellais commented Aug 26, 2024

Some work in progress on this front is being done on this branch: #85

In particular see the notebook which implements an early stage version of the bayes net: https://github.com/ooni/data/blob/bayes-net/oonipipeline/notebooks/web-analysis-bn.ipynb

There are still a few critical theoretical hurdles that need to be overcome, which are questions I would like to pose to people that have more experience about this, namely:

  • What are some best-practices or rules of thumb to determine optimal cardinality for the nodes and when it's appropriate to split a particular proposition into more sub-propositions?
  • How do you deal with the fact that the state of a particular proposition might be undefined? Is it OK for it to just be T | F or is it recommended to explicitly add the "unknown" state?
  • Are there best practices on the optimal cardinality of the CPD tables? (pgmpy has a hard limit of 32, but manually populating tables even of width 10+ is extremely tedious) Are there tricks to try and split the nodes up in a such a way to keep the cardinality low?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant