Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling from the DAG with probabilities (and constraints) #28

Open
marybarker opened this issue Oct 12, 2022 · 2 comments
Open

Sampling from the DAG with probabilities (and constraints) #28

marybarker opened this issue Oct 12, 2022 · 2 comments

Comments

@marybarker
Copy link
Contributor

marybarker commented Oct 12, 2022

Sampling uniformly from the DAG:

We would like to sample trees from the DAG in a way that takes into account the distribution of trees in the DAG itself.

In order to do this efficiently, we would like to attach probability distributions to edges.
If the function count(e) yields the number of subtrees possible below edge e, then we assign a probability to every edge $e$ in a clade $c$ as the number of subtrees divided by the total number of subtrees for all the edges in the clade:
$$P(e) = \frac{count(e)}{\sum_{e_i \in c} count(e_i)}$$
Then trees can be sampled uniformly by following an iterative process beginning at the UA node:
for each clade below the current node, choose an edge(and hence, child node) from the set of edges descending from the clade based on the weights assigned to those edges.

Sampling uniformly from a constrained DAG:

We would also like to sample uniformly from the set of trees in the DAG that contain a given (fixed) node.
There are 2 components to this problem:

  1. Getting the set of trees that contain a fixed node.
  2. Sampling uniformly from that set.
    It turns out that this has a fairly straightforward answer, if we have already implemented a uniform sampling method.
    There is a way to efficiently count the number of trees that a node shows up in (see method count_nodes in historydag repo).
    Then we can assign probabilities to each node: $P(n) = \frac{|\text{trees containing }n|}{|\text{trees in the DAG}|}$.

Here is a bit of useful notation:

  • For a node n in a DAG, the leafset of n is the set of leaf nodes reachable below n.
  • If we have a tree T and a node n defining a subtree below $n$, then the subtree complement of n is the result when we prune the entire subtree below n from T.

In a DAG, if we fix a given node, then any tree in the DAG that contains the node can be partitioned into the subtree below n and a subtree complement, with leafset that is disjoint from the leafset of n.

Note that, for a fixed node n, sampling a subtree below n and sampling a subtree-complement from the DAG are independent.
Therefore to sample uniformly, given the constrained DAG, we can use a uniform sample below n and a uniform sample taken from the subtree complement separately.
The difficulty lies in how to choose a path between 'upward' from n to the UA such that the resulting path is also taken from a uniform distribution on such paths.

This is analogous to the 'downward' sampling problem, with the following 2 small modifications:

  1. Only a single parent node is chosen at each step, whereas in the downward problem we have a single edge chosen from each clade, and a node may have multiple clades.
  2. The probabilities assigned to edges in the upward direction must be calculated based on the choice of the node at each step.

At a given node n with k potential parents p1 ,...pk, we define the probability to assign in the upward direction to edge $e$i connecting n to pi using Bayes' rule as:

$$ P(e_i)^{\text{upward}} := P(p_i | n) = \frac{P(n | p_i )P(p_i ) } {P(n)} =\frac{P(e_i) P(p_i)}{p(n)}$$

Here $P(n)$ and $P(p_i)$ are probabilities assigned to nodes, which we can compute.
Also $P(e_i)$ is the probability assigned to edge $e_i$ based on its membership in a clade.

@willdumm
Copy link
Contributor

willdumm commented Nov 1, 2022

In general, given downward conditional edge probabilities $P(n_c | n_p)$ for each directed edge $(n_p, n_c)$, we can compute the probability of a node $n$ with the following recursion, given that the set $S(n)$ contains parent nodes of $n$ in the hDAG, and that the probability of the UA node $\rho$, $P(\rho) = 1$

$$ P(n)=\sum\limits_{n_{p} \in S(n)} P(n_p) P(n | n_p) $$

(We think)

Then the probability of an edge is the probability of its parent node, times its own downward conditional probability.

@willdumm willdumm changed the title Sampling from the DAG with probabilities (and constriants) Sampling from the DAG with probabilities (and constraints) Nov 16, 2022
@matsen
Copy link
Contributor

matsen commented Jun 20, 2023

We can discuss, but this doesn't appear to be a direction we are heading because we aren't thinking that the hDAG itself will "cover" enough of the probabilistically-interesting space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants