Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DP noise during aggregation for Prio3SumVec, Prio3Histogram #1072

Merged
merged 21 commits into from
Jul 19, 2024

Conversation

divergentdave
Copy link
Collaborator

This PR implements support for differential privacy via aggregator randomization for two more Prio3 VDAFs, Prio3SumVec and Prio3Histogram. I tacked on Prio3Histogram because it was easy to do so, and it has nice utility behavior thanks to the mutual exclusion property enforced by the circuit. These implementations combine a pure DP budget and the discrete Laplace distribution. We already have a routine to sample the discrete Laplace, as it is a subroutine in our discrete Gaussian sampler, so this PR additionally packages it up in a new public type.

I still want to add some tests to smoke test the AggregatorWithNoise interface for these, and perhaps a distributional test on the noised results. I could probably extract the common noise-adding loop here and in fixedpoint_l2, in the spirit of DRY.

This resolves #1068.

@divergentdave
Copy link
Collaborator Author

FWIW, draft-wang-ppm-differential-privacy provides a combination of Prio3Histogram with discrete Gaussian noise to achieve approximate differential privacy. We could also implement that, in parallel with the strategy provided here, but it would be blocked on first addressing #694, which would not be too hard. (it needs a floored square root implemented via binary search and a new constructor for the discrete Gaussian distribution)

fn add_noise_to_result(
&self,
dp_strategy: &PureDpDiscreteLaplace,
agg_result: &mut [Self::Field],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this interface, but add_noise_to_agg_share seems more appropriate as the function name. However, my understanding of the original issue (#1068) is to add noise to the final aggregate result at collector stage, so this argument is more like agg_result: &mut [Self::Field::Integer]? Obviously that doesn't hold up the DAP/VDAF trust model, because ideally we want to add noise at aggregator stage (at least independently to start with).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, AggregatorWithNoise has an add_noise_to_agg_share() method already, while TypeWithNoise has add_noise_to_result(). I think changing the name of the latter to add_noise_to_agg_share() would probably be clearer, considering the nouns used in the description of FLPs. I'll file a separate issue for this, since it would be a breaking change to the trait.

Prio3's add_noise_to_agg_share() dispatches to the inner FLP's add_noise_to_result(), so this does indeed happen on both aggregators, before the final aggregate result is computed, in the field domain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a TODO for the issue filed for the rename?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is #1073.


/// Create a new sampler for the discrete Laplace distribution with a scale parameter calibrated
/// to provide `epsilon`-differential privacy when added to the result of an integer-valued
/// function with l1-sensitivity `sensitivity`, following Lemma 29 from [[CKS20]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in Lemma 29, the query function outputs an integer $q: X^n \to \mathbb{Z}$. Since we are adding discrete Laplace independently on each coordinate, does this calculation still apply?

CKS20 does provide a multivariate version for discrete Gaussian in Theorem 14, but not for discrete Laplace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. Proposition 1 in Calibrating Noise to Sensitivity in Private Data Analysis shows the same thing, but with the continuous Laplace distribution. I think the mechanism is correct, but this will need a better reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around for a good cite for applying the discrete Laplace mechanism/geometric mechanism to a query that produces $\mathbb{Z}^n$, but came up short. I added an intuitive argument here, instead.


// Project it into the field by taking the modulus, converting to a fixed-precision
// integer, and converting to a field element.
let noise_wrapped = noise.mod_floor(&modulus);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: does this function throw an exception if for example, noise is more than modulus?

Another consideration is: because noise could be negative, we may need to reserve some tail end of the field modulus for negative values in aggregate result, so there may need to be a reasonable upper bound on the magnitude of noise, perhaps |noise| < modulus/2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this succeeds unconditionally, and would wrap around on very large noise values. The rationale, (see #578 (comment) and #578 (comment)) is that adding wrapped noise in the field, then unsharding to produce an integer, is equivalent to a trusted curator adding noise to the true sum using arbitrary-precision integers, and then taking the modulus as a post-processing step. (and post-processing preserves DP guarantees) I should bake this reasoning into a comment when I extract this loop into a helper function.

The mod_floor() operation does wrap negative noise values around to positive values below modulus; we're fine with this overlapping with the extreme upper tail of the noise distribution.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense. Out of curiosity, how do you plan to interpret the aggregate result with noise in this case? Given a reasonably large field, and assuming number of clients cannot exceed the field modulus, how much of the tail end of the field modulus do you plan to reserve for noise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't thought about it too much. I suppose the optimal choice will depend on the batch size (and the VDAF's type and parameters).

@junyechen1996
Copy link

junyechen1996 commented Jun 6, 2024

FWIW, draft-wang-ppm-differential-privacy provides a combination of Prio3Histogram with discrete Gaussian noise to achieve approximate differential privacy. We could also implement that, in parallel with the strategy provided here, but it would be blocked on first addressing #694, which would not be too hard. (it needs a floored square root implemented via binary search and a new constructor for the discrete Gaussian distribution)

To implement that, I think we could also start with some concentrated-DP budget, and see what level of CDP gives us the approximate-DP we want. The analysis given in draft-wang-ppm-differential-privacy uses BW18, which assumes continuous Gaussian technically, and we need to be careful when Gaussian sigma is small (also raised as a future TODO in the draft).
And for the L2-sensitivity, I'm leaning towards having a "squared L2-sensitivity" argument so we don't have to worry about taking the square root, and can avoid implementing a binary search to look for the next rational number of the square root.

@divergentdave divergentdave marked this pull request as ready for review June 7, 2024 22:51
@divergentdave divergentdave requested a review from a team as a code owner June 7, 2024 22:51
Copy link

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have questions.

// Compute the l1-sensitivity of the aggregation function (assuming the substitution-DP
// model). The worst case is when one individual's measurement changes such that each vector

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that substitution-DP is relevant to this case. Your goal is to measure the contribution of a single submission. A submission could be $[0..2^b)$ on all dimensions equally, which leads to this outcome. But that does not imply substitution. If you have substitution, then you could have $\{A=0 , B=2^b-1\}$ or $\{A=2^b-1 , B=0\}$. That means sensitivity under substitution effectively doubles, but I don't see how substitution applies to DAP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I relied on the reasoning in https://wangshan.github.io/draft-wang-ppm-differential-privacy/draft-wang-ppm-differential-privacy.html#appendix-B.2 for the choice of substitution-DP. The collector also learns the number of reports (which is needed for unsharding in some VDAFs, or for RAPPOR-style debiasing) and thus we can only get indistinguishable protocol runs if we assume substitution.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with that reasoning. It will depend on the setting, but you have to assume that each contribution is unknown to the collector. A histogram contribution could be a 0 just as much as it could be a 1. That's the contribution that we're hiding, not that a given person is participating or not. Substitution assumes that each report contributes something, which is not something we rely on.

My suggestion would be to externalize whether you want to use substitution as the basis of your sensitivity analysis. That is, if someone believes that substitution is necessary in their setting, they can set epsilon accordingly (i.e., halve it). Otherwise, settings where it is directly detrimental (i.e., the case we currently care about) will have to adjust in the other direction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of Prio3Histogram, the validity circuit enforces that every contribution adds 1 to one of the buckets, so there is no possible 'zero measurement' to exchange with. Using Prio3MultihotCountVec with max_weight = 1 instead would permit contributions to add nothing.

We could allow a choice between substitution and "replacement with zero" (as a modification of deletion-DP) by rolling that choice into the DifferentialPrivacyStrategy trait as well. Then, the different strategy-specific implementations of AggregatorWithNoise/TypeWithNoise could use their own sensitivity formulas as appropriate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this reasoning for the choice of substitution-DP:

The collector also learns the number of reports (which is needed for unsharding in some VDAFs, or for RAPPOR-style debiasing) and thus we can only get indistinguishable protocol runs if we assume substitution.

For unsharding in VDAF, num_measurements is an argument in the public API, and in DAP, report_count is in Collection, so the number of reports in the batch is public information.

My suggestion would be to externalize whether you want to use substitution as the basis of your sensitivity analysis. That is, if someone believes that substitution is necessary in their setting, they can set epsilon accordingly (i.e., halve it).

If the code assumes substitution-DP in its sensitivity analysis, and if someone wants to use deletion-DP, they can double the target central epsilon?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does @martinthomson's suggestion imply specific client behavior, such as randomized response? For example, a client might submit a 1-hot vector, or a vector of all 0s, depending on the outcome of a coinflip?

Exposing in the API what notion of "neighboring databases" we consider seems fraught to me. It's perhaps safer if we don't ask the user to go through this analysis themselves. Of course, this is just applying a more general principle of good API design: with DP we may be in a situation where we really do need to expose a choice (and have an expert make that choice).

Question: Is there a way to be "conservative" here? In particular, is it the case that substitution-DP implies deletion-DP with the same bound, but perhaps at the cost of some amount of utility?

I understand there is time pressure to merge this, so if we merge as-is, I would suggest at least making sure it's documented that we're assuming the application requires substitution-DP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The neighboring dataset definitions used are documented on the PureDpDiscreteLaplace and ZCdpDiscreteGaussian items now.

The comparison between substitution versus deletion differs based on the query -- in this PR, you get the same numeric global sensitivity value either way for Prio3SumVec, whereas you get 2 or 1 for Prio3Histogram. I think it will always be the case that you get sensitivities that are greater or equal with substitution, since a substitution is equivalent to a deletion and an insertion. However, the different assumptions complicate things, because an epsilon value used in a substitution-DP mechanism measures against a different set of counterfactuals than an epsilon value used in a deletion-DP mechanism.

let two = BigUint::from(2u64);
let bits = BigUint::from(self.bits);
let length = BigUint::from(self.len);
let sensitivity = (Pow::pow(two, &bits) - BigUint::one()) * length;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to generate too much noise? The sensitivity for each dimension is $2^b-1$, which requires IID noise for each dimension proportional to $2^b-1$, not $\ell\cdot2^b-1$. Unless I'm misreading your IID function, that is.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, do you expect to support self.bits > 127? Such a vector would be very expensive, but it would still be possible to do let sensitivity = BigUint::from((1u128 << self.bits) - 1) or (clever) let sensitivity = BigUint::from(u128::max >> 128.checked_sub(self.bits).ok_or(Error...)?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would add noise computed based on a sensitivity of $\ell \cdot (2^b-1)$ to each dimension. I think this is correct as far as I can see, and I think this is the main reason why other VDAFs, when composed with DP, benefit from tighter constraints on measurements. (i.e. Prio3Histogram here, Prio3FixedPointBoundedL2VecSum previously, or PINE)

If we split up one $\epsilon$-DP Prio3SumVec query into $\ell$ many Prio3Sum queries, we would have a pure DP budget of $\epsilon/\ell$ for each. The sensitivity of Prio3Sum is $2^b-1$. Thus, we'd add noise drawn from $Lap_\mathbb{Z}((2^b-1)/(\epsilon/\ell))=Lap_\mathbb{Z}(\ell\cdot(2^b-1)/\epsilon)$, which is the same as we'd get from this code as-is.

Another way I thought about this: if the adversary happens to know a priori that each dimension of some client's measurement is perfectly correlated or anti-correlated to the rest, then we'd need that factor of $\ell$ to avoid effectively giving them free queries beyond our budget.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to support self.bits > 127, because the output share would be able to wrap around Field128, and we haven't defined any bigger FFT-friendly fields as of yet, so that works.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just acknowledging that factor of $\ell$. I've spent so long thinking about histograms, I'd forgotten how to think about vectors. That's going to be a lot of noise, but ¯\_(ツ)_/¯

Comment on lines 136 to 142
// The l1-sensitivity of the aggregation function is two, assuming the substitution-DP
// model. Substituting a measurement may, at worst, cause one cell of the query result
// to be incremented by one, and another to be decremented by one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm not following why $\Delta_1 = 2$. DAP doesn't operate on substitution.

src/dp.rs Show resolved Hide resolved
// Compute the l1-sensitivity of the aggregation function (assuming the substitution-DP
// model). The worst case is when one individual's measurement changes such that each vector

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with that reasoning. It will depend on the setting, but you have to assume that each contribution is unknown to the collector. A histogram contribution could be a 0 just as much as it could be a 1. That's the contribution that we're hiding, not that a given person is participating or not. Substitution assumes that each report contributes something, which is not something we rely on.

My suggestion would be to externalize whether you want to use substitution as the basis of your sensitivity analysis. That is, if someone believes that substitution is necessary in their setting, they can set epsilon accordingly (i.e., halve it). Otherwise, settings where it is directly detrimental (i.e., the case we currently care about) will have to adjust in the other direction.

let length = BigUint::from(self.len);
let sensitivity = BigUint::from(
1u128
.checked_shl(self.bits as u32)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so you know, because I've been stung by this one too....

checked_shl does not protect against the shifted value overflowing, it only checks that the shift operand (self.bits here) is in the right range. That means that, for this example, it will fail only if self.bits >= 128, which is fine when the value is 1, but has a surprising effect if the value is larger.
u128::MAX.checked_shr(u128::BITS - self.bits.try_into().ok_or(FlpError::...)?).ok_or(FlpError::...)) would be safer. Or you could just use >>, because checked_shr is a trap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take this suggestion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're OK as is. Since we are shifting a constant one, there's no risk of the result getting silently truncated.

let two = BigUint::from(2u64);
let bits = BigUint::from(self.bits);
let length = BigUint::from(self.len);
let sensitivity = (Pow::pow(two, &bits) - BigUint::one()) * length;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just acknowledging that factor of $\ell$. I've spent so long thinking about histograms, I'd forgotten how to think about vectors. That's going to be a lot of noise, but ¯\_(ツ)_/¯

@divergentdave
Copy link
Collaborator Author

I derived some additional traits on DiscreteLaplaceDpStrategy (matching the existing concentrated DP discrete Gaussian strategy) and added input validation to PureDpBudget based on experience using this over in divviup/janus#3302. I added mention of the sensitivity definition, including the concept of neighboring datasets, to the documentation of both DP strategies. I also rebased to fix an easy merge conflict.

Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing on the basis of code quality and maintainability rather than on correctness of the DP stuff, which is well covered by other reviewers (thanks!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the LyX system, but I think we should be cautious about adding new file formats for documentation. I don't suppose it can render to Markdown? We already have several Markdown documents across our projects, and the format is well-supported on GitHub and docs.rs, which are the principal ways we expect people to consume libprio-rs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some instructions to do so using pandoc here: https://wiki.lyx.org/Tips/ConvertMarkdown. I'm going to try that out and see how GitHub handles the results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running into a number of issues with this.

  • MathJax doesn't support \nicefrac
  • GitHub Markdown has a different experimental syntax for "alerts", like > [!NOTE], etc. Pandoc is using the same triple colon syntax for definitions and proofs here that Docusaurus uses for inset notes, etc.
  • There seems to be a precedence issue in GitHub's Markdown parser, because underscores inside different $ equations are getting grouped together, interpreted as italicizing the text they contain, and thus breaking equations. I've seen suggestions to escape the underscore with a backslash, but those come along with complaints that other tools don't handle this the same way, and thus render literal underscores inside equations.
  • Decorations around the big sigmas and pis are not showing up right, this must be another MathJax incompatibility.
  • The bibliographic reference did not get converted or rendered right, it may need to be turned into a footnote.

Ultimately, I think we'll need to pick one ecosystem, either pandoc/LaTeX or GitHub-flavored Markdown, and target it, as there are some fundamental incompatibilities in Markdown handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I don't want to be too difficult here, because you put a lot of effort into good docs, and LaTeX is more natural for math at this level. Two further ideas:

  • Can we get this to where it requires a LaTeX installaton but not LyX? Can a .lyx file be handled by LaTeX? If not, can we use a format that is LaTeX friendly?
  • Could we check in a PDF of the document so that there's something in the repo people can read without needing additional tools?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I worked through most of the way through these issues, but two remain. Some absolute value bars disappear on my browser at 100% zoom level, but reappear at 110%. In one equation, we have a sum in an exponent, and that is falling apart. It looks fine when I try to render it using the example JSBin link on mathjax.org, but here both bounds just show up below the sigma.

I can check in either a .tex file or a .pdf file. The PDF is only ~180kB, so I think that should be okay to check in, especially since it'll be the most straightforward to read.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well written LaTeX is very readable :)

@@ -95,13 +99,57 @@ impl ZCdpBudget {
/// for a `rho`-ZCDP budget.
///
/// [CKS20]: https://arxiv.org/pdf/2004.00010.pdf
// TODO(#1095): This should be fallible, and it should return an error if epsilon is zero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it panic on epsilon == 0 instead? Or do we want crate prio to defend itself gracefully from misconfiguration at a higher level (i.e., Janus and its control plane)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently panics when adding noise, due to a division by zero. Ideally prio as a library crate shouldn't panic at all, and just return an error upon misconfiguration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, input validation should err. I think panics are more appropriate in situations where a situation shouldn't occur assuming input validation was performed.

let length = BigUint::from(self.len);
let sensitivity = BigUint::from(
1u128
.checked_shl(self.bits as u32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take this suggestion?

fn add_noise_to_result(
&self,
dp_strategy: &PureDpDiscreteLaplace,
agg_result: &mut [Self::Field],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a TODO for the issue filed for the rename?

/// A DP strategy using the discrete Laplace distribution, providing pure DP.
///
/// This uses l1-sensitivity, with the substitution definition of neighboring datasets.
pub type PureDpDiscreteLaplace = DiscreteLaplaceDpStrategy<PureDpBudget>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When naming this stuff, we should make sure to leave room for new DP schemes that may arrive soon, based on the drafts mt and the Bens are working on.

Is this the only DP strategy we'd ever implement that is pure and samples the discrete Laplace distribution? IIUC (but I could easily be wrong), this strategy is where each aggregator independently adds sufficient noise to protect against either aggregator defecting, but the incoming schemes use MPC to meet privacy goals with less noise. So should this be PureDpDiscreteLaplaceButWithMaybeTwiceAsMuchNoiseAsYouNeedSorryAboutThat? (deliberately terrible name to force a better choice)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the binomial noise in MPC mechanism would be different enough that it wouldn't need to interact with these traits, so I'm not too worried about namespace overlap. Plus, this strategy ultimately gets passed into an AggregatorWithNoise implementation, which makes it pretty explicit how noise is added.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A binomial noise sampler w/o MPC seems conceivable, but I don't think we need to plan for this.

Copy link
Collaborator

@cjpatton cjpatton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took at a look at the DP argument first: some editorial feedback, plus a couple of steps I didn't understand. I'll go on to the rest of the code now.

documentation/Pure DP Mechanism.lyx Show resolved Hide resolved
documentation/Pure DP Mechanism.lyx Outdated Show resolved Hide resolved
.
\end_layout

\begin_layout Definition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's good practice to provide a reference to the definition. In latex this is something like

\begin{definition}[{\cite[Definition~XXX]{XXX}}]
   XXX
\end{definition}

documentation/Pure DP Mechanism.lyx Outdated Show resolved Hide resolved
documentation/Pure DP Mechanism.lyx Show resolved Hide resolved
documentation/Pure DP Mechanism.lyx Outdated Show resolved Hide resolved
documentation/Pure DP Mechanism.lyx Show resolved Hide resolved
documentation/Pure DP Mechanism.lyx Outdated Show resolved Hide resolved
documentation/Pure DP Mechanism.lyx Show resolved Hide resolved
documentation/Pure DP Mechanism.lyx Show resolved Hide resolved
Copy link
Collaborator

@cjpatton cjpatton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! I'll await your response to the comments on the writeup before approving.

@@ -95,13 +99,57 @@ impl ZCdpBudget {
/// for a `rho`-ZCDP budget.
///
/// [CKS20]: https://arxiv.org/pdf/2004.00010.pdf
// TODO(#1095): This should be fallible, and it should return an error if epsilon is zero.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, input validation should err. I think panics are more appropriate in situations where a situation shouldn't occur assuming input validation was performed.

src/dp.rs Show resolved Hide resolved
src/dp/distributions.rs Outdated Show resolved Hide resolved
///
/// [CKS20]: https://arxiv.org/pdf/2004.00010.pdf
pub struct DiscreteLaplace {
/// The scale parameter of the distribution.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this $t$ in your writeup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though I've seen t, b, and sigma used for it. I think I just want to refer to it as the "scale" in documentation for clarity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, scale is a good term to use here. I just wanted to clarify. Perhaps mention in the write up that $t = \texttt{scale}$.

src/dp/distributions.rs Show resolved Hide resolved
src/flp/types.rs Show resolved Hide resolved
src/flp/types/dp.rs Show resolved Hide resolved
src/flp/types/dp.rs Outdated Show resolved Hide resolved
// Compute the l1-sensitivity of the aggregation function (assuming the substitution-DP
// model). The worst case is when one individual's measurement changes such that each vector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does @martinthomson's suggestion imply specific client behavior, such as randomized response? For example, a client might submit a 1-hot vector, or a vector of all 0s, depending on the outcome of a coinflip?

Exposing in the API what notion of "neighboring databases" we consider seems fraught to me. It's perhaps safer if we don't ask the user to go through this analysis themselves. Of course, this is just applying a more general principle of good API design: with DP we may be in a situation where we really do need to expose a choice (and have an expert make that choice).

Question: Is there a way to be "conservative" here? In particular, is it the case that substitution-DP implies deletion-DP with the same bound, but perhaps at the cost of some amount of utility?

I understand there is time pressure to merge this, so if we merge as-is, I would suggest at least making sure it's documented that we're assuming the application requires substitution-DP.

src/flp/types/dp.rs Outdated Show resolved Hide resolved
@divergentdave divergentdave merged commit c5ffd83 into main Jul 19, 2024
6 checks passed
@divergentdave divergentdave deleted the david/dp-sumvec branch July 19, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Differential Privacy for Prio3SumVec
6 participants