This repository provides an implementation of the FastJoint algorithm for differentially private top-k selection, as detailed in the paper:
Faster Differentially Private Top-k Selection: A Joint Exponential Mechanism with Pruning
The implementation can be found in FastTopk.py
under the following function:
def fast_joint_sampling_dp_top_k(item_counts, k, epsilon, neighbor_type, failure_probability)
Arguments:
- item_counts: A 1D numpy array representing the histogram (non-negative integer counts or scores for the items).
- k: The number of items to select.
- epsilon: The privacy parameter.
-
neighbor_type: The type of neighboring dataset. Currently supports
"DP_Parameters.NeighborType.ADD_REMOVE"
as defined inDP_Parameters.py
. -
failure_probability: The probability that the algorithm will return a sequence whose error exceeds the truncation threshold. The default value used in the experiments is
$2^{-10}$ .
Returns:
An array containing the indices of the top k
items selected by the FastJoint algorithm.
import numpy as np
import DP_Parameters
from FastTopk import fast_joint_sampling_dp_top_k
# Create a histogram with counts from 20 to 2000 (1 to 100 multiplied by 20)
hist = np.arange(1, 101) * 20
k = 10
epsilon = 1
failure_probability = 2 ** (-10)
# Get the top k items using the FastJoint algorithm
top_k_items = fast_joint_sampling_dp_top_k(item_counts=hist, k=k, epsilon=epsilon,
neighbor_type=DP_Parameters.NeighborType.ADD_REMOVE,
failure_probability=failure_probability)
print(top_k_items)
This section provides a brief overview of the problem and the algorithms used. For detailed information, please refer to the paper.
Given
Differential privacy relies on defining neighboring datasets. Two histograms
The FastJoint algorithm samples a sequence
where
This section provides instructions for reproducing the empirical results presented in the paper. It compares the FastJoint algorithm with the Joint, CDP Peel, and PNF Peel algorithms. The implementations of the later three algorithms can be found in the public dp_topk repository (as of May 2024). Please follow the instructions below to set up the environment, and download the necessary datasets, to replicate the results.
To integrate our modifications with the original dp_topk
repository:
- Clone the dp_topk repository.
- Copy the files
baseline_mechanisms.py
,differential_privacy.py
andjoint.py
from this repository. - Place them into the
dp_topk
folder.
To replicate our results, first download the datasets from the following sources:
- Goodreads Books: Goodreads-books
- Steam Video Games: Steam Video Games
- Tweets Dataset: Tweets Dataset
- Online News Popularity: UCI Online News Popularity
- MovieLens 25M Dataset: MovieLens 25M Dataset
- Amazon Product Data (2014): Amazon Grocery and Gourmet Food
Save the datasets as:
books.csv
for the Goodreads datasetgames.csv
for the Steam Video Games datasettweets.csv
for the Tweets datasetnews.csv
for the Online News Popularity datasetmovies.csv
for the MovieLens datasetfoods.csv
for the Amazon dataset
Place each file into a folder named datasets
.
Note:
For the MovieLens 25M Dataset, the download will provide a zip file named ml-25m.zip
. Extract its contents and locate the file ratings.csv
. Rename this file to movies.csv
and move it to the datasets
folder.
To run the experiments, execute the following command:
python3 RunExp.py [num]
where [num]
is an integer corresponding to a specific dataset:
Value | Dataset |
---|---|
0 | books |
1 | games |
2 | news |
3 | movies |
4 | tweets |
5 | food |
The current code repeats each experiment 200 times. To speed up the code for obtaining preliminary results, you can reduce the number of trials by changing the initialization num_trials = 200
in RunExp.py
to a smaller value.
For further details, refer to the file RunExp.py
in the root directory.