Skip to content

idiap/identifying-privacy-personas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Identifying Privacy Personas

  • This is the code accompanying the following paper O. Hrynenko, A. Cavallaro, "Identifying Privacy Personas" paper, accepted at Proceeding on Privacy Enhancement Technologies, 2025.
  • This code computes the dissimilarity matrix and constructs a dendrogram (without pruning) as part of a processing pipeline described in the paper.
  • We provide randomly generated dummy data for demonstration purposes feature_vector_generation_set_p_dummy.csv and feature_vector_generation_set_p_dummy_prime.csv.
  • The paper includes both qualitative and quantitative analyses. The code builds on the previously conducted qualitative analysis (coding, trait formation, annotation). The output of this code can be used for the subsequent quantitative analysis, namely Boschloo's test.

Install and Setup

Install R:

sudo apt install r-base-core

Clone the project:

git clone git@github.com:idiap/identifying-privacy-personas.git
cd identifying-privacy-personas

Open the constants.py file to provide the following essential information:

  • path_to_data – the name of your data folder.
  • feature_vector_generation_set_p – the name of the participants' feature vectors, $p_i$,
  • feature_vector_generation_set_p_prime – the name of the participants' feature vectors, $p_i’$,
  • max_likert_distance – the maximum possible distance between the participants in the Likert space,
  • number_of_likert_variables – the number of the Likert explanatory variables,
  • num_of_participants_generation_set – the number of the participants in the generation set.

Install the project:

pip install .

Usage

Command-Line

Run ./scripts/run.sh

Detailed steps in Python and R

Computing the dissimilarity matrix (see Section 5.1 from the paper for details)

The input to this step is a feature_vector_generation_set_p_prime file that contains $p_i'$ representation of participant $i$, a feature vector of Likert and binary explanatory variables.

compute_dissimilarity_matrix(path_to_data = path_to_data, 
                              input_file_name = feature_vector_generation_set_p_prime,
                              outfile_name = dissimilarity_matrix_generation_set
                              )

This function is called in the run.sh file:

python ipp/steps/step_1.py

Dendrogram construction (see Section 5.2 from the paper for details)

For dendrogram construction, use the corresponding R script step_2.R. The default name of the output folder is "Converted_R_output_generation_set". The output folder contains $n$ (where $n$ is a number of participants) .csv files with participants' IDs and their corresponding cluster labels. The naming convention for the files in the output folder is cluster_labels_level_i, where $i$ is the level of the dendrogram. The dendrogram is built by running the following function:

cluster_in_r()

The path to the output folder and to the input file and the function call is completed in the run.sh file:

Rscript "ipp/steps/step_2.R" $path_to_data/$dissimilarity_matrix_generation_set $path_to_data/$path_to_r_results

Unparsing dendrogram construction for Python

For consequent analysis we recommend using Python, hence we unparse the output from R into Python. The function below saves each of the clusters’ information into a separate file (for each cluster, for each level of the dendrogram). The output of this call is a set of files "u.v.csv", where $u$ is the dendrogram level, and $v$ is a cluster ID.

unparsing_for_python(path_to_data = path_to_data, 
                    file_name_binary_descriptor = feature_vector_generation_set_p, 
                    path_to_r_results = path_to_r_results,
                    path_to_parsed_results = path_to_parsed_results
                    )

Saving clusters’ descriptors, cluster sizes, and cluster splits

In the paper, we represent a descriptor as the frequency of appearance of the traits in a cluster. For further use of the pipeline described in the paper, namely for Boschloo's test, we recommend storing the count of how many times a trait appeared and the number of people in a cluster separately.

save_descriptors_to_table(path_to_data = path_to_data, 
                          path_to_parsed_results = path_to_parsed_results, 
                          number_of_participants = num_of_participants_generation_set
                          )

save_number_of_ppl_to_dictionary(path_to_data = path_to_data, 
                                  path_to_parsed_results = path_to_parsed_results, 
                                  number_of_participants = num_of_participants_generation_set
                                  )

Additionally, we save how a parent cluster u.v is split into clusters u+1.j and u+1.k into a table.

save_cluster_splits(path_to_data = path_to_data, 
                    path_to_r_results = path_to_r_results, 
                    number_of_participants = num_of_participants_generation_set, 
                    outfile_name = dendrogram_cluster_splits_generation_set
                    )

These functions are called in the run.sh file:

python ipp/steps/step_3.py

About

Identifying privacy personas

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published