This Java library offers the core functionalities needed to implement an end-to-end Private Set Intersection protocol (PSI). Private Set Intersection is a secure multiparty computation that allows two parties to compute the intersection of their encrypted datasets without revealing anything outside the intersection (or as close as possible to anything).
In the literature, we can find multiple alternative PSI algorithms based on different cryptographic primitives. This library provides an implementation of the following algorithms:
- Blind Signature PSI (BS).
- Diffie-Hellman PSI (DH).
- Blind Signature PSI based on Elliptic Curves (ECBS).
- Diffie-Hellman PSI based on Elliptic Curves (ECDH).
The description of these algorithms, as well as their detailed comparison, can be found in this file (currently only available in italian).
An example showing a client-server PSI implementation based on this library can be found on the following repositories: server , client.
The elliptic curve implementation is partially based on the Bouncy Castle library.
The build process is based on Maven. To generate the jar file (which will be located in the target folder) run the following command:
mvn clean install
The generated jar is automatically installed on the local maven repository.
A detailed description of all the classes and methods contained in this repository can be obtained by generating the JavaDoc of this library with the following command:
mvn javadoc:javadoc
By default, the JavaDoc generated through this command will be located in the folder
target/site/apidocs
.
Based on the specific PSI algorithm, the cryptographic operations performed by the parties might differ. To cope with this distinction, this library is based on a client-server abstraction that considers as client the party that starts new PSI calculations (also referred as sessions) and that is interested in getting the result of the intersection, while the server is a passive party that answers the requests made by the client and has no interest in acquiring the result of the PSI. As a consequence of this approach, we note that the problem of computing the result of the PSI in a single session for both parties is outside the scope of this library and that it might not be feasible for all the supported PSI algorithms.
Following the client-server paradigm, the core PSI functionalities are provided by two distinct classes:
PsiServer
and PsiClient
. Remarkably, the interface of these classes is independent of the specific
PSI algorithm, and can be generated from the respective factory classes: PsiServerFactory
and
PsiClientFactory
. The configuration of the PSI calculation
(such as the selection of the algorithm, the keys used for encryption, the cache configuration or the number of threads
used by the computation) is defined at the moment of the creation of the PsiServer
and
PsiClient
objects by calling the proper factory methods. After the creation of these objects, the subsequent
steps of the PSI computation are the same for all the algorithms or configurations.
The PsiServer
class is designed to operate on portions of the datasets, and does not require loading the
entire datasets in-memory. Conversely, the PsiClient
requires loading different representations of the
datasets in-memory. There are two main reasons that motivate this alternative approach. First, in traditional
client-server scenarios, we can expect that the server might run multiple PSI sessions in parallel with different
clients, each with their peculiar datasets, which makes loading all of them in-memory unfeasible. Second, by not being
interested in acquiring the result of the PSI, the server does not need to store any representation of the client
dataset (either in-memory or on disk). An alternative approach where the PsiClient
relies on secondary
storage to load different representations of the datasets is feasible but would result in a more complex interface and
degraded performance. The implementation of this approach is not provided by this repository.
In its external interface, either in input or output, this library considers the items of the datasets as sets of type String. Internally, the String objects are converted to BigInteger objects after being read to speed up the execution of encryption operations, and are converted back to String before being returned to the user. This approach simplifies the interface of the library as strings can be easily serialized and deserialized (either for network communications or database interactions).
The package structure of this repository is mostly motivated by visibility concerns. A notable exception is the
model
package which contains the objects that define the data-layer of the PSI protocol, which, based on
the specific implementations, could either be used in communications or stored in stable storage.
This library provides an abstraction of the steps that constitute a PSI calculation. This allows its users to compute
the PSI, for any of the supported algorithm or configuration, by calling the same sequence of methods on the
PsiServer
and PsiClient
objects.
From a high-level abstraction, the client computes the PSI by comparing the following representations of the two datasets:
- a server-side encryption of the client-side encryption of the client dataset;
- a client-side encryption of the server-side encryption of the server dataset.
To reach this state, the client dataset should undergo the following transformations:
- Load the client dataset on the
PsiClient
object by calling theloadAndEncryptClientDataset
method which converts the dataset to a map by assigning to each item an identifier. If required by the algorithm, it also applies a hash function to the dataset. - Encrypt the resulting dataset map by calling the
encryptClientDataset
method of thePsiClient
object. The actual encryption operations performed by this method depend on the specific PSI algorithm. - The result of the client-side encryption is passed to the
encryptDatasetMap
method of thePsiServer
object, which applies algorithm-specific encryption operations. - The result is loaded by the
PsiClient
object by calling theloadDoubleEncryptedClientDataset
method.
Conversely, the transformations that should be applied to the server dataset are the following:
- The server dataset is encrypted by the
PsiServer
object by calling theencryptDataset
method. The encryption operations applied by this method depend on the specific PSI algorithm. - The resulting set is processed and loaded by the
PsiClient
object by calling theloadAndProcessServerDataset
method. Based on the PSI algorithm, the processing portion of this method might apply algorithm-specific encryption operations.
Once these representations of the datasets are loaded on the PsiClient
object, the result of the PSI can be
computed by calling the computePsi
method.
In this section we provide a basic example showing how the library could be used to run a PSI calculation in a
simplified scenario where a single system acts both as the client and the server. In actual client-server environments,
any data transfer between the PsiServer
and the PsiClient
objects should be implemented
through proper communication primitives, such as REST APIs.
// Could be any supported pair of PsiAlgorithm and key size
PsiAlgorithmParameter psiAlgorithmParameter = new PsiAlgorithmParameter(PsiAlgorithm.BS, 2048)
// PsiServerSession objects containing the session metadata required to init PsiServer objects
PsiServerSession psiServerSession = PsiServerFactory.initSession(psiAlgorithmParameter);
PsiServer psiServer = PsiServerFactory.loadSession(psiServerSession);
// PsiClientSession objects are generated from PsiServerSession objects by
// removing any references to private key fields.
PsiClientSession psiClientSession = PsiClientSession.getFromServerSession(psiServerSession);
PsiClient psiClient = PsiClientFactory.loadSession(psiClientSession);
// Client loads the double encrypted client dataset map
Map<Long, String> clientEncryptedDatasetMap = psiClient.loadAndEncryptClientDataset(clientDataset);
Map<Long, String> doubleEncryptedClientDatasetMap = psiServer.encryptDatasetMap(clientEncryptedDatasetMap);
psiClient.loadDoubleEncryptedClientDataset(doubleEncryptedClientDatasetMap);
// Client loads and processes the encrypted server dataset
Set<String> serverEncryptedDataset = psiServer.encryptDataset(serverDataset);
psiClient.loadAndProcessServerDataset(serverEncryptedDataset);
// Compute PSI
Set<String> psiResult = psiClient.computePsi();
We note that in this example, the entire datasets are always passed as inputs of the methods. However, the
methods loadAndEncryptClientDataset
,
loadDoubleEncryptedClientDataset
and loadAndProcessServerDataset
of the PsiClient
class, as well as the methods encryptDataset
and encryptDatasetMap
of the PsiServer
class, which in this example are called once each,
could be called multiple times (even concurrently) to encrypt or load different portions of the datasets. Moreover, the
operations on the client dataset and the server datasets could be performed in different order or even concurrently. The
only requirement for computing a correct result of the PSI is that the final representations of the two datasets should
be completely loaded by the PsiClient
object before calling the computePsi
method.
In this repository, we refer to keys as the instances of the PsiKeyDescription
interface which contain the security parameters used to set up the encryption functions of the PsiServer
and PsiClient
objects. The PsiKeyDescription
interface is implemented by the
PsiServerKeyDescription
and PsiClientKeyDescription
classes, which contain the all the
security parameters used by the supported PSI algorithms. Independently of their source type, all the fields of the keys
are converted to the String type to simplify their serialization and deserialization.
In the previous example, the keys are generated automatically by the library when creating the PSI objects. In
particular, the PsiServer
keys are generated based on the input PsiAlgorithmParameter
object passed as parameter to the initSession
method of the PsiServerFactory
class.
Conversely, based on the specific PSI algorithm, the PsiClient
might generate its own private key (DH or
ECDH) matching the parameters set by the PsiServer
, or directly use the public key of the
PsiServer
(BS or ECBS).
In addition to generating the keys automatically, the methods of the PsiServerFactory
and the
PsiClientFactory
classes also allow to provide an external PsiKeyDescription
when generating
the PsiServer
and PsiClient
objects. This allows the users of the library to generate its keys
externally, or save previously used keys for subsequent executions, which, as discussed in the following section, is
essential to achieve performance speed-up through caching.
Users of this library can create instances of PsiServerKeyDescription
and
PsiClientKeyDescription
objects through their respective factory classes
(PsiServerKeyDescriptionFactory
and PsiClientKeyDescriptionFactory
). These classes offer
specific methods for each PSI algorithm, allowing the users to only provide the parameters that are relevant for the
selected algorithm. Moreover, the users can either pass the input parameters as String objects (which is the format
exported directly by the library) or as standard key specifications objects. In particular, the keys associated to the
BS algorithm can be generated from RSAPrivateKeySpec
and RSAPublicKeySpec
objects (Java
Security), DH keys can be generated from DHPrivateKeySpec
objects (Java Security), and keys associated to
the ECBS and ECDH algorithms can be generated from
ECPrivateKey
and ECPublicKey
objects (Bouncy Castle).
The computational cost of encryption operations, in particular when using relatively large keys, can be significant. For PSI calculations that use the same keys, saving the result of encryption operations, and then reading these values in subsequent executions (instead than performing once again the encryption operations) can result in significant performance improvements. However, the actual implementation of the caching layer, which could be based on external systems, in-memory data structures or databases, is highly dependent on the specific implementation, and thus, is not provided directly by this library.
To support caching while also offering flexibility and ease-of-use, this library exposes a
PsiCacheProvider
interface, which requires its users to implement a simple key-store interface constituted
by a put
and a get
method. Internally, the library relies on BASE64 and JSON conversions to
convert complex objects to key-value entries, also hiding the heterogeneity of the different PSI algorithms.
We note that the only peculiarity of this interface when compared to the traditional key-store paradigm is that the put operation should not overwrite a value associated to a key already present. This is required to avoid potential errors in scenarios where multiple PSI calculations with the same key are initialized concurrently.
Considering that the goal of the cache is to reduce the execution time, reading and writing to the cache should be
computationally cheap compared to encryption operations, whose cost might change based on the specific algorithm and key
size. For testing purposes, in this repository we provide a basic implementation of the
PsiCacheProvider
interface based on the ConcurrentHashMap
class.
As anticipated, the caching layer can provide performance improvements only when the same keys are used for subsequent
PSI calculations. For instances of the PsiServer
class, this can only happen if an external key is provided
to its factory methods (which might also be keys previously generated by the library). Conversely, for instances of
the PsiClient
class, the requirements might change based on the specific PSI algorithm. For the BS and ECBS
algorithms, the PsiClient
methods can get cache hits even without passing an external key to its factory
methods as long as the public key provided by the PsiServer
object is the same as previous executions.
Conversely, for the DH and ECDH algorithms, the methods of the PsiClient
class can get cache hits only if
its instance is created by passing to its factory method an external
PsiClientKeyDescription
object which was already used in previous executions.
To enable caching during the execution, an object implementing the PsiCacheProvider
interface should be
passed to the factory methods that create PsiServer
and PsiClient
instances.
Running encryption operations in parallel on multiple threads can be particularly effective as each portion of data can be processed independently by each thread with little or no synchronization. To help users exploit the performance benefits of parallelism, this library supports multi-threading natively. Indeed, all methods that operate on the datasets split the data in different portions, and each portion is processed by different threads.
By default, all methods that operate on the datasets are executed concurrently on 4 threads. To change this
configuration, as well as to configure the timeout of the threads, the user can call the method
setConfiguration
on the PsiServer
or PsiClient
instances, passing an instance of a PsiThreadConfiguration
with the preferred thread configuration. We note
that the user could set a different number of threads for each method by setting its distinctive
PsiThreadConfiguration
object prior to its execution. Despite this internal support, all the classes and methods
of this library are thread-safe, which allows the users of this library to also implement external ad-hoc thread management
techniques.