PaddleDTX is a solution that focused on distributed machine learning technology based on decentralized storage. It solves the difficulties when massive private data needs to be securely stored and exchanged, also helps different parties break through isolated data islands to maximize the value of their data.
The computing layer of PaddleDTX is a network that composed of nodes of three kinds: Requester, Executor and DataOwner. The training samples and prediction dataset are stored in a decentralized storage network composed of DataOwner and Storage nodes. This decentralized storage network and the computing layer are supported by a underlying blockchain network.
The Requester is a party with prediction demand, and the Executor is a party that is authorized by the DataOwner to gain access permit to the sample data for possible model training and result predicting. Multiple Executor nodes form an SMPC (secure multi-party computation) network. The Requester nodes publish the task to the blockchain network, and Executor nodes execute the task after authorization. The Executor nodes obtain sample data through the DataOwner, and the latter endorses the trust of data.
SMPC network is the framework that supports multiple distributed learning processes running in parallel. More vertical federated learning and horizontal federated learning algorithms will be supported in the future.
A DataOwner node processes its private data, and encryption, segmentation and replication related algorithms are used in this procedure, and finally encrypted fragments are distributed to multiple Storage nodes. A Storage node proves that it honestly holds the data fragments by answering the challenges generated by the DataOwner. Through these mechanisms, storage resources can be safely maintained without violating any data privacy. Please refer to XuperDB for more about design principle and implementation.
Training tasks and prediction tasks will be broadcasted to the Executor nodes by a blockchain network. Then the Executor nodes involved will execute these tasks. The DataOwner node and the Storage node exchange information through the blockchain network when monitoring files and nodes health status, and also in the challenge-answer-verify process of replicas holding proof.
Currently, XuperChain is the only blockchain framework that PaddleDTX supported.
The open source version of PaddleDTX supports vertical federated learning(VFL) algorithms, including two-party Linear Regression, two-party Logistic Regression and three-party DNN(Deep Neural Networks). Please refer to crypto/ml for more about background and implementation of two-party VFL algorithms. The DNN implementation relies on the PaddleFL framework and all neural network models provided by PaddleFL can be used in PaddleDTX. More algorithms will be open sourced soon, including multi-party VFL and multi-party HFL(horizontal federated learning) algorithms.
Take two-party VFL algorithms as an example, training and prediction steps are shown as follows:
A FL task needs to specify sample files that will be used in computation or prediction, and these files are stored in the decentralized storage system(XuperDB). Before executing a task, executor(often data owner) needs to fetch its own sample files from XuperDB.
Both VFL training and prediction tasks require a sample alignment process. That is, to find sample intersections by using all the participants' ID lists. Training and predicting are performed on intersected samples.
The project implemented PSI(Private Set Intersection) for sample alignment without leaking any participant's ID. Refer to crypto/psi for more details about PSI.
Model training is an iterated process, which relies on collaborative computing of two parities' samples. Participants need to exchange intermediate parameters during many training epochs, in order to get proper local model for each party.
To ensure confidentiality of each participant's data, Paillier cryptosystem is used for parameters encryption and decryption. Paillier is an additive homomorphic algorithm, which enables us to do addition or scalar multiplication on ciphertext directly. Refer to crypto/paillier for more details about Paillier.
Prediction task requires a model, so related training task needs to be done before prediction task starts. Models are separately stored in participants' local storage. Participants compute local prediction result using their own model, and then gather all partial prediction results to deduce final result.
For linear regression, destandardization process can be performed after gathering all partial results. This process is only able to be done by the party has labels. So all partial results will be sent to the party has labels, which will deduce final result and store it as a file in XuperDB for requester to use.
There are two ways of installing PaddleDTX:
We highly recommend to run PaddleDTX in Docker. You could install all the components with docker images provided by us. Please refer to starting network. If you want to build docker images locally, please refer to building image of PaddleDTX and building image of XuperDB.
To build PaddleDTX from source code, you need:
- go 1.13 or greater
# In dai directory
make
# In xdb directory
make
You could get installation package from ./output
and install it manually.
We provide test scripts for you to test, understand and use PaddleDTX.
[1] Konečný J, McMahan H B, Yu F X, et al. Federated learning: Strategies for improving communication efficiency[J]. arXiv preprint arXiv:1610.05492, 2016.
[2] Yang Q, Liu Y, Chen T, et al. Federated machine learning: Concept and applications[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2019, 10(2): 1-19.
[3] Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.
[4] Goodfellow I, Bengio Y, Courville A. Machine learning basics[J]. Deep learning, 2016, 1(7): 98-164.
[5] Paillier P. Public-key cryptosystems based on composite degree residuosity classes[C]//International conference on the theory and applications of cryptographic techniques. Springer, Berlin, Heidelberg, 1999: 223-238.
[6] Lo H K. Insecurity of quantum secure computations[J]. Physical Review A, 1997, 56(2): 1154.
[7] Chen H, Laine K, Rindal P. Fast private set intersection from homomorphic encryption[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017: 1243-1255.
[8] Shamir A. How to share a secret[J]. Communications of the ACM, 1979, 22(11): 612-613.
[9] https://xuper.baidu.com/n/xuperdoc/general_introduction/brief.html