This repository contains the implementation of Privacy preserving vertical federated learning for tree-based models. This paper proposes a private and efficient solution for tree-based models, including decision tree (DT), random forest (RF), and gradient boosting decision tree (GBDT), under the vertical federated learning (VFL) setting. The solution is based on a hybrid of threshold partially homomorphic encryption (TPHE) and secure multiparty computation (MPC) techniques.
- Pivot-SPDZ
- libhcs
- libscapi
- Pivot uses libscapi for network communications among clients.
- Clone this repository and follow the guide in libscapi to install it.
- Python
- We implemented the non-private baselines and generated the synthetic datasets using sklearn.
- Install the necessary dependencies of python (see tools/README.md).
- Protobuf
- We used protobuf version 3.14.0 for the messages communicated among clients.
You can build the docker image using tools/docker/Dockerfile
(test passed on Ubuntu20.04), or download
the pre-built image from docker hub here.
After building the image, follow the steps in tools/docker/README.md
to run the test on a single machine.
If want to build from source, you can follow the steps in tools/docker/Dockerfile
, but need to update
some configurations on your host machine.
- In Pivot, update the following if needed:
-
data/networks/Parties.txt
: defining the participating parties' ip addresses and ports -
src/include/common.h
:-
DEFAULT_PARAM_DATA_FILE
: the SPDZ related party file (in the Pivot-SPDZ folder) -
SPDZ_PORT_NUM_DT
: the port for connecting to SPDZ decision tree MPC program -
SPDZ_PORT_NUM_DT_ENHANCED
: the port for connecting to SPDZ decision tree prediction of the enhanced protocol
-
- other algorithm-related default parameters in
src/include/common.h
: e.g., the number of parties - revise ${SPDZ_HOME} in CMakeLists.txt to ${PIVOT_SPDZ_HONE}
-
- In Pivot-SPDZ, update the following if needed:
-
Programs/Source/vfl_decision_tree.mpc
-
PORT_NUM
: same asSPDZ_PORT_NUM_DT
-
MAX_NUM_CLIENTS
: the maximum number of clients could handle -
MAX_CLASSES_NUM
: the maximum number of classes for classification (by default is 2 for regression) - other algorithm-related parameters
-
-
Programs/Source/vfl_dt_enhanced_prediction.mpc
-
PORT_NUM
: same asSPDZ_PORT_NUM_DT_ENHANCED
-
MAX_NUM_CLIENTS
: the maximum number of clients could handle -
MAX_TREE_DEPTH
: the maximum depth of the evaluated tree, must be the same as in Pivot -
TESTING_NUM
: the number of samples in the testing stage, must be the exact at the moment
-
-
fast-make.sh
: modify Setup.x and setup-online.sh (the security parameter is 128 bits)
-
- Build Pivot-SPDZ
- cd ${PIVOT_SPDZ_HOME}, make sure that
MY_CFLAGS = -DINSECURE
is in the CONFIG.mine file (for running fake online protocol) - cd ${PIVOT_SPDZ_HOME},
make mpir
to generate required mpir lib; - cd ${PIVOT_SPDZ_HOME}, run
bash fast-make.sh
to generate pre-requisite programs and parameters; - compile the MPC programs
./compile.py ${PIVOT_SPDZ_HOME}/Programs/Source/vfl_decision_tree.mpc ./compile.py ${PIVOT_SPDZ_HOME}/Programs/Source/vfl_dt_enhanced_prediction.mpc
- cd ${PIVOT_SPDZ_HOME}, make sure that
- Build Pivot
- build the program as follows:
mkdir build cmake -Bbuild -H. cd build/ make
- To run the Pivot training, for example, the DT algorithm with 3 clients, execute:
- cd ${PIVOT_SPDZ_HOME}, run 3 MPC programs in separate terminals
./semi-party.x -F -N 3 -I -p 0 vfl_decision_tree ./semi-party.x -F -N 3 -I -p 1 vfl_decision_tree ./semi-party.x -F -N 3 -I -p 2 vfl_decision_tree
- cd ${PIVOT_HOME}, run 3 programs in separate terminals for DT model
./Pivot --client-id 0 --client-num 3 --class-num 2 --algorithm-type 0 --tree-type 0 --solution-type 0 --optimization-type 1 --network-file ${PIVOT_HOME}/data/networks/Parties.txt --data-file ${PIVOT_HOME}/data/bank_marketing_data/client_0.txt --logger-file ${PIVOT_HOME}/log/release_test/bank_marketing_data --max-bins 16 --max-depth 3 --num-trees 1 ./Pivot --client-id 1 --client-num 3 --class-num 2 --algorithm-type 0 --tree-type 0 --solution-type 0 --optimization-type 1 --network-file ${PIVOT_HOME}/data/networks/Parties.txt --data-file ${PIVOT_HOME}/data/bank_marketing_data/client_1.txt --logger-file ${PIVOT_HOME}/log/release_test/bank_marketing_data --max-bins 16 --max-depth 3 --num-trees 1 ./Pivot --client-id 2 --client-num 3 --class-num 2 --algorithm-type 0 --tree-type 0 --solution-type 0 --optimization-type 1 --network-file ${PIVOT_HOME}/data/networks/Parties.txt --data-file ${PIVOT_HOME}/data/bank_marketing_data/client_2.txt --logger-file ${PIVOT_HOME}/log/release_test/bank_marketing_data --max-bins 16 --max-depth 3 --num-trees 1
- cd ${PIVOT_SPDZ_HOME}, run 3 MPC programs in separate terminals
- To run RF and GBDT model, modify the corresponding parameter for invoking Pivot
- To run the enhanced protocol, besides of modifying the corresponding parameter for invoking Pivot,
need to run another MPC program
vfl_dt_enhanced_prediction
for the model prediction stage.- cd ${PIVOT_SPDZ_HOME}, run another 3 MPC programs in separate terminals
./semi-party.x -F -N 3 -I -p 0 -pn 6000 vfl_dt_enhanced_prediction ./semi-party.x -F -N 3 -I -p 1 -pn 6000 vfl_dt_enhanced_prediction ./semi-party.x -F -N 3 -I -p 2 -pn 6000 vfl_dt_enhanced_prediction
- the above
-pn
parameter is the port forvfl_dt_enhanced_prediction
connections (if not specified, default is 5000, as used forvfl_decision_tree
)
- cd ${PIVOT_SPDZ_HOME}, run another 3 MPC programs in separate terminals
If you use our code in your research, please kindly cite:
@article{DBLP:journals/pvldb/WuCXCO20,
author = {Yuncheng Wu and
Shaofeng Cai and
Xiaokui Xiao and
Gang Chen and
Beng Chin Ooi},
title = {Privacy Preserving Vertical Federated Learning for Tree-based Models},
journal = {Proc. {VLDB} Endow.},
volume = {13},
number = {11},
pages = {2090--2103},
year = {2020}
}
To ask questions or report issues, please drop us an email.