- Overview
- Installation and Deployment
- Quick Start
- Secure Multi-Party Computation
- Privacy-Preserving Machine Learning
- Privacy-Preserving Deep Learning
- Support multitasking concurrency
- Conclusion
- Additional Notes
If you have not set up a Rosetta
environment yet, please refer to Deployment Document.
In order to simplify the description of this tutorial, the examples below are based on single machine with multiple nodes
mode, please refer to Deployment Documentation
for deployment in this way.
All of the following content is based on the previous installation and deployment you have completed. For saving time, please make sure that your environment is already OK
.
Unless otherwise noted, all commands are run in the path of example/tutorials/code/
.
Now, let us enter this exciting moment together: how to use Rosetta
in the easiest way?
The steps are very simple. Whenever you want to use Rosetta
(Python
script file), just import our Rosetta
package, as follows:
import latticex.rosetta as rtt
Note:
rtt
is short for Rosetta
, just like tf
for TensorFlow
, np
for numpy
, pd
for pandas
. Let's just view this as a convention.
You can run directly in the same terminal, as follows:
./tutorials.sh rtt quickstart
Or, you can run the following program on three different terminals (just to simulate the scenario that three different parties are running on their own private data on their own machine):
# node 0
python3 rtt-quickstart.py --party_id=0
# node 1
python3 rtt-quickstart.py --party_id=1
# node 2
python3 rtt-quickstart.py --party_id=2
If you see DONE!
in the output, you have accomplished the first goal.
--party_id
This is a command line option that specifies which role the current running script is playing. In the following, in order to describe examples concisely, we will just directly uses./tutorials.sh
to show that we are running the three steps above.
The following tutorial is just as easy as this Quick Start
. Let's continue.
Let's assume that there are two
honest
rich millionaires who are discussing on who has more wealth
, while neither of them wants to tell the other one the specific number on their bank account. What can I do? I can't help it, but Rosetta
can help you. Now, let's try to solve this problem.
The problem we have just proposed is a famous example in MPC, called Yao's Millionaires' Problem.
Let‘s be a little more specific. In the following example, we will introduce two millionaires, one called Alice
and the other called Bob
, have $2000001
and $2000000
respectively. You see, the difference between the two people's wealth is just one
dollars.
For comparison,Let's just solve this problem without taking their privacy into consideration at first.
This is trivial, every child can do this in less than one second. But in order to compare with Rosetta
, here we write a toy program in TensorFlow
style:
The first step is to import the package:
import tensorflow as tf
The second step is to set how much wealth each has:
Alice = tf.Variable(2000001)
Bob = tf.Variable(2000000)
The third step is to call session.run
and check the results:
res = tf.greater(Alice, Bob)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
ret = sess.run(res)
print("ret:", ret) # ret: True
Refer to tf-millionaire.py for the full version.
And now, we run this program as follows:
./tutorials.sh tf millionaire
The output will show:
ret: True
The result shows that Alice
has more wealth than Bob
.
It's intuitive, I won't go into details.
The above is a artificial one. Now let's see how to use Rosetta
to solve the original problem of millionaires without leaking their privacy. Let's begin, and you will see it is very simple!
The first step is to import the Rosetta
package.
import latticex.rosetta as rtt
import tensorflow as tf
The second step is activate protocol.
rtt.activate("SecureNN")
The third step is to set how much wealth each has privately.
We use the built-in rtt.private_console_input
to get the private data entered by the user in the terminal.
Alice = tf.Variable(rtt.private_console_input(0))
Bob = tf.Variable(rtt.private_console_input(1))
The forth step is exactly the same as TensorFlow
.
res = tf.greater(Alice, Bob)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
res = sess.run(res)
print('res:', res) # res: b'\x90\xa3\xff\x14\x87f\x95\xc3#'
The above output of res
is a sharing
value in Secret-Sharing scheme.
The
sharing
value is a random number essentially. an original true value x, is randomly split into two 64-bit values x0, x1 (with x = x0 + x1), which are held by P0 and P1, respectively.
How do we know the plaintext value? We provide a reveal
interface to get the plain text value for convenience in debugging and testing. Just add it after the forth step:
with tf.Session() as sess:
# ...
ret = rtt.SecureReveal(res)
print('ret:', sess.run(ret)) # ret: b'1.000000'
For the complete program of console version, refer to rtt-millionaire-console.py.
For the complete program of script version, refer to rtt-millionaire.py.
Run this program as follows:
./tutorials.sh rtt millionaire
The output is as follows:
ret: 1.0
The result indicates that Alice
has more wealth than Bob
.
For a description of all operators supported by
Rosetta
includingSecureReveal
, please refer to our API Document.
Of course, Rosetta
can not only be used to solve simple problems like Millionaire's Problem
. Next, let’s see how we can tackle real problems in Machine Learning
with the help of Rosetta
framework.
We will talk about the combination of Privacy-Preserving
andMachine Learning (ML)
in this part. Let's start with the simplest machine learning model: Linear Regression
.
This section introduces how to use Rosetta
to perform a complete Linear Regression
task, including data processing
, training and model saving
, model loading and prediction
and evaluation
.
Before using Rosetta
for machine learning, in order to compare with Rosetta
-backed MPC version, let's see how we do the same task in native TensorFlow
style without the concern of data privacy.
Here is a simple Linear Regression
with TensorFlow
.
- Import necessary packages, set training parameters, etc.
import math
import os
import csv
import tensorflow as tf
import numpy as np
import pandas as pd
from util import read_dataset
np.set_printoptions(suppress=True)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
np.random.seed(0)
EPOCHES = 10
BATCH_SIZE = 16
learning_rate = 0.0002
- Load dataset
Please refer to the appendix at the end of this tutorial.
We highlight codes that are different from Rosetta
, and we will focus more on them later.
# real data
# ######################################## difference from rosettta
file_x = '../dsets/ALL/reg_train_x.csv'
file_y = '../dsets/ALL/reg_train_y.csv'
real_X, real_Y = pd.read_csv(file_x).to_numpy(), pd.read_csv(file_y).to_numpy()
# ######################################## difference from rosettta
DIM_NUM = real_X.shape[1]
- Write a Linear Regression model.
We will not go into details here, and only present the code below. For details, please refer to TensorFlow™
official website about writing machine learning models.
X = tf.placeholder(tf.float32, [None, DIM_NUM])
Y = tf.placeholder(tf.float32, [None, 1])
# initialize W & b
W = tf.Variable(tf.zeros([DIM_NUM, 1]))
b = tf.Variable(tf.zeros([1]))
# predict
pred_Y = tf.matmul(X, W) + b
# loss
loss = tf.square(Y - pred_Y)
loss = tf.reduce_mean(loss)
# optimizer
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
init = tf.global_variables_initializer()
- Model training.
with tf.Session() as sess:
sess.run(init)
xW, xb = sess.run([W, b])
print("init weight:{} \nbias:{}".format(xW, xb))
# train
BATCHES = math.ceil(len(real_X) / BATCH_SIZE)
for e in range(EPOCHES):
for i in range(BATCHES):
bX = real_X[(i * BATCH_SIZE): (i + 1) * BATCH_SIZE]
bY = real_Y[(i * BATCH_SIZE): (i + 1) * BATCH_SIZE]
sess.run(train, feed_dict={X: bX, Y: bY})
j = e * BATCHES + i
if j % 50 == 0 or (j == EPOCHES * BATCHES - 1 and j % 50 != 0):
xW, xb = sess.run([W, b])
print("I,E,B:{:0>4d},{:0>4d},{:0>4d} weight:{} \nbias:{}".format(
j, e, i, xW, xb))
# predict
Y_pred = sess.run(pred_Y, feed_dict={X: real_X, Y: real_Y})
print("Y_pred:", Y_pred)
For complete code reference Please refer to tf-linear_regression.py for the complete source code.
And then, we run this program as follows:
./tutorials.sh tf linear_regression
The output is as follows:
Y_pred: [[4.8402567]
[5.297159 ]
[5.81963 ]
...
[4.9908857]
[5.8464894]
[6.157756 ]]
As mentioned above, if you have an existing model training script (.py
) written with TensorFlow
, then all you need to do is import the following packages on the first line of this script file:
import latticex.rosetta as rtt
Yes, it's that simple! You don't need to modify any already written code. Even if you don’t have any knowledge of cryptography, you can use it easily. The only difference is about the how to set your private data.
- Activate protocol
rtt.rtt.activate("SecureNN")
Note: The protocol must be activated before any
MPC
relatedAPI
can be used.
- Load dataset
Please refers to the appendix at the end of this article for the dataset description.
We have highlighted the spots that are different from TensorFlow
. In contrast to the native TensorFlow
version without data privacy. Except for the importing of the Rosetta
package, only these several lines are different.
Rosetta
provides a class, PrivateDataset
, specifically for handling private data sets. Check the relevant source code for details.
# real data
# ######################################## difference from tensorflow
file_x = '../dsets/P' + str(rtt.mpc_player.id) + "/reg_train_x.csv"
file_y = '../dsets/P' + str(rtt.mpc_player.id) + "/reg_train_y.csv"
real_X, real_Y = rtt.PrivateDataset(data_owner=(
0, 1), label_owner=1).load_data(file_x, file_y, header=None)
# ######################################## difference from tensorflow
DIM_NUM = real_X.shape[1]
Please refer to rtt-linear_regression.py for the complete source code.
OK. Now let's briefly summarize the difference from the TensorFlow
version:
- Importing
Rosetta
package. - Activate protocol.
- Loading data sets.
Now, we have written the complete code, then how do we run it?
Do you still remember the above Millionaires' Problem
example that you have learned? There is no difference in running them.
Just run it as follows:
./tutorials.sh rtt linear_regression
The output is as follows:
Y_pred: [[b'\x9f\xf5\n\xc2\x81\x06\x00\x00#']
[b'g6j\x7fq\x0f\x00\x00#']
[b'\x95\xfc\x06\x1cA}\x00\x00#']
...
[b'\x19\x02\xd5\xfd\xf1c\x00\x00#']
[b'\xe1\xd5\x16pGz\x00\x00#']
[b'}\xfe8\xd3,\x91\xff\xff#']]
Don't be panic, what you're seeing are random
values in Secret sharing
scheme, rather than plaintext values, because Rosetta is protecting the privacy of your own data right now.
The sharing value
output in the previous section is not human-readable. For testing, debugging, comparison with plain text ans so on, we provide a reveal
interface to get the plain text value.
Reminder:please be cautious when using this reveal interface in a production environment.
Let's modify the previous (basic version) program with addition of reveal
, and see what effect it has. The new program is as follows:
# ########### for test, reveal
reveal_W = rtt.SecureReveal(W)
reveal_b = rtt.SecureReveal(b)
reveal_Y = rtt.SecureReveal(pred_Y)
# ########### for test, reveal
with tf.Session() as sess:
sess.run(init)
rW, rb = sess.run([reveal_W, reveal_b])
print("init weight:{} \nbias:{}".format(rW, rb))
# train
BATCHES = math.ceil(len(real_X) / BATCH_SIZE)
for e in range(EPOCHES):
for i in range(BATCHES):
bX = real_X[(i * BATCH_SIZE): (i + 1) * BATCH_SIZE]
bY = real_Y[(i * BATCH_SIZE): (i + 1) * BATCH_SIZE]
sess.run(train, feed_dict={X: bX, Y: bY})
j = e * BATCHES + i
if j % 50 == 0 or (j == EPOCHES * BATCHES - 1 and j % 50 != 0):
rW, rb = sess.run([reveal_W, reveal_b])
print("I,E,B:{:0>4d},{:0>4d},{:0>4d} weight:{} \nbias:{}".format(
j, e, i, rW, rb))
# predict
Y_pred = sess.run(reveal_Y, feed_dict={X: real_X, Y: real_Y})
print("Y_pred:", Y_pred)
Please refer to rtt-linear_regression_reveal.py for the complete source code.
Then, we run it as follows:
./tutorials.sh rtt linear_regression_reveal
And, the output will be as follows:
Y_pred: [[b'4.844925']
[b'5.297165']
[b'5.819885']
...
[b'4.992172']
[b'5.845917']
[b'6.159866']]
Try to compare this output with the output of the TensorFlow
version to see how much the error is.
Here is the tutorial, you can get both the predicted value and weight value of TensorFlow
version and Rosetta
version.
For models with few parameters, (the error in the previous section) can be barely recognized at the first sight, but if there are many parameters and the dataset is very large, then auxiliary tools are needed.
We only list the final comparison results here. For details, refer to Comparison and Evaluation 2
.
Below is a comparison of the evaluation result of TensorFlow
and Rosetta
.
TensorFlow:
{
"tag": "tensorflow",
"mse": 0.5228572335042407,
"rmse": 0.7230886761001314,
"mae": 0.4290781021000001,
"evs": 0.2238489236789002,
"r2": 0.18746385319936198
}
Rosetta:
{
"tag": "rosetta",
"mse": 0.5219412461669367,
"rmse": 0.72245501324784,
"mae": 0.4286960000000004,
"evs": 0.2244437402223386,
"r2": 0.18888732556213872
}
We can see that the evaluation scores (with little loss on precision) are almost the same.
R^2 is lower because this dataset is a Logistic Regression model, not a Linear Regression model Here we only need to care about the error between the two versions (it is very small)
Error comparison (linear regression)
The following figure is about the absolute error comparison between the predicted values of TensorFlow
and Rosetta
.
The following figure is about the relative error comparison between the predicted values of TensorFlow
and Rosetta
.
We only show the final result in Comparison and Evaluation 1
section. Here we dive a little deeper with the inner process. You can skip this section if not interested.
In this section, Linear Regression is evaluated using R^2, and Logistic Regression is evaluated using AUC/ACC/F1 Evaluation.
Next, let’s modify the last part of the program and add the statistical functionality (this modification is the same for the TensorFlow
version and the Rosetta
version)
# #############################################################
# save to csv for comparing, for debug
scriptname = os.path.basename(sys.argv[0]).split(".")[0]
csvprefix = "./log/" + scriptname
os.makedirs(csvprefix, exist_ok=True)
csvprefix = csvprefix + "/tf"
# #############################################################
with tf.Session() as sess:
sess.run(init)
xW, xb = sess.run([W, b])
print("init weight:{} \nbias:{}".format(xW, xb))
# train
BATCHES = math.ceil(len(real_X) / BATCH_SIZE)
for e in range(EPOCHES):
for i in range(BATCHES):
bX = real_X[(i * BATCH_SIZE): (i + 1) * BATCH_SIZE]
bY = real_Y[(i * BATCH_SIZE): (i + 1) * BATCH_SIZE]
sess.run(train, feed_dict={X: bX, Y: bY})
j = e * BATCHES + i
if j % 50 == 0 or (j == EPOCHES * BATCHES - 1 and j % 50 != 0):
xW, xb = sess.run([W, b])
print("I,E,B:{:0>4d},{:0>4d},{:0>4d} weight:{} \nbias:{}".format(
j, e, i, xW, xb))
savecsv("{}-{:0>4d}-{}.csv".format(csvprefix, j, "W"), xW)
savecsv("{}-{:0>4d}-{}.csv".format(csvprefix, j, "b"), xb)
# predict
Y_pred = sess.run(pred_Y, feed_dict={X: real_X, Y: real_Y})
print("Y_pred:", Y_pred)
savecsv("{}-pred-{}.csv".format(csvprefix, "Y"), Y_pred)
# save real y for evaluation
savecsv("{}-real-{}.csv".format(csvprefix, "Y"), real_Y)
Please refer to tf-linear_regression_stat.py for the complete code.
Please refer to rtt-linear_regression_stat.py for the complete code.
Then, you will get the evaluation results after running it as follows:
./tutorials.sh tf linear_regression_stat
./tutorials.sh rtt linear_regression_stat
./tutorials.sh stat linear_regression_stat linear
So far, we just output the model parameters and predicted values to the terminal. How can we save the trained model as we often do in machine learning?
You may wonder, since we are doing all these in a multi-party way, WHERE will the trained model (should) be saved after running with Rosetta
? And HOW to save it? Good question. Let’s talk about the model saving.
There are several conventions:
-
If you want to use
Rosetta
for prediction on shared private dataset, please save the model ascipher text
. -
If you save the model as plain text, and want to use this model to make predictions on plaintext, please use
TensorFlow
directly to make predictions.
Regarding the saving of the plaintext result, you can choose to save at node 0, node 1, node 2, or all three nodes. This setting is in the configuration file.
You can try to modify the value of
SAVER_MODE
in the configuration file to see how it works.
SAVER_MODE
is a flag of a bitmap combination, and its meaning is as follows
// 0: Save the cipher text. (And the number 1 ~ 7 below indicates which specific parties to save the plain text files)
// 1: P0,
// 2: P1,
// 4: P2,
// 3: P0 and P1
// 5: P0 and P2
// 6: P1 and P2
// 7: P0, P1 and P2
In this section, we will use Rosetta
to train the model, then save the model as plaintext
, and then load this plaintext model
into the TensorFlow
version for prediction. Finally we check the difference between using TensorFlow
on plaintext dataset without data privacy with this one.
Based on the previous version of Rosetta
, we added some code related to save
.
Before training starts:
# save
saver = tf.train.Saver(var_list=None, max_to_keep=5, name='v2')
os.makedirs("./log/ckpt"+str(party_id), exist_ok=True)
After training:
saver.save(sess, './log/ckpt'+str(party_id)+'/model')
Please refer to rtt-linear_regression_saver.py for details.
And then run it as follows:
./tutorials.sh rtt linear_regression_saver
The model has been saved to the corresponding node in the previous step (according to the configuration file). Now use TensorFlow
to load the plaintext model saved in the previous step and make predictions.
# save
saver = tf.train.Saver(var_list=None, max_to_keep=5, name='v2')
os.makedirs("./log/ckpt0", exist_ok=True)
# restore mpc's model and predict
with tf.Session() as sess:
sess.run(init)
if os.path.exists("./log/ckpt0/checkpoint"):
saver.restore(sess, './log/ckpt0/model')
# predict
Y_pred = sess.run(pred_Y)
print("Y_pred:", Y_pred)
For the complete code, please refer to tf-linear_regression_restore.py.
And then run it as follows:
./tutorials.sh tf linear_regression_restore
The output will be as follows:
Y_pred: [[6.17608922]
[6.15961048]
[5.40468624]
...
[5.20862467]
[5.49407074]
[6.21659464]]
Summary:
Complete source code list reference
TensorFlow version
Basics | tf-linear_regression.py |
Model Training and Saving | tf-linear_regression_saver.py |
Model loading and prediction | tf-linear_regression_restore.py |
Evaluation | tf-linear_regression_stat.py |
Rosetta version
Basics | rtt-linear_regression.py |
Basic (output plain text) | rtt-linear_regression_reveal.py |
Model Training and Saving | rtt-linear_regression_saver.py |
Model (Cipher) loading and prediction | rtt-linear_regression_restore.py |
Evaluation | rtt-linear_regression_stat.py |
With the foundation of Linear Regression
above, then Logistic Regression
is very
, very
, very simple
.
Based on Linear Regression, we use sigmoid
as a binary classifier and cross entropy
as a loss function to build a Logistic Regression model.
Regardless of the TensorFlow
version or theRosetta
version, the changes are the same. Compared with the Linear Regression
version, only the model construction part needs to be changed, that is, only the following modification is needed:
-
On the predicted value part, add
sigmoid
functionality -
On the loss function part, we replace them with cross entropy function
# predict
pred_Y = tf.sigmoid(tf.matmul(X, W) + b)
# loss
logits = tf.matmul(X, W) + b
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=Y, logits=logits)
loss = tf.reduce_mean(loss)
Complete source code list reference:
TensorFlow version
Basics | tf-logistic_regression.py |
Model Training and Saving | tf-logistic_regression_saver.py |
Model loading and prediction | tf-logistic_regression_restore.py |
Evaluation | tf-logistic_regression_stat.py |
Rosetta version
Basics | rtt-logistic_regression.py |
Basic (output plain text) | rtt-logistic_regression_reveal.py |
Model Training and Saving | rtt-logistic_regression_saver.py |
Model (Cipher) loading and prediction | rtt-logistic_regression_restore.py |
Evaluation | rtt-logistic_regression_stat.py |
Just run these programs in the same as in Linear Regression
.
Here is the comparison of the two different versions of Logistic Regression model.
TensorFlow:
{
"tag": "tensorflow",
"score_auc": 0.698190821826193,
"score_ks": 0.2857188520398128,
"threshold_opt": 0.6037812829,
"score_accuracy": 0.6458170445660673,
"score_precision": 0.6661931818181818,
"score_recall": 0.6826783114992722,
"score_f1": 0.6743350107836089
}
Rosetta:
{
"tag": "rosetta",
"score_auc": 0.6977740568078996,
"score_ks": 0.2857188520398128,
"threshold_opt": 0.607208,
"score_accuracy": 0.6458170445660673,
"score_precision": 0.6661931818181818,
"score_recall": 0.6826783114992722,
"score_f1": 0.6743350107836089
}
Here we only care about the error between the two versions (we can see that it is very small) Rosetta is even slightly better than TensorFlow
Error comparison (logistic regression)
The following figure is about the absolute error comparison between the predicted values of TensorFlow
and Rosetta
.
The following figure is about the relative error comparison between the predicted values of TensorFlow
and Rosetta
.
Warm tip: There may be individual rtt predictions value
here that differ significantly from tf predictions value
(but do not affect scoring) for the following reason:
- The Rosetta uses a 6-segment (interval [-4,4]) function to simulate sigmoid that sets the output value to 0 or 1 when the sigmoid input is less than -4 or greater than 4. This results in a larger when computing (
tf prediction value
-rtt prediction value
)/rtt prediction value
. - As shown above, for the 425th, 727th sample, the input values of sigmoid are -4.1840362549 and -4.6936187744, respectively.
The above linear regression and logistic regression models all load the entire dataset into memory and then take it out in batch order for training, and as the size of the dataset grows, it becomes impractical to load the dataset into memory at once.
Major plaintext AI frameworks such as TensorFlow are aware of and provide solutions, TensorFlow provides the relevant Dataset APIs to build low-memory consuming, complex, reusable data pipelines, since Rosetta uses TensorFlow as a backend, it can be reused with minor modifications.
We use logistic regression model as an example to illustrate how to train a model with large datasets.
For the TensorFlow version complete code, please refer to tf-ds-lr.py.
For the Rosetta version complete code, please refer to rtt-ds-lr.py.
Analysis of the code in tf-ds-lr.py and rtt-ds-lr.py reveals two main differences.
-
Create a text line dataset, use TextLineDataset class in TensorFlow and use PrivateTextLineDataset class in Rosetta. The code used in TensorFlow is as following:
dataset_x = tf.data.TextLineDataset(file_x) ...
The code used in Rosetta is as following:
dataset_x0 = rtt.PrivateTextLineDataset( file_x, data_owner=0) # P0 hold the file data ...
-
Decode functions are implemented differently. TensorFlow version of the decode function split rows to corresponding fields and then converts the fields to floating-point numbers, while the Rosetta version of the decode function also first split rows to corresponding fields and then calls
PrivateInput
function to share the data. The code used in TensorFlow is as following:# dataset decode def decode_x(line): fields = tf.string_split([line], ',').values fields = tf.string_to_number(fields, tf.float64) return fields
The code used in Rosetta is as following:
# dataset decode def decode_p0(line): fields = tf.string_split([line], ',').values fields = rtt.PrivateInput(fields, data_owner=0) # P0 hold the file data return fields
The Rosetta
backend uses the TensorFlow
execution engine to execute the computational graph constructed by Rosetta's privacy operations, so you can use the same method as TensorFlow
to stop the graph at any time during the training of a privacy AI model. To stop the execution of the graph in TensorFlow
, we simply call python API Session::close()
, so in Rosetta
we can also stop the execution of the computed graph by calling the python API Session::close()
.
After reading the content of [Privacy-Preserving Machine Learning](#Privacy-Preserving Machine Learning), you must have a certain understanding of Rosetta and TensorFlow 's grammar. So, in this section, we have offer an example of classification of mnist dataset by MLP Neural Network.
Firstly, this is the TensorFlow version
- Import related packages and load dataset
from tensorflow.examples.tutorials.mnist import input_data
import os
import tensorflow as tf
mnist_home = os.path.join("/tmp/data/", 'mnist')
mnist = input_data.read_data_sets(mnist_home, one_hot=True)
# split the data into train and test
X_train = mnist.train.images
X_test = mnist.test.images
Y_train = mnist.train.labels
Y_test = mnist.test.labels
# make iterator
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
train_dataset = train_dataset.batch(100).repeat()
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, Y_test))
test_dataset = test_dataset.batch(100).repeat()
train_iterator = train_dataset.make_one_shot_iterator()
train_next_iterator = train_iterator.get_next()
test_iterator = test_dataset.make_one_shot_iterator()
test_next_iterator = test_iterator.get_next()
- Set hyperparameters and construct MLP model
num_outputs = 10
num_inputs = 784
w=[]
b=[]
def mlp(x, num_inputs, num_outputs, num_layers, num_neurons):
w = []
b = []
for i in range(num_layers):
# weights
w.append(tf.Variable(tf.random_normal(
[num_inputs if i == 0 else num_neurons[i - 1],
num_neurons[i]], seed = 1, dtype=tf.float64),
name="w_{0:04d}".format(i), dtype=tf.float64
))
# biases
b.append(tf.Variable(tf.random_normal(
[num_neurons[i]], seed = 1, dtype=tf.float64),
name="b_{0:04d}".format(i), dtype=tf.float64
))
w.append(tf.Variable(tf.random_normal(
[num_neurons[num_layers - 1] if num_layers > 0 else num_inputs,
num_outputs], seed = 1, dtype=tf.float64), name="w_out", dtype=tf.float64))
b.append(tf.Variable(tf.random_normal([num_outputs], seed = 1, dtype=tf.float64), name="b_out", dtype=tf.float64))
# x is input layer
layer = x
# add hidden layers
for i in range(num_layers):
layer = tf.nn.relu(tf.matmul(layer, w[i]) + b[i])
# add output layer
layer = tf.matmul(layer, w[num_layers]) + b[num_layers]
return layer
- Implement training function and related functions
def mnist_batch_func(batch_size=100):
X_batch, Y_batch = mnist.train.next_batch(batch_size)
return [X_batch, Y_batch]
def tensorflow_classification(n_epochs, n_batches,
batch_size,
model, optimizer, loss, accuracy_function,
X_test, Y_test):
with tf.Session() as tfs:
tfs.run(tf.global_variables_initializer())
for epoch in range(n_epochs):
epoch_loss = 0.0
for batch in range(n_batches):
X_batch, Y_batch = tfs.run(train_next_iterator)
feed_dict = {x: X_batch, y: Y_batch}
_, batch_loss = tfs.run([optimizer, loss], feed_dict)
epoch_loss += batch_loss
average_loss = epoch_loss / n_batches
print("epoch: {0:04d} loss = {1:0.6f}".format(
epoch, average_loss))
feed_dict = {x: X_test, y: Y_test}
accuracy_score = tfs.run(accuracy_function, feed_dict=feed_dict)
print("accuracy={0:.8f}".format(accuracy_score))
# construct input
x = tf.placeholder(dtype=tf.float64, name="x",
shape=[None, num_inputs])
# construct output
y = tf.placeholder(dtype=tf.float64, name="y",
shape=[None, 10])
# hidden layers' parameters
num_layers = 2
num_neurons = [128, 256]
learning_rate = 0.01
n_epochs = 30
batch_size = 100
n_batches = int(mnist.train.num_examples/batch_size)
model = mlp(x=x,
num_inputs=num_inputs,
num_outputs=num_outputs,
num_layers=num_layers,
num_neurons=num_neurons)
loss = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(logits=model, labels=y))
optimizer = tf.train.GradientDescentOptimizer(
learning_rate=learning_rate).minimize(loss)
predictions_check = tf.equal(tf.argmax(model, 1), tf.argmax(y, 1))
accuracy_function = tf.reduce_mean(tf.cast(predictions_check, dtype=tf.float64))
# train
tensorflow_classification(n_epochs=n_epochs,
n_batches=n_batches,
batch_size=batch_size,
model = model,
optimizer = optimizer,
loss = loss,
accuracy_function = accuracy_function,
X_test = X_test,
Y_test = Y_test
)
For the complete code, please refer to tf-mlp_mnist.py
Run it as follows:
python ./tf-mlp_mnist.py
Output as follows:
epoch: 0000 loss = 17.504000
epoch: 0001 loss = 6.774922
epoch: 0002 loss = 4.993065
epoch: 0003 loss = 4.047511
epoch: 0004 loss = 3.440471
epoch: 0005 loss = 3.006049
...
epoch: 0027 loss = 0.874316
epoch: 0028 loss = 0.847307
epoch: 0029 loss = 0.821776
accuracy=0.91100000
- Import packages and activate protocol
import os
import tensorflow as tf
import latticex.rosetta as rtt
import csv
import numpy as np
rtt.set_backend_loglevel(1)
np.set_printoptions(suppress=True)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
np.random.seed(0)
rtt.activate("SecureNN")
mpc_player_id = rtt.py_protocol_handler.get_party_id()
- Load dataset
If you want to understand the method of loading dataset, please refer to [Support big data sets](#Support big data sets).
# load data
file_x = '../dsets/P' + str(mpc_player_id) + "/mnist_train_x.csv"
file_y = '../dsets/P' + str(mpc_player_id) + "/mnist_train_y.csv"
X_train_0 = rtt.PrivateTextLineDataset(file_x, data_owner=0)
X_train_1 = rtt.PrivateTextLineDataset(file_x, data_owner=1)
Y_train = rtt.PrivateTextLineDataset(file_y, data_owner=1)
# dataset decode
def decode_p0(line):
fields = tf.string_split([line], ',').values
fields = rtt.PrivateInput(fields, data_owner=0)
return fields
def decode_p1(line):
fields = tf.string_split([line], ',').values
fields = rtt.PrivateInput(fields, data_owner=1)
return fields
# dataset pipeline
X_train_0 = X_train_0.map(decode_p0).cache(f"{cache_dir}/cache_p0_x0").batch(BATCH_SIZE).repeat()
X_train_1 = X_train_1.map(decode_p1).cache(f"{cache_dir}/cache_p1_x1").batch(BATCH_SIZE).repeat()
Y_train = Y_train.map(decode_p1).cache(f"{cache_dir}/cache_p1_y").batch(BATCH_SIZE).repeat()
- The model can be saved in the folder
log/
def tensorflow_classification(n_epochs, n_batches,
batch_size,
model, optimizer, loss
):
with tf.Session() as tfs:
tfs.run(tf.global_variables_initializer())
tfs.run([iter_x0.initializer, iter_x1.initializer, iter_y.initializer])
for epoch in range(n_epochs):
epoch_loss = 0.0
for i in range(n_batches):
tfs.run([optimizer, loss])
saver.save(tfs, './log/ckpt'+str(mpc_player_id)+'/model')
The rest is the same as TensorFlow Version MLP, for the complete code, please refer to rtt-mlp_mnist.py
Run it as follows:
./tutorials.sh rtt mlp_mnist
Plaintext model can be saved at folder log/
, you can the follow code to evaluate the mode
python tf-test_mlp_acc.py
The example of results as follows:
['w_out:0', 'b_out:0']
[array([[ 0.24279785, -0.35299683, -0.83648682, ..., -1.71618652,
0.65466309, -0.75320435],
[ 0.16467285, -1.01171875, 0.07672119, ..., 2.43850708,
-0.34365845, 0.50970459],
[ 0.68637085, 0.59738159, 0.21426392, ..., -2.1026001 ,
-1.08334351, -0.51135254],
...,
[ 1.18951416, 0.50506592, -0.19161987, ..., 0.31906128,
-0.21728516, -1.74258423],
[ 0.47128296, -1.10772705, -1.14147949, ..., -0.80792236,
-0.2272644 , -0.60620117],
[-1.35250854, -0.00039673, -1.37692261, ..., 0.28158569,
-1.86367798, 0.2359314 ]]), array([ 0.05230713, -0.48815918, -0.75996399, -0.41955566, 1.78201294,
-0.42456055, -0.03417969, -1.80670166, 0.40750122, -0.93180847])]
accuracy=0.14000000
Due to the model did not add hidden layers and used the mini-datasets, the accuracy is very low. If you are interested in the model, you can adjust the structure of the model and size of the datasets to imporove the model accuracy.
Multitasking is not supported in versions prior to Rosetta 1.0.0
, which means that only one privacy protocol (e.g. SecureNN
, Helix
, etc.) can be executed at any time, and if you want to execute multiple tasks concurrently, you must use multi-process to implement. We have refactored the code in Rosetta v1.0.0
to support multitasking, allowing different tasks to be executed concurrently using different privacy protocols, meaning that if users have multiple task concurrency requirements, besides multi process implementation, the user business code can also be implemented in multi-threaded way, which provides users with more choices.
Suppose we have a requirement to test the accuracy performance of the TrueDiv
operation under the SecureNN
and Helix
protocols, we can write a case, run it using the SecureNN
protocol, then modify it to run again using the Helix
protocol and compare the results. With multitasking support, we can write our cases in a simple but elegant way (here is just a simple test requirement, which can be extended to specific requirements in more specific business scenarios).
This example program with simple requirements is written using multitasking concurrency. The code is as follows:
import concurrent.futures
import numpy as np
import tensorflow as tf
import latticex.rosetta as rtt
def multi_task_fw(funcs):
task_id = 1
all_task = []
try:
with concurrent.futures.ThreadPoolExecutor() as executor:
for unit_func in funcs:
all_task.append(executor.submit(unit_func, str(task_id)))
task_id += 1
concurrent.futures.wait(all_task, return_when=concurrent.futures.ALL_COMPLETED)
except Exception as e:
print(str(e))
def run_trurediv_op(protocol, task_id, x_init, y_init):
local_g = tf.Graph()
with local_g.as_default():
X = tf.Variable(x_init)
Y = tf.Variable(y_init)
Z = tf.truediv(X, Y)
rv_Z = rtt.SecureReveal(Z)
init = tf.compat.v1.global_variables_initializer()
try:
rtt.activate(protocol, task_id=task_id) # Add the task_id parameter
with tf.Session(task_id=task_id) as sess: # Add the task_id parameter
sess.run(init)
real_Z = sess.run(rv_Z)
print("The result of the truediv calculation using {0} is: {1}".format(protocol, real_Z))
rtt.deactivate(task_id=task_id)
except Exception as e:
print(str(e))
def Snn_Div(task_id):
return run_trurediv_op("SecureNN", task_id,
[1.1, 1200.5, -1.1, -23489.56],
[102.2, 812435.6, 0.95, 0.1234])
def Helix_Div(task_id):
return run_trurediv_op("Helix", task_id,
[1.1, 1200.5, -1.1, -23489.56],
[102.2, 812435.6, 0.95, 0.1234])
# run cases
multi_task_fw([Snn_Div, Helix_Div])
The output of the above code is as follows: (only the calculation result log is kept, other log is removed)
The result of the truediv calculation using Helix is: [b'0.010742' b'0.001465' b'-1.157837' b'-190521.267212']
The result of the truediv calculation using SecureNN is: [b'0.010742' b'0.001465' b'-1.157715' b'-190521.267212']
The above code for multitasking support differs from the code before Rosetta v1.0.0
in only two places.
-
The
rtt.activate
interface adds atask_id
parameter to bind a specific privacy protocol to a task in order to support multitasking. Thetask_id
parameter is optional and if not set will only support single tasks, maintaining compatibility with previous versions. -
The
SecureSession
constructor adds atask_id
parameter to bind the specific session to a task. so that when the session execution graph is used later the operation layer can find and execute a specific privacy protocol based ontask_id
. Thetask_id
parameter is optional, if not set it only supports single tasks, maintaining compatibility with previous versions.Note: The above code uses
tf.Session
and notrtt.SecureSession
becausetf.Session
is statically overridden byrtt.SecureSession
andrtt.SecureSession
is derived fromtf.Session
with the extension parametertask_id
. (see the implementation ofrtt.SecureSession
for more.)
That's all.
Now, you have fully mastered the usage of Rosetta
, go to find a real problem to play with.
welcome!
The data set source reference here.
We have performed simple pre-process steps to get the results as follows, the path is in dsets/
, and the directory structure is as follows:
dsets/
├── ALL
│ ├── cls_test_x.csv
│ ├── cls_test_y.csv
│ ├── cls_train_x.csv
│ ├── cls_train_y.csv
│ ├── reg_test_x.csv
│ ├── reg_test_y.csv
│ ├── reg_train_x.csv
│ ├── reg_train_y.csv
│ ├── mnist_test_x.csv
│ └── mnist_test_y.csv
├── P0
│ ├── cls_test_x.csv
│ ├── cls_test_y.csv
│ ├── cls_train_x.csv
│ ├── cls_train_y.csv
│ ├── reg_test_x.csv
│ ├── reg_train_x.csv
│ └── mnist_train_x.csv
├── P1
│ ├── cls_test_x.csv
│ ├── cls_train_x.csv
│ ├── reg_test_x.csv
│ ├── reg_test_y.csv
│ ├── reg_train_x.csv
│ ├── reg_train_y.csv
│ └── mnist_train_x.csv
└── P2
ALL | Raw data of the dataset |
P* | Indicates the private data owned by each node |
cls* | Represents a binary classification data set for Logistic Regression |
reg* | Regression data set, used for Linear Regression |
*train | Represents the data set used for training |
*test | Represents the data set used for prediction |
*x | Indicates sample |
*y | Means label |
Description:
For comparison with the plaintext (TensorFlow version), we divided the original data set into two parts in the vertical direction, one as private data of P0
and the other as private data ofP1
.
- The data under ALL is used for TensorFlow version.
- The private data of each node of P0/P1 is stored on each node.
- P2 has no data.
- The label for Logistic Regression is owned by P0, and the label for Linear Regression is owned by P1.