docs: dataframe schemas, deploy training, pre-trained tree import (#735)

zama-ai · Jun 20, 2024 · 79d4a60 · 79d4a60
1 parent 5a79e4f
commit 79d4a60
Show file tree

Hide file tree

Showing 7 changed files with 61 additions and 12 deletions.
diff --git a/docs/built-in-models/encrypted_dataframe.md b/docs/built-in-models/encrypted_dataframe.md
@@ -45,6 +45,24 @@ df_decrypted = client.decrypt_to_pandas(df_encrypted)
 - **Quantized Float**: Floating-point numbers are quantized to integers within the supported range. This is achieved by computing a scale and zero point for each column, which are used to map the floating-point numbers to the quantized integer space.
 - **String Enum**: String columns are mapped to integers starting from 1. This mapping is stored and later used for de-quantization. If the number of unique strings exceeds 15, a `ValueError` is raised.
 
+### Using a user-defined schema
+
+Before encryption, the data is preprocessed. For example **string enums** first need to be mapped to integers, and floating point values must be quantized. By default, this mapping is done automatically. However, when two different clients encrypt their data separately, the automatic mappings may differ, possibly due to some missing values in one of the client's DataFrame. Thus the column can not be selected when merging encrypted DataFrames.
+
+The encrypted DataFrame supports user-defined mappings. These schemas are defined as a dictionary where keys represent column names and values contain meta-data about the column. Supported column meta-data are:
+
+- string columns: mapping between string values and integers.
+- float columns: the min/max range that the column values lie in.
+
+<!--pytest-codeblocks:cont-->
+
+```python
+schema = {
+    "string_column": {"abc": 1, "bcd": 2 },
+    "float_column": {"min": 0.1, "max": 0.5 }
+}
+```
+
 ## Supported operations
 
 Encrypted DataFrame is designed to support a subset of operations that are available for pandas DataFrames. For now, only the `merge` operation is supported. More operations will be added in the future releases.

diff --git a/docs/built-in-models/linear.md b/docs/built-in-models/linear.md
@@ -1,10 +1,10 @@
 # Linear models
 
-This document introduces some [scikit-learn](https://scikit-learn.org/stable/)'s linear models for `regression` and `classification` that Concrete ML provides.
+This page explains Concrete ML linear models for both classification and regression. These models are based on [scikit-learn](https://scikit-learn.org/stable/) linear models.
 
-## Supported models
+## Supported models for encrypted inference
 
-The use of the following models in FHE is very similar to the use of scikit-learn's [API](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model). They are also compatible with some of scikit-learn's main workflows, such as `Pipeline()` and `GridSearch()`.
+The following models are supported for training on clear data and predicting on encrypted data. Their API is similar the one of [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model). These models are also compatible with some of scikit-learn's main workflows, such as `Pipeline()` and `GridSearch()`.
 
 |                                             Concrete ML                                              |                                                                         scikit-learn                                                                         |
 | :--------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------: |
@@ -20,6 +20,12 @@ The use of the following models in FHE is very similar to the use of scikit-lear
 |         [ElasticNet](../references/api/concrete.ml.sklearn.linear_model.md#class-elasticnet)         |             [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet)             |
 |       [SGDRegressor](../references/api/concrete.ml.sklearn.linear_model.md#class-sgdregressor)       |                           [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)                           |
 
+## Supported models for encrypted training
+
+In addition to predicting on encrypted data, the following models  support training on encrypted data.
+
+|       [SGDClassifier](../references/api/concrete.ml.sklearn.linear_model.md#class-sgdclassifier)       |                           [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)                           |
+
 ## Quantization parameters
 
 The `n_bits` parameter controls the bit-width of the inputs and weights of the linear models. Linear models do not use table lookups and thus alllows weight and inputs to be high precision integers.

diff --git a/docs/built-in-models/training.md b/docs/built-in-models/training.md
@@ -1,10 +1,25 @@
 # Encrypted training
 
-This document explains how to train [SGD Logistic Regression](../references/api/concrete.ml.sklearn.linear_model.md#class-sgdclassifier) on encrypted data. The [logistic regression training](../advanced_examples/LogisticRegressionTraining.ipynb) example shows this feature in action.
+This document explains how to train [SGD Logistic Regression](../references/api/concrete.ml.sklearn.linear_model.md#class-sgdclassifier) on encrypted data.
+
+Training on encrypted data is done through an FHE program that is generated by Concrete ML, based on the characteristics of the data that are given to the `fit` function. Once the FHE program associated with the `SGDClassifier` object has fit the encrypted data, it performs specifically to that data's distribution and dimensionality.
+
+When deploying encrypted training services, you need to consider the type of data that future users of your services will train on:
+
+- The distribution of the data should match to achieve good accuracy
+- The dimensionality of the data needs to match since the deployed FHE programs are compiled for a fixed number of dimensions.
+
+See the [deployment](#deployment) section for more details.
+
+{% hint style="info" %}
+Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import models trained through federated learning using 3rd party tools. All model types are supported - linear, tree-based and neural networks - through the [`from_sklearn_model` function](linear.md#pre-trained-models) and the [`compile_torch_model`](../deep-learning/torch_support.md) function.
+{% endhint %}
 
 ## Example
 
-This example shows how to instantiate a logistic regression model that trains on encrypted data:
+The [logistic regression training](../advanced_examples/LogisticRegressionTraining.ipynb) example shows logistic regression training on encrypted data in action.
+
+The following snippet shows how to instantiate a logistic regression model that trains on encrypted data:
 
 ```python
 from concrete.ml.sklearn import SGDClassifier
@@ -18,7 +33,7 @@ model = SGDClassifier(
 )
 ```
 
-To activate encrypted training, simply set `fit_encrypted=True` in the constructor. If this value is not set, training is performed on clear data using `scikit-learn` gradient descent.
+To activate encrypted training, simply set `fit_encrypted=True` in the constructor. When the value is set, Concrete ML generates an FHE program which, when called through the `fit` function, processes encrypted training data, labels and initial weights and outputs trained model weights. If this value is not set, training is performed on clear data using `scikit-learn` gradient descent.
 
 Next, to perform the training on encrypted data, call the `fit` function with the `fhe="execute"` argument:
 
@@ -28,10 +43,6 @@ Next, to perform the training on encrypted data, call the `fit` function with th
 model.fit(X_binary, y_binary, fhe="execute")
 ```
 
-{% hint style="info" %}
-Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import linear models, including logistic regression, that are trained using federated learning using the [`from_sklearn` function](linear.md#pre-trained-models).
-{% endhint %}
-
 ## Training configuration
 
 The `max_iter` parameter controls the number of batches that are processed by the training algorithm.
@@ -43,3 +54,7 @@ The `parameters_range` parameter determines the initialization of the coefficien
 The trainable logistic model uses Stochastic Gradient Descent (SGD) and quantizes the data, weights, gradients and the error measure. It currently supports training 6-bit models, including g both the coefficients and the bias.
 
 The `SGDClassifier` does not currently support training models with other bit-width values. The execution time to train a model is proportional to the number of features and the number of training examples in the batch. The `SGDClassifier` training does not currently support client/server deployment for training.
+
+## Deployment
+
+Once you have tested an `SGDClassifier` that trains on encrypted data, you can build an FHE training service by deploying the FHE training program of the `SGDClassifier`. See the [Production Deloyment](../guides/client_server.md) page for more details on how to the Concrete ML deployment utility classes. To deploy an FHE training program, you must pass the `mode='training'` parameter to the `FHEModelDev` class.
diff --git a/docs/built-in-models/tree.md b/docs/built-in-models/tree.md
@@ -26,6 +26,10 @@ For a formal explanation of the mechanisms that enable FHE-compatible decision t
 Using the maximum depth parameter of decision trees and tree-ensemble models strongly increases the number of nodes in the trees. Therefore, we recommend using the XGBoost models which achieve better performance with lower depth.
 {% endhint %}
 
+## Pre-trained models
+
+You can convert an already trained scikit-learn tree-based model to a Concrete ML one by using the [`from_sklearn_model`](../references/api/concrete.ml.sklearn.base.md#classmethod-from_sklearn_model) method.
+
 ## Example
 
 Here's an example of how to use this model in FHE on a popular data-set using some of scikit-learn's pre-processing tools. You can find a more complete example in the [XGBClassifier notebook](../tutorials/ml_examples.md).

diff --git a/docs/getting-started/README.md b/docs/getting-started/README.md
@@ -12,7 +12,7 @@ Concrete ML is an open source, privacy-preserving, machine learning framework ba
 
 - **Training on encrypted data**: FHE is an encryption technique that allows computing directly on encrypted data, without needing to decrypt it. With FHE, you can build private-by-design applications without compromising on features. Learn more about FHE in [this introduction](https://www.zama.ai/post/tfhe-deep-dive-part-1) or join the [FHE.org](https://fhe.org) community.
 
-- **Federated learning**: Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import linear models, including logistic regression, that are trained using federated learning using the [`from_sklearn` function](../built-in-models/linear.md#pre-trained-models).
+- **Federated learning**: Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import all types of models: linear, tree-based and neural networks, that are trained using federated learning using the [`from_sklearn_model` function](../built-in-models/linear.md#pre-trained-models) and the [`compile_torch_model`](../deep-learning/torch_support.md) function.
 
 ## Example usage
 

diff --git a/docs/guides/client_server.md b/docs/guides/client_server.md
@@ -19,7 +19,7 @@ The compiled model (`server.zip`) is deployed to a server and the cryptographic
 
 The `FHEModelDev`, `FHEModelClient`, and `FHEModelServer` classes in the `concrete.ml.deployment` module make it easy to deploy and interact between the client and server:
 
-- **`FHEModelDev`**: This class is used during the development phase to prepare and save the model artifacts (`client.zip` and `server.zip`). It handles the serialization of the underlying FHE circuit as well as the crypto-parameters used for generating the keys.
+- **`FHEModelDev`**: Use the `save` method of this class during the development phase to prepare and save the model artifacts (`client.zip` and `server.zip`). This class handles the serialization of the underlying FHE circuit as well as the crypto-parameters used for generating the keys. By changing the `mode` parameter of the `save` method, you can deploy a trained model or a [training FHE program](../built-in-models/training.md).
 
 - **`FHEModelClient`**: This class is used on the client side to generate and serialize the cryptographic keys, encrypt the data before sending it to the server, and decrypt the results received from the server. It also handles the loading of quantization parameters and pre/post-processing from `serialized_processing.json`.
 

diff --git a/docs/tutorials/ml_examples.md b/docs/tutorials/ml_examples.md
@@ -67,3 +67,9 @@ Two different configurations of the built-in, fully-connected neural networks ar
 - [Regressor comparison](../advanced_examples/RegressorComparison.ipynb)
 
 Based on three different synthetic data-sets, all the built-in classifiers are demonstrated in this notebook, showing accuracies, inference times, accumulator bit-widths, and decision boundaries.
+
+### 7. Training on encrypted data
+
+- [LogisticRegression training](../advanced_examples/LogisticRegressionTraining.ipynb)
+
+This example shows how to configure a training algorithm that works on encrypted data and how to deploy it in a client/server application.