使用 SPU 实现决策树模型基础功能 #212

Candicepan · 2023-06-07T11:33:43Z

此 ISSUE 为隐语开源共建计划（SecretFlow Open Source Contribution Plan，简称 SF OSCP）第二期任务 ISSUE，欢迎社区开发者参与共建～
若有感兴趣想要认领的任务，但还未报名，辛苦先完成报名进行哈～

任务介绍

任务名称：使用 SPU 实现决策树模型基础功能
技术方向：SPU/SML
任务难度：进阶🌟🌟

详细要求：

安全性（尽量少 reveal）
功能性：实现算法的基本功能，包括：
- 需要支持最基本的训练和预测
- 起码一种 criterion，如 gini
- 支持一些简单的预剪枝机制或其他防止过拟合的手段
收敛性：包含 simulator 跑出的实验数据并且证明收敛性
代码规范：Python 代码需要使用 black+isort 进行格式化（流水线包含代码规范检查卡点）
提交说明：关联该 isuue 并提交代码至 https://github.com/secretflow/spu/tree/main/sml
特殊说明：若某个特性有特殊的限制，如需要 FM128，需要更多 fxp 等需要在注释文档中明确说明
任务难点：该任务几乎都是比较运算，密态下每个子节点的数据量理论上都是 O(n)，可能导致效率很低，需要在实现功能的基础上尽可能提高效率

能力要求：

熟悉经典的机器学习算法
熟悉 JAX 或 NumPy，可以使用 NumPy 实现算法

操作说明：

可参考明文实现 DecisionTreeClassifier
操作指引：https://www.secretflow.org.cn/docs/spu/latest/en-US/getting_started/tutorials/develop_your_first_mpc_application
范例：https://github.com/secretflow/spu/blob/main/sml/linear_model/simple_sgd.py

shazi4399 · 2023-06-08T13:19:28Z

shazi4399 Give it to me

Candicepan · 2023-07-14T08:27:00Z

经沟通，该任务已经回收，目前为待认领状态哈，欢迎小伙伴们继续认领～

tarantula-leo · 2023-07-20T08:45:27Z

LR当时问过“只能通过明文转化实现condition判断”，XGB里如果使用simulation应该是没法实现的吧？

deadlywing · 2023-07-20T09:26:29Z

LR当时问过“只能通过明文转化实现condition判断”，XGB里如果使用simulation应该是没法实现的吧？

tree-based算法确实比较特殊，如果纯密态下实现可以预见的效率会非常的低，但是我理解总归是能实现且需要实现的（通过一些01的indicator去标识）。

为了提升性能则需要reveal一部分信息，但此时安全性则无法保证，原则上这类实现只要没有明显的安全隐患，我们也可以接受，但一定需要开发者写清楚整个训练/预测过程中所有泄漏的信息。

deadlywing · 2023-07-20T09:28:54Z

若是中途reveal了信息，确实无法直接用simulation的方式去测试，需要用类似sf的分布式api。spu在spu.utils.distributed下实现了一套类似sf的device机制，可以作为测试使用。

github-actions · 2023-08-20T09:43:40Z

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.

ElleryQu · 2023-09-29T07:57:32Z

ElleryQu Give it to me

deadlywing · 2023-09-29T08:02:54Z

ElleryQu Give it to me

Would you mind describing your implementation in several sentences？

Thanks

ElleryQu · 2023-09-29T08:56:35Z

ElleryQu Give it to me

Would you mind describing your implementation in several sentences？

Thanks

jax实现决策树。
使用indicator-based方案实现安全训练，可能需要手写indicator和排序协议。

ElleryQu · 2023-10-09T02:45:59Z

基于GTree进行了实现，对算法做出了部分调整，扩充了对多分类标签的支持。
code:

# sml/tree/tree.py
import jax.numpy as jnp
'''
The protocols of GTree.
'''
def oblivious_array_access(array, index):
    # (n_array)
    count_array = jnp.arange(0, array.shape[-1])
    # (n_array, n_index)
    E = jnp.equal(index, count_array[:, jnp.newaxis])
    
    # OAA basic case
    if len(array.shape) == 1:
        # (n_array, n_index)
        O = array[:, jnp.newaxis] * E                         # select shares
        zu = jnp.sum(O, axis=0)
    # OAA vectorization variant
    elif len(array.shape) == 2:
        # (n_arrays, n_array, n_index)
        O = array[:, :, jnp.newaxis] * E[jnp.newaxis, :, :]   # select shares
        zu = jnp.sum(O, axis=1)
    else:
        raise "Not implemented."
    return zu

def oaa_elementwise(array, index_array):
    assert index_array.shape[0] == array.shape[0], "n_arrays must be equal to n_index."
    count_array = jnp.arange(0, array.shape[-1])
    # (n_array, n_index)
    E = jnp.equal(index_array[:, jnp.newaxis], count_array)
    if len(array.shape) == 2:
        O = array * E
        zu = jnp.sum(O, axis=1)
    else:
        raise "Not implemented."
    return zu

# def oblivious_learning(X, y, T, F, M, Cn, h):
def oblivious_learning(X, y, T, F, M, h, Cn, n_labels):
    ''' partition the data and count the number of data samples.
    
    params:
        D:  data samples, which is splitted into X, y.  X: (n_samples, n_features), y: (n_samples, 1).
        T:  tree structure reprensenting split features. (total_nodes)
        F:  tree structure reprensenting node types. (total_nodes)
            0 for internal, 1 for leaf, 2 for dummy.
        M:  which leave node does D[i] belongs to (for level h-1).  (n_samples)
        Cn: statical information of the data samples. (n_leaves, n_labels+1, 2*n_features)
        h:  int, current depth of the tree.
        n_labels: the number of label classes.
    '''
    # line 1-5, partition the datas into new leaves.
    n_d, n_f = X.shape
    n_h = 2 ** h
    if h != 0:
        Tval = oaa(T, M)
        Dval = oaae(X, Tval)
        M = 2 * M + Dval + 1
        
    # (n_leaves)
    LCidx = jnp.arange(0, n_h)
    isLeaf = jnp.equal(F[n_h-1:2*n_h-1], jnp.ones(n_h))
    # (n_samples, n_leaves)
    LCF = jnp.equal(M[:, jnp.newaxis] - n_h + 1, LCidx)
    LCF = LCF * isLeaf
    # (n_samples, n_leaves, n_labels, 2 * n_features)
    Cd = jnp.zeros((n_d, n_h, n_labels + 1, 2 * n_f))
    Cd = Cd.at[:,:,0,0::2].set(jnp.tile((1 - X) [:,jnp.newaxis,:], (1, n_h, 1)))
    Cd = Cd.at[:,:,0,1::2].set(jnp.tile((X)     [:,jnp.newaxis,:], (1, n_h, 1)))
    for i in range(n_labels):
        Cd = Cd.at[:,:,i+1,0::2].set(jnp.tile(((1 - X) * (i == y)[:,jnp.newaxis])[:,jnp.newaxis,:], (1, n_h, 1)))
        Cd = Cd.at[:,:,i+1,1::2].set(jnp.tile(((X)     * (i == y)[:,jnp.newaxis])[:,jnp.newaxis,:], (1, n_h, 1)))
    Cd = Cd * LCF[:,:,jnp.newaxis,jnp.newaxis]
    # (n_leaves, n_labels+1, 2*n_features)
    new_Cn = jnp.sum(Cd, axis=0)
    
    print(isLeaf.shape, Cn.shape, new_Cn.shape)
    if h != 0:
        Cn = Cn.repeat(2, axis=0)
    new_Cn = new_Cn[:,:,:] + Cn[:,:,:] * (1 - isLeaf[:,jnp.newaxis,jnp.newaxis])
    
    return new_Cn, M
    
def oblivious_heuristic_computation(Cn, gamma, F, h, n_labels):
    ''' Compute gini index, find the best feature, and update F.
    
    params:
        Cn:         statical information of the data samples. (n_leaves, n_labels+1, 2*n_features)
        gamma:      gamma[n][i] indicates if feature si has been assigned at node n. (n_leaves, n_features)
        F:          tree structure reprensenting node types. (total_nodes)
                    0 for internal, 1 for leaf, 2 for dummy.
        h:          int, current depth of the tree.
        n_labels:   int, number of label classes.
    '''
    n_leaves = Cn.shape[0]
    n_features = gamma.shape[1]
    Ds0 = Cn[:,0,0::2]
    Ds1 = Cn[:,0,1::2]
    D = Ds0 + Ds1
    Q = D * Ds0 * Ds1
    P = jnp.zeros(gamma.shape)
    print(P.shape, Ds1.shape)
    for i in range(n_labels):
        P = P - Ds1 * (Cn[:,i+1,0::2] ** 2) - Ds0 * (Cn[:,i+1,1::2] ** 2)
    gini = Q / (Q + P + 1)
    gini = gini * gamma
    # (n_leaves)
    SD = jnp.argmax(gini, axis=1)
    index = jnp.arange(0, n_features)
    gamma = gamma * jnp.not_equal(index[jnp.newaxis, :], SD[:, jnp.newaxis])
    new_gamma = jnp.zeros((n_leaves * 2, n_features))
    new_gamma = new_gamma.at[0::2, :].set(gamma)
    new_gamma = new_gamma.at[1::2, :].set(gamma)
    
    psi = jnp.zeros((n_leaves, n_labels))
    for i in range(n_labels):
        psi = psi.at[:,i].set(Cn[:,i+1,0] + Cn[:,i+1,1])
    total = jnp.sum(psi, axis=1)
    psi = total[:, jnp.newaxis] - psi
    psi = jnp.prod(psi, axis=1)
    F = F.at[2**h-1:2**(h+1)-1].set(jnp.equal(psi * F[2**h-1:2**(h+1)-1], 0))
    F = F.at[2**(h+1)-1:2**(h+2)-1:2].set(2-jnp.equal(F[2**h-1:2**(h+1)-1], 0))
    F = F.at[2**(h+1):2**(h+2)-1:2].set(F[2**(h+1)-1:2**(h+2)-1:2])
    return SD, new_gamma, F
    

def oblivious_node_split(SD, T, F, Cn, h, max_depth):
    ''' Convert each node into its internal node and generates new leaves at the next level.
    params:
        SD: the best feature at current level. (n_leaves)
        T:  tree structure reprensenting split features. (total_nodes)
        F:  tree structure reprensenting node types. (total_nodes)
            0 for internal, 1 for leaf, 2 for dummy.
        Cn: statical information of the data samples. (n_leaves, n_labels+1, 2*n_features)
        h:  current level.
        max_depth: max depth of tree.
    '''
        
    T = T.at[2**h-1:2**(h+1)-1].set(SD)
    # return T
    return T, Cn


def oblivious_DT_training(X, y, max_depth, n_labels):
    n_samples, n_features = X.shape
    T = jnp.zeros((2 ** (max_depth + 1) - 1))
    F = jnp.ones((2 ** max_depth - 1))
    M = jnp.zeros(n_samples)
    gamma = jnp.ones((1, n_features))
    Cn = jnp.zeros((1, n_labels + 1, 2 * n_features))
    
    h = 0
    while h < max_depth:
        Cn, M = ol(X, y, T, F, M, h, Cn, n_labels)
        SD, gamma, F = ohc(Cn, gamma, F, h, n_labels)
        T, Cn = ons(SD, T, F, Cn, h, max_depth)

        h += 1
    
    n_leaves = 2 ** h
    psi = jnp.zeros((n_leaves, n_labels))
    for i in range(2 ** (h-1)):
        t1 = oaa(Cn[i, 1:], 2*SD[i:i+1]).squeeze()
        t2 = oaa(Cn[i, 1:], 2*SD[i:i+1]+1).squeeze()
        psi = psi.at[2*i, :].set(t1)
        psi = psi.at[2*i+1, :].set(t2)
        # psi = psi.at[0::2, i].set(Cn[0::2, i+1, 0])
        # psi = psi.at[1::2, i].set(Cn[1::2, i+1, 1])
    T = T.at[n_leaves - 1:].set(jnp.argmax(psi, axis=1))
    return T, F

def oblivious_DT_inference(X, T, max_height):
    n_samples, n_features = X.shape
    Tidx = jnp.zeros((n_samples))
    i = 0
    while i < max_height:
        Tval = oaa(T, Tidx)
        Dval = oaae(X, Tval)
        Tidx = Tidx * 2 + Dval + 1
        i += 1
    Tval = oaa(T, Tidx)
    return Tval

oaa = oblivious_array_access
oaae = oaa_elementwise
ol = oblivious_learning
ohc = oblivious_heuristic_computation
ons = oblivious_node_split
odtt = oblivious_DT_training 
odti = oblivious_DT_inference


# sml/tree/emulations/tree_emul.py
import time

import jax.numpy as jnp
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

import sml.utils.emulation as emulation
from sml.tree.tree import odtt, odti

MAX_DEPTH = 2
CONFIG_FILE = "3pc.json"

def emul_tree(mode = emulation.Mode.MULTIPROCESS):
    def proc_wrapper(max_depth=2, n_labels=3):
        def proc(X, y):
            T, F = odtt(X, y, max_depth=max_depth, n_labels=n_labels)
            result = odti(X, T, max_depth)
            return result
        return proc

    def load_data():
        iris = load_iris()
        iris_data, iris_label = jnp.array(iris.data), jnp.array(iris.target)
        # sorted_features: n_samples * n_features_in
        n_samples, n_features_in = iris_data.shape
        n_labels = len(jnp.unique(iris_label))
        sorted_features = jnp.sort(iris_data, axis=0)
        new_threshold = (sorted_features[:-1, :] + sorted_features[1:, :]) / 2
        new_features = jnp.greater_equal(iris_data[:,:], new_threshold[:,jnp.newaxis,:]) + 1 - 1
        new_features = new_features.transpose([1,0,2]).reshape(n_samples, -1)

        X, y = new_features, iris_label
        return X, y

    try:
        # bandwidth and latency only work for docker mode
        emulator = emulation.Emulator(
            CONFIG_FILE, mode, bandwidth=300, latency=20
        )
        emulator.up()

        # load mock data
        X, y = load_data()
        n_samples = y.shape[0]
        n_labels = jnp.unique(y).shape[0]

        # compare with sklearn
        clf = DecisionTreeClassifier(max_depth=MAX_DEPTH, criterion='gini', splitter='best', random_state=None)        
        start = time.time()
        clf = clf.fit(X, y)
        score_plain = clf.score(X, y)
        end = time.time()
        print(f"Running time in SKlearn: {end - start:.2f}s")

        # mark these data to be protected in SPU
        X_spu, y_spu = emulator.seal(X, y)

        # run
        proc = proc_wrapper(MAX_DEPTH, n_labels)
        start = time.time()
        result = emulator.run(proc)(X_spu, y_spu)
        end = time.time()
        score_encrpted = jnp.sum((result == y)) / n_samples
        print(f"Running time in SPU: {end - start:.2f}s")

        # print acc
        print(f"Accuracy in SKlearn: {score_plain:.2f}")
        print(f"Accuracy in SPU: {score_encrpted:.2f}")

    finally:
        emulator.down()


if __name__ == "__main__":
    emul_tree(emulation.Mode.MULTIPROCESS)

现在的问题是：

测试用例需要用不同的数据集测试吗？如果需要的话，是否有指定的数据集？
我该如何撰写bazel文件和提交代码？

deadlywing · 2023-10-09T03:26:08Z

基于GTree进行了实现，对算法做出了部分调整，扩充了对多分类标签的支持。 code:

# sml/tree/tree.py
import jax.numpy as jnp
'''
The protocols of GTree.
'''
def oblivious_array_access(array, index):
    # (n_array)
    count_array = jnp.arange(0, array.shape[-1])
    # (n_array, n_index)
    E = jnp.equal(index, count_array[:, jnp.newaxis])
    
    # OAA basic case
    if len(array.shape) == 1:
        # (n_array, n_index)
        O = array[:, jnp.newaxis] * E                         # select shares
        zu = jnp.sum(O, axis=0)
    # OAA vectorization variant
    elif len(array.shape) == 2:
        # (n_arrays, n_array, n_index)
        O = array[:, :, jnp.newaxis] * E[jnp.newaxis, :, :]   # select shares
        zu = jnp.sum(O, axis=1)
    else:
        raise "Not implemented."
    return zu

def oaa_elementwise(array, index_array):
    assert index_array.shape[0] == array.shape[0], "n_arrays must be equal to n_index."
    count_array = jnp.arange(0, array.shape[-1])
    # (n_array, n_index)
    E = jnp.equal(index_array[:, jnp.newaxis], count_array)
    if len(array.shape) == 2:
        O = array * E
        zu = jnp.sum(O, axis=1)
    else:
        raise "Not implemented."
    return zu

# def oblivious_learning(X, y, T, F, M, Cn, h):
def oblivious_learning(X, y, T, F, M, h, Cn, n_labels):
    ''' partition the data and count the number of data samples.
    
    params:
        D:  data samples, which is splitted into X, y.  X: (n_samples, n_features), y: (n_samples, 1).
        T:  tree structure reprensenting split features. (total_nodes)
        F:  tree structure reprensenting node types. (total_nodes)
            0 for internal, 1 for leaf, 2 for dummy.
        M:  which leave node does D[i] belongs to (for level h-1).  (n_samples)
        Cn: statical information of the data samples. (n_leaves, n_labels+1, 2*n_features)
        h:  int, current depth of the tree.
        n_labels: the number of label classes.
    '''
    # line 1-5, partition the datas into new leaves.
    n_d, n_f = X.shape
    n_h = 2 ** h
    if h != 0:
        Tval = oaa(T, M)
        Dval = oaae(X, Tval)
        M = 2 * M + Dval + 1
        
    # (n_leaves)
    LCidx = jnp.arange(0, n_h)
    isLeaf = jnp.equal(F[n_h-1:2*n_h-1], jnp.ones(n_h))
    # (n_samples, n_leaves)
    LCF = jnp.equal(M[:, jnp.newaxis] - n_h + 1, LCidx)
    LCF = LCF * isLeaf
    # (n_samples, n_leaves, n_labels, 2 * n_features)
    Cd = jnp.zeros((n_d, n_h, n_labels + 1, 2 * n_f))
    Cd = Cd.at[:,:,0,0::2].set(jnp.tile((1 - X) [:,jnp.newaxis,:], (1, n_h, 1)))
    Cd = Cd.at[:,:,0,1::2].set(jnp.tile((X)     [:,jnp.newaxis,:], (1, n_h, 1)))
    for i in range(n_labels):
        Cd = Cd.at[:,:,i+1,0::2].set(jnp.tile(((1 - X) * (i == y)[:,jnp.newaxis])[:,jnp.newaxis,:], (1, n_h, 1)))
        Cd = Cd.at[:,:,i+1,1::2].set(jnp.tile(((X)     * (i == y)[:,jnp.newaxis])[:,jnp.newaxis,:], (1, n_h, 1)))
    Cd = Cd * LCF[:,:,jnp.newaxis,jnp.newaxis]
    # (n_leaves, n_labels+1, 2*n_features)
    new_Cn = jnp.sum(Cd, axis=0)
    
    print(isLeaf.shape, Cn.shape, new_Cn.shape)
    if h != 0:
        Cn = Cn.repeat(2, axis=0)
    new_Cn = new_Cn[:,:,:] + Cn[:,:,:] * (1 - isLeaf[:,jnp.newaxis,jnp.newaxis])
    
    return new_Cn, M
    
def oblivious_heuristic_computation(Cn, gamma, F, h, n_labels):
    ''' Compute gini index, find the best feature, and update F.
    
    params:
        Cn:         statical information of the data samples. (n_leaves, n_labels+1, 2*n_features)
        gamma:      gamma[n][i] indicates if feature si has been assigned at node n. (n_leaves, n_features)
        F:          tree structure reprensenting node types. (total_nodes)
                    0 for internal, 1 for leaf, 2 for dummy.
        h:          int, current depth of the tree.
        n_labels:   int, number of label classes.
    '''
    n_leaves = Cn.shape[0]
    n_features = gamma.shape[1]
    Ds0 = Cn[:,0,0::2]
    Ds1 = Cn[:,0,1::2]
    D = Ds0 + Ds1
    Q = D * Ds0 * Ds1
    P = jnp.zeros(gamma.shape)
    print(P.shape, Ds1.shape)
    for i in range(n_labels):
        P = P - Ds1 * (Cn[:,i+1,0::2] ** 2) - Ds0 * (Cn[:,i+1,1::2] ** 2)
    gini = Q / (Q + P + 1)
    gini = gini * gamma
    # (n_leaves)
    SD = jnp.argmax(gini, axis=1)
    index = jnp.arange(0, n_features)
    gamma = gamma * jnp.not_equal(index[jnp.newaxis, :], SD[:, jnp.newaxis])
    new_gamma = jnp.zeros((n_leaves * 2, n_features))
    new_gamma = new_gamma.at[0::2, :].set(gamma)
    new_gamma = new_gamma.at[1::2, :].set(gamma)
    
    psi = jnp.zeros((n_leaves, n_labels))
    for i in range(n_labels):
        psi = psi.at[:,i].set(Cn[:,i+1,0] + Cn[:,i+1,1])
    total = jnp.sum(psi, axis=1)
    psi = total[:, jnp.newaxis] - psi
    psi = jnp.prod(psi, axis=1)
    F = F.at[2**h-1:2**(h+1)-1].set(jnp.equal(psi * F[2**h-1:2**(h+1)-1], 0))
    F = F.at[2**(h+1)-1:2**(h+2)-1:2].set(2-jnp.equal(F[2**h-1:2**(h+1)-1], 0))
    F = F.at[2**(h+1):2**(h+2)-1:2].set(F[2**(h+1)-1:2**(h+2)-1:2])
    return SD, new_gamma, F
    

def oblivious_node_split(SD, T, F, Cn, h, max_depth):
    ''' Convert each node into its internal node and generates new leaves at the next level.
    params:
        SD: the best feature at current level. (n_leaves)
        T:  tree structure reprensenting split features. (total_nodes)
        F:  tree structure reprensenting node types. (total_nodes)
            0 for internal, 1 for leaf, 2 for dummy.
        Cn: statical information of the data samples. (n_leaves, n_labels+1, 2*n_features)
        h:  current level.
        max_depth: max depth of tree.
    '''
        
    T = T.at[2**h-1:2**(h+1)-1].set(SD)
    # return T
    return T, Cn


def oblivious_DT_training(X, y, max_depth, n_labels):
    n_samples, n_features = X.shape
    T = jnp.zeros((2 ** (max_depth + 1) - 1))
    F = jnp.ones((2 ** max_depth - 1))
    M = jnp.zeros(n_samples)
    gamma = jnp.ones((1, n_features))
    Cn = jnp.zeros((1, n_labels + 1, 2 * n_features))
    
    h = 0
    while h < max_depth:
        Cn, M = ol(X, y, T, F, M, h, Cn, n_labels)
        SD, gamma, F = ohc(Cn, gamma, F, h, n_labels)
        T, Cn = ons(SD, T, F, Cn, h, max_depth)

        h += 1
    
    n_leaves = 2 ** h
    psi = jnp.zeros((n_leaves, n_labels))
    for i in range(2 ** (h-1)):
        t1 = oaa(Cn[i, 1:], 2*SD[i:i+1]).squeeze()
        t2 = oaa(Cn[i, 1:], 2*SD[i:i+1]+1).squeeze()
        psi = psi.at[2*i, :].set(t1)
        psi = psi.at[2*i+1, :].set(t2)
        # psi = psi.at[0::2, i].set(Cn[0::2, i+1, 0])
        # psi = psi.at[1::2, i].set(Cn[1::2, i+1, 1])
    T = T.at[n_leaves - 1:].set(jnp.argmax(psi, axis=1))
    return T, F

def oblivious_DT_inference(X, T, max_height):
    n_samples, n_features = X.shape
    Tidx = jnp.zeros((n_samples))
    i = 0
    while i < max_height:
        Tval = oaa(T, Tidx)
        Dval = oaae(X, Tval)
        Tidx = Tidx * 2 + Dval + 1
        i += 1
    Tval = oaa(T, Tidx)
    return Tval

oaa = oblivious_array_access
oaae = oaa_elementwise
ol = oblivious_learning
ohc = oblivious_heuristic_computation
ons = oblivious_node_split
odtt = oblivious_DT_training 
odti = oblivious_DT_inference


# sml/tree/emulations/tree_emul.py
import time

import jax.numpy as jnp
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

import sml.utils.emulation as emulation
from sml.tree.tree import odtt, odti

MAX_DEPTH = 2
CONFIG_FILE = "3pc.json"

def emul_tree(mode = emulation.Mode.MULTIPROCESS):
    def proc_wrapper(max_depth=2, n_labels=3):
        def proc(X, y):
            T, F = odtt(X, y, max_depth=max_depth, n_labels=n_labels)
            result = odti(X, T, max_depth)
            return result
        return proc

    def load_data():
        iris = load_iris()
        iris_data, iris_label = jnp.array(iris.data), jnp.array(iris.target)
        # sorted_features: n_samples * n_features_in
        n_samples, n_features_in = iris_data.shape
        n_labels = len(jnp.unique(iris_label))
        sorted_features = jnp.sort(iris_data, axis=0)
        new_threshold = (sorted_features[:-1, :] + sorted_features[1:, :]) / 2
        new_features = jnp.greater_equal(iris_data[:,:], new_threshold[:,jnp.newaxis,:]) + 1 - 1
        new_features = new_features.transpose([1,0,2]).reshape(n_samples, -1)

        X, y = new_features, iris_label
        return X, y

    try:
        # bandwidth and latency only work for docker mode
        emulator = emulation.Emulator(
            CONFIG_FILE, mode, bandwidth=300, latency=20
        )
        emulator.up()

        # load mock data
        X, y = load_data()
        n_samples = y.shape[0]
        n_labels = jnp.unique(y).shape[0]

        # compare with sklearn
        clf = DecisionTreeClassifier(max_depth=MAX_DEPTH, criterion='gini', splitter='best', random_state=None)        
        start = time.time()
        clf = clf.fit(X, y)
        score_plain = clf.score(X, y)
        end = time.time()
        print(f"Running time in SKlearn: {end - start:.2f}s")

        # mark these data to be protected in SPU
        X_spu, y_spu = emulator.seal(X, y)

        # run
        proc = proc_wrapper(MAX_DEPTH, n_labels)
        start = time.time()
        result = emulator.run(proc)(X_spu, y_spu)
        end = time.time()
        score_encrpted = jnp.sum((result == y)) / n_samples
        print(f"Running time in SPU: {end - start:.2f}s")

        # print acc
        print(f"Accuracy in SKlearn: {score_plain:.2f}")
        print(f"Accuracy in SPU: {score_encrpted:.2f}")

    finally:
        emulator.down()


if __name__ == "__main__":
    emul_tree(emulation.Mode.MULTIPROCESS)

现在的问题是：

测试用例需要用不同的数据集测试吗？如果需要的话，是否有指定的数据集？
我该如何撰写bazel文件和提交代码？

感谢算法实现，

一般至少需要在单元测试中测试一个数据集，可以使用sklearn自带的数据集，如load_breast_cancer
bazel文件可以参考sml中其他算法的写法，代码提交的话一般是先clone spu，然后在你自己的repo里发起PR即可；（可以参考https://github.com/secretflow/spu/blob/main/sml/development.md#contributing-code）

ElleryQu · 2023-10-17T04:10:59Z

您好，我已经发起了PR，目前还没有审核

deadlywing · 2023-10-17T05:43:03Z

您好，我已经发起了PR，目前还没有审核

sorry，最近有点忙，我争取这周完成review哈～

# Pull Request ## What problem does this PR solve? Issue Number: Fixed #212 ## Possible side effects? - Performance: **test**: ``` No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) [2023-10-13 18:32:00.502] [info] [thread_pool.cc:30] Create a fixed thread pool with size 19 Accuracy in SKlearn: 0.96 Accuracy in SPU: 0.96 . ---------------------------------------------------------------------- Ran 1 test in 78.026s OK ``` **emulation**: Running time in SPU: 77.37s Accuracy in SKlearn: 0.96 Accuracy in SPU: 0.96 [2023-10-13 18:35:38,091] Shutdown multiprocess cluster... - Backward compatibility:

Candicepan added this to OSCP Phase 1 Jun 7, 2023

Candicepan converted this from a draft issue Jun 7, 2023

Candicepan added enhancement New feature or request OSCP SecretFlow Open Source Contribution Plan labels Jun 7, 2023

Candicepan assigned shazi4399 Jun 8, 2023

Candicepan moved this from Needs Triage to In Progress in OSCP Phase 1 Jun 8, 2023

Candicepan unassigned shazi4399 Jul 14, 2023

Candicepan moved this from In Progress to Needs Triage in OSCP Phase 1 Jul 14, 2023

Candicepan removed this from OSCP Phase 1 Jul 21, 2023

Candicepan added this to OSCP Phase 2 Jul 21, 2023

Candicepan moved this to Needs Triage in OSCP Phase 2 Jul 21, 2023

github-actions bot added the no-issue-activity label Aug 20, 2023

Candicepan removed the no-issue-activity label Aug 21, 2023

anakinxc added the never-stale label Sep 11, 2023

Candicepan assigned ElleryQu Sep 29, 2023

Candicepan moved this from Needs Triage to In Progress in OSCP Phase 2 Oct 7, 2023

ElleryQu mentioned this issue Oct 13, 2023

[add] GTree based Decision Tree #367

Merged

Candicepan moved this from In Progress to In Review in OSCP Phase 2 Oct 17, 2023

deadlywing closed this as completed in #367 Oct 24, 2023

github-project-automation bot moved this from In Review to Done in OSCP Phase 2 Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用 SPU 实现决策树模型基础功能 #212

使用 SPU 实现决策树模型基础功能 #212

Candicepan commented Jun 7, 2023 •

edited

Loading

shazi4399 commented Jun 8, 2023

Candicepan commented Jul 14, 2023

tarantula-leo commented Jul 20, 2023

deadlywing commented Jul 20, 2023

deadlywing commented Jul 20, 2023

github-actions bot commented Aug 20, 2023

ElleryQu commented Sep 29, 2023

deadlywing commented Sep 29, 2023

ElleryQu commented Sep 29, 2023

ElleryQu commented Oct 9, 2023

deadlywing commented Oct 9, 2023

ElleryQu commented Oct 17, 2023

deadlywing commented Oct 17, 2023

使用 SPU 实现决策树模型基础功能 #212

使用 SPU 实现决策树模型基础功能 #212

Comments

Candicepan commented Jun 7, 2023 • edited Loading

任务介绍

详细要求：

能力要求：

操作说明：

shazi4399 commented Jun 8, 2023

Candicepan commented Jul 14, 2023

tarantula-leo commented Jul 20, 2023

deadlywing commented Jul 20, 2023

deadlywing commented Jul 20, 2023

github-actions bot commented Aug 20, 2023

ElleryQu commented Sep 29, 2023

deadlywing commented Sep 29, 2023

ElleryQu commented Sep 29, 2023

ElleryQu commented Oct 9, 2023

deadlywing commented Oct 9, 2023

ElleryQu commented Oct 17, 2023

deadlywing commented Oct 17, 2023

Candicepan commented Jun 7, 2023 •

edited

Loading