Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

客户端数量设置问题 #42

Open
Lyy838354973 opened this issue Sep 27, 2023 · 8 comments
Open

客户端数量设置问题 #42

Lyy838354973 opened this issue Sep 27, 2023 · 8 comments

Comments

@Lyy838354973
Copy link

1695780849511

image
你好,当客户端shu'lian数量设置到800时就会出现报错,请问怎么解决

@WwZzz
Copy link
Owner

WwZzz commented Sep 27, 2023

你好,感谢您的反馈。这项报错似乎是因为client的train_data大小是0引起的。当我尝试了一样的参数设置

import flgo
import flgo.benchmark.cifar10_classification as cifar
import flgo.benchmark.partition as fbp
task = 'cifar10_dir1_800'
flgo.gen_task_by_(cifar, fbp.DirichletPartitioner(num_clients=800, alpha=1.0, imbalance=0.7), task)

# run
import flgo.algorithm.fedavg as fedavg
runner = flgo.init(task, fedavg, {"gpu":0})
print(min([len(c.train_data) for c in runner.clients]))

运行得到的结果为:
image
其中最小的用户训练集大小为3,并且运行顺利。

为了复现出相同的错误,我增加用户的数目到10000,出现了大小为0的训练集,该处bug产生的原因有两处:一处为BasicPartitioner.data_imbalance_generator中生成imbalance数据集大小部分,少了对大小为0的训练集的检测;另一部分为生成dirichlet分布时仍有可能因为某类数据比例过小,实际产生0大小的数据集。该bug在imbalance系数较大,且用户数目较多时将出现。我分别修改了这两处,并增加minvol关键字用于控制Dirichlet分布中的用户最小数据集大小。最后我测试3000个用户的dirichlet相同分布,生成代码如下:

import flgo
import flgo.benchmark.cifar10_classification as cifar
import flgo.benchmark.partition as fbp
task = 'cifar10_dir1_3000'
flgo.gen_task_by_(cifar, fbp.DirichletPartitioner(num_clients=3000, alpha=1.0, imbalance=0.7, minvol=10), task)

# run
import flgo.algorithm.fedavg as fedavg
runner = flgo.init(task, fedavg, {"gpu":0})
print(min([len(c.train_data) for c in runner.clients]))

生成的结果图如下:
image

其中最小用户数据集大小为10.
我已在最新版本v1.0.7中修复了该bug。并上传至github和pypi,使用pip install --upgrade flgo即可更新

@Lyy838354973
Copy link
Author

1696246903689
class Server(flgo.algorithm.fedbase.BasicServer): def initialize(self, *args, **kwargs): # 频率fi self.fi = [1] * len(self.clients) self.entropy = [0] * len(self.clients) self.xuhao_client = [i for i in range(self.num_clients)] self.losses_befor = [self.clients[cid].test(self.model)['loss'] for cid in self.xuhao_client] self.test_metric=self.test() print(self.test_metric['loss'])
你好,虽然能够增加客户端数量了,但貌似在初始化时,损失在原本有错误的客户端上出现了问题,在较少的客户端数量时就不会出现此类问题

@Lyy838354973
Copy link
Author

class Server(flgo.algorithm.fedbase.BasicServer):
    def initialize(self, *args, **kwargs):
        # 频率fi
        self.fi = [1] * len(self.clients)
        self.entropy = [0] * len(self.clients)
        self.xuhao_client = [i for i in range(self.num_clients)]
        self.losses_befor = [self.clients[cid].test(self.model)['loss'] for cid in self.xuhao_client]
        self.test_metric=self.test()
        print(self.test_metric['loss'])

@WwZzz
Copy link
Owner

WwZzz commented Oct 3, 2023

class Server(flgo.algorithm.fedbase.BasicServer):
    def initialize(self, *args, **kwargs):
        # 频率fi
        self.fi = [1] * len(self.clients)
        self.entropy = [0] * len(self.clients)
        self.xuhao_client = [i for i in range(self.num_clients)]
        self.losses_befor = [self.clients[cid].test(self.model)['loss'] for cid in self.xuhao_client]
        self.test_metric=self.test()
        print(self.test_metric['loss'])

您好,返回的错误为KeyError可能是因为本地数据集仍然太小,导致本地通过train_holdout和local_test划分出的验证集和测试集为空,例如大小为3的数据集,划出0.2的数据作为其他用途,将划分出空数据。当test函数在空数据集上测试时,将返回空字典,造成keyerror。一种较快的解决方式为,确保最少数据集train_holdout0.5>=1

@Lyy838354973
Copy link
Author

你好,我还想请问我是否可以获得不进行模型传输的客户端的模型参数,我将如何调用以得到所有客户端每轮训练后的模型参数

@WwZzz
Copy link
Owner

WwZzz commented Oct 11, 2023

你好,我还想请问我是否可以获得不进行模型传输的客户端的模型参数,我将如何调用以得到所有客户端每轮训练后的模型参数

你好,默认的实现为用户本地训练后不保存训练好的模型,以防止过多的内存\显存占用;若要不传输就访问,则需要在打包发送之前在本地保存训练好的模型self.local_model=model,然后服务器可以通过self.clients[i].local_model来访问

@Lyy838354973
Copy link
Author

你好,我还想请问我是否可以获得不进行模型传输的客户端的模型参数,我将如何调用以得到所有客户端每轮训练后的模型参数

你好,默认的实现为用户本地训练后不保存训练好的模型,以防止过多的内存\显存占用;若要不传输就访问,则需要在打包发送之前在本地保存训练好的模型self.local_model=model,然后服务器可以通过self.clients[i].local_model来访问

请问能设置每个客户端上本地数据的数据量吗

@WwZzz
Copy link
Owner

WwZzz commented Oct 14, 2023

你好,我还想请问我是否可以获得不进行模型传输的客户端的模型参数,我将如何调用以得到所有客户端每轮训练后的模型参数

你好,默认的实现为用户本地训练后不保存训练好的模型,以防止过多的内存\显存占用;若要不传输就访问,则需要在打包发送之前在本地保存训练好的模型self.local_model=model,然后服务器可以通过self.clients[i].local_model来访问

请问能设置每个客户端上本地数据的数据量吗

你好,要设置具体数量的话,可能需要手动自己写一个Partitioner,这里我提供了一个例子:

import flgo.benchmark.mnist_classification
import flgo.benchmark.partition as fbp
import numpy as np
# 1. Define the partitioner
class MyIIDPartitioner(fbp.BasicPartitioner):
    def __init__(self, samples_per_client=[15000, 15000, 15000, 15000]):
        self.samples_per_client = samples_per_client
        self.num_clients = len(samples_per_client)

    def __call__(self, data):
        # 1.1 shuffle the indices of samples
        d_idxs = np.random.permutation(len(data))
        # 1.2 Divide all the indices into num_clients shards
        local_datas = np.split(d_idxs, np.cumsum(self.samples_per_client))[:-1]
        local_datas = [di.tolist() for di in local_datas]
        return local_datas

# 2. Specify the Partitioner in task configuration
task_config = {
    'benchmark': flgo.benchmark.mnist_classification,
    'partitioner':{
        'name':MyIIDPartitioner,
        'para':{
            'samples_per_client':[5000, 14000, 19000, 22000]
        }
    }
}
task = 'my_test_partitioner'

# 3. Test it now
flgo.gen_task(task_config, task)
import flgo.algorithm.fedavg as fedavg
runner = flgo.init(task, fedavg)
runner.run()

生成效果为
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants