-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Job config of FedJobConfig is not generated correctly #2935
Comments
@KCC13 Thank you very much for using the job API functions, extending the usages and raising the issues to us. After examining your running use case, here's the causes for the errors you experienced:
Note: |
Thank you very much for your clear reply. 🤜🤛 |
@KCC13 Here's an example how to wrap up the Python function into a Python class to work around this issue:
class Resnet18(ResNet):
def __init__(self, num_classes, weights: Optional[ResNet18_Weights] = None, progress: bool = True, **kwargs: Any):
self.num_classes = num_classes
weights = ResNet18_Weights.verify(weights)
if weights is not None:
_ovewrite_named_param(kwargs, "num_classes", len(weights.meta["categories"]))
super().__init__(BasicBlock, [2, 2, 2, 2], num_classes=num_classes, **kwargs)
if weights is not None:
super().load_state_dict(weights.get_state_dict(progress=progress))
|
Describe the bug
Hi, as title, the bug was caused when I tried to replace the model (SimpleNetwork) in the hello-pt example with resnet18 of torchvision. The error message showed that the config was not correctly organized and thus could not be serialized.
To Reproduce
from torchvision.models import resnet18
, and replacemodel = SimpleNetwork()
withmodel = resnet18(num_classes=10)
.initial_model=SimpleNetwork()
withinitial_model=resnet18(num_classes=10)
inFedAvgJob
.Expected behavior
The simulation should be executed correctly.
Screenshots
Desktop (please complete the following information):
Additional context
If we look into the content of
server_app
in_get_server_app
according to the indication of the error message, it shows:Several observations:
{'norm_layer': <class 'torch.nn.modules.batchnorm.BatchNorm2d'>}
.norm_layer
argument ofresnet18
takes classes (subclass of torch.nn.modules) as input, not instances. However, it seems the current FedJobConfig API cannot fulfill this kind of format/input requirement.num_classes
argument ofresnet18
is not the same as the default value (1000), but it's not listed in theserver_app
.The text was updated successfully, but these errors were encountered: