Skip to content
This repository has been archived by the owner on Dec 9, 2024. It is now read-only.

How to evaluate worker performance independently on a distributed training #520

Open
delucca opened this issue Feb 15, 2022 · 0 comments
Open

Comments

@delucca
Copy link

delucca commented Feb 15, 2022

Hi

I'm trying to evaluate the performance of each worker independently in a cluster with multiple machines while training them using the same model. My goal is to record each worker training performance.

Every setup and config that I try I always get the same time for all workers (probably because of synchronization issues). So, even if one of my workers is a machine that is 4x faster, it would still record the same time as the slowest machine in the cluster.

Anyone has any idea how can I do that?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant