Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No multiprocessing/low CPU utilization on CPU and/or Windows #9271

Open
1 task done
nicklasb opened this issue Jan 10, 2025 · 9 comments
Open
1 task done

No multiprocessing/low CPU utilization on CPU and/or Windows #9271

nicklasb opened this issue Jan 10, 2025 · 9 comments
Assignees

Comments

@nicklasb
Copy link

问题确认 Search before asking

  • 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

Hi,

I am training a small PicoDet model, and I am doing it using the CPU on a Windows box.
The odd thing is that whatever I do, I never get more than 5-6 percent CPU load from the training Python process, and also, it is only one process, even though I have set worker_num to all kinds of values. I have also set use_multiprocess: True and so on.

I tried to run it in a debian docker, and while I have some issues there, that didn't either seem to start multiple processes.

Is there something I am missing? What settings should be enabled?

@Bobholamovic
Copy link
Member

I would say that such inefficiency is anticipated when using CPUs for deep learning training. Here are the downsides of using CPUs to train deep learning models. The training process comprises several distinct steps, such as data loading, data preprocessing, model forward pass, loss computation, model backward pass, and so on. To accelerate data loading and data preprocessing, you can increase worker_num to utilize more worker processes; however, this will not improve the performance of the subsequent steps. This is because, regardless of how many workers are used for data loading and preprocessing, only one process will be used for the other steps. Additionally, it is commonly known that CPUs are not optimized for training deep learning models, and they are much less efficient in this regard. To sum up, I highly discourage training with CPUs, as the time taken could be prohibitively long. To save time, I strongly recommend using GPUs or other dedicated devices to expedite your training.

@Bobholamovic
Copy link
Member

Sorry, I just misunderstood your question. This looks like a bug. Please provide the versions of the PaddlePaddle and PaddleDetection you are using, so that we can locate the problem.

@nicklasb
Copy link
Author

Hi!
No problem, of course. It is friday. :-)
I am running from the master branch of PaddleDetection and PaddlePaddle 2.6,2 from pip.

WRT not using a GPU, I have a comparatively small dataset and train for a specific purpose, so I just run the training overnight. But don't worry, I will get a really badass setup later. :-)

@Bobholamovic
Copy link
Member

I found this in PaddlePaddle 2.6 source code. Unfortunately, it looks like the latest version of PaddlePaddle (3.0.0b2) still does not support multi-processing data loaders on Windows. As for the low utilization issue, one reason could be that data loading takes too much time and the code lacks concurrency. I would suggest setting up some timers in the main training loop to measure the time taken for each step. This will help identify the bottleneck, allowing us to design targeted strategies to improve performance.

@nicklasb
Copy link
Author

nicklasb commented Jan 10, 2025

Hm. Unfortunate indeed.
Oddly I am not getting that warning as far as I can see, doesn't it properly identify the platform, perhaps?
Would I benefit from running it in something like a Linux docker container then? (i tried and had some issues, but obviously I will pursue that if it is possibility)

WRT efficiency, I'll see what I find.

@nicklasb
Copy link
Author

nicklasb commented Jan 11, 2025

Oddly, (on a debian docker) it still doesn't start several processes, and still takes about 5 percent.
I am beginning to suspect something else, like antivirus or something (however then I would think that would take a lot of CPU, and it doesn't). And I have played around with process priorities aswell for no avail.
However, I am not having this issue with Torch or any other ML framework. Really strange.

@nicklasb
Copy link
Author

I set a large number of workers in the docker image, and there it seems like it is creating workers, I got this when I control -C that process:
DataLoader 36 workers exit unexpectedly, pids: 619, 621, 623, 625, 627, 629, 631, 633, 635, 637, 639, 641, 643, 645, 647, 649, 651, 653, 655, 657, 659, 661, 663, 665, 667, 669, 671, 673, 675, 677, 679, 681, 683, 685, 687, 689

But I am not sure dataloader workers mean training workers, I am not that versed in PaddleDetection.

@nicklasb
Copy link
Author

Ok, I solved the issue on efficiency under Windows 11
The operating system had to be set in performance mode, and sometimes it set the VSCode process in "Efficiency mode" which constrained its and the processes it spawns ability to use the CPU. Setting those, the Python script will be able to fully take advantage of a logical core (not physical).

However, the issue with the docker instance is odd.
Running top, I see a lot of worker processes, but they are not doing anything, they are completely idle.

@Bobholamovic
Copy link
Member

However, the issue with the docker instance is odd.
Running top, I see a lot of worker processes, but they are not doing anything, they are completely idle.

I guess this is because the underlying host OS is still Windows, and the PaddlePaddle framework does not offer a workable solution for multiprocessing on Windows. To verify this, I wrote a simple code snippet:

import numpy as np
from paddle.io import IterableDataset, DataLoader

class RandomDataset(IterableDataset):
    def __iter__(self):
        while True:
            image = np.random.random([784]).astype(np.float32)
            label = np.random.randint(0, 9, (1, )).astype(np.int64)
            yield image, label
    

for i, _ in enumerate(DataLoader(RandomDataset(), num_workers=4)):
    print(i)

You can run this snippet inside the Docker container to check if 4 worker processes are running with relatively high utilization, as would be expected (I have verified this on my Linux machine). If there are still idle worker processes, the issue is likely with PaddlePaddle, possibly due to the lack of support for multiprocessing on Windows. In that case, I would suggest opening an issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants