Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any way to profile memory usage in dlio? #141

Open
krehm opened this issue Jan 23, 2024 · 18 comments
Open

any way to profile memory usage in dlio? #141

krehm opened this issue Jan 23, 2024 · 18 comments

Comments

@krehm
Copy link
Contributor

krehm commented Jan 23, 2024

Is there any way to track memory usage in a DLIO benchmark process?

I ask because I have a test node with 256 GB of memory. When running unet3d with 9375 files, I see the MPI process grow from 5.5 GB to 11 GB relatively quickly. Each of the 4 spawn reader processes is 1.5 GB. That works out to about 18 GB per MPI process. I can run 12 MPI processes with DAOS and get full accelerator efficiency, but I can't scale up farther because the machine runs out of memory, and processes are too big to swap, I have to reboot the node to get it working again.

My thought was to try to track down what causes the jumps in memory size in the MPI process, maybe there is a way to reduce the amount of RSS space consumed, but if this has been done before, I'd rather learn from the experts. :-)

@hariharan-devarajan
Copy link
Collaborator

Can u disable persistent workers within the data loader to false. I was seeing such memory increases with the persistent workers as pytorch keeps spawning more workers while keeping previous workers alive

@krehm
Copy link
Contributor Author

krehm commented Jan 25, 2024

I will give that a shot on Monday, thanks for the tip.

@krehm
Copy link
Contributor Author

krehm commented Feb 16, 2024

@hariharan-devarajan So the memory usage is caused by the pin_memory=True parameter in the DataLoader() instantiation in torch_data_loader.py. If changed to False, memory usage drops from 11+GB to 2.7 GB. Individual steps are a tad slower, from 1.36 seconds up to 1.40 seconds. My config has 4 reader_threads, there are 2 samplers per thread, so pytorch is adding 1 GB of memory to the process for each reader_thread/sampler combo.

Any idea what is going on when memory is pinned, what is different about the behavior versus when not pinned? The pytorch documentation is not very helpful here.

Why is 1 GB being pinned per reader_thread/sampler combo when the tensors are only a few hundred bytes in size? There doesn't seem to be a way to control the amount of memory that is pinned. I'd like to understand this better.

@hariharan-devarajan
Copy link
Collaborator

My understanding the persistent workers + memory pinning was screwing things up.

With the new changes I recently did, it might have been fixed.

@krehm
Copy link
Contributor Author

krehm commented Feb 17, 2024

Nope, I just upgraded to today's main branch and reran with top (shift M) running in a separate window. Parent process RES value is 11.2 Gbytes, SHR is 9.5 Gbytes, VIRT is 22.0 Gbytes.

PID     USER     PR  NI  VIRT    RES    SHR    S %CPU  %MEM  TIME+     COMMAND
135981  root     20  0   22.0g   11.2g  9.5g   S 8.0   4.5   0:38.18   python3
136124  root     20  0   12.0g   3.2g   1.0g   R 56.1  1.3   1:20.03   python3
136187  root     20  0   11.6g   2.8g   597736 R 14.0  1.1   1:22.17   python3
135998  root     20  0   11.5g   2.7g   597644 S 8.3   1.1   1:21.67   python3
136061  root     20  0   11.5g   2.7g   597644 S 43.5  1.1   1:23.41   python3

Top commit is:

commit 4c8818c10e5cbdf36d877b46c3f337b73bd37fe5 (HEAD -> main, origin/mlperf_storage_v1.0, origin/main, origin/HEAD)
Author: Huihuo Zheng <huihuo.zheng@anl.gov>
Date:   Thu Feb 15 20:47:24 2024 +0000

    added cosmoflow workload configs for v100 and h100

@krehm
Copy link
Contributor Author

krehm commented Feb 18, 2024

The memory of the parent process starts to grow immediately as it walks through the 4 child processes, 2 samplers each, sending the "start sending tensors" message on the sockets. By the time all 8 messages have been sent, the parent process is already near the 11 GB size. This is in the very first epoch, so I don't think persistent workers has any effect here, the problem occurs with the first set of workers.

I will mention that my test node does happen to have a GPU in it. The torch code uses both pin_memory=True plus it searches for an attached GPU, and if both are true, then pin_memory is activated. I mention that in case your test machines do NOT have GPUs in them, I don't think pin_memory would have an effect in that case even if pin_memory=True.

@krehm
Copy link
Contributor Author

krehm commented Feb 18, 2024

The pin_memory call is being made in pin_memory() in python3.9/site-packages/torch/utils/data/_utils/pin_memory.py at the first parameter check, in which device is None:

    if isinstance(data, torch.Tensor):
        return data.pin_memory(device)

I can't step into the data.pin_memory() code because it's C++, but I can see memory allocation occurring there. At the moment I haven't figured out how to dig deeper, see the actual memory allocations, why they happen. Of course I expect some allocation, incoming samples are copied to a Queue() in pinned memory, but I don't see why 8 GB would be allocated given that each step contains only 4 samples. The main process grows to 11 Gbytes for RES quickly and then stays steady after that for all subsequent epochs, so there is not a continuous memory leak.

@krehm
Copy link
Contributor Author

krehm commented Feb 19, 2024

I changed the number of reader_threads from 4 to 8. In this case the parent process RES size jumped from 11 GB to 19.2 GB, which is about another 8 GB, so the expansion size of the main process appears to go up by 1 GB for each reader_thread/sampler combo.

Looking at the init code in pytorch dataloader.py, each worker thread gets its own index_queue Queue() where the parent sends the next sample index to retrieve. All the workers share a common _worker_result_queue Queue() where the returned samples are stored. If there is no pin_memory, then the parent process gets samples from the _worker_result_queue directly. If there is pin_memory, then a single pin_memory thread is started with a single Queue, and it copies samples from _worker_result_queue to its queue using pinned memory, and the parent process reads samples from the pin_memory queue. I don't see where the pin_memory thread has any knowledge of the number of worker threads there are; it is a thread, not a process, but still, I don't think it knows the worker count, and so it couldn't be allocating 1GB of pinned memory based on the worker count. The only Queue() that increases with the number of workers is each worker's index_queue. Could it be that it is each worker's index_queue that is growing to 1GB? Why would it stop growing at 1 GB?

@krehm
Copy link
Contributor Author

krehm commented Feb 22, 2024

@hariharan-devarajan not sure, is a name-ping required for a person to see additions to a ticket, other than the creator?

Anyway, I would be interested in your thoughts on the above. I am running the unet3d benchmark, with num_files_train set to 9376, multiprocessing_context is fork.

@hariharan-devarajan
Copy link
Collaborator

Sorry haven't yet got a chance to look at this. Hopefully I will get to check this out by Saturday.

@hariharan-devarajan
Copy link
Collaborator

@krehm This seems odd. Couple of things could be happening. But do u think this could be a garbage collection issue? if so, can u do del of the numpy array data once it's returned to free memory reference for garbage collection?

For my testing, I generally use a GPU-based machine itself, but maybe I didn't come across this due to the number of files and we have large memories on our clusters.

Maybe u can do a Valgrind run with more workers on one process and see if u spot any memory leaks. NOTE: I have never tried it with a Python app, let alone a Pytorch app.

@hariharan-devarajan
Copy link
Collaborator

Another quick thing to check is if your samples are smaller(say 4096) would the memory consumption be proportional ?

@krehm
Copy link
Contributor Author

krehm commented Feb 25, 2024

I am running a single DLIO accelerator instance. I reduced the number of files from 9375 to 168, it made no difference. By step 5 in the first epoch, the main process is already 11.1 GB in size. The child processes are 2.7 GB, which is what the parent was just before it started talking to the child processes. Throughout the rest of the benchmark, the main process does not grow further in size, so it is not a simple continuous memory leak.

I thought it might have something to do with the pytorch getCachingHostAllocator used with cuda, just a wild guess. I tried setting PYTORCH_NO_CUDA_MEMORY_CACHING=1 but it made no difference.

@hariharan-devarajan
Copy link
Collaborator

Another hunch. What is the sample size? The default sample size is 200MB, right? Also, is there any prefetching cache setup? If so, what is its value?

A couple of things to try again.

  1. Make smaller samples. It will reduce memory, hopefully.
  2. Set the prefetch cache to 0 to force no memory caching for the data loader.

@krehm
Copy link
Contributor Author

krehm commented Feb 28, 2024

@hariharan-devarajan Looking at the training files, there seems to be a distribution of .npz files from 36M up to 300M with 10 files of each size. With 9376 files the average file size is 148M.

prefetch_size in the config is the default 0, so torch will use prefetch_factor of 2. 4 reader children, so that's 8 samples prefetched, plus the batch_size is 4, so there will be 4 samples in use by the sleeping accelerator, so a total of 12 samples in memory maximum at any point in time. Using the average size, that's 1.78 GB of memory growth max in the parent process. Even if all the files were the maximum size, that would be 3.6 GB growth max in the parent process. I consistently see an increase of 8 GB by step 5 in the first epoch. After that the parent process size is stable. All the growth is not instantaneous in the first step, it takes a few to reach max size.

I hacked the code to set prefetch_factor to 1, I can't set it to 0 with reader threads or it asserts. Memory growth was 4GB rather than the usual of 8GB with a prefetch_factor of 2 and 4 reader threads. Setting prefetch_factor to 4 caused 16 GB of growth, setting it to 8 caused 32 GB of growth, so the growth clearly correlates with the number of prefetches, it is 1 GB per prefetch/reader_thread combo, or 1 GB per sample on the pin_memory queue. If I hack pin_memory=False there is no noticeable growth in the parent process at all, although each step is about 0.04 seconds slower.

I know your big cluster is all Intel accelerators, do you have any place where you could do a run on a node with cuda accelerators, just to see if you see the problem there, which would imply a problem in the cuda code. Something like when each sample is added to the pin_memory queue, a minimum of 1 GB is allocated, that would give the observed amount of growth. Given that there is no growth when pin_memory is false, it must have something to do with the pin_memory code, perhaps the cuda code.

@hariharan-devarajan
Copy link
Collaborator

Is the 1GB also corresponds to read threads u use? What is the relation between number of read threads and memory.

I know prefetch factor is not per process but per read thread

@krehm
Copy link
Contributor Author

krehm commented Feb 28, 2024

If I double the number of reader threads, the growth doubles. Or if I double the prefetch_factor, the growth doubles. So the growth corresponds to the number of samples in the pin_memory queue at any point in time.

@hariharan-devarajan
Copy link
Collaborator

hariharan-devarajan commented Feb 28, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants