Skip to content

Incorrect reporting of memory utilisation #141

Open
@david-waterworth

Description

@david-waterworth

Describe the bug
I'm running into issues with batch transform due to what I assume is an OOM condition. The main problem appears to be because as far as I can see there's no way to explicitly configure the batch_size for a batch transform that I'm aware of.

Instead the batch_size appears to be controlled by MaxPayloadInMB which has a minimum of 1. I added logging in my predict_fn and observe that I'm receiving a mix of batches containing 1000 examples, and some that contain 10k+ examples. The huge batches are pretty much 1MB is size - I have no idea where the batches of 1000 come from (I'm wondering if its splitting the last batch that is less than the 1MB payload).

The issue is that the large batches seem to occasionally cause the worker to crash - I suspect it's an out-of-memory (the obvious workaround is to pick a machine with more memory). When I look at the logs the maximum utilisation appears to be around 50% - but looking closer that metric appears wrong, the example below has MemoryUsed=3537.828125 / MemoryAvailable=3843.3515625 = MemoryUtilization=50%

Expected behavior
MemoryUtilization = 100.0 * MemoryUsed / MemoryAvailable

Screenshots or logs

2023-03-22T12:53:27.708+11:00 | 2023-03-22T01:53:26,857 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:3843.3515625\|#Level:Host\|#hostname:4a73e96743e7,timestamp:1679450006
-- | --
  | 2023-03-22T12:53:27.708+11:00 | 2023-03-22T01:53:26,857 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:3537.828125\|#Level:Host\|#hostname:4a73e96743e7,timestamp:1679450006
  | 2023-03-22T12:53:27.708+11:00 | 2023-03-22T01:53:26,857 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:50.0\|#Level:Host\|#hostname:4a73e96743e7,timestamp:1679450006

System information
A description of your system. Please provide:

  • Toolkit version: pytorch
  • Framework version: 1.13.1
  • Python version: 3.9
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): No

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions