Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to explain these memory behaviors when using numpy? #31

Closed
Phylliade opened this issue Feb 6, 2020 · 18 comments
Closed

How to explain these memory behaviors when using numpy? #31

Phylliade opened this issue Feb 6, 2020 · 18 comments

Comments

@Phylliade
Copy link

When running a very simple script based on numpy functions, we can get the following results:

test2.py: % of CPU time = 100.00% out of   3.59s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      |
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [test2.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             |
     3	 |     0.30% |    48.40% |         -80 |       1.03% | x = np.array(range(10**7))
     4	 |     0.59% |    50.72% |           0 |      98.97% | np.array(np.random.uniform(0, 100, size=10**8))
     5	 |           |           |             |             |

How can we get:

  • A negative memory growth for the first line?
  • A null memory growth on the second line?

System info

  • Platform : Mac OS X
  • Python: 3.7 (brew)
  • Numpy: 1.16.4
@emeryberger
Copy link
Member

emeryberger commented Feb 7, 2020

Thanks for your note and the helpful repro case!

First, note that in your original code that there's definitely a fair amount of allocation going on in the second np.array call (line 4) despite the 0 growth, as reflected by the 98.97% memory usage. This means that that line is nearly all of the malloc and free activity, but the mallocs and the frees all evened out.

Second, I strongly suspect what's happening is that the first line is allocating a bunch of objects and then freeing them, leaving behind an allocated object that is big, but smaller than the other allocated objects that were freed in the process.

In any event, the results for scalene (and memory_profiler, its "competitor" for memory profiling) can be confusing because (a) there's allocation and freeing happening on each line, and (b) there's all the implicit garbage collection that Python is doing. I broke the operations into separate lines (and inserted del to immediately trigger deallocation via ref counting) so we can more clearly see what's going on.

(Note you have to add this @profile to every function for memory_profiler to do anything; Scalene ignores such directives.)

import numpy as np

@profile
def main():
    r = list(range(10**7))
    x = np.array(r)
    y = np.random.uniform(0, 100, size=10**8)
    a = np.array(y)
    del r
    del x
    del y
    del a

main()

This is with the latest version (you may need to do a brew reinstall libscalene package and pip install --upgrade --force-reinstall scalene).

% scalene test2.py
test2.py: % of CPU time = 100.00% out of   3.37s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [test2.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |           |           |             |             | @profile
     4	 |           |           |             |             | def main():
     5	 |     0.35% |    10.83% |          16 |       0.98% |     r = list(range(10**7))
     6	 |     0.35% |    24.35% |          80 |       4.90% |     x = np.array(r)
     7	 |     0.35% |    43.37% |         768 |      47.06% |     y = np.random.uniform(0, 100, size=10**8)
     8	 |     0.35% |    15.63% |         768 |      47.06% |     a = np.array(y)
     9	 |     0.35% |     4.08% |         -96 |             |     del r
    10	 |           |           |         -80 |             |     del x
    11	 |           |           |        -768 |             |     del y
    12	 |           |           |        -768 |             |     del a
    13	 |           |           |             |             | 
    14	 |           |           |             |             | main()

Now you can see the allocations, mostly reasonably accounted for. (There's sampling going on, so everything is counted in multiples of 16MB.)

Running with memory_profiler (which can be over 1000x slower than Scalene, so I don't recommend using it in practice for larger programs) gives you more or less similar results (though line 5 is quite different).

Filename: test2.py

Line #    Mem usage    Increment   Line Contents
================================================
     3   45.195 MiB   45.195 MiB   @profile
     4                             def main():
     5  431.469 MiB  386.273 MiB       r = list(range(10**7))
     6  507.766 MiB   76.297 MiB       x = np.array(r)
     7 1270.730 MiB  762.965 MiB       y = np.random.uniform(0, 100, size=10**8)
     8 2033.672 MiB  762.941 MiB       a = np.array(y)
     9 1723.922 MiB    0.000 MiB       del r
    10 1723.922 MiB    0.000 MiB       del x
    11  960.980 MiB    0.000 MiB       del y
    12  198.039 MiB    0.000 MiB       del a

I don't understand the disparity between what memory_profiler and Scalene show for line 5. Looking into it.

Also, my current thinking was that showing negative growth is more useful in understanding what's going on than just printing 0, but maybe just indicating growth (mallocs) and ignoring free would be clearer and perhaps more useful? (Doing so makes it report 16MB for line 5 and 768MB for line 6.) I'd welcome your feedback.

Anyway, here's what memory_profiler says for your original program:

Filename: test2-original.py

Line #    Mem usage    Increment   Line Contents
================================================
     3   45.062 MiB   45.062 MiB   @profile
     4                             def main():
     5  197.781 MiB  152.719 MiB       x = np.array(range(10**7))
     6  197.805 MiB    0.023 MiB       np.array(np.random.uniform(0, 100, size=10**8))

You can see that memory_profiler also reports no growth for line 6, so I think it is subtracting freed memory.

@emeryberger
Copy link
Member

Thanks once again for posting this. It led me to discovering some issues that I have now resolved in the latest version. The most important problem was that Scalene was missing some samples due to lost signals (something I was already accounting for, but not in the case when native code leads to many signals, which all get coalesced). This fix resolves the difference between memory_profiler and Scalene for test2.py, shown above.

test2.py: % of CPU time = 100.00% out of   3.68s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [test2.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |           |           |             |             | @profile
     4	 |           |           |             |             | def main():
     5	 |     0.31% |    11.25% |         405 |      22.10% |     r = list(range(10**7))
     6	 |     0.31% |    22.96% |          77 |       3.72% |     x = np.array(r)
     7	 |     0.31% |    42.46% |         763 |      37.09% |     y = np.random.uniform(0, 100, size=10**8)
     8	 |     0.31% |    15.83% |         763 |      37.09% |     a = np.array(y)
     9	 |     0.31% |     5.97% |        -391 |             |     del r
    10	 |           |           |         -76 |             |     del x
    11	 |           |           |        -763 |             |     del y
    12	 |           |           |        -763 |             |     del a
    13	 |           |           |             |             | 
    14	 |           |           |             |             | main()

For the original program you posted, here are the results:

test2-original.py: % of CPU time = 100.00% out of   4.63s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [test2-original.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |           |           |             |             | @profile
     4	 |           |           |             |             | def main():
     5	 |     0.23% |    53.41% |          90 |      46.24% |     x = np.array(range(10**7))
     6	 |     0.47% |    45.89% |          -0 |      53.76% |     np.array(np.random.uniform(0, 100, size=10**8))
     7	 |           |           |             |             | 
     8	 |           |           |        -839 |             | main()

As you can see, the results for Scalene (for lines 5 and 6) are similar to those from memory_profiler. I actually don't understand why memory_profiler reports a much larger number (152MB). Here is me manually verifying the size of the allocated object.

>>> import numpy as np
>>> import sys
>>> x = np.array(range(10**7))
>>> sys.getsizeof(x)
80000096
>>> sys.getsizeof(x) / (1024 * 1024)
76.29403686523438

And 90 is a lot closer to 76 than 152. :)

@emeryberger
Copy link
Member

I am going to close this issue since I believe it has been resolved. I am still considering extending the output to show average mallocs, average frees, and average growth (mallocs - frees). Right now, it only shows the latter, which may be confusing.

@Phylliade
Copy link
Author

Thanks for the quick support!!

Regarding

Also, my current thinking was that showing negative growth is more useful in understanding what's going on than just printing 0, but maybe just indicating growth (mallocs) and ignoring free would be clearer and perhaps more useful? (Doing so makes it report 16MB for line 5 and 768MB for line 6.)

Having the overall growth is clear to me, but also having (maybe as an option) the possibility to see detailed mallocs and frees would be interesting

About the current issue, i can get perfect results with the following:

test2.py: % of CPU time = 100.00% out of   2.82s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [test2.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |     0.36% |    42.25% |          88 |      46.33% | x = np.array(range(10**7))
     4	 |     0.36% |    41.10% |         763 |      26.85% | z = np.random.uniform(0, 100, size=(10**8))
     5	 |     0.36% |    15.55% |         763 |      26.83% | y = np.array(z)

90M for an array of 10**7 doubles (line 1), 760M for an array of 10**8 doubles (line 2), the same for a copy (line 3) makes sense.

However, when adjusting the script a bit, results get very strange:

test2.py: % of CPU time = 100.00% out of   2.88s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [test2.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |     0.36% |    42.53% |          88 |      46.33% | x = np.array(range(10**7))
     4	 |     0.71% |    56.40% |          -0 |      53.67% | y = np.array(np.random.uniform(0, 100, size=(10**8)))

And even stranger:

test2.py: % of CPU time = 100.00% out of   1.60s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [test2.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |           |           |             |             | #x = np.array(range(10**7))
     4	 |     1.32% |    98.68% |         388 |     100.00% | y = np.array(np.random.uniform(0, 100, size=(10**8)))

When doing the lines 2 and 3 of the first script in one line:

  • There is 0 memory growth if doing something before (e.g. keeping line 1)
  • There is a memory growth if not doing anything (moreover, this growth does not relate to the sum of the individual growths)
    while we saw there was actual memory growth (which makes totally sense)

@emeryberger emeryberger reopened this Feb 16, 2020
@emeryberger
Copy link
Member

This was super helpful! It revealed a bug. Scalene tracks average memory growth. That means it needs to know how many times each line of code executes. But that's wrong! It really needs to know how many times each bytecode index executes. Using lines of code was leading to inaccuracies when there were lots of allocations or frees on individual lines, as in your example. I have now fixed this (commit coming momentarily). Here's the output for each of your tests.

Test 1:

% DYLD_INSERT_LIBRARIES=./libscalene.dylib PYTHONMALLOC=malloc python3 -m scalene phyllidae/test2-1.py
phyllidae/test2-1.py: % of CPU time = 100.00% out of   4.46s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [phyllidae/test2-1.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |     0.24% |    57.44% |          90 |      46.28% | x = np.array(range(10**7))
     4	 |     0.24% |    30.51% |         763 |      26.87% | z = np.random.uniform(0, 100, size=(10**8))
     5	 |     0.24% |    11.33% |         763 |      26.85% | y = np.array(z)

Test 2:

% DYLD_INSERT_LIBRARIES=./libscalene.dylib PYTHONMALLOC=malloc python3 -m scalene phyllidae/test2-2.py
phyllidae/test2-2.py: % of CPU time = 100.00% out of   4.36s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [phyllidae/test2-2.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |     0.25% |    54.74% |          90 |      46.25% | x = np.array(range(10**7))
     4	 |     0.49% |    44.52% |         763 |      53.75% | y = np.array(np.random.uniform(0, 100, size=(10**8)))

Test 3:

% DYLD_INSERT_LIBRARIES=./libscalene.dylib PYTHONMALLOC=malloc python3 -m scalene phyllidae/test2-3.py
phyllidae/test2-3.py: % of CPU time = 100.00% out of   2.08s.
  	 |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage (%)   | [phyllidae/test2-3.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |           |           |             |             | #x = np.array(range(10**7))
     4	 |     1.11% |    98.89% |         777 |     100.00% | y = np.array(np.random.uniform(0, 100, size=(10**8)))

Thanks again for your help in identifying this problem; please keep me updated on how it works out going forwards!

@emeryberger
Copy link
Member

Fixed with emeryberger@bcaddaf.

@Phylliade
Copy link
Author

Amazing!

I may have one last interrogation:
In test 2, we see a memory growth of 763 Mb for the line: y = np.array(np.random.uniform(0, 100, size=(10**8)));
However, when this line is cut into two lines in Test 1, we see that the underlying memory growth is twice: 763Mb for the np.random.uniform(0, 100, size=(10**8)) (line 2), then again 763Mb for the copy (line 3).

My interpretation is that since, in Test 2, the first np.random.uniform(0, 100, size=(10**8)) is not allocated at the Python level, there is no memory growth at the Python level, i.e. the memory is allocated after the start of the bytecode, and then freed before the end of the bytecode.
However, as we can see in Test 1, there is actually internally an allocation of 763Mb going on: It's a peak, and is untracked, because memory is released before the end of the bytecode.

Do you think there could be ways to track suck peaks? When using big numpy arrays, such internal peaks could be super useful to track, because they can be a witness of unwanted memory copies. When doing programming with big numpy arrays (for example in Machine Learning/Deep Learning), this can be a major waste of time.
(I can open a new issue if necessary!)

BTW, The new memory usage graph is awesome!

@emeryberger
Copy link
Member

Scalene intercepts all memory allocations (from Python and native code). What's happening is that np.array is in fact making a copy (adding 763MB to the pre-existing 763MB from invoking np.random.uniform). It then deletes the old array. The result is a net 763MB growth. (There's no need to invoke np.array, since np.random.uniform already returns an array, but numpy is happy to make you a new array anyway.) I verified this by not subtracting deallocations. I am not really sure what the best way is to show this info -- it's not really a peak, but I can see that it's informative to see what the actual allocations vs. deallocations are, rather than just their difference. I wonder whether getting rid of the usage % column would make sense, and then have two fileds: "avg allocated" and "avg net growth"?

@emeryberger
Copy link
Member

Another possibility is to work in sparklines for each line of code...

@Phylliade
Copy link
Author

Having avg allocated and avg net growth seems a good idea!

@emeryberger
Copy link
Member

So scrap the usage %? (Just verifying.) Also, would memory use sparklines for each line of code help (if that's even possible)?

@Phylliade
Copy link
Author

Actually, having the sparklines for each line of code would be a must.
If that's not possible, the averages could be sufficient. (Anyways the averages are useful as statistics for each line, with or without sparklines)

@emeryberger
Copy link
Member

I guess what I am asking is do you find the usage % info of value? I see it as potentially useful as it shows where the most allocation activity is.

@emeryberger
Copy link
Member

Just committed a version with memory usage as per-line sparklines (I wonder if printing a % beside it would be of value). Take a look!
emeryberger@037dbd2

@emeryberger emeryberger reopened this Mar 3, 2020
@emeryberger
Copy link
Member

Memory usage: ▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅ (max: 1542.64MB)
phylliade/test2-3.py: % of CPU time =  89.31% out of   2.42s.
  	     |     CPU % |     CPU % | Avg memory  | Memory      | 
  Line	 |  (Python) |  (native) | growth (MB) | usage       | [phylliade/test2-3.py]
--------------------------------------------------------------------------------
     1	 |           |           |             |             | import numpy as np
     2	 |           |           |             |             | 
     3	 |           |           |             |             | @profile
     4	 |           |           |             |             | def main():
     5	 |           |           |             |             |     #x = np.array(range(10**7))
     6	 |     0.83% |    88.48% |         777 | ▅▁▅         |     y = np.array(np.random.uniform(0, 100, size=(10**8)))
     7	 |           |           |             |             | 
     8	 |           |           |             |             | main()
     9	 |           |           |             |             | 

In this example, you can see the increase and decrease over time, which to me is pretty suggestive that there is copying going on (from an old, reclaimed object -- corresponding to the dip -- into the newly allocated one).

@Phylliade
Copy link
Author

Amazing!!

For me, the current metrics (growth, % of use) as well as the sparklines are enough to interpret what's going on.
I'll begin with the %, and if I see something unexpected, I'll take a look at the sparklines, to try to get an explanation.

Regarding the sparline of the 6th line, shouldn't there be a "increase, increase, decrease" (e.g. allocation of first array, allocation of second array, deallocations of first array), instead of "increase, decrease, increase"?

@emeryberger
Copy link
Member

Great catch. This took a surprisingly long time to unwind.

Memory usage: ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅█▅ (max: 1540.58MB)
phylliade/test2-3.py: % of CPU time =  88.77% out of   2.15s.
       |    CPU % |    CPU % |  Net  | Memory                    | 
  Line | (Python) | (native) |  (MB) | usage                     | [phylliade/test2-3.py]
--------------------------------------------------------------------------------
     1 |          |          |       |                           | import numpy as np
     2 |          |          |       |                           | 
     3 |          |          |       |                           | @profile
     4 |          |          |       |                           | def main():
     5 |          |          |       |                           |     #x = np.array(range(10**7))
     6 |    0.93% |   87.84% |   777 | ▁▁▁▁▁▁▅█▅          (100%) |     y = np.array(np.random.uniform(0, 100, size=(10**8)))
     7 |          |          |       |                           | 
     8 |          |          |       |                           | main()
     9 |          |          |       |                           | 

@Phylliade
Copy link
Author

Nice, that's really awesome!

Closing.

(A last detail, on my terminal emulator, the sparklines touch each other:
scalene
)

But this time it's closed on my side ;) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants