Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BENCH: Minor fixes in SciPy benchmarks #745

Merged
merged 14 commits into from
Jan 20, 2016

Conversation

jakirkham
Copy link
Contributor

  • Allocate a normal zeroed array instead of a random one.
  • Write results into the array.
  • Drop unneeded semicolons
  • Use environment Python.
  • Fix PEP8 issues.
  • Use Fortran arrays.

@jakirkham jakirkham force-pushed the minor_fix_scipy_prof branch from f5599b0 to 75993ff Compare January 19, 2016 17:32
@jakirkham jakirkham changed the title BENCH: In SciPy benchmarks, drop allocation of C in advance BENCH: Minor fixes in SciPy benchmarks Jan 19, 2016
@xianyi
Copy link
Collaborator

xianyi commented Jan 19, 2016

I am not the expert in Python. I think the preallocation C is for reducing the overhead in timing BLAS function.
@wernsaar any ideas?

@jakirkham
Copy link
Contributor Author

I am nearly certain that SciPy is still doing this allocation for you. In an event, if C has an array beforehand we are not writing into it here. It will simply be deallocated and a new array will be allocated for the return result.

@xianyi
Copy link
Collaborator

xianyi commented Jan 19, 2016

@jakirkham , thank you for the answer.

@jakirkham
Copy link
Contributor Author

If we wanted to reuse C, we might need something like dsyrk(1.0, A, c=C, overwrite_c=True) for instance. However, that appears not to work (maybe a bug).

@jakirkham
Copy link
Contributor Author

Python variables could be thought of like C pointers with some reference counting attached and possible garbage collection. All Python variables are pointers to Python objects. So, if I assign a new value to a Python variable, it merely points to a new Python object. The old underlying Python object that was pointed to is unaffected by this assignment.

@jakirkham
Copy link
Contributor Author

The only other points I would add are C is allocated before timing begins. So, it isn't factoring into the timing. Also, as C is being filled with random values this is a bit slow. One would prefer numpy.empty to allocate an array that has not been zeroed. If one wants to overwrite that array, replace C = with C[...] =.

@zerothi
Copy link

zerothi commented Jan 19, 2016

However, C gets allocated over and over.
As will be clear from below, empty shouldn't be used as it does it like C
(empty is a reservation, not allocation).

The scipy implementation recognizes if the variable is pre-allocated.
However, ONLY if doing array assignments.

What should be done is:

C = np.zeros([N,N],np.float32)
C[:,:] = blas.ssyrk(...)

You can check the memory profiling using these small scripts:

import numpy as np
from scipy.linalg import blas
from memory_profiler import profile

N = 1000
A = np.random.randn(N,N).astype('float64')

@Profile
def blas1():
C = np.zeros([N,N],np.float64,order='F')
C = blas.dsyrk(1.0,A)
del C

@Profile
def blas2():
C = blas.dsyrk(1.0,A)
del C

@Profile
def blas3():
C = np.zeros([N,N],np.float64,order='F')
C[:,:] = blas.dsyrk(1.0,A)
del C

blas1()
blas1()
blas2()
blas2()
blas3()
blas3()

print("Mem: {} MB".format(N2*8./10242))

Then use the memory_profiler module to check how it allocates/de-allocates.
Cleary method 3 is the best in terms of memory. Only on the first call will
it do some allocation in the blas library.

Hence to get the best timing, one should:

  1. Pre-allocate all arrays
  2. Do a pre-initialization call to the BLAS routine
  3. Perform stress test, WITH array assignments

2016-01-19 18:54 GMT+01:00 jakirkham notifications@github.com:

The only other points I would add are C is allocated before timing
begins. So, it isn't factoring into the timing. Also, as C is being
filled with random values this is a bit slow. One would prefer numpy.empty
to allocate an array that has not been zeroed. If one wants to overwrite
that array, replace C = with C[...] =.


Reply to this email directly or view it on GitHub
#745 (comment).

Kind regards Nick

@jakirkham
Copy link
Contributor Author

I am not following by what you mean in terms of reservation vs. allocation here, @zerothi. Can you please clarify? In particular, why would the two functions below be any different?

@profile
def blas3():
    C = np.zeros([N,N],np.float64,order='F')
    C[:,:] = blas.dsyrk(1.0, A)
    del C

@profile
def blas4():
    C = np.empty([N,N],np.float64,order='F')
    C[:,:] = blas.dsyrk(1.0, A)
    del C 

@zerothi
Copy link

zerothi commented Jan 19, 2016

When you do:
np.empty(..)
all numpy is doing is reserving a memory block to be later used.
Only when you start assigning values to the variable will the actual allocation take place.

Hence, if you do:

C = np.empty(...)
tic
C[:,:] = 1.
toc

you will effectively also time the allocation of the variable. Essentially...

Take this code:

C = np.empty([N,N],np.float64,order='F')
C[:,:] = 1.
del C

the memory profiler yields:

Line #    Mem usage    Increment   Line Contents
================================================
    27     47.9 MiB      0.0 MiB       C = np.empty([N,N],np.float64,order='F')
    28     55.4 MiB      7.5 MiB       C[:,:] = 1.
    29     55.4 MiB      0.0 MiB       del C

However, if you used zeros you would get:

Line #    Mem usage    Increment   Line Contents
================================================
    27     55.1 MiB      7.5 MiB       C = np.zeros([N,N],np.float64,order='F')
    28     55.1 MiB      0.0 MiB       C[:,:] = 1.
    29     55.1 MiB      0.0 MiB       del C

Empty has it usages, but I would not use it for benchmarks.

@jakirkham
Copy link
Contributor Author

Interesting, so if I do the following...

>>> C = np.empty([N,N],np.float64,order='F')
>>> print(C)

Does this result in allocation when printing C?

@zerothi
Copy link

zerothi commented Jan 19, 2016

There are corner cases I am not fully sure about. In this case it depends on what print does to C.

It also depends on the allocation routines that is linked in to Python/numpy.
So with Python it is a bit difficult to know exactly what happens. :)

@jakirkham
Copy link
Contributor Author

After some initial issues with memory_profiler, it seems this is the answer.

Line #    Mem usage    Increment   Line Contents
================================================
     4     34.4 MiB      0.0 MiB   def test():
     5     34.4 MiB      0.0 MiB       N = 100
     6     34.4 MiB      0.0 MiB       C = np.empty([N,N],np.float64,order='F')
     7     34.6 MiB      0.2 MiB       print(C)

In short, allocation does not occur until the print. I had always assumed that the allocation had already occurred because print would show an array with non-zeroed values. Thanks for helping me realize that empty is only reserving data, @zerothi.

@jakirkham jakirkham force-pushed the minor_fix_scipy_prof branch from 78926a8 to e8cef82 Compare January 19, 2016 18:58
@jakirkham jakirkham force-pushed the minor_fix_scipy_prof branch from e8cef82 to 1153459 Compare January 19, 2016 19:00
@zerothi
Copy link

zerothi commented Jan 19, 2016

Oh, remark that your PR should still do C[:,:] = blas.... calls, otherwise it will de-allocate/allocate C on every call.

@jakirkham
Copy link
Contributor Author

@zerothi, do you know why this dsyrk(1.0, A, c=C, overwrite_c=True) doesn't work? Based on what I have seen in general BLAS documentation and the little bit in SciPy, I would figure this would write the result in C, but it doesn't. I tried this dsyrk(1.0, A, 0.0, C, 0, 0, 1) for good measure, but it doesn't seem to help either.

@zerothi
Copy link

zerothi commented Jan 19, 2016

No, I do not know why. You could perhaps create an issue with scipy? :)
Besides, the other way is (in my opinion) more pythonic:
C[:,:] = blas.*syrk....

@jakirkham
Copy link
Contributor Author

Added ( scipy/scipy#5738 ). Will change. Though I think it would be desirable to benefit from this option.

@jakirkham
Copy link
Contributor Author

Turns out overwrite_c will work, but only on Fortran arrays.

@jakirkham
Copy link
Contributor Author

How does this look?

@wernsaar
Copy link
Contributor

Hi,

dsyrk is defined as:

C := alpha_A'_A + beta*C

I use the benchmark scripts to compare the performance against
the benchmarks written C.

If you don't preallocate and initialize the matrix C, then you will have
malloc and page faults between the start and stop time and page faults
are very expensive on some platforms.

Best regards
Werner

On 01/19/2016 06:35 PM, Zhang Xianyi wrote:

I am not the expert in Python. I think the preallocation C is for
reducing the overhead in timing BLAS function.
@wernsaar https://github.com/wernsaar any ideas?


Reply to this email directly or view it on GitHub
#745 (comment).

@jakirkham
Copy link
Contributor Author

Ok, so, we are still initializing the array in advance. It is just a bit more cleaned up. Also, we are making sure this array is passed in as C so that memory is available to be written into.

BTW, have you seen this ( #730 ), @wernsaar? We are kind of at a loss for why this weird behavior is happening.

@wernsaar
Copy link
Contributor

Hi,

could you please benchmark sgemm and dgemm with scipy on your platform.
I want to know, if there is a similar behavior.

Best regards

Werner

On 01/20/2016 02:53 PM, jakirkham wrote:

Ok, so, we are still initializing the array in advance. It is just a
bit more cleaned up. Also, we are making sure this array is passed in
as |C| so that memory is available to be written into.

BTW, have you seen this ( #730
#730 ), @wernsaar
https://github.com/wernsaar? We are kind of at a loss for why this
weird behavior is happening.


Reply to this email directly or view it on GitHub
#745 (comment).

@jakirkham
Copy link
Contributor Author

So, I did, but I don't see similar behavior. It is the second graph in the issue description. Or is there another way that you would like me to benchmark these?

xianyi added a commit that referenced this pull request Jan 20, 2016
BENCH: Minor fixes in SciPy benchmarks
@xianyi xianyi merged commit fa3018c into OpenMathLib:develop Jan 20, 2016
@jakirkham jakirkham deleted the minor_fix_scipy_prof branch January 20, 2016 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants