Skip to content

Conversation

@debasish83
Copy link
Contributor

Memory optimization are done to bring optimize.linear.NNLS runtime closer to mllib NNLS and optimize.proximal.QuadraticMinimizer default close to blas.dposv

NNLS:iterator pattern cleaned for speed, in-place gemv added,initialState API provided for state reuse;PowerMethod:specialized on DenseMatrix and DenseVector for speed;QuadraticMinimizer:iterator pattern cleaned for speed,memory optimization to bring runtime close to dposv

…tate API provided for state reuse;PowerMethod:specialized on DenseMatrix and DenseVector for speed;QuadraticMinimizer:iterator pattern cleaned for speed,memory optimization to bring runtime close to dposv
@debasish83
Copy link
Contributor Author

I will take a closer look on comparisons with ml cholesky solver tomorrow...the first iteration is always slow...I am not quite sure why...

@debasish83
Copy link
Contributor Author

By the way sorry to turn the code into a C-style code but I had to make sure no memories are allocated in the whole algorithm except through NNLS.initialize and QuadraticMinimizer.initialize since we were compared against native BLAS dposv :-)

When you review please let me know if you see any additional memory allocation in algorithm inner loop (iterations) since I am using lot of breeze overloaded functions and might have missed things...

May be there are ways to optimize initialize further so that first iteration also comes close to mllib numbers...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this is not what you want. This changes the operation.

scala> implicit class Foo(x: Int) { def dot(y: Int) = x * y }
defined class Foo

scala> 3 dot 2 + 3
res1: Int = 15

scala> 3.dot(2) + 3
res2: Int = 9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad...fixed it

@debasish83
Copy link
Contributor Author

any memory specific optimization ? Most of the time I run QuadraticMinimizer first and so hot spot is an issue...I will try the same run after warming JVM tomorrow and finish up the API changes in AM

@dlwh
Copy link
Member

dlwh commented Mar 23, 2015

Remember it's not just that the JVM is warming up: It has to warm up for
each individual class. In general, methods aren't fully optimized until
they're called about 10,000 times. So really you should be running the
algorithm several times, then time it.

-- David

On Sun, Mar 22, 2015 at 11:33 PM, Debasish Das notifications@github.com
wrote:

any memory specific optimization ? Most of the time I run
QuadraticMinimizer first and so hot spot is an issue...I will try the same
run after warming JVM tomorrow and finish up the API changes in AM


Reply to this email directly or view it on GitHub
#386 (comment).

@debasish83
Copy link
Contributor Author

I added an API to provide upper triangular gram matrix and with that the runtime in the first iteration also dropped...I think QuadraticMinimizer should be able to replace the ML CholeskySolver now...

…x provided as primitive array for supporting normal equations
@debasish83
Copy link
Contributor Author

The first iteration issue is consistent with both NNLS and QuadraticMinimizer...Out of curiosity, I looked at the code and both mllib and Breeze back matrix and vector workspace with Arrays[Double]...so I am really not clear why there is the 2X difference only in initial iterations....Is it due to the overhead from traits that show up in DenseVector/DenseMatrix ?

During the solve things are clean and so I don't think there are cases where BLAS using native memory is faster than QuadraticMinimizer sending memory to LAPACK to work on...

@dlwh
Copy link
Member

dlwh commented Mar 24, 2015

using a DenseVector for the first time incurs a lot of overhead: operators
are populated into the multimethod maps, lots of interfaces are loaded,
etc. You really clock each of these after running the exact same code path
at least once. E.g.

def time(x: =>Unit) {
for(i <- 0 until N) x

val in = System.currentTimeMillis();
x;
out = System.currentTimeMillis();
}

On Tue, Mar 24, 2015 at 1:39 PM, Debasish Das notifications@github.com
wrote:

The first iteration issue is consistent with both NNLS and
QuadraticMinimizer...Out of curiosity, I looked at the code and both mllib
and Breeze back matrix and vector workspace with Arrays[Double]...so I am
really not clear why there is the 2X difference only in initial
iterations....Is it due to the overhead from traits that show up in
DenseVector/DenseMatrix ?

During the solve things are clean and so I don't think there are cases
where BLAS using native memory is faster than QuadraticMinimizer sending
memory to LAPACK to work on...


Reply to this email directly or view it on GitHub
#386 (comment).

@dlwh
Copy link
Member

dlwh commented Mar 25, 2015

what's the status here? Can I merge this? I really want to release a fix for the SparseVector bug

@debasish83
Copy link
Contributor Author

I am ok with this...moving from DenseVector/DenseMatrix to Array will make the code ugly

dlwh added a commit that referenced this pull request Mar 25, 2015
@dlwh dlwh merged commit 7be3895 into scalanlp:master Mar 25, 2015
@debasish83
Copy link
Contributor Author

Please let me know when you cut 0.11.2...

@dlwh
Copy link
Member

dlwh commented Mar 25, 2015

tongiht

On Wed, Mar 25, 2015 at 7:17 AM, Debasish Das notifications@github.com
wrote:

Please let me know when you cut 0.11.2...


Reply to this email directly or view it on GitHub
#386 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants