WIP: ENH: implement gradient approximation #60

stsievert · 2017-09-01T20:45:59Z

Optimization can take less time is the gradient is approximated using a subset of the examples. The approach in detailed in "Hybrid deterministic-stochastic methods for data fitting", but more examples are needed to approximate the gradient as the solution is approached. This PR is motivated by "Gradient diversity empowers distributed machine learning".

This approximates the hybrid approach described above in section 4 (with the batch size growing as in section 5). In section 4, they describe using an Armijo line search and using a scaled direction via L-BFGS. This implementation uses the line search but uses gradient descent.

I implemented this in the function gradient_descent (which they describe). In the paper, they apply their scaling to L-BFGS and Newton (see section 4.1 and equation 1.6).

Timing results:

Pointwise loss	Relative error

Time shown in seconds. Pointwise loss the objective functions, and relative error is LA.norm(beta_hat - beta_star) / LA.norm(beta_star), which is perhaps a more true measure of convergence.

Work to do:

Look into approximating gradient in other optimization algs
Profile and optimize
tune defaults; how should the batch size grow, and what should the initial batch size be?

Gist of timing notebook: https://gist.github.com/stsievert/31f34ee76d7c31bd68acb96ebe8f11ef

stsievert · 2017-09-01T20:46:14Z

I believe re-chunking X is an issue because the batch size grows linearly and the graph below:

I believe the code responsible for this is:

i = np.random.permutation(n).astype(int)
batch = list(i[:int(batch_size)])
X, y = keep['X'][batch], keep['y'][batch]

Is there any way to optimize this?

mrocklin · 2017-09-01T21:36:08Z

My first guess was that we were building onto the same graph over and over again, such as would be the case if we kept modifying X in place

for ...
    X = X[...]

But we don't seem to be doing this.

The next thing I would try would be profiling using the single-threaded scheduler and a tool like snakeviz. Learning what is taking up all of the time would probably be pretty informative. If this isn't familiar to you then you might want to look at @jcrist's SciPy 2017 talk on profiling in Dask.

I can probably take a closer look at this early next week. I'm excited to see work in this direction.

stsievert · 2017-09-02T13:23:17Z

Turns out Array.__getitem__ is the bottleneck, taking about 75% of the time when the batch size is large (at least if I'm interpreting snakeviz right). This remains significant even while indexing with NumPy or dask arrays, and most of the time is spent at slicing.py#L58.

Indexing is required to form the batch randomly each iteration. This method relies on X[batch] when batch is a set of random indices, or something that produces the same result.

It might be faster to index with slices after randomly reshuffling. That is, something like

for k in iterations:
    if k % reshuffle_rate == 0:
        X_keep = shuffle(X_keep)  # likewise for y
    i = np.random.choice(n - batch_size)
    X = X_keep[i:i + batch_size]  # likewise for y

If reshuffle_rate > 1, this does violate an assumption used to prove the convergence of mini batch SGD (samples are not drawn iid). I don't think this would be an issue practically though. I'll try to implement this in the next couple days.

I also tried using masked arrays from dask/dask#2301, but they're not the right tool: we will still have to index with some boolean array to get the batch with X[~X.mask].

look at @jcrist's SciPy 2017 talk on profiling in Dask

D'oh – I was even at that talk!

mrocklin · 2017-09-02T13:37:08Z

I think that ideally we would resolve this with something like the following:

In [1]: import dask.array as da

In [2]: X = da.ones((100, 5), chunks=(10, 5))

In [3]: y = da.ones((100,), chunks=(10,))

In [4]: index = da.random.random(size=y.shape, chunks=y.chunks) > 0.9

In [5]: X2 = X[index, :]
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-5-9e9ddb5053ea> in <module>()
----> 1 X2 = X[index, :]

/home/mrocklin/workspace/dask/dask/array/core.py in __getitem__(self, index)
   1205 
   1206         if any(isinstance(i, Array) for i in index):
-> 1207             raise NotImplementedError("Indexing with a dask Array")
   1208 
   1209         from .slicing import normalize_index

NotImplementedError: Indexing with a dask Array

Unfortunately, as you can see, this isn't implemented in dask.array (but this could be changed easily).

I think that this will appropriately avoid the costs that you're seeing and will also be somewhat intuitive.

mrocklin · 2017-09-02T13:37:29Z

I'll come up with a temporary workaround

mrocklin · 2017-09-01T21:29:28Z

dask_glm/algorithms.py

+            batch = list(i[:int(batch_size)])
+            X, y = keep['X'][batch], keep['y'][batch]
+            Xbeta = X.dot(beta)
+            func = loglike(Xbeta, y)


I wonder if we can downsample the array without creating an explicit list of indices locally. This probably doesn't matter much when doing things on a single machine but may matter on a distributed system.

If we can index with a boolean dask.array of increasing density then that might work more nicely, though I'm not sure.

Sorry, this comment was hanging around from a while ago. Just submitted now

mrocklin · 2017-09-02T13:44:24Z

We can bypass some of this by just operating on each chunk independently. This example might be of use.

 In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: X = da.ones((100, 5), chunks=(10, 5))

In [4]: y = da.ones((100,), chunks=(10,))

In [5]: def downsample(chunk, frac):
   ...:     ind = np.random.random(size=len(chunk)) < frac
   ...:     return chunk[ind, ...]
   ...: 

In [6]: X2 = X.map_blocks(downsample, frac=0.1, chunks=(np.nan, X.shape[1]), dtype=X.dtype)

In [7]: y2 = y.map_blocks(downsample, frac=0.1, chunks=(np.nan,), dtype=y.dtype)

In [8]: y2.compute()
Out[8]: array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

Looking again at your implementation with np.random.permutation though I now realize that I may not fully understand what your objectives are. Are we downsampling here or is your desired transformation more complex?

stsievert · 2017-09-02T14:20:12Z

X.map_blocks looks perfect. One small issue is that the batch size is deterministic (not random), but this shouldn't be a huge blocker.

Are we downsampling here or is your desired transformation more complex?

We are downsampling the array (not with some probability but to a known size). The implementation is old; I've pushed some more commits. These implement the idea in #60 (comment)

stsievert · 2017-09-06T17:21:49Z

I am running into an issue with the line search; I've had cases where it stops too early. I'm going to investigate this more. In Numerical Optimization, the describe this same backwards line search and say

Inequality (3.4) [the same condition used in dask-glm.compute_stepsize] is sometimes called the Armijo condition ... In practice, c1 [aka armijoMult ] is chosen to be quite small, say c1 = 10^-4.

Currently armijoMult == 0.1, and setting armijoMult == 1e-4 didn't effect speed in my test case (though I'm not sure how much I trust my test). It looks like @moody-marlin originally implemented this in bfcf708 -- could you detail your choice of armijoMult == 0.1?

cicdw · 2017-09-15T14:55:36Z

@stsievert -- so I don't remember that choice being particularly special. I also started with that book and used 1e-4, but don't remember a specific reason for changing it. Did you experiment with both Logistic and Normal models? Could be that the effect is larger for one vs the other.

ENH: implement growing batch size

3727fc2

Implementing masked arrays

a7d7826

mrocklin reviewed Sep 2, 2017

View reviewed changes

Implement slicing

1a8e6cc

Implement _comptue_batch with map_blocks

04be6bd

Base automatically changed from master to main February 10, 2021 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: ENH: implement gradient approximation #60

WIP: ENH: implement gradient approximation #60

stsievert commented Sep 1, 2017 •

edited

Loading

stsievert commented Sep 1, 2017

mrocklin commented Sep 1, 2017

stsievert commented Sep 2, 2017

mrocklin commented Sep 2, 2017

mrocklin commented Sep 2, 2017

mrocklin Sep 1, 2017

mrocklin Sep 2, 2017

mrocklin commented Sep 2, 2017

stsievert commented Sep 2, 2017

stsievert commented Sep 6, 2017

cicdw commented Sep 15, 2017

WIP: ENH: implement gradient approximation #60

Are you sure you want to change the base?

WIP: ENH: implement gradient approximation #60

Conversation

stsievert commented Sep 1, 2017 • edited Loading

stsievert commented Sep 1, 2017

mrocklin commented Sep 1, 2017

stsievert commented Sep 2, 2017

mrocklin commented Sep 2, 2017

mrocklin commented Sep 2, 2017

mrocklin Sep 1, 2017

Choose a reason for hiding this comment

mrocklin Sep 2, 2017

Choose a reason for hiding this comment

mrocklin commented Sep 2, 2017

stsievert commented Sep 2, 2017

stsievert commented Sep 6, 2017

cicdw commented Sep 15, 2017

stsievert commented Sep 1, 2017 •

edited

Loading