Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ENH: implement gradient approximation #60

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

stsievert
Copy link
Member

@stsievert stsievert commented Sep 1, 2017

Optimization can take less time is the gradient is approximated using a subset of the examples. The approach in detailed in "Hybrid deterministic-stochastic methods for data fitting", but more examples are needed to approximate the gradient as the solution is approached. This PR is motivated by "Gradient diversity empowers distributed machine learning".

This approximates the hybrid approach described above in section 4 (with the batch size growing as in section 5). In section 4, they describe using an Armijo line search and using a scaled direction via L-BFGS. This implementation uses the line search but uses gradient descent.

I implemented this in the function gradient_descent (which they describe). In the paper, they apply their scaling to L-BFGS and Newton (see section 4.1 and equation 1.6).

Timing results:

Pointwise loss Relative error
screen shot 2017-09-01 at 3 26 16 pm screen shot 2017-09-01 at 3 26 00 pm

Time shown in seconds. Pointwise loss the objective functions, and relative error is LA.norm(beta_hat - beta_star) / LA.norm(beta_star), which is perhaps a more true measure of convergence.

Work to do:

  • Look into approximating gradient in other optimization algs
  • Profile and optimize
  • tune defaults; how should the batch size grow, and what should the initial batch size be?

Gist of timing notebook: https://gist.github.com/stsievert/31f34ee76d7c31bd68acb96ebe8f11ef

@stsievert
Copy link
Member Author

I believe re-chunking X is an issue because the batch size grows linearly and the graph below:

screen shot 2017-09-01 at 3 25 46 pm

I believe the code responsible for this is:

i = np.random.permutation(n).astype(int)
batch = list(i[:int(batch_size)])
X, y = keep['X'][batch], keep['y'][batch]

Is there any way to optimize this?

@mrocklin
Copy link
Member

mrocklin commented Sep 1, 2017

My first guess was that we were building onto the same graph over and over again, such as would be the case if we kept modifying X in place

for ...
    X = X[...]

But we don't seem to be doing this.

The next thing I would try would be profiling using the single-threaded scheduler and a tool like snakeviz. Learning what is taking up all of the time would probably be pretty informative. If this isn't familiar to you then you might want to look at @jcrist's SciPy 2017 talk on profiling in Dask.

I can probably take a closer look at this early next week. I'm excited to see work in this direction.

@stsievert
Copy link
Member Author

Turns out Array.__getitem__ is the bottleneck, taking about 75% of the time when the batch size is large (at least if I'm interpreting snakeviz right). This remains significant even while indexing with NumPy or dask arrays, and most of the time is spent at slicing.py#L58.

Indexing is required to form the batch randomly each iteration. This method relies on X[batch] when batch is a set of random indices, or something that produces the same result.

It might be faster to index with slices after randomly reshuffling. That is, something like

for k in iterations:
    if k % reshuffle_rate == 0:
        X_keep = shuffle(X_keep)  # likewise for y
    i = np.random.choice(n - batch_size)
    X = X_keep[i:i + batch_size]  # likewise for y

If reshuffle_rate > 1, this does violate an assumption used to prove the convergence of mini batch SGD (samples are not drawn iid). I don't think this would be an issue practically though. I'll try to implement this in the next couple days.

I also tried using masked arrays from dask/dask#2301, but they're not the right tool: we will still have to index with some boolean array to get the batch with X[~X.mask].

look at @jcrist's SciPy 2017 talk on profiling in Dask

D'oh – I was even at that talk!

@mrocklin
Copy link
Member

mrocklin commented Sep 2, 2017

I think that ideally we would resolve this with something like the following:

In [1]: import dask.array as da

In [2]: X = da.ones((100, 5), chunks=(10, 5))

In [3]: y = da.ones((100,), chunks=(10,))

In [4]: index = da.random.random(size=y.shape, chunks=y.chunks) > 0.9

In [5]: X2 = X[index, :]
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-5-9e9ddb5053ea> in <module>()
----> 1 X2 = X[index, :]

/home/mrocklin/workspace/dask/dask/array/core.py in __getitem__(self, index)
   1205 
   1206         if any(isinstance(i, Array) for i in index):
-> 1207             raise NotImplementedError("Indexing with a dask Array")
   1208 
   1209         from .slicing import normalize_index

NotImplementedError: Indexing with a dask Array

Unfortunately, as you can see, this isn't implemented in dask.array (but this could be changed easily).

I think that this will appropriately avoid the costs that you're seeing and will also be somewhat intuitive.

@mrocklin
Copy link
Member

mrocklin commented Sep 2, 2017

I'll come up with a temporary workaround

batch = list(i[:int(batch_size)])
X, y = keep['X'][batch], keep['y'][batch]
Xbeta = X.dot(beta)
func = loglike(Xbeta, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can downsample the array without creating an explicit list of indices locally. This probably doesn't matter much when doing things on a single machine but may matter on a distributed system.

If we can index with a boolean dask.array of increasing density then that might work more nicely, though I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this comment was hanging around from a while ago. Just submitted now

@mrocklin
Copy link
Member

mrocklin commented Sep 2, 2017

We can bypass some of this by just operating on each chunk independently. This example might be of use.

 In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: X = da.ones((100, 5), chunks=(10, 5))

In [4]: y = da.ones((100,), chunks=(10,))

In [5]: def downsample(chunk, frac):
   ...:     ind = np.random.random(size=len(chunk)) < frac
   ...:     return chunk[ind, ...]
   ...: 

In [6]: X2 = X.map_blocks(downsample, frac=0.1, chunks=(np.nan, X.shape[1]), dtype=X.dtype)

In [7]: y2 = y.map_blocks(downsample, frac=0.1, chunks=(np.nan,), dtype=y.dtype)

In [8]: y2.compute()
Out[8]: array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

Looking again at your implementation with np.random.permutation though I now realize that I may not fully understand what your objectives are. Are we downsampling here or is your desired transformation more complex?

@stsievert
Copy link
Member Author

X.map_blocks looks perfect. One small issue is that the batch size is deterministic (not random), but this shouldn't be a huge blocker.

Are we downsampling here or is your desired transformation more complex?

We are downsampling the array (not with some probability but to a known size). The implementation is old; I've pushed some more commits. These implement the idea in #60 (comment)

@stsievert
Copy link
Member Author

I am running into an issue with the line search; I've had cases where it stops too early. I'm going to investigate this more. In Numerical Optimization, the describe this same backwards line search and say

Inequality (3.4) [the same condition used in dask-glm.compute_stepsize] is sometimes called the Armijo condition ... In practice, c1 [aka armijoMult ] is chosen to be quite small, say c1 = 10-4.

Currently armijoMult == 0.1, and setting armijoMult == 1e-4 didn't effect speed in my test case (though I'm not sure how much I trust my test). It looks like @moody-marlin originally implemented this in bfcf708 -- could you detail your choice of armijoMult == 0.1?

@cicdw
Copy link
Collaborator

cicdw commented Sep 15, 2017

@stsievert -- so I don't remember that choice being particularly special. I also started with that book and used 1e-4, but don't remember a specific reason for changing it. Did you experiment with both Logistic and Normal models? Could be that the effect is larger for one vs the other.

Base automatically changed from master to main February 10, 2021 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants