-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: ENH: implement gradient approximation #60
base: main
Are you sure you want to change the base?
Conversation
My first guess was that we were building onto the same graph over and over again, such as would be the case if we kept modifying X in place for ...
X = X[...] But we don't seem to be doing this. The next thing I would try would be profiling using the single-threaded scheduler and a tool like snakeviz. Learning what is taking up all of the time would probably be pretty informative. If this isn't familiar to you then you might want to look at @jcrist's SciPy 2017 talk on profiling in Dask. I can probably take a closer look at this early next week. I'm excited to see work in this direction. |
Turns out Indexing is required to form the batch randomly each iteration. This method relies on It might be faster to index with slices after randomly reshuffling. That is, something like for k in iterations:
if k % reshuffle_rate == 0:
X_keep = shuffle(X_keep) # likewise for y
i = np.random.choice(n - batch_size)
X = X_keep[i:i + batch_size] # likewise for y If I also tried using masked arrays from dask/dask#2301, but they're not the right tool: we will still have to index with some boolean array to get the batch with
D'oh – I was even at that talk! |
I think that ideally we would resolve this with something like the following: In [1]: import dask.array as da
In [2]: X = da.ones((100, 5), chunks=(10, 5))
In [3]: y = da.ones((100,), chunks=(10,))
In [4]: index = da.random.random(size=y.shape, chunks=y.chunks) > 0.9
In [5]: X2 = X[index, :]
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-5-9e9ddb5053ea> in <module>()
----> 1 X2 = X[index, :]
/home/mrocklin/workspace/dask/dask/array/core.py in __getitem__(self, index)
1205
1206 if any(isinstance(i, Array) for i in index):
-> 1207 raise NotImplementedError("Indexing with a dask Array")
1208
1209 from .slicing import normalize_index
NotImplementedError: Indexing with a dask Array Unfortunately, as you can see, this isn't implemented in dask.array (but this could be changed easily). I think that this will appropriately avoid the costs that you're seeing and will also be somewhat intuitive. |
I'll come up with a temporary workaround |
dask_glm/algorithms.py
Outdated
batch = list(i[:int(batch_size)]) | ||
X, y = keep['X'][batch], keep['y'][batch] | ||
Xbeta = X.dot(beta) | ||
func = loglike(Xbeta, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can downsample the array without creating an explicit list of indices locally. This probably doesn't matter much when doing things on a single machine but may matter on a distributed system.
If we can index with a boolean dask.array of increasing density then that might work more nicely, though I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this comment was hanging around from a while ago. Just submitted now
We can bypass some of this by just operating on each chunk independently. This example might be of use. In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: X = da.ones((100, 5), chunks=(10, 5))
In [4]: y = da.ones((100,), chunks=(10,))
In [5]: def downsample(chunk, frac):
...: ind = np.random.random(size=len(chunk)) < frac
...: return chunk[ind, ...]
...:
In [6]: X2 = X.map_blocks(downsample, frac=0.1, chunks=(np.nan, X.shape[1]), dtype=X.dtype)
In [7]: y2 = y.map_blocks(downsample, frac=0.1, chunks=(np.nan,), dtype=y.dtype)
In [8]: y2.compute()
Out[8]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1.]) Looking again at your implementation with |
We are downsampling the array (not with some probability but to a known size). The implementation is old; I've pushed some more commits. These implement the idea in #60 (comment) |
I am running into an issue with the line search; I've had cases where it stops too early. I'm going to investigate this more. In Numerical Optimization, the describe this same backwards line search and say
Currently |
@stsievert -- so I don't remember that choice being particularly special. I also started with that book and used 1e-4, but don't remember a specific reason for changing it. Did you experiment with both Logistic and Normal models? Could be that the effect is larger for one vs the other. |
Optimization can take less time is the gradient is approximated using a subset of the examples. The approach in detailed in "Hybrid deterministic-stochastic methods for data fitting", but more examples are needed to approximate the gradient as the solution is approached. This PR is motivated by "Gradient diversity empowers distributed machine learning".
This approximates the hybrid approach described above in section 4 (with the batch size growing as in section 5). In section 4, they describe using an Armijo line search and using a scaled direction via L-BFGS. This implementation uses the line search but uses gradient descent.
I implemented this in the function
gradient_descent
(which they describe). In the paper, they apply their scaling to L-BFGS and Newton (see section 4.1 and equation 1.6).Timing results:
Time shown in seconds. Pointwise loss the objective functions, and relative error is
LA.norm(beta_hat - beta_star) / LA.norm(beta_star)
, which is perhaps a more true measure of convergence.Work to do:
Gist of timing notebook: https://gist.github.com/stsievert/31f34ee76d7c31bd68acb96ebe8f11ef