-
-
Notifications
You must be signed in to change notification settings - Fork 43
support sample weights #29
base: master
Are you sure you want to change the base?
Conversation
There seems to be some issues with the tests using tornado loop, is it somehow connected to the fix mentions in test? |
The tests also seem to be failing on master. My guess is that something
upstream changed. I'll take a look.
…On Fri, Oct 12, 2018 at 7:22 AM tomlaube ***@***.***> wrote:
There seems to be some issues with the tests using tornado loop, is it
somehow connected to the fix mentions in test?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#29 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszAhqDe6BKmr4SfQ8Ta23uXkHmybaks5ukHuRgaJpZM4XZP40>
.
|
This error goes away if I remove this line from
If memory serves @TomAugspurger added this because xgboost didn't play nicely if it was started and stopped repeatedly within a process. I tried finding a version of dask-distributed that worked with these tests and wasn't able to find anything. Perhaps the pytest-xdist package changed? I'm at a loss here. |
I think that ideally it would be nice to find out what we're doing that makes XGBoost sad when we run many tests sequentially in the same process (by eliminating that line in setup.cfg entirely) |
I tried in dmlc/xgboost#3656 and dmlc/xgboost#2796 but didn't make any progress really. |
I've trying reproducing the dmlc/xgboost#3656 on version of xgboost (0.80, 0.72.1, 0.7.1, 0.7.post4) and I cannot reproduce it (running 4.15.0-36-generic). So what i would probably suggest is to remove the line from setup.cfg and add version bound for xgboost>=0.7. Regarding dmlc/xgboost#2796, the internal state of rabit engine is not directly visible via the python bindigs, so you cannot really tell if you called init or not before. |
I also have unrelated question, if there is some plan to merge dask-xgboost to dask-ml. So that LabelEnoder is part of the code and conversion of labels to ints doesn't need to be handled externally. |
If you're able to resolve this either here or in a separate PR that would be very welcome. Pinning above version 0.7 sounds fine to me. |
I've provided fix using monkey patching and fixtures, where i use context to manager the lifetime of the rabit instance. All tests seems to pass, just the code formatting failed to some reason. |
Interesting. I'm curious, is this something that we should change within the dask-xgboost code itself instead of within tests? When should we be calling init/finalize (I'm not very familiar with rabit) |
I don't think that's practical. Mostly because rabit assumes gang allocation not incremental one as dask does. Imagine that in constructor you init rabit on all of the workers and in __ del __ you call finalize. If worker joins in the middle, the task will fail on this node, since it was never initialized. I assume there are better ways to do so in dask, like adding RabitService to each of the workers, or via the actors api (which would cause problems with passing actor handlers and actors own lifetime same as in ray: https://ray.readthedocs.io/en/latest/actors.html#current-actor-limitations) |
Some more questions about Rabit (sorry for going off topic): When you say "gang allocation" I assume that you mean that they all have to arrive at the same time. If so, how do they know when their peers have arrived? We don't tell them how many to expect.
This might be a decent approach: https://distributed.dask.org/en/latest/api.html#distributed.Client.register_worker_callbacks But from what you say above it sounds like we don't want to do this, that calling Lets consider the following situation:
Is there some set of init and finalize calls that make this workflow feasible? I'm happy to figure out the infrastructure on the dask side to let us call init/finalize at arbitrary points. I genuinely don't know when do to call them though to be safe. |
Yes by gang allocation i mean that all the workers need to run at the same time. And we do pass the number of workers, it's right here: dask-xgboost/dask_xgboost/core.py Line 40 in 4661c8a
You just don't give it to rabit instances but to tracker, which in turn gives you env that you init rabit at each worker right here: dask-xgboost/dask_xgboost/core.py Line 85 in 4661c8a
The issues of transition from incremental to gang allocation is usually addressed by some synchronization barrier, as for example sparks project hydrogen https://vimeo.com/274267107 So, technically the correct solution is would be to change the dask scheduler to support this sort of barrier, that would be used to init all the workers, do the work and finalize. |
Ah! Indeed we do! Thank you for the pointers. It's been a while since I looked at that code. Barriers are pretty easy to do. I suspect that we already have all of the mechanisms we need today. It looks like the tracker on the scheduler side quits after all the workers shut down, which I assume happens during the finalize call. Given this, I'm curious why things failed before. We start a tracker on the scheduler, init a bunch of workers, train, finalize a bunch of workers, presumably the tracker finishes up. What stops us from starting a new tracker and init-ing again? |
That's a good point. But looking at the code tracker code closer, you can see that the join method itself just tries to join another thread in a loop: dask-xgboost/dask_xgboost/tracker.py Line 354 in 4661c8a
so theoretically it should kill itself. Is there some way we can track this? Is there a way to register things at the scheduler and then at the end of train see if they actually shutted down? |
Isn't the problem actually much simpler. We run two tasks concurrently, on the same thread, that both call rabit init, which can be called only once per thread? Looking just at the c++ code, the init is thread local so it cannot be two threads. I admit that i don't know much about tornado and the ioloop, but this seems like that the order is like so: coroutine1 gets allocated on Thread-1, calls init and yields the execution In other words the problem could be traditional anti pattern of using thread local storage with thread pool |
Yes, if the functions passed to the diff --git a/dask_xgboost/core.py b/dask_xgboost/core.py
index 6bf29d7..c843a00 100644
--- a/dask_xgboost/core.py
+++ b/dask_xgboost/core.py
@@ -34,7 +34,7 @@ def parse_host_port(address):
return host, port
-def start_tracker(host, n_workers):
+def start_tracker(host, n_workers, dask_scheduler=None):
""" Start Rabit tracker """
env = {'DMLC_NUM_WORKER': n_workers}
rabit = RabitTracker(hostIP=host, nslave=n_workers)
@@ -45,6 +45,7 @@ def start_tracker(host, n_workers):
thread = Thread(target=rabit.join)
thread.daemon = True
thread.start()
+ dask_scheduler.xgboost_thread = thread
return env
@@ -155,6 +156,13 @@ def _train(client, params, data, labels, dmatrix_kwargs={}, **kwargs):
num_class = params.get("num_class")
if num_class:
result.set_attr(num_class=str(num_class))
+
+ def wait_on_tracker_thread(dask_scheduler):
+ dask_scheduler.xgboost_thread.join()
+ del dask_scheduler.xgboost_thread
+
+ yield client.run_on_scheduler(wait_on_tracker_thread)
+
raise gen.Return(result) You could also add other operations in the When I add this diff my tests fail and pass as before. |
I believe that we only call init on the workers within tasks. These are always run in separate threads outside of the tornado event loop. The tornado event loop handles communication and administrative work while the thread pool handles all user code. |
@mrocklin was your patch in diff --git a/.circleci/config.yml b/.circleci/config.yml
index f1463079..72faf516 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -16,7 +16,7 @@ jobs:
conda config --add channels conda-forge
conda create -q -n test-environment python=${PYTHON}
source activate test-environment
- conda install -q coverage flake8 pytest pytest-cov pytest-xdist numpy pandas xgboost dask distributed scikit-learn sparse scipy
+ conda install -q coverage flake8 pytest pytest-cov numpy pandas xgboost dask distributed scikit-learn sparse scipy
pip install -e .
conda list test-environment
- run:
diff --git a/dask_xgboost/core.py b/dask_xgboost/core.py
index 6bf29d78..c843a000 100644
--- a/dask_xgboost/core.py
+++ b/dask_xgboost/core.py
@@ -34,7 +34,7 @@ def parse_host_port(address):
return host, port
-def start_tracker(host, n_workers):
+def start_tracker(host, n_workers, dask_scheduler=None):
""" Start Rabit tracker """
env = {'DMLC_NUM_WORKER': n_workers}
rabit = RabitTracker(hostIP=host, nslave=n_workers)
@@ -45,6 +45,7 @@ def start_tracker(host, n_workers):
thread = Thread(target=rabit.join)
thread.daemon = True
thread.start()
+ dask_scheduler.xgboost_thread = thread
return env
@@ -155,6 +156,13 @@ def _train(client, params, data, labels, dmatrix_kwargs={}, **kwargs):
num_class = params.get("num_class")
if num_class:
result.set_attr(num_class=str(num_class))
+
+ def wait_on_tracker_thread(dask_scheduler):
+ dask_scheduler.xgboost_thread.join()
+ del dask_scheduler.xgboost_thread
+
+ yield client.run_on_scheduler(wait_on_tracker_thread)
+
raise gen.Return(result)
diff --git a/setup.cfg b/setup.cfg
index 2348f495..11894603 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -5,4 +5,4 @@ universal=1
exclude = tests/data,docs,benchmarks,scripts
[tool:pytest]
-addopts = -rsx -v -n 1 --boxed
+addopts = -rsx -v
|
I'm not sure that the patch I provided ever fixed the problem. I think it was intended to start folks on how to track things. I don't recall much here though. |
Gotcha. Do you have thoughts on the mocking changes to the test runner? I'd
prefer to not mock if possible, but won't have time to
revisit the failures till later in the week.
…On Wed, Nov 7, 2018 at 8:25 AM Matthew Rocklin ***@***.***> wrote:
I'm not sure that the patch I provided ever fixed the problem. I think it
was intended to start folks on how to track things. I don't recall much
here though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIr5Y6D9yeva1aaO-yaJp1ypEdf_fks5usu1PgaJpZM4XZP40>
.
|
To be honest I haven't thought much about the mocking changes |
Basic support, for sample weights, please don't merge yet