-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainings binary logloss increasing in some iterations in voting-parallel setting #4414
Comments
cc @shiyu1994 |
@freetz-tiplu Thanks for using LightGBM. What's the size of your data and how many features are there? Could you please turn up |
@shiyu1994 Sorry for the late reply. |
@freetz-tiplu Could you provide an example, even if the error does not occur every time? I can run the example multiple times to catch the error. It would be very helpful for us to identify the problem. Thank you! |
I had an example but when I converted my data to CSV-files and tested it on the C++-Code, the problem did not occur. |
I created an example but
Python example: import lightgbm as lgb
import pandas as pd
import numpy as np
import lightgbm.callback as lgb_callback
import scipy
import dask.distributed
"""
How to start:
1. Start dask-scheduler:
> dask-scheduler --host 127.0.0.1
2. Start both dask-workers:
First terminal (workerA):
> dask-worker tcp://127.0.0.1:8786 --name workerA --nthreads 1 --memory-limit="10 GiB"
Second terminal (workerB):
> dask-worker tcp://127.0.0.1:8786 --name workerB --nthreads 1 --memory-limit="10 GiB"
3. Start the skript using python
"""
######################################Helper####################################
def check_metrics(worker_met):
scores = worker_met['train']['logsumexp_logloss_from_raw']
last_value = np.inf
for i, value in enumerate(scores):
if value > last_value:
print(f"Spike in iteration {i}: last_value: {last_value}, value: {value}, diff: {value - last_value}")
last_value = value
######################################Callbacks####################################
def logsumexp_logoss(y_true, raw_score):
y_true = y_true.astype(float)
raw_score = raw_score.astype(float)
return np.mean(np.array([
scipy.special.logsumexp([0, -s]) if y else scipy.special.logsumexp([0, s])
for y, s in zip(y_true, raw_score)]))
def record_evaluation_logsumexp(eval_result, evals):
eval_result.clear()
def callback(env: lgb_callback.CallbackEnv):
bst = env.model
for lgbdata, eval_name in evals:
raw_prediction_score = bst.predict(lgbdata.get_data(), raw_score=True)
y_true = lgbdata.get_label().astype(bool)
score = logsumexp_logoss(y_true, raw_prediction_score)
eval_result.setdefault(eval_name, {})
eval_result[eval_name].setdefault('logsumexp_logloss_from_raw', []).append(score)
callback.order = 30
return callback
######################################Dask distributed functions####################################
def set_data_path(p, local_listen_port):
worker = dask.distributed.get_worker()
worker._data_path = p
worker._local_listen_port = local_listen_port
print(f"Data path set: {p}")
def train_on_workers(lgb_params):
worker = dask.distributed.get_worker()
worker_name = worker.name
print(f"Train on worker {worker_name}")
lgb_params['local_listen_port'] = worker._local_listen_port
# load from CSV
train = pd.read_csv(worker._data_path)
y = train.iloc[:, 0].values
x = train.iloc[:, 1:].values
train_data = lgb.Dataset(x, label=y)
train_sets = [train_data]
train_names = ['train']
metrics = {}
callbacks = [record_evaluation_logsumexp(metrics, list(zip(train_sets, train_names)))]
bst = lgb.train(lgb_params, train_set=train_data, callbacks=callbacks, verbose_eval=3)
return metrics
######################################Main####################################
train_A = "path/to//train27kA.csv" # TOCHANGE
train_B = "path/to//train27kB.csv" # TOCHANGE
dask_scheduler = "127.0.0.1:8786"
worker_lgb_ports = {
'workerA': 12345,
'workerB': 12346
}
dask_client = dask.distributed.Client(dask_scheduler)
workers_info = dask_client.scheduler_info()["workers"]
worker_addresses = {worker_details["name"]: worker_address
for worker_address, worker_details in workers_info.items()}
paths_set = []
for worker_name, worker_address in worker_addresses.items():
if 'A' in worker_name:
p = train_A
local_listen_port = worker_lgb_ports['workerA']
else:
p = train_B
local_listen_port = worker_lgb_ports['workerB']
paths_set.append(
dask_client.submit(set_data_path, p=p, pkl_path=None, local_listen_port=local_listen_port, workers=worker_address, pure=False))
dask_client.gather(paths_set)
machines = "127.0.0.1:12346,127.0.0.1:12345"
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'metric_freq': 1,
'is_training_metric': True,
# deterministic
'deterministic': True,
'seed': 1,
'data_random_seed': 0,
'force_row_wise': True,
# data and tree
'num_trees': 60,
'num_leaves': 31,
'num_threads': 1,
'tree_learner': 'voting',
'top_k': 20,
# machines
'num_machines': 2,
'machines': machines,
'local_listen_port': 12346,
'device_type': 'cpu',
'verbose': 3,
}
metrics_fut = [dask_client.submit(train_on_workers, lgb_params=lgb_params, workers=worker_address, pure=False)
for worker_address in worker_addresses.values()]
metrics = dask_client.gather(metrics_fut)
print("Check first metrics")
check_metrics(metrics[0])
print("\nCheck second metrics")
check_metrics(metrics[1]) The result I get when I execute this example looks like this: When testing the data using the lightgbm-executable only, I used the following configurations
trainA/B.conf:
Also, the logloss differs slightly in general between the python example and the C++ example starting from the first iteration. Maybe this can help you as well. |
Has this been fixed by #4542? If not, we should reopen this issue. |
oh no I don't think so! Maybe something in the language I used in that issue led to this being closed automatically. |
@freetz-tiplu The assumption of voting parallel is, there are enough data samples per node (machine), so the local best features could be the global best features. Therefore, when local samples are not enough, it is hard to say what will happen. |
@guolinke I don't think that this should be a problem in my case. Or do you propose that this might be the reason why the error still occured in my example mentioned above? |
@freetz-tiplu Thanks a lot for preparing the reproducible example. I'm back to look into this. |
I'm experiencing this issue with lightgbm
I can't tell from the discussion whether this issue was fixed, either by #4542 or by #5153. I see that the issue is open so perhaps it's unresolved. |
Description
Some time ago I encountered the problem that when I did not use min_data_in_leaf with a higher value than default, that the training's binary logloss would increase in some iterations. Spikes would occur which varied in size. Some small but some really big as shown in the example.
I always used
gbdt
. The spikes only occured in a distributed setting and i could observe them fordata
andvoting
-parallel training. I did not test it in a feature parallel setting. After some testing I could see that the source of the error had to lie in LightGBMs C++-Code, but could not find the specific location.Then I saw this issue #4026. I observed that in my trees the leaf-values became really high as well in the iterations where spikes occured. After installing the fix #4185. This solved the problem for the
data
-parallel case and improved solutions forvoting
-parallel. But small spikes would still occur in thevoting
-parallel case.The issue is mainly there to point out that #4026 is not yet fully resolved for the
voting
-parallel case. Perhaps the increase in binary_loglosses has another cause in thevoting
-parallel case as wellReproducible example
I did not manage to create a reproducible example for the error does not happen every time.
Environment info
LightGBM version or commit hash:
first version: 3.1.1.99
version with fix: 3.2.1.99
Command(s) you used to install LightGBM
The text was updated successfully, but these errors were encountered: