Dataset construction uses all threads on the machine #5124

ivannp · 2022-04-03T17:31:58Z

Description

Passing nthreads to lightgbm.Dataset constructor (via the params parameter) doesn't seem to be taken into account. construct seems to use all cores on the machine in some phases. I would expect construct to be bound by the maximum number of threads specified.

Reproducible example

Loading large dataset via a hand-crafted Sequence object.

Environment info

LightGBM version or commit hash: 3.2.1

jameslamb · 2022-04-04T00:40:09Z

Thanks for using LightGBM! We need some more information from you before we can help.

Are you able to provide a minimal, reproducible example that demonstrates this behavior?
- "Loading large dataset via a hand-crafted Sequence object" is not sufficient information for maintainers here to understand what you did and offer a suggestion without significant guessing.
Can you please provide some of the other information that was requested in the issue template when you clicked "new issue"? Like:
- what programming language are you using?
- how did you install LightGBM?
Can you try to install the latest version of LightGBM from source in this repo, or at least the latest released version (v3.3.2), and let us know if you still see this behavior?

StrikerRUS · 2022-04-05T00:06:48Z

I think this issue and #4598 have a same root cause.

jameslamb · 2022-04-10T06:40:25Z

Investigating #4598, I found substantial evidence that passing num_threads through Dataset parameters should correctly result in changing the number of threads used in Dataset construction: #4598 (comment).

I really think we need a reproducible example to be able to investigate this report further. Otherwise, solving this conclusively will require significant research and guessing to try to figure out what combination of parameters, LightGBM version, and Python code reproduces this behavior.

ivannp · 2022-04-23T13:49:26Z

#4598 seems to investigate whether or not parallelism is enabled. The intended claim of this issue is that during some stages of the dataset construction ALL threads on the machine are used, ignoring the actual num_threads. The dataset doesn't matter much, it's the behavior of parallelism. At best, I can provide you with a screenshot of htop during the dataset construction.

jameslamb · 2024-04-23T03:23:30Z

I believe this is fixed in newer versions of LightGBM. Specifically, I think that #6226 fixed this.

I got a c5a.4xlarge EC2 instance on AWS tonight (16 vCPUs).

Built LightGBM like this:

git clone --recursive https://github.com/microsoft/LightGBM.git
sh build-python.sh bdist_wheel install

Created a fairly expensive Dataset construction task:

10 million rows
100 features
no limit on histogram bin sizes (min_data_in_bin = 1)
up to 10,000 bins per feature

cat << EOF > make-data.py
import numpy as np

X = np.random.random(size=(1_000_000, 100))
y = np.random.random(size=(X.shape[0],))
np.save("X.npy", X)
np.save("y.npy", y)
EOF

python ./make-data.py

cat << EOF > check-multithreading.py
import lightgbm as lgb
import numpy as np
import time
import os
import sys

X = np.load("X.npy")
y = np.load("y.npy")
ds = lgb.Dataset(
    X,
    y,
    params={
        "verbose": -1,
        "min_data_in_bin": 1,
        "max_bin": 10000
    }
)
tic = time.time()
ds.construct()
toc = time.time()
num_threads = os.environ.get("OMP_NUM_THREADS", None)
print(f"threads: {num_threads} | execution time (s): {round(toc - tic, 3)}")
EOF

Tested with OMP_NUM_THREADS=1...

OMP_NUM_THREADS=1 \
    python ./check-multithreading.py
# threads: 1 | execution time (s): 22.849

... and OMP_NUM_THREADS=4 (there are 16 total vCPUs available)

OMP_NUM_THREADS=4 \
    python ./check-multithreading.py
# threads: 4 | execution time (s): 6.156

.. and with OMP_NUM_THREADS not set at all

unset OMP_NUM_THREADS

python ./check-multithreading.py
# threads: None | execution time (s): 2.396

For completeness, I repeated this same exercise but with with environment variable OMP_NUM_THREADS unset and passing different values to Dataset parameter num_threads... found the same thing.

jameslamb added the awaiting response label Apr 4, 2022

StrikerRUS added bug and removed awaiting response labels Apr 5, 2022

StrikerRUS added the awaiting response label Apr 10, 2022

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

github-actions bot removed the awaiting response label Apr 23, 2022

This was referenced Oct 8, 2023

factor out uses of omp_get_num_threads() and omp_get_max_threads() outside of OpenMP wrapper #6133

Merged

set explicit number of threads in every OpenMP parallel region #6135

Merged

jameslamb mentioned this issue Dec 5, 2023

[R-package] [c++] add tighter multithreading control, avoid global OpenMP side effects (fixes #4705, fixes #5102) #6226

Merged

jameslamb closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset construction uses all threads on the machine #5124

Dataset construction uses all threads on the machine #5124

ivannp commented Apr 3, 2022

jameslamb commented Apr 4, 2022

StrikerRUS commented Apr 5, 2022

jameslamb commented Apr 10, 2022

ivannp commented Apr 23, 2022 •

edited

Loading

jameslamb commented Apr 23, 2024

Dataset construction uses all threads on the machine #5124

Dataset construction uses all threads on the machine #5124

Comments

ivannp commented Apr 3, 2022

Description

Reproducible example

Environment info

jameslamb commented Apr 4, 2022

StrikerRUS commented Apr 5, 2022

jameslamb commented Apr 10, 2022

ivannp commented Apr 23, 2022 • edited Loading

jameslamb commented Apr 23, 2024

ivannp commented Apr 23, 2022 •

edited

Loading