[Fix #30] Fix multithread race condition on global variable _factory #33

danilo-augusto · 2017-05-23T23:26:00Z

A lock is used to avoid a race condition on the global variable _factory: as soon as the line '_factory = DetectorFactory()' was executed, the other threads jumped the condition 'if _factory is None' going through 'create' and '_create_detector', until they meet the line 'if not self.langlist'.

Since the thread that created the global DetectorFactory didn't necessarily have the time to populate the language profiles, the other threads start raising LangDetectException until the langlist is no longer empty (which is not good either: it can have just one language...).

Since the global variable is used to avoid reloading the profiles again and again (which can take some time), I'm leaving it global, but a lock is used to ensure that other threads can use the fully loaded object, instead of a yet-to-populate langlist.

A lock is used to avoid a race condition on the global variable _factory: as soon as the line '_factory = DetectorFactory()' was executed, the other threads jumped the condition 'if _factory is None' going through 'create' and '_create_detector', until they met the line 'if not self.langlist'. Since the thread that created the global DetectorFactory didn't necessarily have the time to populate the language profiles, the other threads start raising LangDetectException until the langlist is no longer empty (which is not good either: it can have just one language...). Since the global variable is used to avoid reloading the profiles again and again (which can take some time), I'm leaving it global, but a lock is used to ensure that other threads can use the fully loaded object, instead of a yet-to-populate langlist.

Mimino666

Thanks for the catch.
I think using threading.local (https://docs.python.org/2/library/threading.html#threading.local) would be more appropriate, since the rest of the Factory code is not thread-safe as well.

danilo-augusto · 2017-07-07T13:03:01Z

Hi,

Sorry for the (enormous) delay. I've been very busy lately.
I understand your point: not trying to transform the code into a mix of thread-safe and non thread-safe land.

However, the two main functions (detect and detect_langs) provided in the Basic Usage instructions have the same problem, which is the biggest obstacle now to the use of multiple threads: the use of global variables that can be modified inside functions/classes (which requires the use of global).

So, if you make a $ grep -r 'global' . inside your repo, you will find two uses of it: one for _factory and another for _messages, inside the script messages.py of utils, which indeed I didn't catch on my commit. This opens the possibility for race conditions again, since messages is imported by ngram, which is imported by detector and this last one by detector_factory. 😛

However, I still think a global lock should be used, this time on both _messages and _factory:
if we use a threading.local() to wrap the _factory variable to each thread, this means each one of them has to consume (read, preprocess and load into memory) the profile languages. The profile folder has $ du -sh langdetect/profiles = 2.2M size, and using 20 threads we use 44M of memory instead of 2.2M. My C++ soul is not ok with this 😢.

And you can say: 40M is nothing for today's RAM. However, there's a second and more annoying problem: the initial delay to load all the profiles into RAM. As you know, Python has a GIL, and if the programmer's threads are not working around it, he will effectively wait N_THREADS * TIME_TO_LOAD_PROFILES (I tested), because there's is no gain in putting a thread to sleep while waking up another one (no HTTP requests to wait etc.).

Using only 10 threads, this time is around 0.5 sec. using the global lock (my initial solution), but more than 5 seconds using threading_local! And this last one becomes about 11 seconds with 20 threads and so on. (cf. screenshots below.)

Also, the variable used by each thread is forgotten when the job finishes. So there's no reuse of the profiles loaded in the past if the threads are killed and re-launched: think of a program which, instead of using daemon threads in permanence, uses a pool of threads (to speed things up) launched in a callback of a rare event to spare resources. The threading.local solution would force the programmer to make an active wait in order to respond quickly. C++ soul is sad again.

So I'm in favor of using threading.Lock(), but maybe I'm not seeing the disadvantages of it. What do you think?

And thanks for the repo. :)

NB: In the screenshots below I'm executing the solution with the global lock just after starting ipython and importing your module, so I'm not reusing the global variable (i.e. I'm making a fair comparison!). Subsequent calls using the global lock make the time drop by more than half (0.19s instead of 0.5s), so my C++ soul is in peace. 😉

GasparPizarro · 2019-01-29T14:42:35Z

Will this be merged in the near future?

In the meantime, maybe this can be of use:

# With external lock
from langdetect import detect
import threading

from time import time

t0 = time()

the_lock = threading.Lock()

def worker():
    for i in range(5):
        with the_lock:
            print(threading.current_thread().name, detect("Uno, dos, tres, catorce."), (time() - t0))

threads = []

for i in range(10):
    t = threading.Thread(target=worker)
    threads.append(t)
    t.start()



#Initializing the factory beforehand
from langdetect import detect
from langdetect.detector_factory import init_factory
import threading

from time import time

t0 = time()

init_factory()

def worker():
    for i in range(5):
        print(threading.current_thread().name, detect("Uno, dos, tres, catorce"), (time() - t0))

threads = []

for i in range(10):
    t = threading.Thread(target=worker)
    threads.append(t)
    t.start()

danilo-augusto · 2019-01-30T21:59:25Z

Sadly it seems the author abandoned this repo. In this case it would be thoughtful if he shared control over it, otherwise the open PRs and Issues will remain forever unanswered. Regretfully this is a very recurrent problem in open-source...
On a side note, people who want this functionality could also use other repositories.
I know of polyglot for example: https://polyglot.readthedocs.io/en/latest/Detection.html.

StephennFernandes · 2022-09-15T10:14:41Z

@danilo-augusto hey i am still facing the same issued you had raised. ive been using the detect() in my huggingface datasets using the datasets.map() in multithreading setup. and i am facing the same issues.

StephennFernandes · 2023-04-25T04:44:51Z

Any update on langdetect ?

Mimino666 reviewed Jun 5, 2017

View reviewed changes

dalf mentioned this pull request Dec 26, 2022

Use fasttext-predict instead of fasttext(-wheel) searxng/searxng#2047

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix #30] Fix multithread race condition on global variable _factory #33

[Fix #30] Fix multithread race condition on global variable _factory #33

danilo-augusto commented May 23, 2017 •

edited

Loading

Mimino666 left a comment

danilo-augusto commented Jul 7, 2017

GasparPizarro commented Jan 29, 2019 •

edited

Loading

danilo-augusto commented Jan 30, 2019 •

edited

Loading

StephennFernandes commented Sep 15, 2022

StephennFernandes commented Apr 25, 2023

[Fix #30] Fix multithread race condition on global variable _factory #33

Are you sure you want to change the base?

[Fix #30] Fix multithread race condition on global variable _factory #33

Conversation

danilo-augusto commented May 23, 2017 • edited Loading

Mimino666 left a comment

Choose a reason for hiding this comment

danilo-augusto commented Jul 7, 2017

GasparPizarro commented Jan 29, 2019 • edited Loading

danilo-augusto commented Jan 30, 2019 • edited Loading

StephennFernandes commented Sep 15, 2022

StephennFernandes commented Apr 25, 2023

danilo-augusto commented May 23, 2017 •

edited

Loading

GasparPizarro commented Jan 29, 2019 •

edited

Loading

danilo-augusto commented Jan 30, 2019 •

edited

Loading