Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix #30] Fix multithread race condition on global variable _factory #33

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

danilo-augusto
Copy link

@danilo-augusto danilo-augusto commented May 23, 2017

A lock is used to avoid a race condition on the global variable _factory: as soon as the line '_factory = DetectorFactory()' was executed, the other threads jumped the condition 'if _factory is None' going through 'create' and '_create_detector', until they meet the line 'if not self.langlist'.

Since the thread that created the global DetectorFactory didn't necessarily have the time to populate the language profiles, the other threads start raising LangDetectException until the langlist is no longer empty (which is not good either: it can have just one language...).

Since the global variable is used to avoid reloading the profiles again and again (which can take some time), I'm leaving it global, but a lock is used to ensure that other threads can use the fully loaded object, instead of a yet-to-populate langlist.

A lock is used to avoid a race condition on the global variable _factory: as soon as the line '_factory = DetectorFactory()' was executed, the other threads jumped the condition 'if _factory is None' going through 'create' and '_create_detector', until they met the line 'if not self.langlist'.

Since the thread that created the global DetectorFactory didn't necessarily have the time to populate the language profiles, the other threads start raising LangDetectException until the langlist is no longer empty (which is not good either: it can have just one language...).

Since the global variable is used to avoid reloading the profiles again and again (which can take some time), I'm leaving it global, but a lock is used to ensure that other threads can use the fully loaded object, instead of a yet-to-populate langlist.
Copy link
Owner

@Mimino666 Mimino666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch.
I think using threading.local (https://docs.python.org/2/library/threading.html#threading.local) would be more appropriate, since the rest of the Factory code is not thread-safe as well.

@danilo-augusto
Copy link
Author

Hi,

Sorry for the (enormous) delay. I've been very busy lately.
I understand your point: not trying to transform the code into a mix of thread-safe and non thread-safe land.

However, the two main functions (detect and detect_langs) provided in the Basic Usage instructions have the same problem, which is the biggest obstacle now to the use of multiple threads: the use of global variables that can be modified inside functions/classes (which requires the use of global).

So, if you make a $ grep -r 'global' . inside your repo, you will find two uses of it: one for _factory and another for _messages, inside the script messages.py of utils, which indeed I didn't catch on my commit. This opens the possibility for race conditions again, since messages is imported by ngram, which is imported by detector and this last one by detector_factory. 😛

However, I still think a global lock should be used, this time on both _messages and _factory:
if we use a threading.local() to wrap the _factory variable to each thread, this means each one of them has to consume (read, preprocess and load into memory) the profile languages. The profile folder has $ du -sh langdetect/profiles = 2.2M size, and using 20 threads we use 44M of memory instead of 2.2M. My C++ soul is not ok with this 😢.

And you can say: 40M is nothing for today's RAM. However, there's a second and more annoying problem: the initial delay to load all the profiles into RAM. As you know, Python has a GIL, and if the programmer's threads are not working around it, he will effectively wait N_THREADS * TIME_TO_LOAD_PROFILES (I tested), because there's is no gain in putting a thread to sleep while waking up another one (no HTTP requests to wait etc.).

Using only 10 threads, this time is around 0.5 sec. using the global lock (my initial solution), but more than 5 seconds using threading_local! And this last one becomes about 11 seconds with 20 threads and so on. (cf. screenshots below.)

Also, the variable used by each thread is forgotten when the job finishes. So there's no reuse of the profiles loaded in the past if the threads are killed and re-launched: think of a program which, instead of using daemon threads in permanence, uses a pool of threads (to speed things up) launched in a callback of a rare event to spare resources. The threading.local solution would force the programmer to make an active wait in order to respond quickly. C++ soul is sad again.

So I'm in favor of using threading.Lock(), but maybe I'm not seeing the disadvantages of it. What do you think?

And thanks for the repo. :)

NB: In the screenshots below I'm executing the solution with the global lock just after starting ipython and importing your module, so I'm not reusing the global variable (i.e. I'm making a fair comparison!). Subsequent calls using the global lock make the time drop by more than half (0.19s instead of 0.5s), so my C++ soul is in peace. 😉
screen shot 2017-07-07 at 14 31 46
screen shot 2017-07-07 at 14 42 35

@GasparPizarro
Copy link

GasparPizarro commented Jan 29, 2019

Will this be merged in the near future?

In the meantime, maybe this can be of use:

# With external lock
from langdetect import detect
import threading

from time import time

t0 = time()

the_lock = threading.Lock()

def worker():
    for i in range(5):
        with the_lock:
            print(threading.current_thread().name, detect("Uno, dos, tres, catorce."), (time() - t0))

threads = []

for i in range(10):
    t = threading.Thread(target=worker)
    threads.append(t)
    t.start()



#Initializing the factory beforehand
from langdetect import detect
from langdetect.detector_factory import init_factory
import threading

from time import time

t0 = time()

init_factory()

def worker():
    for i in range(5):
        print(threading.current_thread().name, detect("Uno, dos, tres, catorce"), (time() - t0))

threads = []

for i in range(10):
    t = threading.Thread(target=worker)
    threads.append(t)
    t.start()

@danilo-augusto
Copy link
Author

danilo-augusto commented Jan 30, 2019

Sadly it seems the author abandoned this repo. In this case it would be thoughtful if he shared control over it, otherwise the open PRs and Issues will remain forever unanswered. Regretfully this is a very recurrent problem in open-source...
On a side note, people who want this functionality could also use other repositories.
I know of polyglot for example: https://polyglot.readthedocs.io/en/latest/Detection.html.

@StephennFernandes
Copy link

@danilo-augusto hey i am still facing the same issued you had raised. ive been using the detect() in my huggingface datasets using the datasets.map() in multithreading setup. and i am facing the same issues.

@StephennFernandes
Copy link

Any update on langdetect ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants