-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GeneratorEnqueuer Multithreading on Windows (and Linux...) #8662
Conversation
At this point, what's the use case of using multiple threads for a generator? Is there any speedup doing multithreading, since the GIL would kill any concurrency anyway? Shouldn't we just limit the I cannot speak for the Windows case since I do not own one. |
Good questions, Frédéric. This is really just a bug fix and is mostly about feature parity (not performance improvements) between Linux and Windows. I don't aim to support new use cases/scenarios either. If you need to know what my specific use case is, I have a personal interest in getting this project to work on Windows. Perhaps your questions point to a larger interface design issue. I agree with you on the limitations imposed by the GIL. It seems to me that your valid concerns really apply to all platforms and I wasn't attempting to address them with this fix. As of today, multi-process and multi-threaded generators are simply broken on Windows. And, by broken, I do mean code execution crashes (you'll see there are several bugs that have been reported over time). Being a huge fan of Keras, I don't want to forced to move to a different high-level deep learning library (Gluon?) because crashes on Windows are simply tolerated and bugs don't get fixed on the platform I have to support. I will leave it to you to come up with what I'm sure will be good answers to the larger Keras API issue. |
setattr(e, '__traceback__', None) | ||
elif not hasattr(e, '__traceback__'): | ||
setattr(e, '__traceback__', sys.exc_info()[2]) | ||
traceback.print_exc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the traceback when using threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a question better asked to @de-vri-es who made commit 4a58b178073f0ba3b166220f7ebd7d56149bfb20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the exception is put in the queue and rethrown in the main thread. Printing the trackback twice is no good. Also, if someone de decides to rethrow the exception you don't want the trackbacks printer at all.
The only reason I kept the print in multiprocessing=True is that the trackback cant be pickled, and cant be put in an inter-process queue.
Unconditionally printing the trackback is not a good idea in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant "if someone de decides to catch the exception you don't want the trackback printed at all" . I should 't do this from my phone apparantly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you undid my work in a merge conflict here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so. The test for multiprocessing=True has been taken out of the while loop for clarity (there is now a code path for multiprocessing and one for multithreading). The diff displayed by Github appears misleading however. @de-vri-es, would you mind looking at the files side by side and tell me if you agree? I would greatly appreciate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I didn't see the full diff (I really shouldn't have continued on my phone). I see the if/else
is still there but much earlier. Looks good =]
keras/utils/data_utils.py
Outdated
self._stop_event = threading.Event() | ||
|
||
for _ in range(workers): | ||
if self._use_multiprocessing: | ||
# Reset random seed else all children processes | ||
# share the same seed | ||
np.random.seed(self.seed) | ||
thread = multiprocessing.Process(target=data_generator_task) | ||
thread = multiprocessing.Process(target=self.data_generator_task, args=()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unused keyword args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit
tests/test_multiprocessing.py
Outdated
import pytest | ||
import numpy as np | ||
from keras.models import Sequential | ||
from keras.layers.core import Dense | ||
from keras.utils.test_utils import keras_test | ||
|
||
gsteps_per_epoch = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constants should be upper case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit. Apologies for ignoring PEP 8 errs and warns in previous commit. This has been taken care of as well.
Your PR is fine, if multiprocessing doesn't work on Windows, we shouldn't try to support it. |
Thanks, Frédéric. |
keras/utils/data_utils.py
Outdated
break | ||
except Exception as e: | ||
# Can't pick tracebacks. | ||
# As a compromise, print the traceback and pickle None instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was only valid for the multiprocessing branch. It can be removed here.
The real underlying issue here seems to be that python generators simply aren't suitable for multi-threading or multiprocessing. In the case of multiprocessing you'll always get each worker generating the same sequence, and in the case of multi-threading there can't really be any parallelism. An interface which implies to do something impossible without actually doing it does more harm than good, I think. How about adding deprecation warnings to using generators with more than 1 worker and eventually remove support all together? If there is a valid use case we should come up with an interface which does actually work. Sadly that probably means no Python generators, nice as they are. |
There is a warning : here |
generator_output = next(self._generator) | ||
self.queue.put((True, generator_output)) | ||
else: | ||
time.sleep(self.wait_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleeping with a lock held seems bad. Shouldn't it be sufficient to guard this line with the lock:
generator_output = next(self._generator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my (admittedly limited) experience with generators, the thread is mostly waiting on getting an available slot in the queue rather than on getting samples from the generator. I guess this is all very dependent on whatever your use case scenario is.
With the above in mind, I believe it is much better sleeping than busy waiting on the queue not being full anymore (that's essentially what the self.queue.qsize() < self.max_queue_size test is about).
Or course, a much better fix would have the lock replaced with a condition variable and signalling with the queue as wells as the generator. I assumed you decided not to go that way in the first place because you wanted to make it easy on developers to write generators. Or, perhaps, because of the limitations of multithreading in Python, you decided there was absolutely no point on making it much harder on developers to have to adhere to a much more sophisticated synchronization mechanism for a feature that is so clobbered by the gil anyway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's not really a busy wait since its mostly sleeping. However, I have to agree that there isn't that much to be gained by holding the lock shorter.
I think it would only matter if putting the object in the queue is a time consuming action (which it isn't for the threading case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also bothered by the sleep in the lock and thought to correct it. Even though the code can be refactored to avoid it, WE CANT lock only the generator_output = next(self._generator)
. This will not work. I thought to put a comment here, if someone tries to do this on the future.
This PR makes the queue to have fixed size. If we don't protect both the iterator and the addition to the queue, we can end up with threads being stack while trying to add to the queue. Because the stop() operation joins and waits for the threads to finish, this will not happen. To avoid sleeping in a lock, the function needs to be heavily refactored and to echo @philferriere might not be 100% worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, I'll just remove this use case. I suspect having workers > 1 and use_multiprocessing == False
doesn't gives any real speedup.
I'll do some profiling and post my findings here.
Also, we should try to mimic the Ordered Enqueuer so this class will get heavily refactored anytime soon anyway. (By the end of August)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With use_multiprocessing=False, how do you avoid ValueError : generator already executing
without adding a Lock and therefore lose all speedup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me provide some context and then move to benchmarks. Here are the key Keras releases for this discussion:
2.0.5: All Generator classes for Images used Iterator logic.
2.0.6: Sequences introduced.
2.1.3: Locks introduced in GeneratorEnqueuer
Before the introduction of Sequences (<=2.0.5), all Generator classes in Keras implemented next()
. They were also thread-safe and as a result the GeneratorEnqueuer
did not require any locks. On the other hand, python generator methods did not work well as they were throwing ValueError : generator already executing
. This was fixed by @philferriere in this PR and it was released on 2.1.3.
When I talk about thread-safe generators, I'm not refering to python generator methods but to Generator classes like those used by Keras before 2.0.5. These classes used an Iterator and made sure by using minimal locking that all the heavy computations are done ourside the lock. Since C libs can release the GIL, this gave speed improvements. When Sequences were introduced, many of the classes were rewritten and instead of next()
they implemented __getitem__()
. Sequences are better when concurrency is necessary, nevertheless not all use cases can implement a "random-access" logic; still they can be thread-safe. Thus having the ability to build thread-safe generator classes is still useful IMO. The lock introduced on 2.1.3, stops those valid cases from using proper parallelism.
Below I provide a snippet that can be ran with and without the lock (just comment out line 650). To produce something more than a toy example, I use the ImageDataGenerator
class with realistic configuration to load real images (note that I am aware that ImageDataGenerator will return a Sequence but that has no effect here; for an Iterator-based implementation you can take the older version of the class from Keras 2.0.5 and get the same results. What we need here is a thread-safe "iterator" with heavy computations in non-locking parts.).
Snippet:
import time
from keras.preprocessing import image
from keras.utils.data_utils import GeneratorEnqueuer
it = image.ImageDataGenerator().flow_from_directory('/path/to/images', target_size=(224, 224), batch_size=512)
reader = GeneratorEnqueuer(it, use_multiprocessing=False)
reader.start(workers=16, max_queue_size=16)
g = reader.get()
start = time.time()
n = 100
for i in range(n):
x=g.next()
total_time = time.time()-start
print('Total time %f sec; Average time: %f sec' % (total_time, total_time/n))
WITH LOCK: Total time 161.514681 sec; Average time: 1.615147 sec
WITHOUT LOCK: Total time 32.234057 sec; Average time: 0.322341 sec
That's 5 times faster, provided that the underlying iterator is thread-safe. As you can see, use_multiprocessing=False and workers>1
can be quite useful because many C libraries release the GIL. Unfortunately the lock introduced above, even though admittedly fixes the case of Python generators, it makes it impossible for thread-safe iterators to work fast. Note that this benchmark does not conflict yours. If I run the same code without the Enqueuer, its time is the same as with the Enqueuer & the lock (exactly what you said earlier).
In my opinion, removing the lock makes sense (or maybe checking the instance type and/or the existence of specific methods to understand if the lock is necessary?), because if you are using non-thread-safe Python generators you should not use workers>1
. If you use thread-safe iterators though, it absolutely makes sense to use parallelism especially if you implement them with minimal locking. Finally I want to mention that this is not a theoretical discussion. I started investigating this problem because we observed GPU starvation and 4-5 times slower execution times on real-world applications after Keras 2.1.3. Our problem is solved by removing the lock and making a couple of minor changes on the queue configuration.
If that's interesting to you, I can draft a PR so that you can check it out and decide if it's something you want to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please draft a PR. I think we should keep a UX-friendly way of handling python generators. (ie. a good clear message stating that they should set workers=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'll put something together in the weekend and I'll tag you to get your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, removing the lock makes sense (or maybe checking the instance type and/or the existence of specific methods to understand if the lock is necessary?), because if you are using non-thread-safe Python generators you should not use workers>1.
Yeah, I think it's a good idea to add some specific member variable to indicate that a generator is safe for running multi-threaded so it can be detected at runtime. That could then trigger a warning and enable the locking to serialize the next()
calls with workers > 1
.
Yeah, but only in the |
@Dref360 @de-vri-es, please let me know if you need anything else from me. If you approve this PR, may I suggest adding a small note to Docs » Models » Model (functional API) for the three generator methods (fit_generator, evaluate_generator, predict_generator)? Something along the lines of: Using a generator with Thank you both for your help with this! |
Looks good to me. I do think it makes sense to update the API of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
keras/utils/data_utils.py
Outdated
except StopIteration: | ||
break | ||
except Exception as e: | ||
# Can't pick tracebacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pickle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
keras/utils/data_utils.py
Outdated
""" | ||
|
||
def data_generator_task(): | ||
def data_generator_task(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a private method, I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
keras/utils/data_utils.py
Outdated
""" | ||
|
||
def data_generator_task(): | ||
def __data_generator_task(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only need one leading underscore to make a method private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per Python's class module documentation:
“Private” instance variables that cannot be accessed except from inside an object don’t exist in Python. However, there is a convention that is followed by most Python code: a name prefixed with an underscore (e.g. _spam) should be treated as a non-public part of the API (whether it is a function, a method or a data member). It should be considered an implementation detail and subject to change without notice.
Since there is a valid use-case for class-private members (namely to avoid name clashes of names with names defined by subclasses), there is limited support for such a mechanism, called name mangling. Any identifier of the form __spam (at least two leading underscores, at most one trailing underscore) is textually replaced with _classname__spam, where classname is the current class name with leading underscore(s) stripped. This mangling is done without regard to the syntactic position of the identifier, as long as it occurs within the definition of a class.
The one-underscore approach assumes developers abiding by a coding convention and little else. Two underscores uses name mangling to enforce some level of privacy (easy to break, sure, but at least it is trying). Is the "by convention" approach the only one you need me to follow?
I just want to make sure. Thanks, @fchollet!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all private methods in the Keras codebase use a single leading underscore. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
The GIL kills parallelism, not concurrency:
Just adding this so people aren't confused. |
That's a little nitpicky, but also, the distinction between multitasking and multithreading as you describe it is quite arbitrary. Nowadays, a "task" tends to refer to a concept from either your languague runtime or a library, and doesn't necessarily map to parallel execution. Multi-threading on the other hand is an execution model, possibly for "tasks", that actually runs parallel in practically any languague other than python since the days of multi-core CPUs (~2005). |
Modified Files
keras\utils\data_utils.py
tests\test_multiprocessing.py
Issues with GeneratorEnqueuer Multithreading on Windows (and Linux...)
Open bugs:
#6582
#5071
Stale but legit bugs:
#3962
#5510
Any attempt to use multithreading or multiprocessing on Windows will result in a
AttributeError: Can't pickle local object 'GeneratorEnqueuer.start.<locals>.data_generator_task'
error in themultiprocessing
package, as shwon below:Converting the
data_generator_task()
local function to aGeneratorEnqueuer
class method fixes the issue. Fixing the error above, however, doesn't fix a more general problem withmultiprocessing
on Windows. Indeed, on Windows,multiprocessing
cannot marshall objects that contain generators across process boundaries. Attempting to do so will systematically generate aTypeError: can't pickle generator objects
error, as shown here:Our two-pronged
data_utils.py
fix for these issues is the following:data_generator_task()
local function to aGeneratorEnqueuer
class method.ValueError
exception instead ifuse_multiprocessing
is set to True and suggest alternative in error message, such as using multithreading, BUT ---- BUT the current code in
data_utils.py
does not work properly in multithreading mode. Since calls to the generator'snext()
function are not serialized, execution sytematically results in aValueError: generator already executing
(both on Windows and Linux!). See notes below ontest_multiprocessing.py
to see why this seldom shows up on Linux.Our
data_utils.py
fix for this additional issue is the following:generator_output = next(generator)
using a threading lock.max_queue_size
. Right now, it is not properly initialized and grows indefinitely on all platforms!Yes, we are aware of Python's limited multithreaded abailities when it comes to the global interpreter lock (gil), as discussed here. We are also of the opinion that degraded performance is a better alternative to catastrophic failure in execution (as is the case right now). Without a fix,
GeneratorEnqueuer
is unusable on Windows.If the PR for
data_utils.py
is rejected, please consider using our modified version oftest_multiprocessing.py
. The current version is woefully inadequate at catching multithreading and multiprocessing bugs or stressing theGeneratorEnqueuer
queueing mechanism. Using our updated version, you will see that the same multithreading issues pop up immediately on Linux as well.Our fix for
test_multiprocessing.py
to bring out threading issues and stressing the queue are the following:use_multiprocessing
set to True, then False)use_multiprocessing
set to True, then False)use_multiprocessing
set to True, then False)Code to Repro Multiprocessing Bugs and Test Fix
Simply run
test_multiprocessing.py
. Below, we show execution of this code (with and without the fix) in four different configurations:dlwin36tf140kerasmaster
(Python 3.6 on Windows 10 with Tensorflow 1.4)Execution without the fix
Execution with the fix
Steps to recreate test environment
dlwin27cntk23kerasmaster
(Python 2.7 on Windows 10 with CNTK 2.3)Execution without the fix
Execution with the fix
Steps to recreate test environment
dlubu36tf140kerasmaster
(Python 3.6 on Ubuntu 16.04 with Tensorflow 1.4)Execution without the fix
Execution with the fix
dlubu27cntk23kerasmaster
(Python 2.7 on Ubuntu 16.04 with CNTK 2.3)Execution without the fix
Execution with the fix
Steps to recreate test environment