Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLayer singleton bugfix #348

Merged
merged 2 commits into from
Apr 22, 2014

Conversation

jeffdonahue
Copy link
Contributor

I believe this fully fixes the apparent issue with the RNG mentioned in #335, (which it turns out my PR #336 did nothing to address, but was still nonetheless an improvement).

Basically, the DataLayer prefetch thread has no knowledge of the main Caffe thread's Caffe singleton, but it was making a call to Caffe::phase(), which resulted in the DataLayer constructing its own brand new Caffe singleton instance, and then, my theory at least is that somehow the destruction of this singleton instance when the thread exits isn't correctly handled and interferes with main thread memory? Not entirely sure about that explanation, but I do know that after at least 25 runs of the imagenet architecture with this change, I have not seen a segfault (vs. current dev which segfaults 25-75% of the time).

I also thought I'd uncovered a long-standing DataLayer bug as it seemed like the Caffe::phase() call in the prefetch thread should always return the default phase (TRAIN), but I added debug printouts inside the prefetch thread and ran from dev and it turns out that somehow the prefetch thread's phase is already being set correctly, but I'm pretty sure I have no idea how...

@@ -53,7 +53,7 @@ void* DataLayerPrefetch(void* layer_pointer) {
CHECK(data.size()) << "Image cropping only support uint8 data";
int h_off, w_off;
// We only do random crop when we do training.
if (Caffe::phase() == Caffe::TRAIN) {
if (layer->phase_ == Caffe::TRAIN) {
// NOLINT_NEXT_LINE(runtime/threadsafe_fn)
h_off = rand() % (height - crop_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is related to anything, but rand() is not thread safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true - I had planned to followup (either tonight or tomorrow probably) with another PR which removes all uses of rand throughout the codebase (by giving the prefetch thread its own private RNG object). Thanks for the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rand() followup-to-be. I remember it coming up now and then so it'll be nice to have it squared away.

@shelhamer
Copy link
Member

This has been quite the chase. I'd make some claim about shared pointers, threading, and freeing but honestly I'm not sure exactly how this worked out either.

However, this does fix the issue. Thanks Jeff!

shelhamer added a commit that referenced this pull request Apr 22, 2014
Fix singleton call in data layers

Note previous crash reports were not in fact due to RNG, although they led to improvements of the RNG code.
@shelhamer shelhamer merged commit aeae177 into BVLC:dev Apr 22, 2014
@shelhamer
Copy link
Member

single shot.

@jeffdonahue jeffdonahue deleted the datalayer-singleton-bugfix branch April 22, 2014 18:37
mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Fix singleton call in data layers

Note previous crash reports were not in fact due to RNG, although they led to improvements of the RNG code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants