Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on exit with OSX #1991

Closed
ThiagoIze opened this issue Aug 2, 2018 · 2 comments
Closed

Crash on exit with OSX #1991

ThiagoIze opened this issue Aug 2, 2018 · 2 comments

Comments

@ThiagoIze
Copy link
Collaborator

While upgrading to the master branch, I came across a very rare crash (sometimes it'll take 10s of thousands of runs before a crash) that happened upon shutdown on my high sierra OSX machine:

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

This is a bug in OIIO. The following patch, which is perfectly valid code, allows it to happen on almost every run:

diff --git a/src/libutil/thread.cpp b/src/libutil/thread.cpp
index ca2d8024..fe758703 100644
--- a/src/libutil/thread.cpp
+++ b/src/libutil/thread.cpp
@@ -126,7 +126,9 @@ public:
 
     // the destructor waits for all the functions in the queue to be finished
     ~Impl () {
+        printf("~thread_pool::Impl called\n");
         this->stop(true);
+        printf("~thread_pool::Impl manual destruction finished. Automatic destructors happen now\n");
     }
 
     // get the number of running threads in the pool
@@ -300,8 +302,11 @@ private:
                         isPop = this->q.pop(_f);
                 }
                 // the queue is empty here, wait for the next command
+                printf("set_thread: the queue is empty here, wait for the next command\n");
+                sleep(1); // wait for mutex to be destroyed
                 std::unique_lock<std::mutex> lock(this->mutex);
                 ++this->nWaiting;
+                printf("this->cv.wait\n"); // we made it through with no crashes!
                 this->cv.wait(lock, [this, &_f, &isPop, &_flag](){ isPop = this->q.pop(_f); return isPop || this->isDone || _flag; });
                 --this->nWaiting;
                 if (!isPop)
@@ -322,6 +327,9 @@ private:
     std::atomic<bool> isStop;
     std::atomic<int> nWaiting;  // how many threads are waiting
     int m_size {0};             // Number of threads in the queue
+    struct Foo{ ~Foo() {
+       printf("mutex was destructed. pausing ~thread_pool::Impl() for 1s!\n"); sleep(1); }
+    } this_gets_destroyed_after_mutex;
     std::mutex mutex;
     std::condition_variable cv;

The bug here is that when we are shutting down and destroying everything, the thread pool class is destroyed while the detached worker threads are still making use of its member variables. This patch pauses the destructor for 1s after the mutex is destroyed and makes the worker thread delay exiting by 1s so that when it finally locks the mutex it's now much more likely that the mutex will have been destroyed and OSX throws the error.

We could wrap all usage of the mutex usage in a try/catch block, and that does help a lot with this crash, however, even then I worry that we might still get other crashes due to other errors that could be triggered with using a mutex and other state that has been destroyed, not to mention maybe on other operating systems, or future versions of OSX, this could start to fail. So I think it might be better to have a proper fix.

I wonder if this could be related to #1572 and #1795?

@ThiagoIze
Copy link
Collaborator Author

Another important bit of info is that this crash happens only when we do OIIO::attribute ("threads", 1); because that causes all the threads to be detached which are not waited on when we exit. I think the following might fix it:

diff --git a/src/libutil/thread.cpp b/src/libutil/thread.cpp
index c1da23d8..bb10bf9b 100644
--- a/src/libutil/thread.cpp
+++ b/src/libutil/thread.cpp
@@ -159,7 +159,8 @@ public:
             else {  // the number of threads is decreased
                 for (int i = oldNThreads - 1; i >= nThreads; --i) {
                     *this->flags[i] = true;  // this thread will finish
-                    this->threads[i]->detach();
+                    this->terminating_threads.push_back(std::move(this->threads[i]));
+                    this->threads.erase(this->threads.begin() + i);
                 }
                 {
                     // stop the detached threads that were waiting
@@ -218,10 +219,15 @@ public:
             if (thread->joinable())
                 thread->join();
         }
+        for (auto& thread : this->terminating_threads) {  // wait for the terminated threads to finish
+            if (thread->joinable())
+                thread->join();
+        }
         // if there were no threads in the pool but some functors in the queue, the functors are not deleted by the threads
         // therefore delete them here
         this->clear_queue();
         this->threads.clear();
+        this->terminating_threads.clear();
         this->flags.clear();
     }
 
@@ -317,6 +323,7 @@ private:
     void init() { this->nWaiting = 0; this->isStop = false; this->isDone = false; }
 
     std::vector<std::unique_ptr<std::thread>> threads;
+    std::vector<std::unique_ptr<std::thread>> terminating_threads;
     std::vector<std::shared_ptr<std::atomic<bool>>> flags;
     mutable Queue<std::function<void(int id)> *> q;
     std::atomic<bool> isDone;

If it looks good I can try to make a pull request.

@lgritz
Copy link
Collaborator

lgritz commented Sep 24, 2018

Fixed by #2013

@lgritz lgritz closed this as completed Sep 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants