Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instacrash when training fluid.mlpregressor~ #314

Open
rconstanzo opened this issue Jul 6, 2022 · 5 comments
Open

Instacrash when training fluid.mlpregressor~ #314

rconstanzo opened this issue Jul 6, 2022 · 5 comments

Comments

@rconstanzo
Copy link

I don't have exact steps to reproduce as I was doing a few things at once (basically fit-ing a fluid.mlpregressor~ in a defer loop while also fluid.robustscale~-ing stuff elsewhere) but the crash report has a bunch of fluid stuff in it so figured I'd post it here in case it adds some insight.

flucoma crash.txt

@tremblap
Copy link
Member

tremblap commented Oct 9, 2024

hmm that is hard to pinpoint... let me know if you find a way to reproduce, especially not in Max. I know @AlexHarker has a pending fix to a lot of memory racing conditions problem in threading, but I'll need you as poweruser to test it with me :)

@rconstanzo
Copy link
Author

Not really had this pop up as a recurring thing, so will keep my eyes peeled.

Can close this (if you want) as there's no concrete issue/steps.

@tremblap
Copy link
Member

tremblap commented Oct 9, 2024

nah keep it on but keep an eye on it :)

@AlexHarker
Copy link
Contributor

The crash is in what I assume is the scheduler (timer) thread.
I think it's a C++ exception on a weak pointer and it happens I think NRTSharedInstanceAdaptor::lookup().

At a quick and not very detailed look I am not really sure that this method is threadsafe, as it access the count of an std::unordered_map() without any protection of that object and then assumes that it can access the value(s) in that slot/bin without protection immediately afterwards (which in concurrent computer land doesn't actually mean immediately, nor does it mean the data has not mutated in the meantime)..

Hard to be sure but the most obvious explanation is that one thread looks up the dataset as another deletes it. The count returns 1 or more in the first bit of lookup() in the lookup thread, then the deletion occurs in the other, then the access to the weak_ptr back in the lookup thread is bogus. This might be fixed by what I am due to do at some point externally, but I'd argue (given that shared pointers are in use here - implying threadsafety) that lookup() itself should be thread safe. That would then cover usage in other environments.

@tremblap
Copy link
Member

this reminds me of last summer when you had a fix for all of this but that might cost performance. Maybe when you have a bit of time you and I can sit around this and you can revive the pr (which you had on your local branch) which I could action fast I promise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants