Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/dont crash on invalid homogeneous batch dataset #690

Merged
merged 3 commits into from
Aug 13, 2024

Conversation

mgovers
Copy link
Member

@mgovers mgovers commented Aug 12, 2024

The issue

Non-caught exceptions caused by real-world update scenarios thrown by a worker thread in parallel batch calculations cause fatal crashes.

Considerations

After investigation related to exceptions during thread creation/execution, the following findings were made:

  • It is possible that the main thread throws an error when creating a thread
    • If this happens, stack unwinding is done
    • then a joinable std::thread will be destructed
    • this causes std::terminate()
    • This is OK because it is of the highest severity and the PowerGridModel instance cannot continue with its current tasks
      • Decision: intended behavior. the users' program crashes and we do not catch this to throw a PowerGridError
  • It is possible that a worker thread raises a exception
    • Option 1: in a scenario
      • this should be handled by the scenario runner and instead a PowerGridBatchError should be raised
      • This is intended behavior
    • Option 2: during creation of the thread
      • Option 2a: during copy of MainModel, e.g.
        • This points at a severe issue like running out of memory
        • This causes std::terminate()
        • This is intended behavior (see above)
      • Option 2b: during other caching mechanisms
        • This may be caused by a real-life scenario (see below)
        • This causes std::terminate() (fatal crash).
        • This is a bug: instead, it should cause an actual PowerGridError
        • This is solved by this PR

Preconditions

The above-mentioned bug solved in this PR arises when all of the following conditions are satisfied:

  1. There is a batch calculation
  2. Running in parallel with at least 2 threads (threading=0 or threading=x where x >= 2)
  3. With at least 2 scenarios
  4. With homogeneous update data (all scenarios touch the same IDs)
  5. And at least 1 ID in the update data is not present in the input data

These conditions trigger the following contributing factors:

  • The 5th condition triggers the exception
  • The 4th condition triggers throwing the exception in the setup part instead of in a scenario
  • The 1st, 2nd and 3rd condition triggers throwing the exception in a worker thread

The solution

The caching in which the exception was moved from the worker-thread setup part to all-scenario combined setup. This prevents non-handled exceptions from being raised in the worker thread

mgovers added 2 commits August 9, 2024 11:44
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
@mgovers mgovers added the bug Something isn't working label Aug 12, 2024
@mgovers mgovers requested a review from TonyXiang8787 August 12, 2024 11:38
@mgovers mgovers self-assigned this Aug 12, 2024
Signed-off-by: Martijn Govers <Martijn.Govers@Alliander.com>
Copy link

@TonyXiang8787 TonyXiang8787 added this pull request to the merge queue Aug 13, 2024
Merged via the queue into main with commit cd18769 Aug 13, 2024
26 checks passed
@TonyXiang8787 TonyXiang8787 deleted the feature/join-threads-on-failure branch August 13, 2024 07:40
@mgovers mgovers mentioned this pull request Nov 5, 2024
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants