Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

soroban-rpc: add generic panic handing #856

Merged
merged 9 commits into from
Aug 23, 2023

Conversation

tsachiherman
Copy link
Contributor

@tsachiherman tsachiherman commented Aug 11, 2023

What

We've recently received a crash log from one of our partners ( see blow ).
The goal of this PR is to ensure that the next time we get a log, it would be actionable.

Why

The panic itself wasn't reported to the log. That would prevent us from handling these issues.

Known limitations

Adding the monitored to each of the go routines seems to be laborious. It does add some potential interesting features such as resiliency to panics.

Test output

time="2023-08-11T15:21:14.771-04:00" level=warning msg="panicing root function 'github.com/stellar/soroban-tools/cmd/soroban-rpc/internal/util.IndirectPanicingFunctionC'" pid=16895
time="2023-08-11T15:21:14.771-04:00" level=warning msg="panic({0x103122b20, 0x1033fd260})" pid=16895
time="2023-08-11T15:21:14.771-04:00" level=warning msg="/usr/local/go/src/runtime/panic.go:890 +0x230" pid=16895
time="2023-08-11T15:21:14.771-04:00" level=warning msg="github.com/stellar/soroban-tools/cmd/soroban-rpc/internal/util.PanicingFunctionA(0x0)" pid=16895
time="2023-08-11T15:21:14.771-04:00" level=warning msg="/Users/tsachi/go/src/github.com/stellar/soroban-tools/cmd/soroban-rpc/internal/util/monitoredRoutine_test.go:53 +0x4" pid=16895
time="2023-08-11T15:21:14.771-04:00" level=warning msg="github.com/stellar/soroban-tools/cmd/soroban-rpc/internal/util.IndirectPanicingFunctionB()" pid=16895
time="2023-08-11T15:21:14.771-04:00" level=warning msg="/Users/tsachi/go/src/github.com/stellar/soroban-tools/cmd/soroban-rpc/internal/util/monitoredRoutine_test.go:57 +0x20" pid=16895
time="2023-08-11T15:21:14.771-04:00" level=warning msg="github.com/stellar/soroban-tools/cmd/soroban-rpc/internal/util.IndirectPanicingFunctionC()" pid=16895
time="2023-08-11T15:21:14.771-04:00" level=warning msg="/Users/tsachi/go/src/github.com/stellar/soroban-tools/cmd/soroban-rpc/internal/util/monitoredRoutine_test.go:61 +0x1c" pid=16895
time="2023-08-11T15:21:14.771-04:00" level=warning msg="github.com/stellar/soroban-tools/cmd/soroban-rpc/internal/util.MonitoredRoutine.func1()" pid=16895
PASS
Process 16895 has exited with status 0
Detaching

Received Log

time="2023-08-11T17:29:04.281Z" level=info msg="Tx: applying ledger 343258 (classic phase: 1 component(s): [{discounted txs:1, ops:1, base_fee:100}], soroban phase: 1 component(s): [{discounted txs:1, ops:1, base_fee:100}])" pid=1 subservice=stellar-core
time="2023-08-11T17:29:04.283Z" level=error msg="Tx: Exception while applying operations (fullHash = 976358c7a45be23d687ff62a05f8cc4acc6b942a6e50b2d50922eaaca762538a" pid=1 subservice=stellar-core
time="2023-08-11T17:29:04.283Z" level=info msg=", contentsHash = 029871781e6e2510617b7949f4b109e27cd60554456924be6a3dd90deaa619f8" pid=1 subservice=stellar-core
time="2023-08-11T17:29:04.283Z" level=info msg="): Key already exists" pid=1 subservice=stellar-core
time="2023-08-11T17:29:04.293Z" level=info msg="default: Application destructing" pid=1 subservice=stellar-core
time="2023-08-11T17:29:04.295Z" level=info msg="default: Application destroyed" pid=1 subservice=stellar-core
time="2023-08-11T17:29:04.415Z" level=error msg="default: Got an exception: Local node's ledger corrupted during close" pid=1 subservice=stellar-core
time="2023-08-11T17:29:04.415Z" level=error msg="default: Please report this bug along with this log file if this was not expected" pid=1 subservice=stellar-core
time="2023-08-11T17:29:04.423Z" level=error msg="could not run ingestion. Retrying" error="error starting prepare range: opening subprocess: error running stellar-core: error runing stellar-core catchup: exit status 1" pid=1
time="2023-08-11T17:29:05.424Z" level=info msg="Applying txmeta for ledger 343296" pid=1
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xee8624]

@tsachiherman tsachiherman self-assigned this Aug 11, 2023
Copy link
Contributor

@sreuland sreuland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice integration!

@tsachiherman tsachiherman marked this pull request as ready for review August 11, 2023 21:47
@2opremio
Copy link
Contributor

I don't think it's kosher to use camelCase in filenames

@tsachiherman
Copy link
Contributor Author

I don't think it's kosher to use camelCase in filenames

I'm ok changing that. would monitored_routine.go be ok ?

@2opremio
Copy link
Contributor

I don't think it's kosher to use underscores either, I think, but we are already going it at different places. See https://stackoverflow.com/a/39625565/1914440

@tsachiherman
Copy link
Contributor Author

I don't think it's kosher to use underscores either, I think, but we are already going it at different places. See https://stackoverflow.com/a/39625565/1914440

The link you've provided suggeted that

- smallallcupspackagename
|- veryLongFileName.go
|- veryLongFileName_test.go
|- veryLongFileName_amd64_windows.go

are legitimate. I'm open to changing the file names, but please be more specific - how would you like to call the file ?

@2opremio
Copy link
Contributor

Two questions:

  1. How does Horizon do this? I don't recall seeing anything like MonitoredRoutine.
  2. Why wasn't the panic included in the logs?

Also, if we were to use logs for the panics, I am not a fan of reporting the panic across multiple log lines.

Finally I am not a fan of the MonitoredRoutine (it's too generic). PanicCatcher or PanicWrapper is more to the point.

@2opremio
Copy link
Contributor

are legitimate. I'm open to changing the file names, but please be more specific - how would you like to call the file ?

monitoredroutine.go . Although, please take into account the naming suggestions above.

@2opremio
Copy link
Contributor

2opremio commented Aug 14, 2023

Lastly, have you seen https://github.com/gregwebs/go-recovery ?

It may not be worth using such small library but it's ergonomics/API are a bit nicer.

@tsachiherman
Copy link
Contributor Author

Lastly, have you seen https://github.com/gregwebs/go-recovery ?

It may not be worth using such small library but it's ergonomics/API are a but nicer.

I think that this library would give us less than what this PR does. With this PR:

  1. we can control of "what happen when panic happens".
  2. we can control where the stack trace would go to.
  3. we have control of whether make it a benign failure and potentially update a metric or let the process fail
  4. we have the option to default to classic "exit process".

The technique used in that library is the same that is being used in this PR, so I really don't see much value in using it.

@tsachiherman
Copy link
Contributor Author

Two questions:

  1. How does Horizon do this? I don't recall seeing anything like MonitoredRoutine.
  2. Why wasn't the panic included in the logs?

Also, if we were to use logs for the panics, I am not a fan of reporting the panic across multiple log lines.

Finally I am not a fan of the MonitoredRoutine (it's too generic). PanicCatcher or PanicWrapper is more to the point.

The issue is that we're not the ones that run the process - and we can't "force" third parties to work in a specific way.
If you'll look carefully, you'll see that the entire provided output was json formatted - with the exception of the last line.
That "tells" me that they had the information from the log collected correctly, and apparently they were able to catch the first line of the stderr.

This PR would ensure that we have the output on both the stderr as well as the log output. That way, the information is guaranteed to be "there".

Copy link
Contributor

@paulbellamy paulbellamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall having a utility to standardize panic/recover workflows seems like a good idea. I think the thing holding this PR back is the API.

I see a few ideas to experiment with to get a nicer API here:

  1. we could try renaming the struct (panicgroup?) or even using the package name as part of the name: recovery.Go(cfg, func()), or panicgroup.Group type thing
  2. We could try breaking up the different handlers into different tools: one for logging, one for exiting, then offering a way to compose them.
  3. We could draw stricter boundaries around what this struct does, to trim down the options and make the interactions more clear. E.g. do we need both Log and LogPanicsToStdErr?
  4. We could have a valid zero-value for MonitoredRoutine, letting you do: Routine{}.Go(func(){}). Not sure if this would help?
  5. something else entirely?
// instead of a thing you pass your func to, maybe
// a thing you use within your func:
func() {
  defer recovery.Handler(
    recovery.Log(logger),
    recovery.ExitProcess(),
  )
  doRisky()
}()

cmd/soroban-rpc/internal/ingest/service.go Outdated Show resolved Hide resolved
cmd/soroban-rpc/internal/util/monitoredroutine.go Outdated Show resolved Hide resolved
cmd/soroban-rpc/internal/util/monitoredroutine.go Outdated Show resolved Hide resolved
@tsachiherman
Copy link
Contributor Author

looking onto the feedback here, I think that it's not 100% clear where we take this PR. It's also won't makes a ton of difference exactly how it's implemented. I'll try to adopt some of Paul's suggestions:

  1. Change the external naming into PanicGroup.
  2. Create two global variables that contains the "defaults" for critical and non-critical panic groups.
  3. use <panic_group>.With to create concrete group ( with a logger, counter, etc )
  4. use .Go that would receive the async function as an argument.

@tsachiherman tsachiherman merged commit 3758c2e into main Aug 23, 2023
20 of 21 checks passed
@tsachiherman tsachiherman deleted the tsachi/genericPanicHandling branch August 23, 2023 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants