-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: pebbleIterator panics too eagerly on Close() #84396
Comments
Hi @msbutler, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Looking at the Pebble code, it seems like it might return the previous iterator error when calling So if we're seeing an error during iteration, then cockroach/pkg/storage/pebble_iterator.go Lines 895 to 898 in c58c0aa
@jbowens Is this new behavior? It seems problematic, because any iteration error will essentially cause a CRDB panic when used together with |
@erikgrinaker — no, I'm pretty sure that's the way it's always been: https://github.com/cockroachdb/cockroach/blame/c58c0aac82480c069d71717bac1df317798317fd/pkg/storage/pebble_iterator.go#L897 It does seem problematic that an ephemeral I/O iteration error causes a CockroachDB panic, although ephemeral I/O errors also cause a panic in some compaction code paths (eg, cockroachdb/pebble#1794). There are times where pebble iterators will surface an error due to misuse (eg, SeekPrefixGE bounds checks), in which a panic is appropriate. Also, as long as we don't have graceful corruption recovery (#67568), panicking on sstable checksum errors surfaced by iteration is also appropriate. We can choose to not panic, or panic only on errors not known to be ephemeral, but what do we do with the ephemeral error? The current storage package interface doesn't provide a means to differentiate between an iterator exhausted due to an error, versus an iterator exhausted successfully. If we don't panic, it's not sufficient to just I'm assuming we're running into this only now because only now we're using a |
Hmm, I forgot that |
Don't panic if Iterator.Close returns a non-nil error. Iteration errors should already be surfaced through (*pebble.Iterator).Error and (storage.EngineIterator).Valid. Panicking may crash the node for retriable, ephemeral I/O errors, especially when reading from sstables over the network. In a follow up, we should make sure we're still appropriately panicking for non-retriable errors. Fix cockroachdb#84396. Release note (bug fix): fix bug where an ephemeral I/O error could crash a node.
84449: storage: don't panic on Iterator.Close r=nicktrav,erikgrinaker a=jbowens Don't panic if Iterator.Close returns a non-nil error. Iteration errors should already be surfaced through (*pebble.Iterator).Error and (storage.EngineIterator).Valid. Panicking may crash the node for retriable, ephemeral I/O errors, especially when reading from sstables over the network. In a follow up, we should make sure we're still appropriately panicking for non-retriable errors. Fix #84396. Release note (bug fix): fix bug where an ephemeral I/O error could crash a node. Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Is it possible we're wrapping
where's the networked implementation of |
In the PR I reverted, restore called The basic logic is:
Perhaps the new PebbleIterator doesn't play nice with the |
In cockroachdb#84449, the behavior of the `pebbleIterator` was altered to ignore all errors when closing an iterator. This was done to avoid panic-ing when the iterator encounters an ephemeral (i.e. retriable) error that remains pinned to the iterator until it is closed (see cockroachdb#84396 for the motivation). Partially address the TODO added in cockroachdb#84449 by panic-ing on errors that are known to be fatal (such as corruption). Addressing the TODO in its entirety would likely require enumerating known ephemeral errors and allowing such errors to be ignored on close, as well as implementing cockroachdb/pebble#1811 to avoid pinning errors to the iterator even after they have been handled. Fix cockroachdb#84479. Release note: None.
84590: *: upgrade to go 1.18.4 r=knz a=rickystewart * [ ] Adjust the Pebble tests to run in new version. * [x] Adjust version in Docker image ([source](./builder/Dockerfile)). * [x] Adjust version in the TeamCity agent image ([setup script](./packer/teamcity-agent.sh)) * [ ] Rebuild and push the Docker image (following [Basic Process](#basic-process)) * [x] Download ALL the archives (`.tar.gz`, `.zip`) for the new Go version from https://golang.org/dl/ and mirror them in the `public-bazel-artifacts` bucket in the `Bazel artifacts` project in GCP (sub-directory `go`, next to the other Go SDK's). * [x] Bump the version in `WORKSPACE` under `go_download_sdk`. You may need to bump [rules_go](https://github.com/bazelbuild/rules_go/releases). Also edit the filenames listed in `sdks` and update all the hashes to match what you mirrored in the step above. * [x] Run `./dev generate bazel` to refresh `distdir_files.bzl`, then `bazel fetch `@distdir//:archives`` to ensure you've updated all hashes to the correct value. * [ ] Bump the version in `builder.sh` accordingly ([source](./builder.sh#L6)). * [x] Bump the version in `go-version-check.sh` ([source](./go-version-check.sh)), unless bumping to a new patch release. * [ ] Bump the go version in `go.mod`. You may also need to rerun `make vendor_rebuild` if vendoring has changed. * [x] Bump the default installed version of Go in `bootstrap-debian.sh` ([source](./bootstrap/bootstrap-debian.sh)). * [x] Replace other mentions of the older version of go (grep for `golang:<old_version>` and `go<old_version>`). * [ ] Update the `builder.dockerImage` parameter in the TeamCity [`Cockroach`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Cockroach&tab=projectParams) and [`Internal`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Internal&tab=projectParams) projects. * [ ] Ask the Developer Infrastructure team to deploy new TeamCity agent images according to [packer/README.md](./packer/README.md) Release note (build change): Upgrade to golang 1.18.4 84723: opt: build UDF expressions r=mgartner a=mgartner #### opt: set opt tester search path to empty `OptTester` now sets its `SemaContext`'s `SearchPath` to `EmptySearchPath`, instead of `nil`, to avoid nil pointer exceptions when resolving unknown functions. Release note: None #### opt: build UDF expressions This commit adds basic support for building UDFs in optbuilder. Only scalar, nullary (arity of zero) functions with a single statement in the body are supported. Support for more types of UDFs will follow in future commits. Note that this commit does not add support for execution of UDFs, only building them within an optimizer expression. Release note: None 84913: storage: panic on iterator close when encountering corruption r=jbowens a=nicktrav In #84449, the behavior of the `pebbleIterator` was altered to ignore all errors when closing an iterator. This was done to avoid panic-ing when the iterator encounters an ephemeral (i.e. retriable) error that remains pinned to the iterator until it is closed (see #84396 for the motivation). Partially address the TODO added in #84449 by panic-ing on errors that are known to be fatal (such as corruption). Addressing the TODO in its entirety would likely require enumerating known ephemeral errors and allowing such errors to be ignored on close, as well as implementing cockroachdb/pebble#1811 to avoid pinning errors to the iterator even after they have been handled. Fix #84479. Release note: None. 84917: outliers: rename to insights r=matthewtodd a=matthewtodd The UI design that includes outliers also features other insights about sql execution. We will expand this outliers subsystem to support making those insights as well (probably with more detectors, changing the signature / return type of `isOutlier`). Renaming now is the first step in that direction. Release note: None 84987: kvserver: send range GC requests when point requests are absent r=aliher1911 a=aliher1911 Previously range GC requests were not sent if point GC keys were not requested as well. This patch fixes the problem by checking that any of parameters could be specified. Release note: None 85011: sql: remove a reference to issue 77733 r=ajwerner a=knz Fixes #77733. Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Nick Travers <travers@cockroachlabs.com> Co-authored-by: Matthew Todd <todd@cockroachlabs.com> Co-authored-by: Oleg Afanasyev <oleg@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
After hooking the new pebbleIterator into the restore processor (#83984), a few restore roachtests on master have been failing (#84240, #84237), because the pebbleIterator panics when it attempts to close, for example:
(see full example panic stack trace in comment).
In the failed tests and in repros, the panic reasons have been
EOF
,Context cancelled
, andnil
, which at a high level, makes me wonder if all this panicking is necessary. To get a better understanding of when the iterator panics, I ran therestore/nodeShutdown/coordinator
roachtest on my gce-worker using master + a commit that adds more logging related to the SSTs in the iterator and issues prints instead of panics inpebbleIterator.destroy()
.I can explain and reproduce the
context cancelled
panic. When the restore fails for some independent reason, context cancellations circulates throughout the goroutines, leading to pebbleIterator.Valid() returning!ok
and acontext cancelled
error. This then leads todefer.Close()
, surfacing thecontext cancelled
error duringdestroy()
, leading to a panic. To repro this, stress therestore/nodeShutdown/coordinator
roachtest above, potentially with my augmented commit.I'm not sure what causes the EOF or nil based panics.
@erikgrinaker
Jira issue: CRDB-17633
Epic CRDB-2624
The text was updated successfully, but these errors were encountered: