-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: TestRegistryLifecycle failed #68315
Comments
jobs.TestRegistryLifecycle failed with artifacts on master @ c995342ead51e08f8ed1155de4218d30a00d86d2: Fatal error:
Stack:
Log preceding fatal error
ReproduceTo reproduce, try: make stressrace TESTS=TestRegistryLifecycle PKG=./pkg/jobs TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1 Parameters in this failure:
|
jobs.TestRegistryLifecycle failed with artifacts on master @ 65d81716a22a38873a679b4fd5f1e019280725d5: Fatal error:
Stack:
Log preceding fatal error
ReproduceTo reproduce, try: make stressrace TESTS=TestRegistryLifecycle PKG=./pkg/jobs TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1 Parameters in this failure:
|
jobs.TestRegistryLifecycle failed with artifacts on master @ 65d81716a22a38873a679b4fd5f1e019280725d5: Fatal error:
Stack:
Log preceding fatal error
ReproduceTo reproduce, try: make stressrace TESTS=TestRegistryLifecycle PKG=./pkg/jobs TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1 Parameters in this failure:
|
This is #67386 I'm pretty certain. |
Yep, I have a fix in the works. Just trying to get the test passing under stress. |
Previously, TraceCollector.StartIter would return a nil object if we failed to resolve nodeliveness during initialization. This led to a NPE. This change now return a TraceCollector instance with the error field set to the appropriate error, so that the validity check on the iterator can correctly handle this scenario. This change also reworks the dump trace on job cancellation test. Job cancellation semantics under stress are slightly undeterministic in terms of how many times execution of OnFailOrCancel is resumed. This makes it hard to coordinate when to check and how many trace files to expect. Fixes: cockroachdb#68315 Release note: None
jobs.TestRegistryLifecycle failed with artifacts on master @ eef03a46f2e43ff70485dadf7d9ad445db05cab4: Fatal error:
Stack:
Log preceding fatal error
ReproduceTo reproduce, try: make stressrace TESTS=TestRegistryLifecycle PKG=./pkg/jobs TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1 Parameters in this failure:
|
Refs: cockroachdb#68315 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None
66362: kv,storage: use a BytesMonitor to track memory allocations for scans r=sumeerbhola a=sumeerbhola The goal is to track memory allocations for: - non-local SQL=>KV requests: this can happen with joins, multi-tenant clusters, and if ranges move between DistSQL planning and execution. - local SQL=>KV requests for the first request by a fetcher: in this case the fetcher reserves a modest 1KB which can be significantly exceeded by KV allocations. Only allocations in pebbleMVCCScanner for kv pairs and intents are tracked. The memory is released when returning from executeReadOnlyBatchWithServersideRefreshes since the chain of returns will end up in gRPC response processing and we can't hook into where that memory is released. This should still help for some cases of OOMs, and give some signal of memory overload that we can use elsewhere (e.g. admission control). The BytesMonitor is used to construct a BoundAccount that is wrapped in a narrower ScannerMemoryMonitor that is passed via the EvalContext interface. The other alternative would be for the engine to have a BytesMonitor at initialization time that it can use to construct a BoundAccount for each MVCC scan, and pass it back via MVCCScanResult. This would mean multiple BoundAccounts for a batch (since we don't want to release memory until all the requests in the batch are processed), and would be harder to extend to track additional request types compared to embedding in EvalContext. The rootSQLMonitor is reused for this memory allocation tracking. This tracking is always done for non-local requests, and for the first request by a fetcher for a local request. This is to avoid double-counting, the first request issued by a SQL fetcher only reserves 1KB, but subsequent ones have already reserved what was returned in the first response. So there is room to tighten this if we knew what had been reserved on the local client (there are complications because the batch may have been split to send to different nodes, only one of which was local). The AdmissionHeader.SourceLocation field is used to mark local requests and is set in rpc.internalClientAdapter. The first request is marked using the AdmissionHeader.NoMemoryReservedAtSource bit. Informs #19721 Release note (ops change): The memory pool used for SQL is now also used to cover KV memory used for scans. 67451: colexec: optimize Bytes sorting with abbreviated values r=mgartner a=mgartner #### colexec: add benchmark for sorting UUIDs Release note: None #### colexec: optimize Bytes sorting with abbreviated values This commit introduces an optimization for sorting `Bytes` vectors. While sorting, a signification potion of time is spent performing comparisons between two datums. These comparisons are optimized by creating an abbreviated `uint64` value for each `[]byte` that represents up to the first eight bytes in the slice. The abbreviated value is used for a comparison fast path that avoid comparing all bytes in a slice in some cases. This technique is used by Postgres[1][2] and Pebble[3]. Abbreviating values to a `uint64` helps optimize comparisons for datums that are larger than the size of the system's native pointer (64 bits). Given datums `a` and `b`, the following properties must hold true for the respective abbreviated values `abbrA` and `abbrB`: - `abbrA > abbrB` iff `a > b` - `abbrA < abbrB` iff `a < b` - If `abbrA == abbrB`, it is unknown if `a` is greater than, less than, or equal to `b`. A full comparison of `a` and `b` is required. Comparing the abbreviated values first improves comparison performance for two reasons. First, comparing two `uint64`s is fast. It is a single CPU instruction. Any datum larger than 64 bits requires multiple instructions for comparison. For example, comparing two `[]byte`s requires iterating over each byte until non-equal bytes are found. Second, CPU caches are more efficiently used because the abbreviated values of a vector are packed into a `[]uint64` of contiguous memory. This increases the chance that two datums can be compared by only fetching information from CPU caches rather than main memory. In the benchmarks below, `rows` is the number of rows being sorted, `cols` is the number of sort columns, and `constAbbrPct` is the percentage of rows for which the abbreviated values are the same. The benchmarks show that in the best case, when the abbreviated values are all unique, sort throughput has increased by up to ~45%. In the worst case, where all values share the same abbreviated value, throughput has decreased by as much as ~8.5%. This is due to the extra overhead of creating abbreviated values and comparing them, only to have to fall back to a full comparison every time. name old speed new speed delta SortUUID/rows=2048/cols=1/constAbbrPct=0-16 38.0MB/s ± 2% 55.3MB/s ± 3% +45.30% SortUUID/rows=2048/cols=1/constAbbrPct=50-16 36.2MB/s ± 3% 43.0MB/s ± 4% +18.82% SortUUID/rows=2048/cols=1/constAbbrPct=75-16 36.0MB/s ± 4% 37.1MB/s ± 4% ~ SortUUID/rows=2048/cols=1/constAbbrPct=90-16 36.8MB/s ± 1% 36.2MB/s ± 3% ~ SortUUID/rows=2048/cols=1/constAbbrPct=100-16 37.0MB/s ± 1% 33.8MB/s ± 5% -8.66% SortUUID/rows=2048/cols=2/constAbbrPct=0-16 60.3MB/s ± 1% 74.0MB/s ± 3% +22.85% SortUUID/rows=2048/cols=2/constAbbrPct=50-16 58.7MB/s ± 1% 63.7MB/s ± 1% +8.54% SortUUID/rows=2048/cols=2/constAbbrPct=75-16 58.3MB/s ± 1% 56.6MB/s ± 6% ~ SortUUID/rows=2048/cols=2/constAbbrPct=90-16 58.7MB/s ± 1% 54.2MB/s ± 3% -7.69% SortUUID/rows=2048/cols=2/constAbbrPct=100-16 59.3MB/s ± 1% 56.1MB/s ± 4% -5.30% SortUUID/rows=16384/cols=1/constAbbrPct=0-16 30.7MB/s ± 2% 41.9MB/s ± 3% +36.24% SortUUID/rows=16384/cols=1/constAbbrPct=50-16 30.6MB/s ± 1% 32.0MB/s ± 8% ~ SortUUID/rows=16384/cols=1/constAbbrPct=75-16 30.2MB/s ± 2% 30.9MB/s ± 6% ~ SortUUID/rows=16384/cols=1/constAbbrPct=90-16 30.4MB/s ± 2% 27.6MB/s ± 3% -9.29% SortUUID/rows=16384/cols=1/constAbbrPct=100-16 30.6MB/s ± 2% 28.5MB/s ± 5% -6.98% SortUUID/rows=16384/cols=2/constAbbrPct=0-16 49.5MB/s ± 2% 59.0MB/s ± 2% +19.13% SortUUID/rows=16384/cols=2/constAbbrPct=50-16 48.6MB/s ± 1% 49.5MB/s ± 3% +1.84% SortUUID/rows=16384/cols=2/constAbbrPct=75-16 47.3MB/s ± 2% 47.3MB/s ± 2% ~ SortUUID/rows=16384/cols=2/constAbbrPct=90-16 48.5MB/s ± 2% 45.2MB/s ± 1% -6.65% SortUUID/rows=16384/cols=2/constAbbrPct=100-16 48.9MB/s ± 1% 44.4MB/s ± 2% -9.13% SortUUID/rows=262144/cols=1/constAbbrPct=0-16 25.4MB/s ± 2% 35.7MB/s ± 3% +40.53% SortUUID/rows=262144/cols=1/constAbbrPct=50-16 24.9MB/s ± 3% 28.8MB/s ± 3% +15.44% SortUUID/rows=262144/cols=1/constAbbrPct=75-16 25.4MB/s ± 2% 25.7MB/s ± 5% ~ SortUUID/rows=262144/cols=1/constAbbrPct=90-16 25.4MB/s ± 3% 24.7MB/s ± 2% -3.00% SortUUID/rows=262144/cols=1/constAbbrPct=100-16 25.1MB/s ± 3% 23.7MB/s ± 2% -5.73% SortUUID/rows=262144/cols=2/constAbbrPct=0-16 37.5MB/s ± 2% 43.7MB/s ± 5% +16.65% SortUUID/rows=262144/cols=2/constAbbrPct=50-16 37.5MB/s ± 1% 37.8MB/s ± 7% ~ SortUUID/rows=262144/cols=2/constAbbrPct=75-16 37.1MB/s ± 4% 36.4MB/s ± 1% ~ SortUUID/rows=262144/cols=2/constAbbrPct=90-16 36.9MB/s ± 5% 34.6MB/s ± 7% -6.26% SortUUID/rows=262144/cols=2/constAbbrPct=100-16 37.2MB/s ± 3% 34.0MB/s ± 2% -8.53% I experimented with several mitigations to the regressions, but did not have much luck. Postgres uses HyperLogLog while building abbreviated values to track their cardinality and abort the optimization if the cardinality is too low. I attempted this, but the overhead of HyperLogLog negated the benefits of the optimization entirely. Using a `map[uint64]struct{}` to track the cardinality of abbreviated values was the most promising. It was able to reduce the worst regressions at the expense of reducing the speedup of the best case benchmarks. These mechanisms for aborting the optimization when abbreviated value cardinality is low are essentially a trade-off between maximizing sort performance when their cardinality is high and minimizing the overhead of generating and comparing abbreviated values when their cardinality is low. It's worth future investigation to try to find a low-overhead method of aborting, and to be more thoughtful about the trade-offs. [1] https://brandur.org/sortsupport [2] http://pgeoghegan.blogspot.com/2015/01/abbreviated-keys-exploiting-locality-to.html [3] https://github.com/cockroachdb/pebble/blob/ea60b4722cca21fa0d3c2749c788e1ac76981f7d/internal/base/comparer.go#L28-L36 Release note (performance improvement): Sort performance has been improved when sorting columns of type STRING, BYTES, or UUID. 67606: sql: add sql shell telemetry counter r=rafiss,knz,arulajmani a=e-mbrown Fixes #62208 This change allows us to keep track of connections made using our internal SQl shell. Release note: None 68427: jobs: skip TestRegistryLifecycle r=adityamaru a=jbowens Refs: #68315 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: e-mbrown <ebsonari@gmail.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
@adityamaru it's pretty bad that this test remains skipped. Can you please re-enable it or let me know if you're not going to. |
Previously, TraceCollector.StartIter would return a nil object if we failed to resolve nodeliveness during initialization. This led to a NPE. This change now return a TraceCollector instance with the error field set to the appropriate error, so that the validity check on the iterator can correctly handle this scenario. This change also reworks the dump trace on job cancellation test. Job cancellation semantics under stress are slightly undeterministic in terms of how many times execution of OnFailOrCancel is resumed. This makes it hard to coordinate when to check and how many trace files to expect. Fixes: cockroachdb#68315 Release note: None Release justification: low risk, high benefit changes to existing functionality
Previously, TraceCollector.StartIter would return a nil object if we failed to resolve nodeliveness during initialization. This led to a NPE. This change now return a TraceCollector instance with the error field set to the appropriate error, so that the validity check on the iterator can correctly handle this scenario. This change also reworks the dump trace on job cancellation test. Job cancellation semantics under stress are slightly undeterministic in terms of how many times execution of OnFailOrCancel is resumed. This makes it hard to coordinate when to check and how many trace files to expect. Fixes: cockroachdb#68315 Release note: None Release justification: low risk, high benefit changes to existing functionality
Previously, TraceCollector.StartIter would return a nil object if we failed to resolve nodeliveness during initialization. This led to a NPE. This change now return a TraceCollector instance with the error field set to the appropriate error, so that the validity check on the iterator can correctly handle this scenario. This change also reworks the dump trace on job cancellation test. Job cancellation semantics under stress are slightly undeterministic in terms of how many times execution of OnFailOrCancel is resumed. This makes it hard to coordinate when to check and how many trace files to expect. Fixes: cockroachdb#68315 Release note: None Release justification: low risk, high benefit changes to existing functionality
Previously, TraceCollector.StartIter would return a nil object if we failed to resolve nodeliveness during initialization. This led to a NPE. This change now return a TraceCollector instance with the error field set to the appropriate error, so that the validity check on the iterator can correctly handle this scenario. This change also reworks the dump trace on job cancellation test. Job cancellation semantics under stress are slightly undeterministic in terms of how many times execution of OnFailOrCancel is resumed. This makes it hard to coordinate when to check and how many trace files to expect. Fixes: cockroachdb#68315 Release note: None Release justification: low risk, high benefit changes to existing functionality
Previously, TraceCollector.StartIter would return a nil object if we failed to resolve nodeliveness during initialization. This led to a NPE. This change now return a TraceCollector instance with the error field set to the appropriate error, so that the validity check on the iterator can correctly handle this scenario. This change also reworks the dump trace on job cancellation test. Job cancellation semantics under stress are slightly undeterministic in terms of how many times execution of OnFailOrCancel is resumed. This makes it hard to coordinate when to check and how many trace files to expect. Fixes: cockroachdb#68315 Release note: None Release justification: low risk, high benefit changes to existing functionality
68374: jobs: fix race which permitted concurrent execution r=pbardea,ajwerner a=adityamaru The race was that we'd unregister a job from the `adoptedJobs` map while processing the pause or cancel requests. This is problematic because the job also unregisters itself in such cases. However, because the job was already unregistered, the entry which the now canceled job removes may in fact correspond to a new execution. We remove this hazard by centralizing the location where the entry is actually removed. Instead, we have the cancelation/pause processing only cancel the context of the relevant execution. Fixes #68593 Release note: None Release justification: bug fixes and low-risk updates to new functionality Previously, TraceCollector.StartIter would return a nil object if we failed to resolve nodeliveness during initialization. This led to a NPE. This change now return a TraceCollector instance with the error field set to the appropriate error, so that the validity check on the iterator can correctly handle this scenario. This change also reworks the dump trace on job cancellation test. Job cancellation semantics under stress are slightly indeterministic in terms of how many times the execution of OnFailOrCancel is resumed. This makes it hard to coordinate when to check and how many trace files to expect. Fixes: #68315 Release note: None Release justification: low risk, high benefit changes to existing functionality Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Aditya Maru <adityamaru@gmail.com>
jobs.TestRegistryLifecycle failed with artifacts on master @ c995342ead51e08f8ed1155de4218d30a00d86d2:
Fatal error:
Stack:
Log preceding fatal error
Reproduce
To reproduce, try:
Parameters in this failure:
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: