Single node has a very high CPU utilization when attempting a split of a range, causing range to be inaccessible. #43106

ricardocrdb · 2019-12-10T23:58:37Z

A user is currently using 2.1.6 and is running into an issue where a single node in the 5 node cluster is experiencing nearly 100% CPU usage on 12 cores. Logs show the following error on repeat for nearly 6 hours:

[n1,split,s1,r14507/42:/Table/123/1/"RandomStuff..] split at key /Table/123/1/"RandomStuff here" failed: context deadline exceeded

The range reports for r14507 shows everything expected with the node in question also as the leaseholder. An output of all goroutines from the Debugging page of the AdminUI shows that a high number of routines are in rocksDBIterator.Seek A core dump has been created of the cockroach process and is available here for further analysis.

Of note, when creating the core dump, the cockroach process was paused long enough for the leaseholder to be moved to a new node, which caused the range to be available. The user has confirmed that previously inaccessible rows are now available. The cockroach process on the node is still running at near max CPU usage as of the creating of this issue.

The text was updated successfully, but these errors were encountered:

ricardocrdb · 2019-12-10T23:59:09Z

CCing @ajwerner @petermattis

RoachietheSupportRoach · 2019-12-11T00:00:22Z

Zendesk ticket #4246 has been linked to this issue.

ajkr · 2019-12-11T00:30:33Z

2.1.9 has multiple infinite loop fixes: https://www.cockroachlabs.com/docs/releases/v2.1.9.html. One for compaction and one for reverse scan. Note there is yet another infinite loop reverse scan bug that is not present in any 2.1 patch release: #35505.

petermattis · 2019-12-11T00:42:27Z

The goroutine traces are not implicating MVCC scan or compaction. All the stuck goroutines are in DBIterSeek.

…

On Tue, Dec 10, 2019 at 7:30 PM Andrew Kryczka ***@***.***> wrote: 2.1.9 has multiple infinite loop fixes: https://www.cockroachlabs.com/docs/releases/v2.1.9.html. One for compaction and one for reverse scan. Note there is yet another infinite loop reverse scan bug that is not present in any 2.1 patch release: #35505 <#35505>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43106?email_source=notifications&email_token=ABPJ755H3INTK6XF6TAF65TQYAYCXA5CNFSM4JZGRB4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGROC5Q#issuecomment-564322678>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPJ7573NTT63DD56X6XUM3QYAYCXANCNFSM4JZGRB4A> .

ajkr · 2019-12-11T00:46:10Z

Got it. These are the stack traces I'm seeing.

Thread 29 (LWP 315581):
#0  0x00007fbeb90c1600 in ?? ()
#1  0x000000000209058d in rocksdb::Slice::compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/../rocksdb/include/rocksdb/slice.h:223
#2  cockroach::DBComparator::Compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/comparator.cc:29
#3  0x00000000021759b2 in rocksdb::InternalKeyComparator::Compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/dbformat.cc:156
#4  0x00000000021aea42 in operator() () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:33
#5  _M_upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_tree.h:1686
#6  upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_tree.h:1111
--Type <RET> for more, q to quit, c to continue without paging--
#7  upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_map.h:1195
#8  rocksdb::CollapsedRangeDelMap::GetTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:339
#9  0x00000000021ac45a in rocksdb::RangeDelAggregator::GetTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:602
#10 0x000000000222ceec in rocksdb::BlockBasedTableIterator::InitRangeTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2133
#11 0x000000000222f900 in rocksdb::BlockBasedTableIterator::FindKeyForward () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2039
#12 0x000000000222fc51 in rocksdb::BlockBasedTableIterator::SeekToFirst () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1877
#13 0x00000000021c16a6 in rocksdb::IteratorWrapper::SeekToFirst () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:69
#14 SkipEmptyFileForward () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:685
#15 0x000000000224dd5d in rocksdb::IteratorWrapper::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:63
#16 rocksdb::MergingIterator::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/merging_iterator.cc:112
#17 0x000000000216d886 in rocksdb::DBIter::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_iter.cc:1165
#18 0x000000000208e038 in Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/batch.cc:155
#19 0x0000000002092d5a in DBIterSeek () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/db.cc:518
#20 0x0000000001ff1733 in _cgo_77181d5e6ea5_Cfunc_DBIterSeek (v=0xc467b10468) at cgo-gcc-prolog:646

I am not sure it rules out an MVCC scan loop as presumably that is a caller of DBIterSeek. But I am still looking and don't have evidence to rule it in either.

ajkr · 2019-12-11T00:50:58Z

Actually it does rule it out as mvcc scan would show up in the C++ stack trace, sorry.

petermattis · 2019-12-11T00:51:03Z

We have a goroutine dump as well which does not show the MVCC scan routines. This is the first I’m seeing the C++ stacks. Looks range tombstone related.

…

On Tue, Dec 10, 2019 at 7:46 PM Andrew Kryczka ***@***.***> wrote: Got it. These are the stack traces I'm seeing. Thread 29 (LWP 315581): #0 0x00007fbeb90c1600 in ?? () #1 0x000000000209058d in rocksdb::Slice::compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/../rocksdb/include/rocksdb/slice.h:223 #2 <http://github.com/cockroachdb/cockroach/c-deps/libroach/../rocksdb/include/rocksdb/slice.h:223#2> cockroach::DBComparator::Compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/comparator.cc:29 #3 <http://github.com/cockroachdb/cockroach/c-deps/libroach/comparator.cc:29#3> 0x00000000021759b2 in rocksdb::InternalKeyComparator::Compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/dbformat.cc:156 #4 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/dbformat.cc:156#4> 0x00000000021aea42 in operator() () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:33 #5 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:33#5> _M_upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_tree.h:1686 #6 upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_tree.h:1111 --Type <RET> for more, q to quit, c to continue without paging-- #7 upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_map.h:1195 #8 rocksdb::CollapsedRangeDelMap::GetTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:339 #9 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:339#9> 0x00000000021ac45a in rocksdb::RangeDelAggregator::GetTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:602 #10 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:602#10> 0x000000000222ceec in rocksdb::BlockBasedTableIterator::InitRangeTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2133 #11 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2133#11> 0x000000000222f900 in rocksdb::BlockBasedTableIterator::FindKeyForward () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2039 #12 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2039#12> 0x000000000222fc51 in rocksdb::BlockBasedTableIterator::SeekToFirst () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1877 #13 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1877#13> 0x00000000021c16a6 in rocksdb::IteratorWrapper::SeekToFirst () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:69 #14 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:69#14> SkipEmptyFileForward () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:685 #15 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:685#15> 0x000000000224dd5d in rocksdb::IteratorWrapper::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:63 #16 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:63#16> rocksdb::MergingIterator::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/merging_iterator.cc:112 #17 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/merging_iterator.cc:112#17> 0x000000000216d886 in rocksdb::DBIter::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_iter.cc:1165 #18 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_iter.cc:1165#18> 0x000000000208e038 in Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/batch.cc:155 #19 <http://github.com/cockroachdb/cockroach/c-deps/libroach/batch.cc:155#19> 0x0000000002092d5a in DBIterSeek () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/db.cc:518 #20 <http://github.com/cockroachdb/cockroach/c-deps/libroach/db.cc:518#20> 0x0000000001ff1733 in _cgo_77181d5e6ea5_Cfunc_DBIterSeek (v=0xc467b10468) at cgo-gcc-prolog:646 I am not sure it rules out an MVCC scan loop as presumably that is a caller of DBIterSeek. But I am still looking and don't have evidence to rule it in either. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43106?email_source=notifications&email_token=ABPJ757LCPDYBQP4I4BRP53QYAZ5HA5CNFSM4JZGRB4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGRO6XA#issuecomment-564326236>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPJ75ZVQ4AO2OTEYW4TNL3QYAZ5HANCNFSM4JZGRB4A> .

petermattis · 2019-12-11T14:43:07Z

Here is a list of the threads from the core file:
threads.txt. I stripped out all of the sleeping threads. The count of threads in DBIterSeek (33) matches up precisely with the count from the goroutine dump. Note that the goroutine dump does indicate one goroutine in _Cfunc_DBIterSeek, but that goroutine has yet to make the transition to cgo.

It looks like there is a pattern of DBIterSeek -> ... -> rocksdb::BlockBasedTableIterator::SeekToFirst. I'm going to see if I can determine the sstable file numbers involved.

ajkr · 2019-12-11T18:31:58Z

Based on some potentially error-prone steps I suspect the offending file is an L5 file, "2326537.sst".

Another notable thing is the cached range tombstone (BlockBasedTableIterator::range_tombstone_) appears to have the same start and end key. Maybe the skipping optimization gets stuck in this case.

petermattis · 2019-12-11T18:42:11Z

Mind walking through your analysis? Some folks here in NY would like to follow along. An empty tombstone might certainly cause problems.

…

On Wed, Dec 11, 2019 at 1:31 PM Andrew Kryczka ***@***.***> wrote: Based on some potentially error-prone steps I suspect the offending file is an L5 file, "2326537.sst". Another notable thing is the cached range tombstone appears to have the same start and end key. Maybe the skipping optimization gets stuck in this case. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#43106?email_source=notifications&email_token=ABPJ753FZJ4E5SUOIWBXSSDQYEWZ7A5CNFSM4JZGRB4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUEAAY#issuecomment-564674563>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPJ75722LKD3LEPVUCCCCTQYEWZ7ANCNFSM4JZGRB4A> .

ajkr · 2019-12-11T18:48:18Z

It isn't recommended but I rebuilt v2.1.6 with -g instead of -g1 (which does not include local variables) and the debug info looks good to me. Picked a few of the stacks stuck in SeekToFirst(), changed frame to one in BlockBasedTableIterator, then print the file number. It was the same for the few cases I checked:

(gdb) p this->file_meta_->fd.packed_number_and_path_id & kFileNumberMask
$20 = 2326537

petermattis · 2019-12-11T19:12:12Z

Thanks. That is exactly what Andrei wanted to try tonight. I’d feel ok about this if you used the builder image so we have exactly the same tool chain.

…

On Wed, Dec 11, 2019 at 1:48 PM Andrew Kryczka ***@***.***> wrote: It isn't recommended but I rebuilt v2.1.6 with -g instead of -g1 (which does not include local variables) and the debug info looks good to me. Picked a few of the stacks stuck in SeekToFirst(), changed frame to one in BlockBasedTableIterator, then print the file number. It was the same for the few cases I checked: (gdb) p this->file_meta_->fd.packed_number_and_path_id & kFileNumberMask $20 = 2326537 — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#43106?email_source=notifications&email_token=ABPJ754DEDGLQTARVU3PHJTQYEYXHA5CNFSM4JZGRB4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUFN6I#issuecomment-564680441>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPJ757MTVGQTXO523VTLQ3QYEYXHANCNFSM4JZGRB4A> .

ajkr · 2019-12-11T20:10:59Z

Yes it was done using builder image. Here's something else interesting - the empty range tombstone originates from two points in the collapsed map with the same user key, where one has kMaxSequenceNumber and the other has seqnum zero. Also the two points are at the same seqnum (elem[i]->right) which suggests they came from the same logical tombstone.

Used std map pretty printers from https://sourceware.org/ml/gdb/2008-02/msg00064/stl-views.gdb. Ran in a stack frame of rocksdb::CollapsedRangeDelMap:

(gdb) pmap &rep_ ParsedInternalKey SequenceNumber
...
elem[9]->left: $51 = {
  user_key = {
    data_ = 0x7fbdce6e5f20 "X",
    size_ = 76
  },
  sequence = 72057594037927935,
  type = rocksdb::kMaxValue
}
elem[9]->right: $52 = 326417547197
elem[10]->left: $53 = {
  user_key = {
    data_ = 0x7fbdce6e48a0 "X",
    size_ = 76
  },
  sequence = 0,
  type = rocksdb::kMaxValue
}
elem[10]->right: $54 = 326417547197

User keys replaced with "X" for privacy. Verified they are the same before replacement.

ajkr · 2019-12-11T21:30:05Z

I think the PartialRangeTombstone that is returned according to the above two points is non-empty, actually. It covers the single point key "X". My feeling is the fix should be to handle seeking to the end key of a range tombstone that has the same user key as the current key properly in BlockBasedTableIterator::FindKeyForward().

ajkr · 2019-12-12T05:55:46Z

I looked into how we got in a situation where the collapsed range tombstone map can return a PartialRangeTombstone with an end_key_ whose user key equals a point key at a lower level, and whose seqnum is smaller than that point key. That is a necessary condition to enter the infinite loop I'm thinking of, and is satisfied by the evidence in the core dump.

The point key we noted earlier that scan gets stuck at is "X" in "2326537.sst" at L5. Further investigation shows the covering range tombstone is three levels earlier (i.e., L2) and split across files "2632281.sst" and "2327340.sst". Note the split point is the same "X" as in L5 where scan is stuck.

(gdb) p flevel_[-3].files[3]
$158 = {
  fd = {
    table_reader = 0x0,
    packed_number_and_path_id = 2327355,
    file_size = 2632281
  },
  file_metadata = 0x7fbe06f324c0,
  smallest_key = {
    data_ = 0x7fbea983e4c0 "W",
    size_ = 84
  },
  largest_key = {
    data_ = 0x7fbea983e514 "X",
    size_ = 84
  }
}
(gdb) p flevel_[-3].files[4]
$159 = {
  fd = {
    table_reader = 0x0,
    packed_number_and_path_id = 2327340,
    file_size = 8444279
  },
  file_metadata = 0x7fbdff3a2600,
  smallest_key = {
    data_ = 0x7fbea9874000 "X",
    size_ = 84
  },
  largest_key = {
    data_ = 0x7fbea9874054 "Y",
    size_ = 84
  }
}

Now, we need to figure out why the collapsed map contains an internal key with seqnum smaller than a point key at a lower level. Knowing the latter of the above two files has a range tombstone spanning beyond its earliest key, we can look at that table's smallest key to see how its tombstones would be truncated.

(gdb) p (uint64_t)flevel_[-3].files[4].smallest_key.data_[76]
$162 = 15

The above output indicates the table's smallest key has type 15 (range delete) and seqnum zero. This is highly unusual. Typically a file's smallest key corresponds to its first point key, which should have a non-zero seqnum at an upper level. We would not expect the file to be extended left by range tombstones. It means truncation of our range tombstone produces "X@0" to be inserted into the collapsed map. And the "X" point key at L5 certainly has a higher seqnum than zero. So the necessary condition for the infinite loop is satisfied.

Another unusual thing about the above two files is, although they have overlapping endpoints, they do not have consecutive file numbers. In fact the latter's file number is smaller than the former's file number. This points to subcompaction, which indeed is enabled in v2.1.6. Additionally, L2 appears to be the base level, and L0->Lbase is the case where subcompaction would happen.

So, without a complete understanding/repro, my feeling so far is (1) the unusual database state is related to subcompactions; and (2) although the state is unusual, it doesn't look like any corruption happened, so our seek-to-end-key optimization should handle this case.

petermattis · 2019-12-12T15:11:26Z

Excellent sleuthing, @ajkr. Definitely looks like a partial tombstone with the same start/end user-key would cause a problem. Something that was confusing when I was verifying this is that we seek to tombstone_internal_end_key(). Naively I was assuming that would seek past the tombstone's end key, but tombstone_internal_end_key() actually returns a key with kMaxSequenceNumber.

Seems like we could just skip this optimization if parsed.user_key == range_tombstone_.end_key().user_key. Does that sound right?

Does the same problem exist for FindKeyBackward? I think no because of the construction of tombstone_internal_start_key().

petermattis · 2019-12-12T15:13:43Z

Looks like this bug is present in 19.1.x as well. The code in question is very different in 19.2, and this particular optimization is not present.

ajkr · 2019-12-12T17:15:25Z

Seems like we could just skip this optimization if parsed.user_key == range_tombstone_.end_key().user_key. Does that sound right?

Yes it sounds right. Or tombstone_internal_end_key() could return the actual end_key_ and we could seek to there - I am not sure why we seek elsewhere. But skipping the optimization sounds safer.

By the way, the condition for infinite loop turned out to be slightly different from a PartialRangeTombstone with same start/end user keys. I spent a while trying to repro it in a test case before realizing it. The condition is like this:

Collapsed range-del map has points A#i and B#j mapping to seqnum n (n is the tombstone seqnum whereas i and j are for limiting its span). A and B are variables representing user keys, and A <= B.
An SST file has point key B#k, where k > j. Additionally that file's max seqnum is less than n, which is necessary to enable the seek optimization for that file.
When iterator is positioned at B#k, it calls GetTombstone() and receives the partial tombstone [A#i, B#j)#n. (Note without k > j it would return a later tombstone and we might be safe.)
Now the iterator seeks to a forged end key which is B#kMaxSeqnum. It lands back on B#k and repeats.

The second bullet point makes it hard to repro. Requiring the file's max seqnum less than n means that file must be in a lower level than our range tombstone. Meanwhile, k > j requires having a point key in that file that is newer than a point in the collapsed map corresponding to our range tombstone. That is very difficult as the points in the collapsed map either have max seqnum or are file endpoints where the tombstone was split. The only case I can find a file endpoint with lower seqnum than a point key in a lower level is subcompactions, where a range tombstone extending a file to the left causes it to be assigned a smallest key with seqnum zero.

petermattis · 2019-12-12T18:11:49Z

Yes it sounds right. Or tombstone_internal_end_key() could return the actual end_key_ and we could seek to there - I am not sure why we seek elsewhere. But skipping the optimization sounds safer.

The comment in tombstone_internal_end_key() scares me from touching that seqnum magic.

The second bullet point makes it hard to repro. Requiring the file's max seqnum less than n means that file must be in a lower level than our range tombstone. Meanwhile, k > j requires having a point key in that file that is newer than a point in the collapsed map corresponding to our range tombstone. That is very difficult as the points in the collapsed map either have max seqnum or are file endpoints where the tombstone was split. The only case I can find a file endpoint with lower seqnum than a point key in a lower level is subcompactions, where a range tombstone extending a file to the left causes it to be assigned a smallest key with seqnum zero.

Can you repro with a test directly on BlockBasedTableIterator? That is, construct an sstable with the desired point keys, then construct a BlockBasedTableIterator and pass it a RangeDelAggregator populated with the necessary range tombstones.

ajkr · 2019-12-12T21:49:13Z

Well, I was able to write the test of the condition using subcompaction: ajkr/rocksdb@6b55f57. Sadly the result is reappearing keys under the subcompaction split point rather than infinite loop.

petermattis · 2019-12-13T13:47:45Z

Sadly the result is reappearing keys under the subcompaction split point rather than infinite loop.

Can you elaborate on this? Did the test find another bug?

ajkr · 2019-12-13T19:16:20Z

Yes. Well we already knew there are bugs with DeleteRange+subcompactions -- it just materialized one.

Let me explain by printing the PartialRangeTombstones that are seen during the DB scan at the end of that test. Here, inf means infinity, i.e., unbounded.

[inf, key000100#72057594037927935)#900
[key000100#72057594037927935, key000100#0)#0
[key000100#0, key000300#72057594037927935)#900
[key000300#72057594037927935, key000300#0)#0
[key000300#0, inf)#900

The subcompaction split points were key000100 and key000300. Notice how the PartialRangeTombstones returned exactly at split point X follow the form [X#kMaxSeqnum, X#0)#0. This means X is not covered in lower levels, while it should be as the range tombstone spanned over it before it was cut by subcompactions.

This happens because the last file in the subcompaction left of the split point has largest key X#kMaxSeqnum, while the first file in the subcompaction right of the split point has smallest key X#0. It is a loss of power that is harmful to readers, and such tombstones are not restored to their full power during compaction.

ajkr · 2019-12-13T20:47:20Z

Yes. Well we already knew there are bugs with DeleteRange+subcompactions -- it just materialized one.

It's fixed by facebook/rocksdb@64aabc9. We don't need to backport as we already disabled subcompactions in v2.1.

For the original bug, I am still trying to understand what conditions caused the collapsed range-del map to contain a [X#kMaxSeqnum, X#0)#n where n > 0 and X is a subcompaction split point.

petermattis · 2019-12-14T18:18:44Z

It's fixed by facebook/rocksdb@64aabc9. We don't need to backport as we already disabled subcompactions in v2.1.

Ack. Glad you tracked that down.

For the original bug, I am still trying to understand what conditions caused the collapsed range-del map to contain a [X#kMaxSeqnum, X#0)#n where n > 0 and X is a subcompaction split point.

In parallel, it would be good to fix the BlockBasedTableIterator bug. I think that infinite loop condition can be tested with a targeted test of BlockBasedTableIterator and a manually constructed RangeDelAggregator.

ajkr · 2019-12-16T17:30:34Z

For the original bug, I am still trying to understand what conditions caused the collapsed range-del map to contain a [X#kMaxSeqnum, X#0)#n where n > 0 and X is a subcompaction split point.

In parallel, it would be good to fix the BlockBasedTableIterator bug. I think that infinite loop condition can be tested with a targeted test of BlockBasedTableIterator and a manually constructed RangeDelAggregator.

Will do. Here is the test: ajkr/rocksdb@c46e860. While writing it I became even more surprised by the presence of X#kMaxSeqnum: n and X#0: n consecutively in the collapsed map. The testing showed consecutive points at the same sequence number should be collapsed into a single point, i.e., only X#kMaxSeqnum: n should be present.

ajkr · 2019-12-17T19:51:34Z

fixed by #43253 and #43255.

We still don't fully understand how a collapsed range-del map contains two adjacent points with identical seqnum. It might be worth looking into later.

ricardocrdb added O-community Originated from the community A-kv-distribution Relating to rebalancing and leasing. labels Dec 10, 2019

petermattis assigned ajkr and petermattis Dec 11, 2019

ajkr closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single node has a very high CPU utilization when attempting a split of a range, causing range to be inaccessible. #43106

Single node has a very high CPU utilization when attempting a split of a range, causing range to be inaccessible. #43106

ricardocrdb commented Dec 10, 2019

ricardocrdb commented Dec 10, 2019

RoachietheSupportRoach commented Dec 11, 2019

ajkr commented Dec 11, 2019

petermattis commented Dec 11, 2019 via email

ajkr commented Dec 11, 2019

ajkr commented Dec 11, 2019

petermattis commented Dec 11, 2019 via email

petermattis commented Dec 11, 2019

ajkr commented Dec 11, 2019 •

edited

Loading

petermattis commented Dec 11, 2019 via email

ajkr commented Dec 11, 2019

petermattis commented Dec 11, 2019 via email

ajkr commented Dec 11, 2019 •

edited

Loading

ajkr commented Dec 11, 2019

ajkr commented Dec 12, 2019 •

edited

Loading

petermattis commented Dec 12, 2019

petermattis commented Dec 12, 2019

ajkr commented Dec 12, 2019

petermattis commented Dec 12, 2019

ajkr commented Dec 12, 2019

petermattis commented Dec 13, 2019

ajkr commented Dec 13, 2019

ajkr commented Dec 13, 2019 •

edited

Loading

petermattis commented Dec 14, 2019

ajkr commented Dec 16, 2019 •

edited

Loading

ajkr commented Dec 17, 2019

Single node has a very high CPU utilization when attempting a split of a range, causing range to be inaccessible. #43106

Single node has a very high CPU utilization when attempting a split of a range, causing range to be inaccessible. #43106

Comments

ricardocrdb commented Dec 10, 2019

ricardocrdb commented Dec 10, 2019

RoachietheSupportRoach commented Dec 11, 2019

ajkr commented Dec 11, 2019

petermattis commented Dec 11, 2019 via email

ajkr commented Dec 11, 2019

ajkr commented Dec 11, 2019

petermattis commented Dec 11, 2019 via email

petermattis commented Dec 11, 2019

ajkr commented Dec 11, 2019 • edited Loading

petermattis commented Dec 11, 2019 via email

ajkr commented Dec 11, 2019

petermattis commented Dec 11, 2019 via email

ajkr commented Dec 11, 2019 • edited Loading

ajkr commented Dec 11, 2019

ajkr commented Dec 12, 2019 • edited Loading

petermattis commented Dec 12, 2019

petermattis commented Dec 12, 2019

ajkr commented Dec 12, 2019

petermattis commented Dec 12, 2019

ajkr commented Dec 12, 2019

petermattis commented Dec 13, 2019

ajkr commented Dec 13, 2019

ajkr commented Dec 13, 2019 • edited Loading

petermattis commented Dec 14, 2019

ajkr commented Dec 16, 2019 • edited Loading

ajkr commented Dec 17, 2019

ajkr commented Dec 11, 2019 •

edited

Loading

ajkr commented Dec 11, 2019 •

edited

Loading

ajkr commented Dec 12, 2019 •

edited

Loading

ajkr commented Dec 13, 2019 •

edited

Loading

ajkr commented Dec 16, 2019 •

edited

Loading