Suggestion for improving Set performance and utility #122

cardbot · 2017-07-25T07:09:46Z

Disclaimer: I know what I'm about to suggest is quite challenging and may not fit with your plans for Badger. But I want to record my thoughts on this, in case you haven't yet considered an approach like this. Or maybe you have, and have found it to be flawed...

How it is

Currently, the various Set methods have some drawbacks:

CAS requests require multiple traversals of the LSM tree:
a. Get the current value of some key, including its CAS (traversal Compaction logic #1)
b. Put an updated value, which then calls get in order to check the CAS (traversal Various small tweaks #2)
c. Put then traverses the Skiplist to find the insertion location (traversal Implement bloom filters #3)
CAS counters are uint16 values, which may not be what some applications want/need. It would be much better if applications can define their own CAS values, or not define them at all in order to save space.
All mutating requests are funneled to a single goroutine, which executes the requests in series. This means that heavy write loads can only take advantage of a single CPU.

How it could be

Add additional mutating methods to iterators rather than keeping them separate on the KV interface:

type Iterator interface {
	// Existing methods.
	Valid() bool
	Next()
	Key() []byte
	Value() []byte
	SeekToFirst()
	Seek(key []byte) bool
	Close()

	// Add creates a new key/value record if it does not yet exist,
	// positions the iterator on it, and returns true. If the record
	// already exists, then Add positions the iterator on the most
	// current value and returns false.
	Add(key []byte, val []byte) bool

	// Set updates the value of the current iteration record if it has
	// not been updated or deleted since iterating or seeking to it,
	// and returns true. If the record has been updated, then Set
	// positions the iterator on the most current value and returns
	// false. If the record has been deleted, then Set positions the
	// iterator on the next record and returns false.
	Set(val []byte) bool

	// Delete marks the current iterator record as deleted from the
	// store if it has not been updated since iterating or seeking to
	// it, and returns true. If the record has been updated, then
	// Delete positions the iterator on the most current value and
	// returns false. If the record is deleted, then Delete positions
	// the iterator on the next record.
	Delete() bool
}

Here's how to use the API to atomically create and increment a value when there are concurrent writers:

func incrementCounter(key []byte) {
	iter = kv.NewIterator()
	defer iter.Close()

	// If the counter does not yet exist, then create it with an
	// initial value of 1.
	if iter.Add(key, []byte{1}) {
		return
	}

	// Increment the value in a loop since other writers may
	// be racing to increment as well.
	for {
		newVal := iter.Value()[0] + 1
		if iter.Set([]byte{newVal}) {
			break
		}
	}
}

Here's how to use the API to utilize a custom int64 CAS value:

func compareAndSet(key, val []byte, casCheck int64) bool {
	iter = kv.NewIterator()
	defer iter.Close()

	newCas := time.Now().UnixNano()
	encoded := append(encodeInt64(newCas), val...)
	if iter.Add(key, encoded) {
		return true
	}

	oldCas := decodeInt64(iter.Value())
	if oldCas != casCheck {
		return false
	}

	return iter.Set(encoded)
}

What's nice about integrating the mutating operations into the iterator is that it gives callers a great deal of control over the CAS semantics, without sacrificing efficiency. In the common case, the atomic increment operation only needs to traverse the LSM tree once (though the iterator would need to be implemented properly to enable that). And the compareAndSet function shows that applications can maintain their own CAS value using whatever mechanism they choose.

To make all this as efficient as possible, consider doing away with a single writer goroutine. All tree traversal can take place without locking. Then, when Add/Set/Delete methods are called on the iterator, a lock can be acquired while checking to see if any conflicting updates have occurred, as well as to make the actual update. In common cases, this can be done very quickly. For example, in most cases, the mutable skiplist is already positioned correctly to do the Add/Set/Delete. No further traversal is necessary - just a quick check to make sure that the skiplist is still mutable. By not traversing inside locks and allowing many concurrent writes, a heavy "put" load should be able to scale well across CPU's. This would definitely be some tricky code to write, but it would bring some very nice benefits.

The text was updated successfully, but these errors were encountered:

srh · 2017-07-26T03:58:27Z

Get the current value of some key, including its CAS (traversal #1)

We don't actually get the "value" of the key -- (*KV)::get() just returns the pointer to the value (unless it's small enough to be inlined in the LSM tree). Then fillItem could be used to get the actual value, as you can see in capital-G Get().

manishrjain · 2017-07-27T05:00:45Z

The biggest thing to answer here is, what problem are you trying to solve? The complexity of the solution must match the complexity of the problem that's being solved.

I think the main takeaway I get here is to do concurrent writes. The problem there is that we have to maintain a write ahead log, and file writes are by default serial. If all writes to disk are serial, then their application to RAM (LSM tree), whether that happens serially or not, is irrelevant.

On top of that, if we have the update key twice, and we apply them concurrently, then it's possible that the latter update gets applied before the former. Which would result in data loss. So, to apply them concurrently, we'll have to maintain a dependency tree; which further complicates things.

So, concurrent writes won't result in performance gain, because the writes to disk have to be serialized. This doesn't even consider the fact that we're bound by the efficiency of compactions, not application to memtable in LSM.

But, what we have done, is to allow concurrent reads, while writes are going on. And that gives almost all the performance benefits you'd need.

To do atomic increments, CAS operation would be your friend. It was designed for exactly that purpose.

The idea to do optional CAS has some merit, but it requires more care from users perspective. They must remember to always do CAS Sets for the same key. A single normal Set would no longer have CAS, which would render any future operations sensitive to failure. So, it's not a clearly better solution than providing CAS counter by default, and letting user decide when and how they want to use it.

manishrjain closed this as completed Jul 27, 2017

spongedu pushed a commit to spongedu/badger that referenced this issue Jan 14, 2021

add external table ts support (dgraph-io#122)

15d00a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion for improving Set performance and utility #122

Suggestion for improving Set performance and utility #122

cardbot commented Jul 25, 2017 •

edited

Loading

srh commented Jul 26, 2017

manishrjain commented Jul 27, 2017 •

edited

Loading

Suggestion for improving Set performance and utility #122

Suggestion for improving Set performance and utility #122

Comments

cardbot commented Jul 25, 2017 • edited Loading

How it is

How it could be

srh commented Jul 26, 2017

manishrjain commented Jul 27, 2017 • edited Loading

cardbot commented Jul 25, 2017 •

edited

Loading

manishrjain commented Jul 27, 2017 •

edited

Loading