fix(sort): Make sort consistent for indexed and without indexed predicates #7241

ahsanbarkati · 2021-01-04T11:44:01Z

This PR makes the result of sort consistent for predicate with and without index. There was an issue that for predicate with index, we used to drop the nodes which doesn't contain the predicate used for sorting, while we retained them for that of without index.

The nulls were dropped in thee case of sortWithIndex because in this case to calculate the result, we used to take intersection of the given list for sorting with the index of the predicate used for sorting, hence other predicates were dropped. In this change, we keep track of nodes which have null sort predicates and append them at the end of the result as required by pagination.

Some benchmarks:
The dataset contain 1million UIDs, with half of them containing the predicate used for sorting. The time taken are in nano seconds.

No Index:
Query                                   Time
orderasc, first:100 ->                  7092526995
orderasc, offset:100000, first:100 ->   6809554104
orderdesc, first:100 ->                 6981851622
orderdesc, offset:100000, first:100 ->  6862242806

Index:
orderasc, first:100 ->                  1421648793
orderasc, offset:100000, first:100 ->   2044206032
orderdesc, first:100 ->                 1577750157
orderdesc, offset:100000, first:100 ->  1956177772

The results of the same queries on Master:

No Index:
orderasc, first:100 ->                  6881212989
orderasc, offset:100000, first:100 ->   7543954929
orderdesc, first:100 ->                 7327315514
orderdesc, offset:100000, first:100 ->  7262242806

Index:
orderasc, first:100 ->                  1410905825
orderasc, offset:100000, first:100 ->   1890659562
orderdesc, first:100 ->                 1404256406
orderdesc, offset:100000, first:100 ->  2106243131

This change is

vvbalaji-dgraph · 2021-01-07T04:34:31Z

@danielmai : this change is expected to produce consistent results while retaining performance of using sorted index.

vmrajas

Reviewable status: 0 of 5 files reviewed, 3 unresolved discussions (waiting on @ahsanbarkati, @manishrjain, @pawanrawal, @vmrajas, and @vvbalaji-dgraph)

worker/sort.go, line 185 at r5 (raw file):

	span.Annotate(nil, "sortWithIndex")

	maxCount := 0

Can you add a comment on what maxCount is going to store.

worker/sort.go, line 287 at r5 (raw file):

			token := k.Term
			if !order.Desc {
				maxCount = int(ts.Count)

Can you add a comment on why this is done. I know that this would become a rather large comment. But, it would help in understanding the code later.

worker/sort.go, line 606 at r5 (raw file):

// intersectBucket intersects every UID list in the UID matrix with the
// indexed bucket.

Can you also add a comment on the significance of count parameter.

pawanrawal

Reviewable status: 0 of 5 files reviewed, 10 unresolved discussions (waiting on @ahsanbarkati, @manishrjain, and @vvbalaji-dgraph)

query/query1_test.go, line 1963 at r6 (raw file):

}

func TestSortNull2(t *testing.T) {

All these could just have been one test with Table driven tests because you are testing the same thing that is null behavior with various values of first and offset

query/query1_test.go, line 2095 at r6 (raw file):

	query := `{
me(func: uid(61, 62, 63, 64, 65, 66, 67, 68, 69, 70), orderdesc: pred, first: 2) {

How about a test with offset:5 and first:5 and another one with offset:9 and first:5

query/query2_test.go, line 977 at r6 (raw file):

			"data": {
				"q": [{
					"name_lang_index@de": "öffnen",

Are these the null value ones?

worker/sort.go, line 195 at r6 (raw file):

		var emptySkippedList pb.List
		out[i].ulist = &emptyList
		out[i].skippedUids = &emptySkippedList

Just define it inline like

out[i].skippedUids = &pb.List{}

worker/sort.go, line 325 at r6 (raw file):

		for _, uid := range ul.Uids {
			if _, ok := present[uid]; !ok {
				nullPreds = append(nullPreds, uid)

So this nullPreds is common across all the lists? Shouldn't it be reset for each uid list?

worker/sort.go, line 329 at r6 (raw file):

		}

		requiredCount := int(ts.Count) - len(r.UidMatrix[i].Uids)

remainingCount would be a better name.

worker/sort.go, line 769 at r6 (raw file):

				val.Value = nil
				nullsList = append(nullsList, uid)
				nullVals = append(nullVals, []types.Val{val})

What does val hold? Is it just a null? Would be the null be returned? I don't see null being returned in the query tests.

ahsanbarkati

Reviewable status: 0 of 5 files reviewed, 10 unresolved discussions (waiting on @ahsanbarkati, @manishrjain, @pawanrawal, and @vvbalaji-dgraph)

query/query1_test.go, line 1963 at r6 (raw file):