Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for cursors and limits on LookupResources API calls #1296

Merged
merged 21 commits into from
May 22, 2023

Conversation

josephschorr
Copy link
Member

@josephschorr josephschorr commented May 3, 2023

Note
This PR has changes for the ReachableResources and LookupResources dispatch

This change adds support for (optional) cursors and limits on LookupResource API calls, changing them to be fully streaming, which should significantly reduce or even remove the pause when loading very large sets of resources

This is accomplished by changing ReachableResources into a non-parallelized, ordered (internally defined) and cursored API and having LookupResources invoke it in a streaming manner, passing the cursor down into it.

This PR also has a number of small tech debt cleanups, including renaming Lookup -> LookupResources everywhere

Fixes #43

@josephschorr josephschorr requested a review from a team May 3, 2023 20:19
@github-actions github-actions bot added area/api v1 Affects the v1 API area/dependencies Affects dependencies area/dispatch Affects dispatching of requests area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels May 3, 2023
@josephschorr josephschorr marked this pull request as draft May 3, 2023 20:22
@josephschorr josephschorr force-pushed the cursored-lookup-resources branch from a9f14f3 to 37a17dd Compare May 3, 2023 20:26
@josephschorr josephschorr marked this pull request as ready for review May 3, 2023 20:51
Copy link
Contributor

@vroldanbet vroldanbet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first batch of comments, I'm still reviewing 😅

Please note that I'm going commit by commit, so some of my questions may not apply in subsequent commits

internal/graph/cursors.go Show resolved Hide resolved
internal/graph/cursors.go Outdated Show resolved Hide resolved
internal/graph/cursors.go Outdated Show resolved Hide resolved
internal/graph/cursors.go Outdated Show resolved Hide resolved
proto/internal/dispatch/v1/dispatch.proto Show resolved Hide resolved
})
}
ci := newCursorInformation(req.OptionalCursor)
return withSingletonInCursor(ci, "same-type",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

introduce constant SECTION_SAME_TYPE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the section names are only used within the code once, I don't know if I see great value in const-ing them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought they could be documented at the top and help the reader understand the different sections involved in a reachability graph call

// outgoingCursorSections are the sections to be added to the outgoing *partial* cursor.
// It is the responsibility of the *caller* to append together the incoming cursors to form
// the final cursor.
outgoingCursorSections []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these strings will be allocated over and over during reachability dispatch or if the go compiler will be clever enough to internalize them, but I'm thinking we could provide nicer type checking and reduce allocations and runtime-memory at request time if we create a Section enum type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that sections are be "arbitrary" in their nesting: how would we predefine a type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if im wrong but this slice essentially encodes a "header value" and "N subsequent values" for each section. How is that something we cannot abstract into a type and avoid using strings. It seemed like all sections had predefined names, but the nesting was indeed arbitrary. Do I miss something?

internal/graph/reachableresources.go Outdated Show resolved Hide resolved
internal/graph/reachableresources.go Outdated Show resolved Hide resolved

isFirstIteration := true
for index, item := range items {
if index < afterIndex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding why cursorInformation has the structure it has. I'm probably missing something since this PR is massive and I'm going commit by commit, but I keep asking myself why the choice of modeling sections and values with a slice. I recall you mentioning maps wouldn't work because we don't have a guarantee of the order, which is very important for the cursor, but why couldn't create a struct field for each section, and give the cursor a more type-safe shape. This kind of index contortions and jumping around is what preoccupies me, if we get anything wrong... we could be skipping elements during the iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the arbitrary nesting of cursor sections. As an example, entrypoint #0 might contain a data access underneath it, while entrypoint #1 contains a direct match return, and entrypoint #2 follows a TTU that leads to yet more entrypoints. The chain of sections is different for each path of code being followed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you say you can't model it as something like this?

type section struct {
  name string
  values []string
  children []section
}

}
return a
}
var queryLimit uint64 = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is part of this PR to address or something for a follow up, but wouldn't this hardcoded limit means that we would be potentially retrieving more elements that the LookupResources API asked for?
Say you call with limit 10, you want 10 elements at a time, but the ReverseQueryRelationships call is retrieving a larger page (100), more than needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct but its necessary for efficiency reasons

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's that more efficient? Can you elaborate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you're not returning these results, you're returning the results computed from them. As an example, say you ask for 10 resources... we might have to follow 50 resources here to compute those 10, so its better to look them up in bulk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a fair point. I guess this is another of those points where we would fare better by keeping in-memory stats of say, the p99 size of relationships returned that yielded a positive result. not much we can do about the negative result. Should I open an issue?

internal/graph/cursors.go Show resolved Hide resolved
@@ -360,21 +362,28 @@ func (crr *ConcurrentReachableResources) redispatchOrReport(
if foundResourceType.Namespace == parentRequest.ResourceRelation.Namespace &&
foundResourceType.Relation == parentRequest.ResourceRelation.Relation {
return parentStream.Publish(&v1.DispatchReachableResourcesResponse{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit tangential, but while reviewing the semantics of optimized revisions, I think I found that we don't have any test for this scenario right here, which if I understood correctly is reachability is invoked over the last entry point along the current path.

This is possibly a dumb question: something I noted was the optimized entrypoints might sometimes return empty even though there are entry points via the full entry point methods, (test) case in point here. Can you help me understand why it is ok to not dispatch the "non-optimized" reachability graph when optimized revisions is empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimized entrypoints will never return empty if there is at least one valid entrypoint in the unoptimized version. optimized entrypoints is different because it only follows a single child of the intersection, rather than all of them, because if we find a resource on one path of the intersection, it must be on all others for it to be valid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspected that was the case but was confused by a test-case that has optimized revision returning empty

internal/graph/reachableresources.go Outdated Show resolved Hide resolved
}
return a
}
var queryLimit uint64 = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's that more efficient? Can you elaborate?

rsm := newResourcesSubjectMap(resourceType)
var lastTpl *core.RelationTuple
for tpl := it.Next(); tpl != nil; tpl = it.Next() {
if it.Err() != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := it.Err(); err != nil {
  return nil, err
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this vs just calling it.Err again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's idiomatic go?

internal/dispatch/graph/reachableresources_test.go Outdated Show resolved Hide resolved
internal/graph/reachableresources.go Outdated Show resolved Hide resolved
internal/dispatch/graph/reachableresources_test.go Outdated Show resolved Hide resolved
internal/dispatch/graph/reachableresources_test.go Outdated Show resolved Hide resolved
internal/namespace/reachabilitygraph.go Show resolved Hide resolved
@@ -225,7 +236,7 @@ func (crr *ConcurrentReachableResources) chunkedRedispatch(
resourceType *core.RelationReference,
handler func(ctx context.Context, ci cursorInformation, resources dispatchableResourcesSubjectMap) error,
) error {
return withQueryInCursor(ci, "query-rels",
return withDatastoreCursorInCursor(ci, "query-rels",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm reiterating myself on some concepts, but I'm going commit by commit, so some assumptions do not hold from one commit to another.

Even though this commit is introducing the concept of limits, we are deliberately not using it in withDatastoreCursorInCursor. I think that's probably ok because we might need to redispatch anyway, which gives you a subset, but I wonder if, based on the limit, we can follow a heuristic to follow a proportional amount of relationships from ReverseQueryRelationship. If your limit is, say, 1K (since we cannot do more than that because of proto validation), we could issue up to 10 DB roundtrips because we hardcode query limit as 100.

}

// -1 means that the handler has been completed.
return next(ci.mustWithOutgoingSection(name, "-1"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's extract that -1 and turn it into a constant with its semantics documented.

internal/namespace/reachabilitygraph.go Show resolved Hide resolved
internal/graph/limits.go Show resolved Hide resolved
internal/graph/resourcesubjectsmap.go Show resolved Hide resolved
internal/graph/reachableresources.go Outdated Show resolved Hide resolved
Namespace: prometheusNamespace,
Subsystem: prometheusSubsystem,
Name: "lookup_total",
Name: "lookup_resources_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename of prometheus metrics is going to break anybody out there that had created Grafana dashboards for SpiceDB... including us 😅

Given this is a pretty big change and we'd need to carefully monitor how this affects our production systems during the rollout, can we keep the rename for all the things except prometheus, and we do the prometheus rename in a follow up, in an isolated deployment?

I think it's going to cause problems to folks out there monitoring SpiceDB. Is there perhaps a way to do aliasing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmph.... maybe we should alias it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any reference to aliasing in prometheus, and duplicating all these metrics can easily increase the costs on the observability stack in large enough deployments. I'm not sure honestly, the most conservative would be to leave it as it is, everything else would be properly named. Or we can decide to not honor metrics names, which can be painful for the community.

@josephschorr josephschorr force-pushed the cursored-lookup-resources branch from db9bd85 to a563da0 Compare May 11, 2023 19:57
internal/dispatch/caching/caching.go Show resolved Hide resolved
internal/dispatch/graph/lookupresources_test.go Outdated Show resolved Hide resolved
proto/internal/impl/v1/impl.proto Outdated Show resolved Hide resolved
internal/dispatch/keys/computed_test.go Show resolved Hide resolved
internal/dispatch/keys/hasher_common.go Outdated Show resolved Hide resolved
internal/services/v1/hash.go Show resolved Hide resolved
internal/graph/reachableresources.go Show resolved Hide resolved

func TestDecode(t *testing.T) {
for _, testCase := range []struct {
format string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name. It took me a few more seconds than I would have liked to realize this was a test case name.

@josephschorr josephschorr force-pushed the cursored-lookup-resources branch from a563da0 to 246ca8b Compare May 17, 2023 18:17
@josephschorr
Copy link
Member Author

Updated

vroldanbet
vroldanbet previously approved these changes May 19, 2023
Copy link
Contributor

@vroldanbet vroldanbet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your patience with such a long review! ✨

Namespace: prometheusNamespace,
Subsystem: prometheusSubsystem,
Name: "lookup_total",
Name: "lookup_resources_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any reference to aliasing in prometheus, and duplicating all these metrics can easily increase the costs on the observability stack in large enough deployments. I'm not sure honestly, the most conservative would be to leave it as it is, everything else would be properly named. Or we can decide to not honor metrics names, which can be painful for the community.

// outgoingCursorSections are the sections to be added to the outgoing *partial* cursor.
// It is the responsibility of the *caller* to append together the incoming cursors to form
// the final cursor.
outgoingCursorSections []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if im wrong but this slice essentially encodes a "header value" and "N subsequent values" for each section. How is that something we cannot abstract into a type and avoid using strings. It seemed like all sections had predefined names, but the nesting was indeed arbitrary. Do I miss something?


isFirstIteration := true
for index, item := range items {
if index < afterIndex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you say you can't model it as something like this?

type section struct {
  name string
  values []string
  children []section
}

internal/namespace/reachabilitygraph.go Show resolved Hide resolved
Comment on lines 340 to 358
errG.Go(func() error {
return namespace.CheckNamespaceAndRelation(
checksCtx,
req.Subject.Object.ObjectType,
normalizeSubjectRelation(req.Subject),
true,
req.ResourceObjectType,
req.Permission,
false,
ds,
)
})
errG.Go(func() error {
return namespace.CheckNamespaceAndRelation(
ctx,
req.ResourceObjectType,
req.Permission,
false,
req.Subject.Object.ObjectType,
stringz.DefaultEmpty(req.Subject.OptionalRelation, tuple.Ellipsis),
true,
ds,
)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #1333

internal/dispatch/keys/hasher_common.go Outdated Show resolved Hide resolved
@@ -36,6 +33,10 @@ type ValidatedLookupResourcesRequest struct {
Revision datastore.Revision
}

// reachableResourcesLimit is a limit set on the reachable resources calls to ensure caching
// stores smaller chunks.
const reachableResourcesLimit = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you implying by omission that caching is not a problem and that different RR response sizes are reusable regardless?

Actually, no: we're not forwarding the limit because of the filtering aspect: a request for 100 resources from LR might need 1000 from reachable resources, because 900 are checked and are not returned. This is, to be fair, a corner case and in most cases the limits should be fairly evenly matched, but it is possible, so we run a (small) risk of having to make extra dispatch requests to reachable resources in that scenario. Thoughts?

I think this once again one of those situations where the system should adapt dynamically keeping stats around, and issue the limit that yields less amount of dispatch requests. For the time being this is fine, but let's open follow up issues because it will be forgotten in the ReachableResources ocean 🙏🏻


default:
return spiceerrors.MustBugf("unknown check result status for reachable resources")
[]string{reachableResource.ResourceId},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's open a follow up issue please 🙏🏻

Comment on lines +144 to +149
if errors.Is(err, context.Canceled) {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, got it 👍🏻

internal/graph/reachableresources.go Show resolved Hide resolved
@josephschorr
Copy link
Member Author

so you say you can't model it as something like this?

type section struct {
  name string
  values []string
  children []section
}

Not all sections have values. We could define a Section sub-type message in the protocol buffer, and then have it be:

message CursorSection {
  string name = 1;
  value_oneof {
     string str_value = 2;
     uint64 numeric_value = 3;
  }
}

That would work, but it wouldn't add a great deal to the type safety.

@vroldanbet Let me know if you'd like me to do ^

@vroldanbet
Copy link
Contributor

That would work, but it wouldn't add a great deal to the type safety.
@vroldanbet Let me know if you'd like me to do ^

Let's leave as it is 👍🏻

vroldanbet
vroldanbet previously approved these changes May 22, 2023
@josephschorr josephschorr force-pushed the cursored-lookup-resources branch from e7aff85 to 776745b Compare May 22, 2023 15:45
@josephschorr josephschorr merged commit a81b79c into authzed:main May 22, 2023
@josephschorr josephschorr deleted the cursored-lookup-resources branch May 22, 2023 16:02
@github-actions github-actions bot locked and limited conversation to collaborators May 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/api v1 Affects the v1 API area/dispatch Affects dispatching of requests area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dispatching Lookup should stream results
2 participants