-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for cursors and limits on LookupResources API calls #1296
Add support for cursors and limits on LookupResources API calls #1296
Conversation
a9f14f3
to
37a17dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first batch of comments, I'm still reviewing 😅
Please note that I'm going commit by commit, so some of my questions may not apply in subsequent commits
internal/graph/reachableresources.go
Outdated
}) | ||
} | ||
ci := newCursorInformation(req.OptionalCursor) | ||
return withSingletonInCursor(ci, "same-type", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
introduce constant SECTION_SAME_TYPE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the section names are only used within the code once, I don't know if I see great value in const-ing them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought they could be documented at the top and help the reader understand the different sections involved in a reachability graph call
// outgoingCursorSections are the sections to be added to the outgoing *partial* cursor. | ||
// It is the responsibility of the *caller* to append together the incoming cursors to form | ||
// the final cursor. | ||
outgoingCursorSections []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if these strings will be allocated over and over during reachability dispatch or if the go compiler will be clever enough to internalize them, but I'm thinking we could provide nicer type checking and reduce allocations and runtime-memory at request time if we create a Section
enum type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that sections are be "arbitrary" in their nesting: how would we predefine a type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me if im wrong but this slice essentially encodes a "header value" and "N subsequent values" for each section. How is that something we cannot abstract into a type and avoid using strings. It seemed like all sections had predefined names, but the nesting was indeed arbitrary. Do I miss something?
|
||
isFirstIteration := true | ||
for index, item := range items { | ||
if index < afterIndex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time understanding why cursorInformation
has the structure it has. I'm probably missing something since this PR is massive and I'm going commit by commit, but I keep asking myself why the choice of modeling sections and values with a slice. I recall you mentioning maps wouldn't work because we don't have a guarantee of the order, which is very important for the cursor, but why couldn't create a struct field for each section, and give the cursor a more type-safe shape. This kind of index contortions and jumping around is what preoccupies me, if we get anything wrong... we could be skipping elements during the iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the arbitrary nesting of cursor sections. As an example, entrypoint #0 might contain a data access underneath it, while entrypoint #1 contains a direct match return, and entrypoint #2 follows a TTU that leads to yet more entrypoints. The chain of sections is different for each path of code being followed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you say you can't model it as something like this?
type section struct {
name string
values []string
children []section
}
} | ||
return a | ||
} | ||
var queryLimit uint64 = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is part of this PR to address or something for a follow up, but wouldn't this hardcoded limit means that we would be potentially retrieving more elements that the LookupResources API asked for?
Say you call with limit 10, you want 10 elements at a time, but the ReverseQueryRelationships
call is retrieving a larger page (100), more than needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct but its necessary for efficiency reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's that more efficient? Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you're not returning these results, you're returning the results computed from them. As an example, say you ask for 10 resources... we might have to follow 50 resources here to compute those 10, so its better to look them up in bulk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a fair point. I guess this is another of those points where we would fare better by keeping in-memory stats of say, the p99 size of relationships returned that yielded a positive result. not much we can do about the negative result. Should I open an issue?
internal/graph/reachableresources.go
Outdated
@@ -360,21 +362,28 @@ func (crr *ConcurrentReachableResources) redispatchOrReport( | |||
if foundResourceType.Namespace == parentRequest.ResourceRelation.Namespace && | |||
foundResourceType.Relation == parentRequest.ResourceRelation.Relation { | |||
return parentStream.Publish(&v1.DispatchReachableResourcesResponse{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit tangential, but while reviewing the semantics of optimized revisions, I think I found that we don't have any test for this scenario right here, which if I understood correctly is reachability is invoked over the last entry point along the current path.
This is possibly a dumb question: something I noted was the optimized entrypoints
might sometimes return empty even though there are entry points via the full entry point methods, (test) case in point here. Can you help me understand why it is ok to not dispatch the "non-optimized" reachability graph when optimized revisions is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimized entrypoints will never return empty if there is at least one valid entrypoint in the unoptimized version. optimized entrypoints is different because it only follows a single child of the intersection, rather than all of them, because if we find a resource on one path of the intersection, it must be on all others for it to be valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspected that was the case but was confused by a test-case that has optimized revision returning empty
} | ||
return a | ||
} | ||
var queryLimit uint64 = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's that more efficient? Can you elaborate?
rsm := newResourcesSubjectMap(resourceType) | ||
var lastTpl *core.RelationTuple | ||
for tpl := it.Next(); tpl != nil; tpl = it.Next() { | ||
if it.Err() != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := it.Err(); err != nil {
return nil, err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this vs just calling it.Err
again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's idiomatic go?
@@ -225,7 +236,7 @@ func (crr *ConcurrentReachableResources) chunkedRedispatch( | |||
resourceType *core.RelationReference, | |||
handler func(ctx context.Context, ci cursorInformation, resources dispatchableResourcesSubjectMap) error, | |||
) error { | |||
return withQueryInCursor(ci, "query-rels", | |||
return withDatastoreCursorInCursor(ci, "query-rels", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm reiterating myself on some concepts, but I'm going commit by commit, so some assumptions do not hold from one commit to another.
Even though this commit is introducing the concept of limits, we are deliberately not using it in withDatastoreCursorInCursor
. I think that's probably ok because we might need to redispatch anyway, which gives you a subset, but I wonder if, based on the limit
, we can follow a heuristic to follow a proportional amount of relationships from ReverseQueryRelationship
. If your limit is, say, 1K (since we cannot do more than that because of proto validation), we could issue up to 10 DB roundtrips because we hardcode query limit as 100.
} | ||
|
||
// -1 means that the handler has been completed. | ||
return next(ci.mustWithOutgoingSection(name, "-1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's extract that -1
and turn it into a constant with its semantics documented.
Namespace: prometheusNamespace, | ||
Subsystem: prometheusSubsystem, | ||
Name: "lookup_total", | ||
Name: "lookup_resources_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename of prometheus metrics is going to break anybody out there that had created Grafana dashboards for SpiceDB... including us 😅
Given this is a pretty big change and we'd need to carefully monitor how this affects our production systems during the rollout, can we keep the rename for all the things except prometheus, and we do the prometheus rename in a follow up, in an isolated deployment?
I think it's going to cause problems to folks out there monitoring SpiceDB. Is there perhaps a way to do aliasing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmph.... maybe we should alias it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any reference to aliasing in prometheus, and duplicating all these metrics can easily increase the costs on the observability stack in large enough deployments. I'm not sure honestly, the most conservative would be to leave it as it is, everything else would be properly named. Or we can decide to not honor metrics names, which can be painful for the community.
db9bd85
to
a563da0
Compare
pkg/cursor/cursor_test.go
Outdated
|
||
func TestDecode(t *testing.T) { | ||
for _, testCase := range []struct { | ||
format string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: name. It took me a few more seconds than I would have liked to realize this was a test case name.
a563da0
to
246ca8b
Compare
Updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for your patience with such a long review! ✨
Namespace: prometheusNamespace, | ||
Subsystem: prometheusSubsystem, | ||
Name: "lookup_total", | ||
Name: "lookup_resources_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any reference to aliasing in prometheus, and duplicating all these metrics can easily increase the costs on the observability stack in large enough deployments. I'm not sure honestly, the most conservative would be to leave it as it is, everything else would be properly named. Or we can decide to not honor metrics names, which can be painful for the community.
// outgoingCursorSections are the sections to be added to the outgoing *partial* cursor. | ||
// It is the responsibility of the *caller* to append together the incoming cursors to form | ||
// the final cursor. | ||
outgoingCursorSections []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me if im wrong but this slice essentially encodes a "header value" and "N subsequent values" for each section. How is that something we cannot abstract into a type and avoid using strings. It seemed like all sections had predefined names, but the nesting was indeed arbitrary. Do I miss something?
|
||
isFirstIteration := true | ||
for index, item := range items { | ||
if index < afterIndex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you say you can't model it as something like this?
type section struct {
name string
values []string
children []section
}
internal/services/v1/permissions.go
Outdated
errG.Go(func() error { | ||
return namespace.CheckNamespaceAndRelation( | ||
checksCtx, | ||
req.Subject.Object.ObjectType, | ||
normalizeSubjectRelation(req.Subject), | ||
true, | ||
req.ResourceObjectType, | ||
req.Permission, | ||
false, | ||
ds, | ||
) | ||
}) | ||
errG.Go(func() error { | ||
return namespace.CheckNamespaceAndRelation( | ||
ctx, | ||
req.ResourceObjectType, | ||
req.Permission, | ||
false, | ||
req.Subject.Object.ObjectType, | ||
stringz.DefaultEmpty(req.Subject.OptionalRelation, tuple.Ellipsis), | ||
true, | ||
ds, | ||
) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #1333
@@ -36,6 +33,10 @@ type ValidatedLookupResourcesRequest struct { | |||
Revision datastore.Revision | |||
} | |||
|
|||
// reachableResourcesLimit is a limit set on the reachable resources calls to ensure caching | |||
// stores smaller chunks. | |||
const reachableResourcesLimit = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you implying by omission that caching is not a problem and that different RR response sizes are reusable regardless?
Actually, no: we're not forwarding the limit because of the filtering aspect: a request for 100 resources from LR might need 1000 from reachable resources, because 900 are checked and are not returned. This is, to be fair, a corner case and in most cases the limits should be fairly evenly matched, but it is possible, so we run a (small) risk of having to make extra dispatch requests to reachable resources in that scenario. Thoughts?
I think this once again one of those situations where the system should adapt dynamically keeping stats around, and issue the limit that yields less amount of dispatch requests. For the time being this is fine, but let's open follow up issues because it will be forgotten in the ReachableResources ocean 🙏🏻
|
||
default: | ||
return spiceerrors.MustBugf("unknown check result status for reachable resources") | ||
[]string{reachableResource.ResourceId}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's open a follow up issue please 🙏🏻
if errors.Is(err, context.Canceled) { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, got it 👍🏻
so you say you can't model it as something like this?
Not all sections have values. We could define a
That would work, but it wouldn't add a great deal to the type safety. @vroldanbet Let me know if you'd like me to do ^ |
Let's leave as it is 👍🏻 |
246ca8b
to
0509cda
Compare
Also adds a test to ensure cursor hashes are stable
e7aff85
to
776745b
Compare
This change adds support for (optional) cursors and limits on LookupResource API calls, changing them to be fully streaming, which should significantly reduce or even remove the pause when loading very large sets of resources
This is accomplished by changing ReachableResources into a non-parallelized, ordered (internally defined) and cursored API and having LookupResources invoke it in a streaming manner, passing the cursor down into it.
This PR also has a number of small tech debt cleanups, including renaming Lookup -> LookupResources everywhere
Fixes #43