Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relationships selected in SQL-based datastores now elide columns that have static values #2096

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

josephschorr
Copy link
Member

@josephschorr josephschorr commented Oct 21, 2024

This vastly reduces data over the wire, as well as deserialization time and memory usage

Fixes #59
Fixes #1527

@github-actions github-actions bot added area/datastore Affects the storage system area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels Oct 21, 2024
@josephschorr josephschorr force-pushed the rel-struct-sql branch 3 times, most recently from d6104da to 44f2cbb Compare October 22, 2024 16:33
@josephschorr josephschorr marked this pull request as ready for review October 22, 2024 16:49
@josephschorr josephschorr requested a review from a team October 22, 2024 16:49
@josephschorr josephschorr changed the title Relationships selected in SQL-based datastores should elide columns that have static values Relationships selected in SQL-based datastores now elide columns that have static values Oct 22, 2024
tstirrat15
tstirrat15 previously approved these changes Oct 22, 2024
Copy link
Contributor

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments. What I saw looked good but I wouldn't mind having another set of eyes on it.

}

// QueryRelationships queries relationships for the given query and transaction.
func QueryRelationships[R Rows, C ~map[string]any](ctx context.Context, queryInfo QueryInfo, sqlStatement string, args []any, span trace.Span, tx Querier[R], withIntegrity bool) (datastore.RelationshipIterator, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the ~ mean in this context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~T means the set of all types whose underlying type is T

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you did something like type MyType map[string]any, doing ~map[string]any will also accept that and not just explicitly map[string]any

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is exactly what we do

Comment on lines -197 to -202
colIntegrityKeyID,
colIntegrityHash,
colTimestamp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For myself: this was moved into extraFields on the common.SchemaInformation struct.

Comment on lines 428 to 441
type wrappedTX struct {
tx querier
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What role does this play?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It translates the interface

ColCaveatName string
ColCaveatContext string
PaginationFilterType PaginationFilterType
PlaceholderFolder sq.PlaceholderFormat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is this in here because this allows you to push more logic up into the common sql logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Comment on lines +189 to +198
var resourceObjectType string
var resourceObjectID string
var relation string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean anything that these declarations moved out of the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means they are on the heap, which isn't good, but it also means we don't need to recalculate them every iteration, which is good

@@ -1573,10 +1573,7 @@ func ConcurrentWriteSerializationTest(t *testing.T, tester DatastoreTester) {
<-waitToFinish
return err
})
if err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it this is unnecessary when we're using MustBugF?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in a test, so I changed it to just return the error

@@ -562,17 +654,57 @@ func (tqs QueryExecutor) ExecuteQuery(
limit = *queryOpts.Limit
}

toExecute := query.limit(limit)
if limit < math.MaxInt64 {
query = query.limit(limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this conditional just to reduce what's going over the wire?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, yes

@josephschorr
Copy link
Member Author

Updated

@josephschorr josephschorr force-pushed the rel-struct-sql branch 2 times, most recently from b45f0a7 to 65b4e07 Compare October 28, 2024 15:47
@github-actions github-actions bot added the area/dispatch Affects dispatching of requests label Oct 28, 2024
@josephschorr josephschorr force-pushed the rel-struct-sql branch 2 times, most recently from aaad075 to 334d8ea Compare October 28, 2024 15:58
@github-actions github-actions bot added area/CLI Affects the command line area/api v1 Affects the v1 API labels Nov 5, 2024
@josephschorr josephschorr requested a review from a team as a code owner November 22, 2024 18:21
@josephschorr
Copy link
Member Author

Rebased


// StaticValueOrAddColumnForSelect adds a column to the list of columns to select if the value
// is not static, otherwise it sets the value to the static value.
func StaticValueOrAddColumnForSelect(colsToSelect []any, queryInfo QueryInfo, colName string, field *string) []any {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these []any? aren't columns always []string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are actually string reference, and it returns an []any because that's what the rows.Scan requires.

This isn't adding the static value to the slice, its adding the reference to the string value for collecting the non-static value

// and injects additional proxies for validation at test time.
// NOTE: These additional proxies are not performant for use in production (but then,
// neither is memdb)
func NewMemDBDatastoreForTesting(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this in the memdb package? This just feels hard to find.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want the memdb package to have requirements on any of the common SQL code, since it is an odd mixture


// WithAdditionalFilter returns a new SchemaQueryFilterer with an additional filter applied to the query.
func (sqf SchemaQueryFilterer) WithAdditionalFilter(filter func(original sq.SelectBuilder) sq.SelectBuilder) SchemaQueryFilterer {
return SchemaQueryFilterer{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, what about a Clone() method (with a test), and then modify the clone? Then you can't accidentally omit a field in these With methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This is also why I want to have the linter check these, but I'll add a Clone


colsToSelect = append(colsToSelect, &expiration)

if withIntegrity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it make sense to have determine this from queryinfo instead of a seperate boolean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are injected at different places. To add it to QueryInfo, we'd need a Clone or a WithIntegrity(bool) on QueryInfo. I'm happy to make that change, if you like

}

// QueryRelationships queries relationships for the given query and transaction.
func QueryRelationships[R Rows, C ~map[string]any](ctx context.Context, queryInfo QueryInfo, sqlStatement string, args []any, span trace.Span, tx Querier[R], withIntegrity bool) (datastore.RelationshipIterator, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simpler to have a QueryRelationshipsNoElision (or whatever name) insted of all of the queryInfo.Schema.ColumnOptimization == ColumnOptimizationOptionNone throughout. Then the query constructors can just pick the implementation on startup and not check at runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was all intended to be temporary until we end the experiment and remove the flag. I can change it to use a different query function as you suggest, but I'm not sure its worth the effort for a temp flag

@@ -580,19 +702,73 @@ func (tqs QueryExecutor) ExecuteQuery(
limit = *queryOpts.Limit
}

toExecute := query.limit(limit)
if limit < math.MaxInt64 {
query = query.limit(limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the impact of no longer always having a limit clause?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, nothing. It just means sending less bytes over the wire

// Set the column names to select.
columnNamesToSelect := make([]string, 0, 8+len(query.extraFields))

columnNamesToSelect = checkColumn(columnNamesToSelect, query.schema.ColumnOptimization, query.filteringColumnTracker, query.schema.ColNamespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is essentially replicating the logic in QueryRelationships - there's no way to unify those two things?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for spanner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not easily; it is because of Spanner that its hard to do: Spanner needs its own executor code (it can't use the one relationships.go because it has different handling of types for caveats and expiration), but by putting the column selection here, we do share at least that portion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can column selection be factored out and called from both places? this seems easy to get out of sync.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could push it down to the use sites, but it would be fairly similar.

Out of sync is fairly evident: every single read test immediately fails

internal/datastore/common/sql_test.go Outdated Show resolved Hide resolved
internal/datastore/common/sql_test.go Show resolved Hide resolved
@@ -417,7 +435,24 @@ type querier interface {
QueryContext(context.Context, string, ...interface{}) (*sql.Rows, error)
}

func newMySQLExecutor(tx querier) common.ExecuteQueryFunc {
type wrappedTX struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it wrapped? can we have a better name here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It rewrites it to match the interface expected by the common executor. Any suggestions? wrappedForCommon?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah anything that will indicate the reason for the wrapping, like withCommonTx or asCommonTx or commonTx

}

toExecute.queryBuilder = toExecute.queryBuilder.From(from)
columnNamesToSelect = append(columnNamesToSelect, b.Schema.ColExpiration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a way to skip expiration fields if we know the rel can't have expiration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to do it a followup PR, but I'm happy to do so here. Let me know which you prefer

@josephschorr josephschorr force-pushed the rel-struct-sql branch 10 times, most recently from 71d25bc to c6dc6d0 Compare December 19, 2024 20:01
columns that have static values

This vastly reduces data over the wire, as well as deserialization time and memory usage
This will allow us to centrally register additional datastore validation that only runs at test time
This validation test acts as a proxy in the memdb testing datastore and validates that the column elision code (which *isn't* used in memdb) matches the static fields to the values returned for all relationships loaded
This moves the behavior out of Spanner datastore and into a common lib where possible
…sabled or the relationship cannot be marked as expiring
Also adds a datastore test to ensure the constructed cursor operates as expected
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api v1 Affects the v1 API area/CLI Affects the command line area/datastore Affects the storage system area/dispatch Affects dispatching of requests area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)
Projects
None yet
4 participants