Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: session-scoped temporary objects do not work with tenant clusters #67401

Closed
andy-kimball opened this issue Jul 9, 2021 · 3 comments · Fixed by #69486 or #70129
Closed

sql: session-scoped temporary objects do not work with tenant clusters #67401

andy-kimball opened this issue Jul 9, 2021 · 3 comments · Fixed by #69486 or #70129
Assignees
Labels
A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Milestone

Comments

@andy-kimball
Copy link
Contributor

Repro:

export COCKROACH=./cockroach

# Create certificates.
export CERTSDIR=$HOME/.cockroach-certs
export COCKROACH_CA_KEY=$CERTSDIR/ca.key
$COCKROACH cert create-ca
$COCKROACH cert create-client root
$COCKROACH cert create-node 127.0.0.1 localhost
$COCKROACH mt cert create-tenant-client 123 --certs-dir=${HOME}/.cockroach-certs

# Start KV process.
$COCKROACH start-single-node --host 127.0.0.1

# Create tenant
$COCKROACH sql --host 127.0.0.1 -e 'select crdb_internal.create_tenant(123);'

# Start SQL tenant process.
COCKROACH_TRUST_CLIENT_PROVIDED_SQL_REMOTE_ADDR=true $COCKROACH mt start-sql --tenant-id 123 --kv-addrs 127.0.0.1:26257 --http-addr 127.0.0.1:8081 --sql-addr 127.0.0.1:36257

Notice that already there's a warning in the log, despite not having created any temporary objects:

W210709 00:19:08.472581 195 sql/temporary_schema.go:601  [-] 2  failed to clean temp objects: isMeta1Leaseholder is not available to secondary tenants

Now create a temporary table in the SQL CLI:

$COCKROACH sql --port 36257

SET experimental_enable_temp_tables = 'on';
CREATE TEMP TABLE t (x INT PRIMARY KEY, y INT);

While still in the CLI, kill the SQL tenant process with ctrl-C. Then start it up again. Back in the SQL CLI:

SHOW TABLES;

EXPECTED: The temp table should be gone, since it was tied to the previous session, which got terminated.
ACTUAL: The temp table is still showing:

root@:36257/defaultdb> show tables;
             schema_name            | table_name | type  | owner | estimated_row_count | locality
------------------------------------+------------+-------+-------+---------------------+-----------
  pg_temp_1625790426654356000_10001 | t          | table | root  |                   0 | NULL

Even if we decide not to fix this right now, we need to at least disable temporary object creation in tenant clusters and stop spamming the logs (logs are filled with the warning message even when temp objects are not ever used). That change needs to be back-ported to 21.1.x.

Also, here's @ajwerner's commentary from Slack:

Oh, that’s not good. Interesting that it’s new.
This needs some tracking. Seems like we didn’t think through temporary objects on tenants.
If a node crashes or is shut down with an open session that created temporary objects, nothing will ever clean them up.
There’s different levels of solutions. I think what I’d do is just get rid of the leasing component. It’s idemponent work. That being said, the code also makes RPCs to the other instances to find out which sessions are live which is also busted on tenants.
We’re working on the sessions listing stuff (I think). At that point, it’s just about deciding whether we care about the leasing. For that I think there are two simple solutions:
No leasing, just jitter a long loop
Use the fact that you have the lowest instance ID you know about like being the leaseholder of meta1.
All in all, it is non-trivial, but it is small.

@andy-kimball andy-kimball added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 9, 2021
@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Jul 9, 2021
@andy-kimball andy-kimball added this to the 21.1 milestone Jul 9, 2021
@fqazi fqazi self-assigned this Aug 25, 2021
@ajwerner
Copy link
Contributor

Alright, so this should proceed in two phases. Both of which should only affect secondary tenants, making it okay to be working on it at this point.

  1. Just have the tenant pod always try the cleanup. This will not work if there is more than one tenant but it's something we can backport and it should be tiny. Skip this check if you're not the system tenant and put a big comment saying that it's fine because this instance is the only instance as of writing. You'll probably also need to plumb around some other dependencies

// We only want to perform the cleanup if we are holding the meta1 lease.
// This ensures only one server can perform the job at a time.
isLeaseholder, err := c.isMeta1LeaseholderFunc(ctx, c.db.Clock().NowAsClockTimestamp())

  1. Make it work with multiple sql pods. We're going to need to wire up the RPCs to check for session liveness, which I think should be doable now.

Then we need to orchestrate which node is going to do the work. There's a few options here. One is to not coordinate but just run the loop rarely and with a lot of jitter. I think that that would be totally totally fine. I think it's my preferred approach. Any contention should sort itself out.

If we did want to coordinate we could use a singleton job. See

The second one went further and created a scheduled job to do the cleanup. That could be cool. I don't think it's important.

@ajwerner
Copy link
Contributor

So, in conclusion, the work item of 1. is to hook up the temporary cleaner for secondary tenants and make sure it works with an assumption that it's the only tenant. Then, for 2. we hook up the RPCs to ensure that we don't trash data that a live session is using. For hooking up those RPCs we're going to want to be rather careful that we have a complete view of the set of nodes. We may need to add some new uncached accessor for the set of instances. That set of instances is exposed by

// AddressResolver exposes API for retrieving the instance address and all live instances for a tenant.
type AddressResolver interface {
// GetInstance returns the InstanceInfo for a pod given the instance ID.
// Returns a NonExistentInstanceError if an instance with the given ID
// does not exist.
GetInstance(context.Context, base.SQLInstanceID) (InstanceInfo, error)
.

fqazi added a commit to fqazi/cockroach that referenced this issue Aug 27, 2021
Fixes: cockroachdb#67401

Previously, the temporary table clean up did not execute for
tenants. This was inadequate that temporary tables would last
longer then the life span of user sessions. To address this,
this patch adds support for cleaning up on a single tenant
pod. Specifically removing checks for meta1 lease when under
a tenant, and support for listing sessions.

Release justification: low risk and fixes a tenant related bug
Release note (bug fix): Temporary tables were not properly cleaned up for
tenants.
fqazi added a commit to fqazi/cockroach that referenced this issue Aug 30, 2021
Fixes: cockroachdb#67401

Previously, the temporary table clean up did not execute for
tenants. This was inadequate that temporary tables would last
longer then the life span of user sessions. To address this,
this patch adds support for cleaning up on a single tenant
pod. Specifically removing checks for meta1 lease when under
a tenant, and support for listing sessions.

Release justification: low risk and fixes a tenant related bug
Release note (bug fix): Temporary tables were not properly cleaned up for
tenants.
@RaduBerinde RaduBerinde added the A-multitenancy Related to multi-tenancy label Aug 31, 2021
craig bot pushed a commit that referenced this issue Sep 10, 2021
69486:     sql: support for temporary table clean up for tenants r=fqazi a=fqazi

Fixes: #67401
Previously, the temporary table clean-up did not execute for
tenants. This was inadequate that temporary tables would last
longer than the life span of user sessions. To address this,
this patch adds support for cleaning up on a single tenant
pod. Specifically removing checks for meta1 lease when under
a tenant, and support for listing sessions.

Release justification: low risk and fixes a tenant-related bug
Release note (bug fix): Temporary tables were not properly cleaned up for
tenants.

69789: sql: drop database cascade can fail resolving schemas r=fqazi a=fqazi

Fixes: #69713

Previously, drop database cascade would drop the schema first,
and then drop any objects under those schemas after. This was
inadequate because as we look up any objects under the schemas,
we may need to resolve types, which will may lead to a look up
on the schema. If the schema is dropped then we will fail while
resolving any types. To address this, this patch drops the objects
under the schema first followed by the database.

Release justification: Low risk bug fix for drop database cascade
Release note (bug fix): Drop database cascade can fail while resolving
a schema in a certain scenarios with the following error:
 "ERROR: error resolving referenced table ID <ID>: descriptor is
  being dropped"

69975: sql: fix interaction between stmt bundles and tracing r=yuzefovich a=yuzefovich

Previously, we wouldn't generate the bundle if the verbose tracing was
already enabled on the cluster because we wouldn't call
`instrumentationHelper.Finish` where we actually generate the bundle.
This would result in empty responses for `EXPLAIN ANALYZE (DEBUG)` as
well as the requests for stmt diagnostics being stuck in "waiting"
state.

Fixes: #69398.

Release note (bug fix): Previously, if the tracing
(`sql.trace.txn.enable_threshold` cluster setting) was enabled on the
cluster, the statement diagnostics collection (`EXPLAIN ANALYZE
(DEBUG)`) wouldn't work. This is now fixed.

Release justification: low-risk fix to a long-standing bug.

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@craig craig bot closed this as completed in 4e76a9f Sep 10, 2021
fqazi added a commit to fqazi/cockroach that referenced this issue Sep 13, 2021
Fixes: cockroachdb#67401

Previously, the temporary table clean up did not execute for
tenants. This was inadequate that temporary tables would last
longer then the life span of user sessions. To address this,
this patch adds support for cleaning up on a single tenant
pod. Specifically removing checks for meta1 lease when under
a tenant, and support for listing sessions.

Release justification: low risk and fixes a tenant related bug
Release note (bug fix): Temporary tables were not properly cleaned up for
tenants.
@ajwerner ajwerner added branch-release-21.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 13, 2021
@ajwerner
Copy link
Contributor

Reopening for backport.

@ajwerner ajwerner reopened this Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
5 participants