Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: break Gossip dependency #47925

Closed
tbg opened this issue Apr 22, 2020 · 19 comments · Fixed by #62563
Closed

stats: break Gossip dependency #47925

tbg opened this issue Apr 22, 2020 · 19 comments · Fixed by #62563
Assignees
Labels
A-multitenancy Related to multi-tenancy T-sql-queries SQL Queries Team X-anchored-telemetry The issue number is anchored by telemetry references.

Comments

@tbg
Copy link
Member

tbg commented Apr 22, 2020

Table stats uses gossip to notify the nodes about new statistics:

func NewTableStatisticsCache(

Multi-tenancy Phase 2 won't have gossip available. On the plus side, there will only be a single SQL server running on behalf of a tenant (meaning the use of Gossip is not necessary, you just need to notify the running process).

Post phase 2, when we consider multiple SQL servers for a given tenant, we need to properly substitute the use of Gossip, but for now here the work is to use Gossip only when it's available, but to unconditionally notify the local server. This suggests that the statistics cache should not know about gossip but should operate on a suitable notifier abstraction.

Tentatively marking as a blocker since I think we'll want to avoid disabling stats, even for the small workloads we expect to support in the beginning.

@tbg tbg added A-multitenancy Related to multi-tenancy A-multitenancy-blocker labels Apr 22, 2020
tbg added a commit to tbg/cockroach that referenced this issue May 5, 2020
Deprecated() calls are blocking phase 2, but the ones
made optional here were not.

Touches cockroachdb#47900.
Touches cockroachdb#47925.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 7, 2020
Deprecated() calls are blocking phase 2, but the ones
made optional here were not.

Touches cockroachdb#47900.
Touches cockroachdb#47925.

Release note: None
@tbg
Copy link
Member Author

tbg commented Jul 8, 2020

@RaduBerinde I feel like you told me before but can you write down what the consequences would be of not doing this. Would stats just arrive on the node with a delay?

@RaduBerinde
Copy link
Member

Currently they won't be updated at all without gossip (except on the local node where they are generated). There is no "polling" in place for table stats.

@RaduBerinde
Copy link
Member

For Phase 2, I believe things would just work without Gossip, as we also notify the local stats cache directly (this was to guarantee in tests that the next statement would use new stats).

@asubiotto
Copy link
Contributor

I'm trying to run the distsql_automatic_stats logic test in a multitenant configuration but it seems to fail to show any stats for the first SHOW STATISTICS statement (even after a 45s retry):

# Disable automatic stats
statement ok
SET CLUSTER SETTING sql.stats.automatic_collection.enabled = false

statement ok
CREATE TABLE data (a INT, b INT, c FLOAT, d DECIMAL, PRIMARY KEY (a, b, c), INDEX d_idx (d))

# Enable automatic stats
statement ok
SET CLUSTER SETTING sql.stats.automatic_collection.enabled = true

# Generate all combinations of values 1 to 10.
statement ok
INSERT INTO data SELECT a, b, c::FLOAT, NULL FROM
   generate_series(1, 10) AS a(a),
   generate_series(1, 10) AS b(b),
   generate_series(1, 10) AS c(c)

query TTIII colnames,rowsort,retry
SELECT statistics_name, column_names, row_count, distinct_count, null_count
FROM [SHOW STATISTICS FOR TABLE data] ORDER BY column_names::STRING, created
----
statistics_name  column_names  row_count  distinct_count  null_count
__auto__         {a,b,c}       1000       1000            0
__auto__         {a,b}         1000       100             0
__auto__         {a}           1000       10              0
__auto__         {b}           1000       10              0
__auto__         {c}           1000       10              0
__auto__         {d}           1000       1               1000

I wondered whether it had anything to do with disabling auto stats on the backing kv cluster and (since setting cluster settings on a tenant only affect the in-memory setting), but enabling that doesn't result in a difference.

To reproduce paste the above logictest into a __test file and run: make test PKG=./pkg/ccl/logictestccl TESTS=TestTenantLogic//distsql_automatic_stats. @RaduBerinde do you mind taking a look?

@RaduBerinde
Copy link
Member

The logs show this error:

failed to create statistics on table 53: create-stats: unexpected concurrency for a flow that was forced to be planned locally

I guess the stats plan can't be fused into a single processor. @andreimatei I think you ran into this at some point and fixed something, do you remember?

@andreimatei
Copy link
Contributor

I think maybe this is what #52621 is trying to address.
Is there a good reason why these stats processors can't be fused?

@RaduBerinde
Copy link
Member

@asubiotto I don't understand why you expect this to work when there is a known issue that you have a PR out for? Or is the problem still happening with your PR?

@asubiotto
Copy link
Contributor

I guess the stats plan can't be fused into a single processor. @andreimatei I think you ran into this at some point and fixed something, do you remember?

Woops, I did not see the logs. I just posted here since I saw no stats and that the expectation was that this should work. #52621 should fix this, sorry for the noise.

@tbg
Copy link
Member Author

tbg commented Sep 4, 2020

Should be fixed now then, right? Can we confirm, add the test, and close?

@asubiotto
Copy link
Contributor

distsql_automatic_stats runs with 3node-tenant.

@knz knz reopened this Sep 11, 2020
@knz
Copy link
Contributor

knz commented Sep 11, 2020

The INJECT STATS statement still uses gossip in a way that refers to this issue.

@knz knz added the X-anchored-telemetry The issue number is anchored by telemetry references. label Sep 11, 2020
@knz
Copy link
Contributor

knz commented Sep 11, 2020

CREATE STATS too

@rytaft
Copy link
Collaborator

rytaft commented Sep 14, 2020

So just to confirm, the remaining work here is to remove all references to Gossip in stats for post-Phase 2, in which tenants may have multiple SQL servers? What is the timeline for this?

Is the suggested solution to use polling? Are there other DB features that have moved from Gossip to polling? Or have they used something else?

craig bot pushed a commit that referenced this issue Sep 14, 2020
54256: sql: make multi-tenancy errors report to telemetry r=asubiotto a=knz

Fixes #48164. 

Issues referenced, I also added the X-anchored-telemetry label to them on github (we keep those issues open on github until their reference in the code is removed):

#54254 
#54255 
#54250 
#54251 
#54252 
#49854 
#48274
#47150
#47971
#47970
#47900 
#47925 


Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@knz
Copy link
Contributor

knz commented Sep 14, 2020

There is a new subsystem alongside gossip called "gossip subscriptions".
The SQL servrer can register for certain gossip updates and the KV server will push updates over the subscription stream.

@rytaft
Copy link
Collaborator

rytaft commented Sep 14, 2020

Got it - thanks!

@RaduBerinde
Copy link
Member

It looks like the best option is to set up a RangeFeed on the entire table_statistics table and look at modified keys to see which tables need refresh. There is a similar usecase in sql/catalog/lease/lease.go.

@RaduBerinde
Copy link
Member

Node @ajwerner is working on a library for setting up rangefeeds which we should use once it's ready (#58361).

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Mar 25, 2021
We currently use gossip to keep table stats caches updated when new
statistics become available. This mechanism is not available in
multi-tenant configurations (currently we only support single SQL
nodes in that configuration and we refresh the cache on the same node
out of band).

This commit changes this mechanism to use a range feed that listens
for updates to system.table_statistics. This is a cleaner mechanism as
it doesn't require any out-of-band coordination.

The range feed events are on a row-by-row basis; a statistics
collection typically inserts multiple rows. We handle this by
remembering the timestamp of the table_statistics row that triggered
the last update. In addition, we remember information about the last
event to avoid unnecessary overhead.

Note that we continue to send gossip updates when new statistics are
available, to keep things working for mixed-version clusters during
upgrade.

Fixes cockroachdb#47925.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Mar 25, 2021
We currently use gossip to keep table stats caches updated when new
statistics become available. This mechanism is not available in
multi-tenant configurations (currently we only support single SQL
nodes in that configuration and we refresh the cache on the same node
out of band).

This commit changes this mechanism to use a range feed that listens
for updates to system.table_statistics. This is a cleaner mechanism as
it doesn't require any out-of-band coordination.

The range feed events are on a row-by-row basis; a statistics
collection typically inserts multiple rows. We handle this by
remembering the timestamp of the table_statistics row that triggered
the last update. In addition, we remember information about the last
event to avoid unnecessary overhead.

Note that we continue to send gossip updates when new statistics are
available, to keep things working for mixed-version clusters during
upgrade.

Fixes cockroachdb#47925.

Release note: None
craig bot pushed a commit that referenced this issue Mar 28, 2021
62563: sql: use a range feed for refreshing the table stats cache r=RaduBerinde a=RaduBerinde

We currently use gossip to keep table stats caches updated when new
statistics become available. This mechanism is not available in
multi-tenant configurations (currently we only support single SQL
nodes in that configuration and we refresh the cache on the same node
out of band).

This commit changes this mechanism to use a range feed that listens
for updates to system.table_statistics. This is a cleaner mechanism as
it doesn't require any out-of-band coordination.

The range feed events are on a row-by-row basis; a statistics
collection typically inserts multiple rows. We handle this by
remembering the timestamp of the table_statistics row that triggered
the last update. In addition, we remember information about the last
event to avoid unnecessary overhead.

Note that we continue to send gossip updates when new statistics are
available, to keep things working for mixed-version clusters during
upgrade.

Fixes #47925.

Release note: None

Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@craig craig bot closed this as completed in 8905b9b Mar 28, 2021
@RaduBerinde
Copy link
Member

Keeping open until we have something backported.

@RaduBerinde RaduBerinde reopened this Mar 29, 2021
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Apr 29, 2021
This is a modified backport of cockroachdb#62563 (description below). The
difference is that we keep both mechanisms and only use the range
cache when gossip is not available (i.e. multi-tenant). This avoids
potential regressions caused by the new mechanism.

---

We currently use gossip to keep table stats caches updated when new
statistics become available. This mechanism is not available in
multi-tenant configurations (currently we only support single SQL
nodes in that configuration and we refresh the cache on the same node
out of band).

This commit changes this mechanism to use a range feed that listens
for updates to system.table_statistics. This is a cleaner mechanism as
it doesn't require any out-of-band coordination.

The range feed events are on a row-by-row basis; a statistics
collection typically inserts multiple rows. We handle this by
remembering the timestamp of the table_statistics row that triggered
the last update. In addition, we remember information about the last
event to avoid unnecessary overhead.

Note that we continue to send gossip updates when new statistics are
available, to keep things working for mixed-version clusters during
upgrade.

Fixes cockroachdb#47925.

Release note: None
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 13, 2021
This is a modified backport of cockroachdb#62563 (description below). The
difference is that we keep both mechanisms and only use the range
cache when gossip is not available (i.e. multi-tenant). This avoids
potential regressions caused by the new mechanism.

---

We currently use gossip to keep table stats caches updated when new
statistics become available. This mechanism is not available in
multi-tenant configurations (currently we only support single SQL
nodes in that configuration and we refresh the cache on the same node
out of band).

This commit changes this mechanism to use a range feed that listens
for updates to system.table_statistics. This is a cleaner mechanism as
it doesn't require any out-of-band coordination.

The range feed events are on a row-by-row basis; a statistics
collection typically inserts multiple rows. We handle this by
remembering the timestamp of the table_statistics row that triggered
the last update. In addition, we remember information about the last
event to avoid unnecessary overhead.

Note that we continue to send gossip updates when new statistics are
available, to keep things working for mixed-version clusters during
upgrade.

Fixes cockroachdb#47925.

Release note: None
@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
@ajwerner
Copy link
Contributor

The backport merged in #64360.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy T-sql-queries SQL Queries Team X-anchored-telemetry The issue number is anchored by telemetry references.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants