Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Separated Statistics [2/7ish] #5889

Merged
merged 30 commits into from
Aug 28, 2019
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
d7675e7
Add schema for Separated Statistics
reivilibre Aug 20, 2019
80a1c6e
Add storage function for storing stats deltas
reivilibre Aug 20, 2019
1819563
Ack, isort!
reivilibre Aug 20, 2019
b5573c0
Update synapse/storage/stats.py
reivilibre Aug 20, 2019
4a97eef
Update synapse/storage/stats.py
reivilibre Aug 20, 2019
6a19f7e
Add room and user statistics documentation.
reivilibre Aug 20, 2019
981c6cf
Sanitise accepted fields in `_update_stats_delta_txn`
reivilibre Aug 20, 2019
977310e
Clarify `_update_stats_delta_txn`
reivilibre Aug 20, 2019
eafa8d3
Unify name of 'stats regenerator' in schema comments.
reivilibre Aug 20, 2019
18a4c03
Remove needless defaults.
reivilibre Aug 20, 2019
7b657f1
Simplify table structure
reivilibre Aug 22, 2019
e8fc180
Fix up SQL schema delta
reivilibre Aug 22, 2019
79252d1
Fix up historical stats support.
reivilibre Aug 22, 2019
c3d2bf2
Allow schema deltas to be engine-specific
reivilibre Aug 27, 2019
1ecd1a6
Use engine-specific delta SQL files rather than delta written in Python.
reivilibre Aug 27, 2019
5043ef8
Merge branch 'rei/rss_target' into rei/rss_inc2
reivilibre Aug 27, 2019
4b7bf2e
Apply suggestions from code review
reivilibre Aug 27, 2019
81c5289
Clarify `_update_stats_delta_txn` by adding code comments and kwargs.
reivilibre Aug 27, 2019
544ba2c
Apply minor suggestions from review
reivilibre Aug 27, 2019
a6c1020
Lock tables in upsert fall-backs.
reivilibre Aug 27, 2019
736ac58
Code formatting (Black)
reivilibre Aug 27, 2019
09cbc3a
Switch to milliseconds in room/user stats for consistency.
reivilibre Aug 27, 2019
c775f31
Don't include the room & user stats docs in this PR.
reivilibre Aug 27, 2019
491eaf0
Remove obsolete `OldCollectionRequired` as old collection is obsolete.
reivilibre Aug 27, 2019
11c4e50
Rename `room_state` table to `room_stats_state`
reivilibre Aug 27, 2019
62b1250
Update `_purge_room_txn` to take account of separated stats tables
reivilibre Aug 27, 2019
324f21b
Fix logic error.
reivilibre Aug 27, 2019
1af7866
Clean up code with improved naming and hoist around functions.
reivilibre Aug 27, 2019
b9f1adc
Update synapse/storage/stats.py
reivilibre Aug 28, 2019
a344ad3
Code formatting (Black)
reivilibre Aug 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions docs/room_and_user_statistics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
Room and User Statistics
========================

Synapse maintains room and user statistics (as well as a cache of room state),
in various tables.

These can be used for administrative purposes but are also used when generating
the public room directory. If these tables get stale or out of sync (possibly
after database corruption), you may wish to regenerate them.


# Synapse Administrator Documentation

## Various SQL scripts that you may find useful

### Delete stats, including historical stats

```sql
DELETE FROM room_stats_current;
DELETE FROM room_stats_historical;
DELETE FROM user_stats_current;
DELETE FROM user_stats_historical;
```

### Regenerate stats (all subjects)

```sql
BEGIN;
DELETE FROM stats_incremental_position;
INSERT INTO stats_incremental_position (
state_delta_stream_id,
total_events_min_stream_ordering,
total_events_max_stream_ordering,
is_background_contract
) VALUES (NULL, NULL, NULL, FALSE), (NULL, NULL, NULL, TRUE);
COMMIT;

DELETE FROM room_stats_current;
DELETE FROM user_stats_current;
```

then follow the steps below for **'Regenerate stats (missing subjects only)'**

### Regenerate stats (missing subjects only)

```sql
-- Set up staging tables
-- we depend on current_state_events_membership because this is used
-- in our counting.
INSERT INTO background_updates (update_name, progress_json) VALUES
('populate_stats_prepare', '{}', 'current_state_events_membership');

-- Run through each room and update stats
INSERT INTO background_updates (update_name, progress_json, depends_on) VALUES
('populate_stats_process_rooms', '{}', 'populate_stats_prepare');

-- Run through each user and update stats.
INSERT INTO background_updates (update_name, progress_json, depends_on) VALUES
('populate_stats_process_users', '{}', 'populate_stats_process_rooms');

-- Clean up staging tables
INSERT INTO background_updates (update_name, progress_json, depends_on) VALUES
('populate_stats_cleanup', '{}', 'populate_stats_process_users');
```

then **restart Synapse**.


# Synapse Developer Documentation

## High-Level Concepts

### Definitions

* **subject**: Something we are tracking stats about – currently a room or user.
* **current row**: An entry for a subject in the appropriate current statistics
table. Each subject can have only one.
* **historical row**: An entry for a subject in the appropriate historical
statistics table. Each subject can have any number of these.

### Overview

Stats are maintained as time series. There are two kinds of column:

* absolute columns – where the value is correct for the time given by `end_ts`
in the stats row. (Imagine a line graph for these values)
* per-slice columns – where the value corresponds to how many of the occurrences
occurred within the time slice given by `(end_ts − bucket_size)…end_ts`
or `start_ts…end_ts`. (Imagine a histogram for these values)

Currently, only absolute columns are in use.

Stats are maintained in two tables (for each type): current and historical.

Current stats correspond to the present values. Each subject can only have one
entry.

Historical stats correspond to values in the past. Subjects may have multiple
entries.

## Concepts around the management of stats

### current rows

#### dirty current rows

Current rows can be **dirty**, which means that they have changed since the
latest historical row for the same subject.
**Dirty** current rows possess an end timestamp, `end_ts`.

#### old current rows and old collection

When a (necessarily dirty) current row has an `end_ts` in the past, it is said
to be **old**.
Old current rows must be copied into a historical row, and cleared of their dirty
status, before further statistics can be tracked for that subject.
The process which does this is referred to as **old collection**.

#### incomplete current rows

There are also **incomplete** current rows, which are current rows that do not
contain a full count yet – this is because they are waiting for the stats
regenerator to give them an initial count. Incomplete current rows DO NOT contain
correct and up-to-date values. As such, *incomplete rows are not old-collected*.
Instead, old incomplete rows will be extended so they are no longer old.

### historical rows

Historical rows can always be considered to be valid for the time slice and
end time specified. (This, of course, assumes a lack of defects in the code
to track the statistics, and assumes integrity of the database).

Even still, there are two considerations that we may need to bear in mind:

* historical rows will not exist for every time slice – they will be omitted
if there were no changes. In this case, the following assumptions can be
made to interpolate/recreate missing rows:
- absolute fields have the same values as in the preceding row
- per-slice fields are zero (`0`)
* historical rows will not be retained forever – rows older than a configurable
time will be purged.

#### purge

The purging of historical rows is not yet implemented.

115 changes: 115 additions & 0 deletions synapse/storage/schema/delta/56/stats_separated1.sql
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,118 @@ DELETE FROM background_updates WHERE update_name IN (
'populate_stats_process_rooms',
'populate_stats_cleanup'
);

----- Create tables for our version of room stats.

-- single-row table to track position of incremental updates
reivilibre marked this conversation as resolved.
Show resolved Hide resolved
CREATE TABLE IF NOT EXISTS stats_incremental_position (
-- the stream_id of the last-processed state delta
state_delta_stream_id BIGINT,

-- the stream_ordering of the last-processed backfilled event
-- (this is negative)
total_events_min_stream_ordering BIGINT,

-- the stream_ordering of the last-processed normally-created event
-- (this is positive)
total_events_max_stream_ordering BIGINT,

-- If true, this represents the contract agreed upon by the stats
-- regenerator.
-- If false, this is suitable for use by the delta/incremental processor.
is_background_contract BOOLEAN NOT NULL PRIMARY KEY
);

-- insert a null row and make sure it is the only one.
DELETE FROM stats_incremental_position;
INSERT INTO stats_incremental_position (
state_delta_stream_id,
total_events_min_stream_ordering,
total_events_max_stream_ordering,
is_background_contract
) VALUES (NULL, NULL, NULL, (0 = 1)), (NULL, NULL, NULL, (1 = 1));

-- represents PRESENT room statistics for a room
CREATE TABLE IF NOT EXISTS room_stats_current (
room_id TEXT NOT NULL PRIMARY KEY,

-- These starts cover the time from start_ts...end_ts (in seconds).
-- Note that end_ts is quantised, and start_ts usually so.
reivilibre marked this conversation as resolved.
Show resolved Hide resolved
start_ts BIGINT,
end_ts BIGINT,

current_state_events INT NOT NULL,
total_events INT NOT NULL,
joined_members INT NOT NULL,
invited_members INT NOT NULL,
left_members INT NOT NULL,
banned_members INT NOT NULL,

-- If initial stats regen is still to be performed: NULL
-- If initial stats regen has been performed: the maximum delta stream
-- position that this row takes into account.
completed_delta_stream_id BIGINT,

CONSTRAINT timestamp_nullity_equality CHECK ((start_ts IS NULL) = (end_ts IS NULL))
);


-- represents HISTORICAL room statistics for a room
CREATE TABLE IF NOT EXISTS room_stats_historical (
room_id TEXT NOT NULL,
-- These stats cover the time from (end_ts - bucket_size)...end_ts (in seconds).
-- Note that end_ts is quantised, and start_ts usually so.
end_ts BIGINT NOT NULL,
bucket_size INT NOT NULL,

current_state_events INT NOT NULL,
total_events INT NOT NULL,
joined_members INT NOT NULL,
invited_members INT NOT NULL,
left_members INT NOT NULL,
banned_members INT NOT NULL,

PRIMARY KEY (room_id, end_ts)
);

-- We use this index to speed up deletion of ancient room stats.
CREATE INDEX IF NOT EXISTS room_stats_historical_end_ts ON room_stats_historical (end_ts);

-- We don't need an index on (room_id, end_ts) because PRIMARY KEY sorts that
-- out for us. (We would want it to review stats for a particular room.)


-- represents PRESENT statistics for a user
CREATE TABLE IF NOT EXISTS user_stats_current (
user_id TEXT NOT NULL PRIMARY KEY,

-- The timestamp that represents the start of the
start_ts BIGINT,
end_ts BIGINT,

public_rooms INT NOT NULL,
private_rooms INT NOT NULL,

-- If initial stats regen is still to be performed: NULL
-- If initial stats regen has been performed: the maximum delta stream
-- position that this row takes into account.
completed_delta_stream_id BIGINT
);

-- represents HISTORICAL statistics for a user
CREATE TABLE IF NOT EXISTS user_stats_historical (
user_id TEXT NOT NULL,
end_ts BIGINT NOT NULL,
bucket_size INT NOT NULL,

public_rooms INT NOT NULL,
private_rooms INT NOT NULL,

PRIMARY KEY (user_id, end_ts)
);

-- We use this index to speed up deletion of ancient user stats.
CREATE INDEX IF NOT EXISTS user_stats_historical_end_ts ON user_stats_historical (end_ts);

-- We don't need an index on (user_id, end_ts) because PRIMARY KEY sorts that
-- out for us. (We would want it to review stats for a particular user.)
87 changes: 87 additions & 0 deletions synapse/storage/schema/delta/56/stats_separated2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# -*- coding: utf-8 -*-
# Copyright 2019 The Matrix.org Foundation C.I.C.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This schema delta will be run after 'stats_separated1.sql' due to lexicographic
# ordering. Note that it MUST be so.
from synapse.storage.engines import PostgresEngine, Sqlite3Engine


def _run_create_generic(stats_type, cursor, database_engine):
"""
Creates the pertinent (partial, if supported) indices for one kind of stats.
Args:
stats_type: "room" or "user" - the type of stats
cursor: Database Cursor
database_engine: Database Engine
"""
if isinstance(database_engine, Sqlite3Engine):
# even though SQLite >= 3.8 can support partial indices, we won't enable
# them, in case the SQLite database may be later used on another system.
# It's also the case that SQLite is only likely to be used in small
# deployments or testing, where the optimisations gained by use of a
# partial index are not a big concern.
cursor.execute(
"""
CREATE INDEX IF NOT EXISTS %s_stats_current_dirty
ON %s_stats_current (end_ts);
"""
% (stats_type, stats_type)
)
cursor.execute(
"""
CREATE INDEX IF NOT EXISTS %s_stats_not_complete
ON %s_stats_current (completed_delta_stream_id, %s_id);
"""
% (stats_type, stats_type, stats_type)
)
elif isinstance(database_engine, PostgresEngine):
# This partial index helps us with finding dirty stats rows
cursor.execute(
"""
CREATE INDEX IF NOT EXISTS %s_stats_current_dirty
ON %s_stats_current (end_ts)
WHERE end_ts IS NOT NULL;
"""
% (stats_type, stats_type)
)
# This partial index helps us with old collection
cursor.execute(
"""
CREATE INDEX IF NOT EXISTS %s_stats_not_complete
ON %s_stats_current (%s_id)
WHERE completed_delta_stream_id IS NULL;
"""
% (stats_type, stats_type, stats_type)
)
else:
raise NotImplementedError("Unknown database engine.")


def run_create(cursor, database_engine):
"""
This function is called as part of the schema delta.
It will create indices - partial, if supported - for the new 'separated'
room & user statistics.
"""
_run_create_generic("room", cursor, database_engine)
_run_create_generic("user", cursor, database_engine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be a lot clearer to just have two schema files *.sql.postges and *.sql.sqlite and list the create index clauses. This code generation is a lot longer than the eight lines of sql it generates :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be – but afaict there isn't a mechanism to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh booo, its only implemented for full schemas and not deltas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikjohnston With #5911, this should be resolved, I hope?



def run_upgrade(cur, database_engine, config):
"""
This function is run on a database upgrade (of a non-empty database).
We have no need to do anything specific here.
"""
pass
Loading