Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra performance improvements #94

Closed
michaelsembwever opened this issue May 8, 2017 · 9 comments
Closed

Cassandra performance improvements #94

michaelsembwever opened this issue May 8, 2017 · 9 comments
Assignees

Comments

@michaelsembwever
Copy link
Member

michaelsembwever commented May 8, 2017

Relevant to #85

Observations:

  • frequent full table scans on cluster
  • UI pages hit the rest endpoints like /repair_run/state, causing scheduled 10s repeated selects on repair_run, repair_unit, and repair_run_by_cluster,
  • small granularity of tables: no/little utilisations of clustering keys and common keys,
  • manual denormalisation over 2nd indexes,
  • large volumes of cql requests (selects and inserts) during creation of a repair run (before activation).

The rest endpoint /repair_run/state repeats the follow every 10 seconds…

SELECT * FROM cluster;
 1-> SELECT * FROM repair_run_by_cluster WHERE cluster_name = ?;
   N-> SELECT * FROM repair_run WHERE id = ?; 
   N-> SELECT * FROM repair_unit WHERE id = ?;
     1-> SELECT * FROM repair_segment_by_run_id WHERE run_id = ?;
     1-> SELECT * FROM repair_segment WHERE id = ?;

During the creation of a repair run, non-incremental with 1467 segments, ~14k requests were logged.
A distribution of these requests is as follows:

   8235  SELECT * FROM repair_segment WHERE id = ?;
   1467  UPDATE repair_id SET id=N WHERE id_type = 'repair_segment' IF id = N;
   1467  SELECT id FROM repair_id WHERE id_type = 'repair_segment';
   1467  INSERT INTO repair_segment(id, repair_unit_id, run_id, start_token, end_token, state, coordinator_host, start_time, end_time, fail_count) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?);
   1467  INSERT INTO repair_id (id_type, id) VALUES('repair_segment', N) IF NOT EXISTS;
     14  SELECT * FROM cluster;
     11  BEGIN BATCH INSERT INTO repair_segment_by_run_id(run_id, segment_id) VALUES(?, ?); INSERT INTO repair_segment_by_run_id(run_id, segment_id) VALUES(?, ?); INSERT INTO repair_segment_by_run_id(run_id, segment_id) VALUES(?, ?); INSERT INTO repair_segment_by_run_id(run_id, segment_id) VALUES(?, ?); INSERT INTO repair_segment_by_run_id(run_id, segment_id) VALUES(?, ?); INSERT INTO repair_segment_by_run_id(run_id, segment_id) VALUES(?, ?); INSERT INTO repair_segment_by_run_id(run_id, segment_id) VALUE... [truncated output]
      8  SELECT * FROM repair_run_by_cluster WHERE cluster_name = ?;
      7  SELECT * FROM repair_unit WHERE id = ?;
      7  SELECT * FROM repair_segment_by_run_id WHERE run_id = ?;
      7  SELECT * FROM repair_run WHERE id = ?;
      4  BEGIN BATCH APPLY BATCH;
      2  SELECT * FROM cluster WHERE name = ?;
      1  UPDATE repair_id SET id=N WHERE id_type = 'repair_unit' IF id = N;
      1  UPDATE repair_id SET id=N WHERE id_type = 'repair_run' IF id = N;
      1  SELECT id FROM repair_id WHERE id_type = 'repair_unit';
      1  SELECT id FROM repair_id WHERE id_type = 'repair_run';
      1  SELECT * FROM repair_unit;
      1  SELECT * FROM repair_schedule;
      1  INSERT INTO repair_unit(id, cluster_name, keyspace_name, column_families, incremental_repair) VALUES(?, ?, ?, ?, ?);
      1  INSERT INTO repair_run(id, cluster_name, repair_unit_id, cause, owner, state, creation_time, start_time, end_time, pause_time, intensity, last_event, segment_count, repair_parallelism) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);
      1  INSERT INTO repair_id (id_type, id) VALUES('repair_unit', N) IF NOT EXISTS;
      1  INSERT INTO repair_id (id_type, id) VALUES('repair_run', N) IF NOT EXISTS;
      1  BEGIN BATCH INSERT INTO repair_run(id, cluster_name, repair_unit_id, cause, owner, state, creation_time, start_time, end_time, pause_time, intensity, last_event, segment_count, repair_parallelism) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?); INSERT INTO repair_run_by_cluster(cluster_name, id) values(?, ?); INSERT INTO repair_run_by_unit(repair_unit_id, id) values(?, ?); APPLY BATCH;

What stands out above is …

  • 1.4k UPDATE repair_id SET id=N WHERE id_type = 'repair_segment' IF id = N;, and
  • 1.4k INSERT INTO repair_id (id_type, id) VALUES('repair_segment', N) IF NOT EXISTS;.

These appear to be identical statements, the latter as a LWT.
Neither of them are used as prepared statements in CassandraStorage. (The 8.2k select requests against the same table are also not prepared statements.)

Otherwise the multiple inserts and reads on repair_segment could be collapsed by making repair_segment_id a clustering key within the repair_run table. Furthermore a small row cache (since reads against the one repair_id are particularly hot) would have significant impact.


Possibilities:

  • row cache on cluster and repair_run_by_cluster (kinda expecting page cache to be doing enough already…)
  • small row cache on repair_run
  • LCS on repair_run (…actually on all the tables)
  • prepared statements against repair_id
  • cache a floor repairId value in CassandraStorage.getNewRepairId(..)
  • configurable UI refresh rate/limit

Maybe there's a possibility of collapsing queries by collapsing tables by making repair_segment a clustering key within repair_run.

In fact this might collapse three tables: repair_segment, repair_run, and repair_segment_by_run_id; together with the one table like:

CREATE TABLE IF NOT EXISTS repair_run (
  id                            bigint,
  cluster_name         text static,
  repair_unit_id         bigint static,
  cause                     text static,
  owner                     text static,
  state                      text static,
  creation_time        timestamp static,
  start_time              timestamp static,
  end_time               timestamp static,
  pause_time           timestamp static,
  intensity                double static,
  last_event             text static,
  segment_count     int static,
  repair_parallelism text static,
  segment_id           bigint,
  start_token           varint,
  end_token             varint,
  state                      int   ,
  coordinator_host  text,
  start_time             timestamp,
  end_time               timestamp,
  fail_count              int,
  PRIMARY KEY (id, segment_id)
);

This could then leave the regular requests to

SELECT * FROM cluster;
 1-> SELECT * FROM repair_run_by_cluster WHERE cluster_name = ?;
   N-> SELECT * FROM repair_run WHERE id = ?; 
   N-> SELECT * FROM repair_unit WHERE id = ?;
michaelsembwever added a commit that referenced this issue May 9, 2017
… cache a floor for the sequence number to reduce lookups.

ref: #94
michaelsembwever added a commit that referenced this issue May 9, 2017
… cache a floor for the sequence number to reduce lookups.

ref: #94
michaelsembwever added a commit that referenced this issue May 9, 2017
… cache a floor for the sequence number to reduce lookups.

ref:
 - #94
 - #95
@michaelsembwever
Copy link
Member Author

prepared statements against repair_id
cache a floor repairId value in CassandraStorage.getNewRepairId(..)

The pull request in #95
addresses these two possibilities.

Tested against a non-incremental repair with 1467 segments, it reduced (before activation of the repair) the number of cql requests by over 8k.

@michaelsembwever michaelsembwever changed the title WIP – Cassandra performance improvements Cassandra performance improvements May 9, 2017
@michaelsembwever
Copy link
Member Author

  • row cache on cluster and repair_run_by_cluster (kinda expecting page cache to be doing enough already…)
  • small row cache on repair_run
  • LCS on repair_run (…actually on all the tables)

The pull request in #96 addresses these three possibilities.

michaelsembwever added a commit that referenced this issue May 9, 2017
adejanovski pushed a commit that referenced this issue May 9, 2017
… cache a floor for the sequence number to reduce lookups. (#95)

ref:
 - #94
 - #95
@adejanovski
Copy link
Contributor

adejanovski commented May 9, 2017

The collapsed repair_run table looks interesting and we would have to test partitions sizes in case of big clusters using 256 vnodes.

Just like you did for the generation of ids, we could hold a local cache of repair segments, removing finished ones and keeping the rest. Before processing a segment we could query the database to check if it's up to date and move on accordingly. This would bring down to a single query for a single partition most of the time, with an overhead only on init.

How does that sound @michaelsembwever ?

@michaelsembwever michaelsembwever self-assigned this May 9, 2017
@michaelsembwever
Copy link
Member Author

michaelsembwever commented May 9, 2017

Just like you did for the generation of ids, we could hold a local cache of repair segments, removing finished ones and keeping the rest.

Adding more "local caches" would be a band-aid to the design of the persistence layer which has too small a granularity.

I'd rather see IStorage re-assessed.

Using the constraint that one IStorage object works against an owned shard this could simplify a number of things:

  • fault-tolerant design would be a lot simpler (leader-election minimised due to separated shards of data)
  • writes to database minimised, (state held in IStorage, no sequence number in db, etc).

This would lose some durability of the data if the reaper process died. But i think this is irrelevant. Reaper can either restore transient state, or in situations like a un-activitated repair run doesn't care to the durability.

@michaelsembwever
Copy link
Member Author

The collapsed repair_run table looks interesting and we would have to test partitions sizes in case of big clusters using 256 vnodes.

What's the largest number of segments we're seen?
So long we can stay under a million segments i think we'd be in a better place.

@michaelsembwever
Copy link
Member Author

michaelsembwever commented May 11, 2017

prepared statements against repair_id
cache a floor repairId value in CassandraStorage.getNewRepairId(..)

The pull request in #95 addresses these two possibilities.

Tested against a non-incremental repair with 1467 segments, it reduced (before activation of the repair)
the number of cql requests by over 8k.

This has been further addressed (and superseded) by #99

The sequence number was a postgresql (sql) optimisation that imposed itself upon the code's design.
But there is no constraint in the codebase that IDs are to be sequential.

By replacing all IDs in the codebase with UUIDs, we

  • isolate sequence generation to postgresql as an internal implementation detail, and
  • permit C* to generate and use time-based UUIDs.

This reduces those 14k cql requests to an even lower number as the repair_id table doesn't exist now.

michaelsembwever added a commit that referenced this issue May 12, 2017
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
michaelsembwever added a commit that referenced this issue May 13, 2017
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
 - #101
michaelsembwever added a commit that referenced this issue May 13, 2017
…repair_run table.

Change required in IStorage so to identify a segment both by runId and segmentId.

ref:
 - #94
 - #102
@michaelsembwever
Copy link
Member Author

In fact this might collapse three tables: repair_segment, repair_run, and repair_segment_by_run_id; together with the one table like:

CREATE TABLE IF NOT EXISTS repair_run (
 id                            bigint,
 cluster_name         text static,
 repair_unit_id         bigint static,
 cause                     text static,
 owner                     text static,
 state                      text static,
 creation_time        timestamp static,
 start_time              timestamp static,
 end_time               timestamp static,
 pause_time           timestamp static,
 intensity                double static,
 last_event             text static,
 segment_count     int static,
 repair_parallelism text static,
 segment_id           bigint,
 start_token           varint,
 end_token             varint,
 state                      int   ,
 coordinator_host  text,
 start_time             timestamp,
 end_time               timestamp,
 fail_count              int,
 PRIMARY KEY (id, segment_id)
);

This has been addressed in #102

@adejanovski
Copy link
Contributor

adejanovski commented May 23, 2017

@michaelsembwever, I've tested with the row cache activated on the repair_segment table and we mostly hit the cache only :

Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             0.00             20.50             29.52               372                10
75%             0.00             24.60             29.52               372                10
95%             0.00             24.60             42.51               372                10
98%             0.00             24.60             51.01               372                10
99%             0.00             24.60             88.15               372                10
Min             0.00             11.87              9.89               311                 9
Max             1.00             24.60          30130.99               372                10

Nodetool info :

nodetool info
ID                     : 504625e7-6bf7-4419-b6a6-466831d3a0ca
Gossip active          : true
Thrift active          : false
Native Transport active: true
Load                   : 13.83 GB
Generation No          : 1495555543
Uptime (seconds)       : 11042
Heap Memory (MB)       : 2072.83 / 3804.00
Off Heap Memory (MB)   : 120.84
Data Center            : datacenter1
Rack                   : rack1
Exceptions             : 0
Key Cache              : entries 18230, size 2.58 MB, capacity 100 MB, 4385 hits, 22753 requests, 0.193 recent hit rate, 14400 save period in seconds
Row Cache              : entries 2265, size 2.21 KB, capacity 200 MB, 1789547 hits, 1795750 requests, 0.997 recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 50 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds

michaelsembwever added a commit that referenced this issue May 23, 2017
…repair_run table.

Change required in IStorage so to identify a segment both by runId and segmentId.

ref:
 - #94
 - #102
michaelsembwever added a commit that referenced this issue May 28, 2017
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
 - #101
michaelsembwever added a commit that referenced this issue May 28, 2017
…repair_run table.

Change required in IStorage so to identify a segment both by runId and segmentId.

ref:
 - #94
 - #102
michaelsembwever added a commit that referenced this issue May 28, 2017
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
 - #101
adejanovski pushed a commit that referenced this issue May 29, 2017
…repair_run table. (#102)

Change required in IStorage so to identify a segment both by runId and segmentId.

ref:
 - #94
 - #102
adejanovski pushed a commit that referenced this issue May 29, 2017
ref:
 - #99
 - #94

Makes the schema changes in a separate migration step, so that data in the repair_unit and repair_schedule tables can be migrated over.

ref: #99 (comment)

Move the new schema migration to 003 as 002 already exists in master

Recover Cassandra migration 002

Fix incorrect type used to get incremental repair value during schema migration
adejanovski pushed a commit that referenced this issue May 29, 2017
ref:
 - #99
 - #94

Makes the schema changes in a separate migration step, so that data in the repair_unit and repair_schedule tables can be migrated over.

ref: #99 (comment)

Move the new schema migration to 003 as 002 already exists in master

Recover Cassandra migration 002

Fix incorrect type used to get incremental repair value during schema migration
adejanovski pushed a commit that referenced this issue May 29, 2017
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
 - #101
adejanovski pushed a commit that referenced this issue May 29, 2017
* Cassandra performance: Replace sequence ids with time-based UUIDs

ref:
 - #99
 - #94

* Simplify the creation of repair runs and their segments.
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
 - #101

* SQUASH ME
 Makes the schema changes in a separate migration step, so that data in the repair_unit and repair_schedule tables can be migrated over.

ref: #99 (comment)

* Fix file names and 002 migration file
adejanovski pushed a commit that referenced this issue May 29, 2017
…repair_run table.

Change required in IStorage so to identify a segment both by runId and segmentId.

ref:
 - #94
 - #102
michaelsembwever added a commit that referenced this issue May 31, 2017
  Makes the schema changes in a separate migration step, so that data in the repair_unit and repair_schedule tables can be migrated over.

ref:
 - #99
 - #94
 - #99 (comment)
michaelsembwever added a commit that referenced this issue May 31, 2017
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
 - #101
michaelsembwever added a commit that referenced this issue May 31, 2017
…repair_run table.

Change required in IStorage so to identify a segment both by runId and segmentId.

ref:
 - #94
 - #102
adejanovski added a commit that referenced this issue Jun 1, 2017
* Cassandra performance: Replace sequence ids with time-based UUIDs
  Makes the schema changes in a separate migration step, so that data in the repair_unit and repair_schedule tables can be migrated over.

ref:
 - #99
 - #94
 - #99 (comment)

* Simplify the creation of repair runs and their segments.
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
 - #101

* In CassandraStorage implement segments as clustering keys within the repair_run table.
Change required in IStorage so to identify a segment both by runId and segmentId.

ref:
 - #94
 - #102

* Fix number of parallel repair computation
Downgrade to Dropwizard 1.0.7 and Guava 19.0 to fix dependency issues
Make repair manager schedule cycle configurable (was 30s hardcoded)

ref: #108

* In CassandraStorage replace the table scan on `repair_run` with a async break-down of per cluster run-throughs of known run IDs.

 ref: #105
@adejanovski
Copy link
Contributor

Performance issues were fixed with release 0.6.0. Closing.

adejanovski pushed a commit that referenced this issue Jun 26, 2017
  Makes the schema changes in a separate migration step, so that data in the repair_unit and repair_schedule tables can be migrated over.

ref:
 - #99
 - #94
 - #99 (comment)
adejanovski pushed a commit that referenced this issue Jun 26, 2017
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
 - #101
adejanovski pushed a commit that referenced this issue Jun 26, 2017
…repair_run table.

Change required in IStorage so to identify a segment both by runId and segmentId.

ref:
 - #94
 - #102
michaelsembwever added a commit that referenced this issue Jun 27, 2017
… cache a floor for the sequence number to reduce lookups. (#95)

ref:
 - #94
 - #95
michaelsembwever added a commit that referenced this issue Jun 27, 2017
michaelsembwever pushed a commit that referenced this issue Jun 27, 2017
* Cassandra performance: Replace sequence ids with time-based UUIDs
  Makes the schema changes in a separate migration step, so that data in the repair_unit and repair_schedule tables can be migrated over.

ref:
 - #99
 - #94
 - #99 (comment)

* Simplify the creation of repair runs and their segments.
 Repair runs and their segments are one unit of work in concept and the persistence layer should be designed accordingly.
 Previous they were separated because the concern of sequence generation for IDs were exposed in the code. This is now encapsulated within storage implementations.
 This work allows the CassandraStorage to implement segments as clustering keys within the repair_run table.

ref:
 - #94
 - #101

* In CassandraStorage implement segments as clustering keys within the repair_run table.
Change required in IStorage so to identify a segment both by runId and segmentId.

ref:
 - #94
 - #102

* Fix number of parallel repair computation
Downgrade to Dropwizard 1.0.7 and Guava 19.0 to fix dependency issues
Make repair manager schedule cycle configurable (was 30s hardcoded)

ref: #108

* In CassandraStorage replace the table scan on `repair_run` with a async break-down of per cluster run-throughs of known run IDs.

 ref: #105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants