-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Clustered Index #4841
Comments
I am trying to understand this feature correctly. Is it a change so that the handle_id may be some other key or column, other than the primary key? |
Clustering your data determines how the data is laid out on disk. Several databases use different terminology to describe the same thing. Let me explain it by example. Suppose you have a schema with columns
You can have secondary indexes to help you know where to look for data on disk, but if you want 1000 rows for a single In MySQL with InnoDB, the PRIMARY KEY means it is the clustering key, and there's only one clustering key. In MySQL/derivatives with TokuDB as storage engine you can have multiple clustering keys. That would make TokuDB effectively duplicate the table for you in different orders for each additional clustering key. Both TokuDB and InnoDB do a little bit of trade-off to not hurt INSERT performance too much. Consolidation of data before it's actually written to the data files, not fully filled pages and a bit of fragmentation is the result, but in practice it's still a huge benefit in performance for the SELECTs for specific PostgreSQL has even another approach: it does not order data on disk in a specific order. It just appends new INSERTS at the end of the data file, until you tell it to cluster it (e.g. CLUSTER command or Cassandra works similarly to MySQL with regard to the user/schema with a primary key, but under the hood more like PostgreSQL with later compaction in the background to cluster data efficiently (but that's in fact quite complex, and this sentence is not very accurate once you know the details). To get back to TiDB case; it would be very much beneficial to be able to control how data is laid out on disk and to be able to make a trade-off in INSERT performance to gain certain SELECT performance with it. |
https://github.com/pingcap/tidb/projects/45 Workload Estimation600 |
Need to add a sub-task for TiFlash Clustered Index. |
Description
Currently, TiDB arranges table's row data by handle id, for example we have a table whose schema is
There will be two key-value pairs for each row data:
The handle id is a uint64 generated by TiDB. There are some disadvantages for this type of data arrangement:
data
by primary key, TiDB will trigger two read operations, one is to fetch the related handle id by primary key, and the other is to fetch thedata
by the handle id. This will increase the latency of SELECT statements.In the clustered index, we arrange row data by primary key, in the above example, there will be only one key-value pair for each row data, this will the disadvantages mentioned above:
Task list
TiFlash is column-oriented storage, it only stores specified tables' row data, used to speedup AP queries on these tables. After TiDB supports the clustered-index feature, TiFlash also should recognize it.
Progress Tracking: https://github.com/pingcap/tidb/projects/45
Category
Feature
Value
The clustered index can speed up some queries. Its usage can refer to:
Workload Estimation
600
Time
GanttStart: 2020-06-01
GanttDue: 2020-10-30
GanttProgress: 100%
Weekly Report
https://docs.google.com/document/d/1zeCni1g-BfGsi7B_qWSbSJGXFiWrDaUsqic14gXygk8/edit?usp=sharing
The text was updated successfully, but these errors were encountered: