-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Describe the bug
The specific meaning of this error can be found in #3270. And in PR #3271, we try to alleviate this problem by introducing a new BE config cumulative_compaction_skip_window_seconds.
However, this config does not completely solve this problem, because in the following two cases, we can not accurately set the config value.
- Query with Join
In the query with Join operation, the query engine will first perform a Build operation on the right table, and wait until the right table is built before Open the left table for data reading. The operation of obtaining the rowset reader according to the version only starts when it is Open. If the build phase is very time-consuming, it will cause the Open on the left to be executed long after the query starts, and the version that should be read at this time may has been merged. The time of the build phase cannot be estimated, so it is difficult to set a reasonable value for the config cumulative_compaction_skip_window_seconds.
- Transaction publish delay
Each BE merges its own tablet separately. Only the published version will be merged. The Publish here only represents the publish of BE's own tablet, not the corresponding successful publish of the load transaction. This causes the following problems, for example:
1. Assume that the version corresponding to a load transaction is 10.
2. In this load transaction, there are three replicas of a tablet, one of which is published successfully at 10:00, and the other two replicas are published successfully at 10:10. So the finish time of the entire load transaction is 10:10. In other words, users can query the 10 version of the data from 10:10. And because there is already one replica published at 10:00. That replica may have received the 11 version of the publish task at 10:02. Then at 10:05, a compaction is performed to form the [0-11] version. Then when 10:10, the user uses the 10 version to query, that replica will report an error "-230".
The above two situations can be alleviated by increasing the cumulative_compaction_skip_window_seconds configuration, but
1. The larger the config may cause the version backlog, and the compaction is not timely.
2. This is just an empirical value and cannot completely solve this problem.
Resolution
The essence of the problem is that before we read the specified version, the specified version has been merged. We can control the timing of compaction by recording the current largest read version in the tablet.
In the tablet, the current latest read version is recorded in field latest_read_version. This version is set every time the tablet is read, and only the largest version currently read is recorded.
When compaction selects a version for merging, only data whose version is less than or equal to latest_read_version is selected. This will ensure that if a version(or a version larger than it), has not been read, the version will not be compacted.
Of course, under this strategy, if there has been no read request, the compaction will not be triggered. So we need to set a threshold to prevent the data from never being merged. We can use cumulative_compaction_skip_window_seconds to configure this threshold. That is, when the threshold time is exceeded, latest_read_version will be ignored and the version merge will start.
This strategy may have problems in the following scenarios: the query frequency is low, such as once every 10 minutes, but the load frequency is high, such as 10 times per minute. In this case, if the cumulative_compaction_skip_window_seconds setting is large, it may accumulate large number of versions, eg, 100 versions. If the setting is small, the -230 error may still be generated. However, in terms of actual operation and maintenance experience, setting cumulative_compaction_skip_window_seconds to 5 minutes can basically solve most problems.