Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-balance big parts in JBOD array #16481

Merged
merged 5 commits into from
Mar 12, 2021
Merged

Conversation

amosbird
Copy link
Collaborator

@amosbird amosbird commented Oct 28, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Introduce a new merge tree setting min_bytes_to_rebalance_partition_over_jbod which allows assigning new parts to different disks of a JBOD volume in a balanced way.

Detailed description / Documentation draft:

JBOD volume is notorious in that only a small subset of disks can be fully utilized for a given query. This PR focuses on a particular common scenario: partition-wise queries. Now big new parts are assigned to different disks of a JBOD volume in a balanced way.
It might be useful for #16300.

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Oct 28, 2020
@amosbird amosbird marked this pull request as draft October 28, 2020 15:25
@amosbird amosbird force-pushed the jbodbalance branch 2 times, most recently from c28def2 to e10faae Compare October 29, 2020 07:32
@alesapin alesapin self-assigned this Nov 20, 2020
@amosbird amosbird marked this pull request as ready for review February 18, 2021 08:52
@amosbird amosbird requested a review from alesapin February 18, 2021 08:56
@amosbird amosbird force-pushed the jbodbalance branch 2 times, most recently from 9bc3ec3 to a36c60c Compare February 18, 2021 10:40
@amosbird
Copy link
Collaborator Author

@azat
Copy link
Collaborator

azat commented Feb 19, 2021

@azat Another uncaught exception from dtor. Please take a look. https://clickhouse-test-reports.s3.yandex.net/16481/5d150cce4778dd14f58dcff67435bdec1efa155b/stress_test_(thread).html#fail1

Here is another patch - #20981

Copy link
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm a little confused, but we just have three types of parts:

  1. Existing parts.
  2. Currently merging source parts that will become outdated (included in 1).
  3. Currently merging parts that will be added as a result of the merge.

For each new merged/fetched part we are trying to find a disk that has a minimum size of data in the same partition. So we just summing sizes by a disk of all existing parts from this same partition excluding parts 2) and adding parts 3). After that, we choose the smallest occupied disk.

The code looks slightly more complicated :)

Btw, how it will work with background moves? Also without tests merge is impossible.


// Pre-define a reasonable minimum size for the JBOD rebalancer
static constexpr size_t MIN_BYTES_TO_REBALANCE_OVER_JBOD = 100 * 1024 * 1024;
if (storage_settings->min_bytes_to_rebalance_partition_over_jbod > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (const auto & part : parts)
{
if (!storage.currently_submerging_parts.count(part))
LOG_WARNING(log, "currently_submerging_parts is missing parts. JBOD might lose balance");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear message. Which part is missing? What does it mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


for (const auto & [name, emerging_part] : currently_emerging_parts)
{
// It's possible that the emerging parts are committed and get added twice. Thus a set is used to deduplicate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it possible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because currently_emerging_parts is also maintained separately. We do commit before removing parts from it. Thus the previous getDataPartsStateRange(MergeTreeData::DataPartState::Committed) might contain (rarely) the same part.

{
auto name = part->volume->getDisk()->getName();
auto it = disk_occupation.find(name);
if (it != disk_occupation.end())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, it's a LOGICAL_ERROR, isn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that parts are stored on different volumes. We ignore them here.

{
it->second += part->getBytesOnDisk();
disk_parts[name].push_back(formatReadableSizeWithBinarySuffix(part->getBytesOnDisk()));
big_parts.insert(part->name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big_not_merging_parts?


// Also include current submerging parts
for (const auto & part : covered_parts)
merging_parts.insert(part->name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why they are already not here? Ho we can have covered parts which are not in currently_submerging_parts? Also why we don't check?

part->isStoredOnDisk() && part->getBytesOnDisk() >= min_bytes_to_rebalance_partition_over_jbod
                && part_info.partition_id == part->info.partition_id

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently_submerging_parts only records parts in previous balanced reservation.

for (const auto & [name, emerging_part] : currently_emerging_parts)
{
// It's possible that the emerging parts are committed and get added twice. Thus a set is used to deduplicate.
if (big_parts.find(name) == big_parts.end())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, no checks -- maybe this part from a different partition or small?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we should check partition here. It will be alway big though.

{
const auto & disks = getStoragePolicy()->getVolume(max_volume_index)->getDisks();
std::map<String, size_t> disk_occupation;
std::map<String, std::vector<String>> disk_parts;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment: "Used only for logging"

if (part->isStoredOnDisk() && part->getBytesOnDisk() >= min_bytes_to_rebalance_partition_over_jbod
&& part_info.partition_id == part->info.partition_id)
{
merging_parts.insert(part->name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big_submerging_parst_from_partition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set is used to remove some big valid parts out of calculation. It doesn't have currently submerging parts because those are not yet inserted into currently_submerging_parts, which only contains parts that participate in some other balanced activity.

try
{
const auto & disks = getStoragePolicy()->getVolume(max_volume_index)->getDisks();
std::map<String, size_t> disk_occupation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why map and not unordered_map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disk names are usually sorted.

for _ in range(10):
try:
print("Syncing replica")
node2.query("SYSTEM SYNC REPLICA tbl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query ignores timeout, which exception do we expect?

Copy link
Collaborator Author

@amosbird amosbird Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added timeout = 10s. We can safely ignore any exceptions. The test checks the final state of disk balance.

"insert into tmp2 select randConstant() % 2, randomPrintableASCII(16) from numbers(50)"
)

time.sleep(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't.

WriteBufferFromOwnString log_str;
writeCString("\nbalancer: \n", log_str);
for (const auto & [disk_name, per_disk_parts] : disk_parts_for_logging)
writeString(fmt::format(" {}: [{}]\n", disk_name, boost::algorithm::join(per_disk_parts, ", ")), log_str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, fmt::join looks prettier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, quite isolated and useful code. Also, we have a setting that allows disabling this behavior.

src/Storages/MergeTree/MergeTreeSettings.h Outdated Show resolved Hide resolved
@alesapin alesapin merged commit cca05da into ClickHouse:master Mar 12, 2021
@sevirov
Copy link
Contributor

sevirov commented Jul 29, 2021

Internal documentation ticket: DOCSUP-12425

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants