Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why sometime backup duration delay to hours level but no errors. #880

Closed
hueiyuan opened this issue Apr 3, 2024 · 9 comments
Closed

Why sometime backup duration delay to hours level but no errors. #880

hueiyuan opened this issue Apr 3, 2024 · 9 comments

Comments

@hueiyuan
Copy link

hueiyuan commented Apr 3, 2024

Description

Hi, Everyone
We found that our backup duration sometime is so fast and some of backup is slow to 6 hours. And backup duration, do not see and error problems. So does anyone have any idea can explain this?

However, Our table TTL is inner 1 day stored to disk, and exceed 1 day to 30 days would be stored to AWS S3.
clickhouse-backup is sidecar in clickhouse-server, so corresponding resource request is 1 core / 1 GB and limit is 2 core/2GB

  • print-config
general:
    remote_storage: s3
    max_file_size: 0
    disable_progress_bar: true
    backups_to_keep_local: 0
    backups_to_keep_remote: 50
    log_level: debug
    allow_empty_backups: false
    download_concurrency: 1
    upload_concurrency: 1
    use_resumable_state: true
    restore_schema_on_cluster: ""
    upload_by_part: true
    download_by_part: true
    restore_database_mapping: {}
    retries_on_failure: 3
    retries_pause: 30s
    watch_interval: 30m
    full_interval: 24h
    watch_backup_name_template: shard{shard}-{type}-{time:20060102150405}
    sharded_operation_mode: ""
    cpu_nice_priority: 15
    io_nice_priority: idle
    retriesduration: 30s
    watchduration: 30m0s
    fullduration: 24h0m0s
clickhouse:
    username: username
    password: password
    host: localhost
    port: 9000
    disk_mapping: {}
    skip_tables:
        - system.*
        - INFORMATION_SCHEMA.*
        - information_schema.*
        - _temporary_and_external_tables.*
    skip_table_engines: []
    timeout: 5m
    freeze_by_part: false
    freeze_by_part_where: ""
    use_embedded_backup_restore: false
    embedded_backup_disk: ""
    backup_mutations: true
    restore_as_attach: false
    check_parts_columns: true
    secure: false
    skip_verify: false
    sync_replicated_tables: false
    log_sql_queries: true
    config_dir: /etc/clickhouse-server/
    restart_command: exec:systemctl restart clickhouse-server
    ignore_not_exists_error_during_freeze: true
    check_replicas_before_attach: true
    tls_key: ""
    tls_cert: ""
    tls_ca: ""
    max_connections: 2
    debug: false
s3:
    access_key: ""
    secret_key: ""
    bucket: ipp-clickhouse-backup-int
    endpoint: ""
    region: us-west-2
    acl: private
    assume_role_arn: arn:aws:iam::xxxx:role/test-rolw
    force_path_style: true
    path: backup/chi-shard-backup
    object_disk_path: tiered-backup
    disable_ssl: false
    compression_level: 1
    compression_format: zstd
    sse: ""
    sse_kms_key_id: ""
    sse_customer_algorithm: ""
    sse_customer_key: ""
    sse_customer_key_md5: ""
    sse_kms_encryption_context: ""
    disable_cert_verification: false
    use_custom_storage_class: false
    storage_class: STANDARD
    custom_storage_class_map: {}
    concurrency: 3
    part_size: 0
    max_parts_count: 2000
    allow_multipart_download: false
    object_labels: {}
    request_payer: ""
    check_sum_algorithm: ""
    debug: true
gcs:
    credentials_file: ""
    credentials_json: ""
    credentials_json_encoded: ""
    bucket: ""
    path: ""
    object_disk_path: ""
    compression_level: 1
    compression_format: tar
    debug: false
    force_http: false
    endpoint: ""
    storage_class: STANDARD
    object_labels: {}
    custom_storage_class_map: {}
    client_pool_size: 6
cos:
    url: ""
    timeout: 2m
    secret_id: ""
    secret_key: ""
    path: ""
    compression_format: tar
    compression_level: 1
    debug: false
api:
    listen: 0.0.0.0:7171
    enable_metrics: true
    enable_pprof: false
    username: ""
    password: ""
    secure: false
    certificate_file: ""
    private_key_file: ""
    ca_cert_file: ""
    ca_key_file: ""
    create_integration_tables: true
    integration_tables_host: ""
    allow_parallel: false
    complete_resumable_after_restart: true
ftp:
    address: ""
    timeout: 2m
    username: ""
    password: ""
    tls: false
    skip_tls_verify: false
    path: ""
    object_disk_path: ""
    compression_format: tar
    compression_level: 1
    concurrency: 6
    debug: false
sftp:
    address: ""
    port: 22
    username: ""
    password: ""
    key: ""
    path: ""
    object_disk_path: ""
    compression_format: tar
    compression_level: 1
    concurrency: 6
    debug: false
azblob:
    endpoint_schema: https
    endpoint_suffix: core.windows.net
    account_name: ""
    account_key: ""
    sas: ""
    use_managed_identity: false
    container: ""
    path: ""
    object_disk_path: ""
    compression_level: 1
    compression_format: tar
    sse_key: ""
    buffer_size: 0
    buffer_count: 3
    max_parts_count: 256
    timeout: 4h
    debug: false
custom:
    upload_command: ""
    download_command: ""
    list_command: ""
    delete_command: ""
    command_timeout: 4h
    commandtimeoutduration: 4h0m0s

Additional questions

  • Will the backup time become longer as the data size increases?
  • If the backup is suspended, will the original clickhouse server be interrupted?
@Slach
Copy link
Collaborator

Slach commented Apr 3, 2024

do you have object disk like s3 or azure

could you share

SELECT * FROM system.disks
SELECT * FROM system.storage_policies

?

According to non-empty

s3:
 object_disk_path: tiered-backup

looks like yes

so this is not "pause" actually this is server side CopyObject execution which allow you restore your data after DROP TABLE ... SYNC \ DROP DATABASE ... SYNC

try to change /etc/clickhouse-server/config.yml

general:
  log_level: debug

and share the logs

@hueiyuan
Copy link
Author

hueiyuan commented Apr 3, 2024

@Slach Thanks for your assistance. Related sharing show the below:

  • SELECT * FROM system.disks
Row 1:
──────
name:             default
path:             /var/lib/clickhouse/
free_space:       33461338112
total_space:      52521566208
unreserved_space: 33461338112
keep_free_space:  0
type:             local
is_encrypted:     0
is_read_only:     0
is_write_once:    0
is_remote:        0
is_broken:        0
cache_path:

Row 2:
──────
name:             s3_tier_cold
path:             /var/lib/clickhouse/disks/s3_tier_cold/
free_space:       18446744073709551615
total_space:      18446744073709551615
unreserved_space: 18446744073709551615
keep_free_space:  0
type:             s3
is_encrypted:     0
is_read_only:     0
is_write_once:    0
is_remote:        1
is_broken:        0
cache_path:
2 rows in set. Elapsed: 0.002 sec.
  • SELECT * FROM system.storage_policies
Row 1:
──────
policy_name:                default
volume_name:                default
volume_priority:            1
disks:                      ['default']
volume_type:                JBOD
max_data_part_size:         0
move_factor:                0
prefer_not_to_merge:        0
perform_ttl_move_on_insert: 1
load_balancing:             ROUND_ROBIN

Row 2:
──────
policy_name:                move_from_local_disks_to_s3
volume_name:                cold
volume_priority:            1
disks:                      ['s3_tier_cold']
volume_type:                JBOD
max_data_part_size:         0
move_factor:                0.1
prefer_not_to_merge:        0
perform_ttl_move_on_insert: 1
load_balancing:             ROUND_ROBIN

Row 3:
──────
policy_name:                move_from_local_disks_to_s3
volume_name:                hot
volume_priority:            2
disks:                      ['default']
volume_type:                JBOD
max_data_part_size:         0
move_factor:                0.1
prefer_not_to_merge:        0
perform_ttl_move_on_insert: 1
load_balancing:             ROUND_ROBIN

3 rows in set. Elapsed: 0.003 sec.
  • The path tiered-backup of AWS S3, which is not empty and have objects.

@Slach
Copy link
Collaborator

Slach commented Apr 3, 2024

During create backup
for all tables with SETTINGS storage_policy='move_from_local_disks_to_s3'
will execute s3:CopyObject into tiered-backup path in your backup bucket

we will improve speed ща incremental backups СopyObject execution for object disks data in v2.5

@Slach Slach closed this as completed Apr 3, 2024
@Slach
Copy link
Collaborator

Slach commented Apr 3, 2024

check

SELECT 
  count() AS parts, database, 
  uniqExact(table) AS tables, active, disk_name, 
  formatReadableSize(sum(bytes_on_disk)) 
FROM system.parts 
GROUP BY database, active, disk_name 
FORMAT Vertical

@hueiyuan
Copy link
Author

hueiyuan commented Apr 3, 2024

@Slach
Thanks for answer it. I would like to confirm what is the ETA for v2.5?

@hueiyuan
Copy link
Author

hueiyuan commented Apr 3, 2024

check

SELECT 
  count() AS parts, database, 
  uniqExact(table) AS tables, active, disk_name, 
  formatReadableSize(sum(bytes_on_disk)) 
FROM system.parts 
GROUP BY database, active, disk_name 
FORMAT Vertical

@Slach For your information:

Row 1:
──────
parts:                                  38
database:                               otel
tables:                                 4
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 1.86 GiB

Row 2:
──────
parts:                                  10462
database:                               otel
tables:                                 3
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 2.98 GiB

Row 3:
──────
parts:                                  439
database:                               system
tables:                                 5
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 128.27 MiB

Row 4:
──────
parts:                                  218
database:                               otel
tables:                                 4
active:                                 1
disk_name:                              s3_tier_cold
formatReadableSize(sum(bytes_on_disk)): 37.44 GiB

Row 5:
──────
parts:                                  234
database:                               system
tables:                                 7
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 7.48 GiB

5 rows in set. Elapsed: 0.023 sec. Processed 11.39 thousand rows, 721.45 KB (488.22 thousand rows/s., 30.92 MB/s.)
Peak memory usage: 15.72 KiB.

--> But this our dev environment data size.

For our production size FYI:

Row 1:
──────
parts:                                  204
database:                               otel
tables:                                 5
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 334.00 GiB

Row 2:
──────
parts:                                  11862
database:                               otel
tables:                                 3
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 3.19 GiB

Row 3:
──────
parts:                                  571
database:                               system
tables:                                 7
active:                                 0
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 275.28 MiB

Row 4:
──────
parts:                                  220
database:                               otel
tables:                                 3
active:                                 1
disk_name:                              s3_tier_cold
formatReadableSize(sum(bytes_on_disk)): 444.90 GiB

Row 5:
──────
parts:                                  343
database:                               system
tables:                                 7
active:                                 1
disk_name:                              default
formatReadableSize(sum(bytes_on_disk)): 11.09 GiB

5 rows in set. Elapsed: 0.023 sec. Processed 13.20 thousand rows, 822.72 KB (565.34 thousand rows/s., 35.24 MB/s.)
Peak memory usage: 20.98 KiB.

@Slach
Copy link
Collaborator

Slach commented Apr 3, 2024

@Slach Thanks for answer it. I would like to confirm what is the ETA for v2.5?

subscribe to #843 and watch progress

@hueiyuan
Copy link
Author

hueiyuan commented Apr 8, 2024

@Slach
Does v2.5 can resolve automatically execute watch cli after watch is stopped because of some errors?

@Slach
Copy link
Collaborator

Slach commented Apr 9, 2024

Does v2.5 can resolve automatically execute watch cli after watch is stopped because of some errors?

it resolve issue with reconnect to clickhouse-server
but if backup will failure more time than allow to store full backup in full watch period
then watch commands sequence will stop
cause you need to figure out with your configuration before continue watch
maybe we should change this behavior
but please create new issue in this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants