Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): Reduce serialisation memory usage when spilling to local disk #16580

Merged
merged 12 commits into from
Oct 14, 2024

Conversation

forsaken628
Copy link
Collaborator

@forsaken628 forsaken628 commented Oct 9, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Changes:

  1. Instead of serializing blocks first and then copying them to the dio buffer, the serialization of blocks is written directly to the aligned dio buffer.
  2. Instead of freeing memory after the entire file is written, the buffer is freed incrementally as the file is written.
  3. The spill file is compressed with LZ4 to dramatically reduce io.
  4. Introduced settings enable_dio to disable direct io, i.e. use buffer io to read and write files.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Oct 9, 2024
@forsaken628 forsaken628 mentioned this pull request Oct 9, 2024
4 tasks
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
@forsaken628 forsaken628 added the ci-cloud Build docker image for cloud test label Oct 10, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-16580-2c9bf0a-1728535521

note: this image tag is only available for internal use,
please check the internal doc for more details.

@forsaken628
Copy link
Collaborator Author

forsaken628 commented Oct 10, 2024

benchmark:

dataset: tpch sf100

settings:

set max_memory_usage = 16*1024*1024*1024;
set window_partition_spilling_memory_ratio = 30;

sql

EXPLAIN ANALYZE SELECT
    l_orderkey,
    l_partkey,
    l_quantity,
    l_extendedprice,
    ROW_NUMBER() OVER (PARTITION BY l_orderkey ORDER BY l_extendedprice DESC) AS row_num,
    RANK() OVER (PARTITION BY l_orderkey ORDER BY l_extendedprice DESC) AS rank_num
FROM
    lineitem ignore_result;

remote only

set window_partition_spilling_to_disk_bytes_limit = 0; 

        ├── hash keys: [l_orderkey]
        ├── estimated rows: 600037902.00
        ├── cpu time: 542.237127119s
        ├── wait time: 175.743771795s
        ├── output rows: 600.04 million
        ├── output bytes: 26.82 GiB

        ├── numbers remote spilled by write: 112
        ├── bytes remote spilled by write: 26.70 GiB
        ├── remote spilled time by write: 206.274s

        ├── numbers remote spilled by read: 1520
        ├── bytes remote spilled by read: 26.70 GiB
        ├── remote spilled time by read: 96.865s

local only

set window_partition_spilling_to_disk_bytes_limit = 30*1024*1024*1024;

        ├── hash keys: [l_orderkey]
        ├── estimated rows: 600037902.00
        ├── cpu time: 399.754689619s
        ├── wait time: 136.395287921s
        ├── output rows: 600.04 million
        ├── output bytes: 26.82 GiB

        ├── numbers local spilled by write: 127
        ├── bytes local spilled by write: 26.71 GiB
        ├── local spilled time by write: 97.54s

        ├── numbers local spilled by read: 1776
        ├── bytes local spilled by read: 26.71 GiB
        ├── local spilled time by read: 38.516s

mix

set window_partition_spilling_to_disk_bytes_limit = 10*1024*1024*1024;

        ├── hash keys: [l_orderkey]
        ├── estimated rows: 600037902.00
        ├── cpu time: 467.401265027s
        ├── wait time: 187.011393311s
        ├── output rows: 600.04 million
        ├── output bytes: 26.82 GiB

        ├── numbers remote spilled by write: 73
        ├── bytes remote spilled by write: 16.77 GiB
        ├── remote spilled time by write: 101.024s

        ├── numbers remote spilled by read: 1153
        ├── bytes remote spilled by read: 16.77 GiB
        ├── remote spilled time by read: 45.963s

        ├── numbers local spilled by write: 39
        ├── bytes local spilled by write: 9.94 GiB
        ├── local spilled time by write: 76.073s

        ├── numbers local spilled by read: 383
        ├── bytes local spilled by read: 9.94 GiB
        ├── local spilled time by read: 21.642s

The local spill is very sensitive to disk io performance. The local test environment is a dynamic vhdx over ssd, and as soon as the free disk space is reduced, the overall performance drops drastically, and the query time becomes 5 times that of the previous one.

cloud dev (updated)

create warehouse 'local-spill' warehouse_size='small' with version='pr-16580-2c9bf0a-1728535521' cache_size=400;

        ├── hash keys: [l_orderkey]
        ├── estimated rows: 600037902.00
        ├── cpu time: 156.483316215s
        ├── wait time: 632.100383231s
        ├── output rows: 600.04 million
        ├── output bytes: 26.82 GiB

        ├── numbers remote spilled by write: 73
        ├── bytes remote spilled by write: 16.62 GiB
        ├── remote spilled time by write: 355.896s

        ├── numbers remote spilled by read: 1153
        ├── bytes remote spilled by read: 16.62 GiB
        ├── remote spilled time by read: 185.933s

        ├── numbers local spilled by write: 39
        ├── bytes local spilled by write: 9.99 GiB
        ├── local spilled time by write: 85.412s

        ├── numbers local spilled by read: 383
        ├── bytes local spilled by read: 9.99 GiB
        ├── local spilled time by read: 23.979s

The cloud dev environment is weird

  1. disk io performance is very poor.
  2. a few crashes, guess it's still oom, lower thread count would help. (Unreasonable failure to reproduce)
  3. can only write to 8G, even with a very large disk. (fixed)

These are all things I haven't encountered in local testing.

@forsaken628 forsaken628 marked this pull request as ready for review October 10, 2024 12:08
Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
@forsaken628 forsaken628 added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Oct 11, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-16580-efea741-1728633047

note: this image tag is only available for internal use,
please check the internal doc for more details.

@forsaken628
Copy link
Collaborator Author

forsaken628 commented Oct 11, 2024

disable dio when in the cloud: (update: buf io write, dio read)

set enable_dio = 0;
set window_partition_spilling_to_disk_bytes_limit = 30*1024*1024*1024;

        ├── estimated rows: 600037902.00
        ├── cpu time: 143.7673667s
        ├── wait time: 2137.871630537s
        ├── output rows: 600.04 million
        ├── output bytes: 26.82 GiB

        ├── numbers remote spilled by write: 80
        ├── bytes remote spilled by write: 18.21 GiB
        ├── remote spilled time by write: 417.852s

        ├── numbers remote spilled by read: 1264
        ├── bytes remote spilled by read: 18.21 GiB
        ├── remote spilled time by read: 303.683s

        ├── numbers local spilled by write: 32
        ├── bytes local spilled by write: 8.40 GiB
        ├── local spilled time by write: 6.322s

        ├── numbers local spilled by read: 272
        ├── bytes local spilled by read: 8.40 GiB
        ├── local spilled time by read: 1428.324s

After turning off dio although the write elapsed time was reduced, the read elapsed time increased dramatically. This should be determined by the underlying implementation of block storage in cloud environments.

@Dousir9
Copy link
Member

Dousir9 commented Oct 11, 2024

After disabling DIO, the local spill write speed is 30 times that of the remote spill, may be the data is written to the page cache.

@Dousir9
Copy link
Member

Dousir9 commented Oct 11, 2024

For the problem that only 8GB of data can be spilled locally in the cloud environment, it may be that changing the cache_size does not increase the available space of the cache directory, because when I set cache_size to 5, there is still 8GB data spilled.
截屏2024-10-11 17 10 07

@Dousir9
Copy link
Member

Dousir9 commented Oct 11, 2024

For the problem that only 8GB of data can be spilled locally in the cloud environment, it may be that changing the cache_size does not increase the available space of the cache directory, because when I set cache_size to 5, there is still 8GB data spilled. 截屏2024-10-11 17 10 07

Since changing cache_size does not increase available disk space in cloud environment, we cannot conclude that the disk throughput on the cloud is not high.
We need to locate the problem that the available disk space on the cloud has not increased.

Signed-off-by: coldWater <forsaken628@gmail.com>
@forsaken628 forsaken628 added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Oct 11, 2024
Signed-off-by: coldWater <forsaken628@gmail.com>
Copy link
Contributor

Docker Image for PR

  • tag: pr-16580-b8c0389-1728642497

note: this image tag is only available for internal use,
please check the internal doc for more details.

@forsaken628
Copy link
Collaborator Author

buf io read and write

        ├── estimated rows: 600037902.00
        ├── cpu time: 143.221267069s
        ├── wait time: 1477.216299294s
        ├── output rows: 600.04 million
        ├── output bytes: 26.82 GiB

        ├── numbers remote spilled by write: 87
        ├── bytes remote spilled by write: 19.83 GiB
        ├── remote spilled time by write: 506.082s

        ├── numbers remote spilled by read: 1362
        ├── bytes remote spilled by read: 19.83 GiB
        ├── remote spilled time by read: 335.667s

        ├── numbers local spilled by write: 25
        ├── bytes local spilled by write: 6.60 GiB
        ├── local spilled time by write: 5.409s

        ├── numbers local spilled by read: 160
        ├── bytes local spilled by read: 6.60 GiB
        ├── local spilled time by read: 644.779s

Signed-off-by: coldWater <forsaken628@gmail.com>
Signed-off-by: coldWater <forsaken628@gmail.com>
@forsaken628 forsaken628 added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Oct 12, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-16580-5f09210-1728752404

note: this image tag is only available for internal use,
please check the internal doc for more details.

Signed-off-by: coldWater <forsaken628@gmail.com>
@forsaken628 forsaken628 added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Oct 13, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-16580-d904435-1728797635

note: this image tag is only available for internal use,
please check the internal doc for more details.

@forsaken628
Copy link
Collaborator Author

pr-16580-d904435-1728797635

remote spill

        ├── estimated rows: 600037902.00
        ├── cpu time: 186.200647759s
        ├── wait time: 361.996422691s
        ├── output rows: 600.04 million
        ├── output bytes: 26.82 GiB

        ├── numbers remote spilled by write: 128
        ├── bytes remote spilled by write: 8.45 GiB
        ├── remote spilled time by write: 212.578s

        ├── numbers remote spilled by read: 1792
        ├── bytes remote spilled by read: 8.45 GiB
        ├── remote spilled time by read: 131.925s

in 62.413 sec.

local spill dio

        ├── estimated rows: 600037902.00
        ├── cpu time: 166.382398539s
        ├── wait time: 82.413800073s
        ├── output rows: 600.04 million
        ├── output bytes: 26.82 GiB

        ├── numbers local spilled by write: 112
        ├── bytes local spilled by write: 8.41 GiB
        ├── local spilled time by write: 37.682s

        ├── numbers local spilled by read: 1506
        ├── bytes local spilled by read: 8.41 GiB
        ├── local spilled time by read: 16.454s

in 28.618 sec.

local spill buffer io

        ├── estimated rows: 600037902.00
        ├── cpu time: 185.377149301s
        ├── wait time: 54.782024041s
        ├── output rows: 600.04 million
        ├── output bytes: 26.82 GiB

        ├── numbers local spilled by write: 112
        ├── bytes local spilled by write: 8.41 GiB
        ├── local spilled time by write: 35.4s

        ├── numbers local spilled by read: 1536
        ├── bytes local spilled by read: 8.41 GiB
        ├── local spilled time by read: 5.226s

in 28.460 sec.

Copy link
Member

@Dousir9 Dousir9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@Dousir9 Dousir9 added this pull request to the merge queue Oct 14, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 14, 2024
@forsaken628 forsaken628 added this pull request to the merge queue Oct 14, 2024
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Oct 14, 2024
@BohuTANG BohuTANG merged commit 7548f99 into databendlabs:main Oct 14, 2024
83 checks passed
@forsaken628 forsaken628 deleted the spill-writer branch October 14, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants