Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Graphite Disk Usage Calculation #261

Open
cloudbehl opened this issue Nov 14, 2017 · 10 comments
Open

[WIP] Graphite Disk Usage Calculation #261

cloudbehl opened this issue Nov 14, 2017 · 10 comments
Assignees

Comments

@cloudbehl
Copy link
Member

cloudbehl commented Nov 14, 2017

Graphite Disk Usage Calculation

Whisper storage utilization

Per data point: 12 bytes
Per metric: 12 * no of data points
so for 60s:180d retention (60 * 24 * 180 data points) * 12 bytes = 3110400 (~ 2.97 MB)
or for 10s:180d retention (6 * 60 * 24 * 180 data points) * 12 bytes = 18662400 (~ 17.8 MB)

The calculations below are based on Tendrl’s default storage retention policy of all metrics consisting of data points at 60 seconds interval being stored for 180 days.

There are currently two trees to enable grafana navigation:

  1. Cluster -> Volume -> Node -> Brick -> Block Device

  2. Cluster -> Node -> Brick -> Block Device

Cluster -> Volume -> Node -> Brick -> Block Device

This tree contains all the cluster specific information for Volumes, Nodes, Bricks and Block Devices. This tree does NOT contain Node specific information. Nodes contain information only as relates to the cluster, such as rebalance information.

Block Device
Size on disk: 37325481 (~36 MB)
Structure:
├── disk_octets
│ ├── read.wsp
│ └── write.wsp
├── disk_ops
│ ├── read.wsp
│ └── write.wsp
├── disk_time
│ ├── read.wsp
│ └── write.wsp
├── mount_utilization
│ ├── percent_used.wsp
│ ├── total.wsp
│ └── used.wsp
└── utilization
├── percent_used.wsp
├── total.wsp
└── used.wsp

Brick
Size on disk: 102648857 (~98MB per brick) + (no of devices * (36 MB per devices))
Structure:
├── connections_count.wsp
├── device
│ └── vda
│ ├── disk_octets
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── disk_ops
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── disk_time
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── mount_utilization
│ │ ├── percent_used.wsp
│ │ ├── total.wsp
│ │ └── used.wsp
│ └── utilization
│ ├── percent_used.wsp
│ ├── total.wsp
│ └── used.wsp
├── entry_ops.wsp
├── fop
│ ├── GETXATTR
│ │ ├── hits.wsp
│ │ ├── latencyAvg.wsp
│ │ ├── latencyMax.wsp
│ │ └── latencyMin.wsp
│ ├── LOOKUP
│ │ ├── hits.wsp
│ │ ├── latencyAvg.wsp
│ │ ├── latencyMax.wsp
│ │ └── latencyMin.wsp
│ ├── OPENDIR
│ │ ├── hits.wsp
│ │ ├── latencyAvg.wsp
│ │ ├── latencyMax.wsp
│ │ └── latencyMin.wsp
│ └── READDIR
│ ├── hits.wsp
│ ├── latencyAvg.wsp
│ ├── latencyMax.wsp
│ └── latencyMin.wsp
├── healed_cnt.wsp
├── heal_failed_cnt.wsp
├── inode_ops.wsp
├── inode_utilization
│ ├── gauge-total.wsp
│ ├── gauge-used.wsp
│ └── percent-percent_bytes.wsp
├── iops
│ ├── gauge-read.wsp
│ └── gauge-write.wsp
├── lock_ops.wsp
├── read_write_ops.wsp
├── split_brain_cnt.wsp
├── status.wsp
└── utilization
├── gauge-total.wsp
├── gauge-used.wsp
└── percent-percent_bytes.wsp

Node
Size on disk: 12441712 (~12 MB per host) + (no of bricks * (98MB per brick)) + (no of devices * (36 MB per device))
Structure:
├── bricks
│ └── |root|gluster_bricks|vol1_b2
│ ├── connections_count.wsp
│ ├── device
│ │ └── vda
│ │ ├── disk_octets
│ │ │ ├── read.wsp
│ │ │ └── write.wsp
│ │ ├── disk_ops
│ │ │ ├── read.wsp
│ │ │ └── write.wsp
│ │ ├── disk_time
│ │ │ ├── read.wsp
│ │ │ └── write.wsp
│ │ ├── mount_utilization
│ │ │ ├── percent_used.wsp
│ │ │ ├── total.wsp
│ │ │ └── used.wsp
│ │ └── utilization
│ │ ├── percent_used.wsp
│ │ ├── total.wsp
│ │ └── used.wsp
│ ├── inode_utilization
│ │ ├── gauge-total.wsp
│ │ ├── gauge-used.wsp
│ │ └── percent-percent_bytes.wsp
│ ├── status.wsp
│ └── utilization
│ ├── gauge-total.wsp
│ ├── gauge-used.wsp
│ └── percent-percent_bytes.wsp
├── rebalance_bytes.wsp
├── rebalance_failures.wsp
├── rebalance_files.wsp
└── rebalance_skipped.wsp

Volume
Size on disk: 46656545 (~44.5 MB per volume) + (no of hosts * (12 MB per host)) + (no of bricks * (98MB per brick)) + (no of devices * (36 MB per device))
Structure:
├── brick_count
│ ├── down.wsp
│ ├── total.wsp
│ └── up.wsp
├── geo_rep_session
│ ├── down.wsp
│ ├── partial.wsp
│ ├── total.wsp
│ └── up.wsp
├── nodes
│ ├── dhcp43-54_lab_eng_blr_redhat_com
│ │ ├── bricks
│ │ │ └── |root|gluster_bricks|vol1_b2
│ │ │ ├── connections_count.wsp
│ │ │ ├── device
│ │ │ │ └── vda
│ │ │ │ ├── disk_octets
│ │ │ │ │ ├── read.wsp
│ │ │ │ │ └── write.wsp
│ │ │ │ ├── disk_ops
│ │ │ │ │ ├── read.wsp
│ │ │ │ │ └── write.wsp
│ │ │ │ ├── disk_time
│ │ │ │ │ ├── read.wsp
│ │ │ │ │ └── write.wsp
│ │ │ │ ├── mount_utilization
│ │ │ │ │ ├── percent_used.wsp
│ │ │ │ │ ├── total.wsp
│ │ │ │ │ └── used.wsp
│ │ │ │ └── utilization
│ │ │ │ ├── percent_used.wsp
│ │ │ │ ├── total.wsp
│ │ │ │ └── used.wsp
│ │ │ ├── inode_utilization
│ │ │ │ ├── gauge-total.wsp
│ │ │ │ ├── gauge-used.wsp
│ │ │ │ └── percent-percent_bytes.wsp
│ │ │ ├── status.wsp
│ │ │ └── utilization
│ │ │ ├── gauge-total.wsp
│ │ │ ├── gauge-used.wsp
│ │ │ └── percent-percent_bytes.wsp
│ │ ├── rebalance_bytes.wsp
│ │ ├── rebalance_failures.wsp
│ │ ├── rebalance_files.wsp
│ │ └── rebalance_skipped.wsp
│ └── dhcp43-83_lab_eng_blr_redhat_com
│ ├── rebalance_bytes.wsp
│ ├── rebalance_failures.wsp
│ ├── rebalance_files.wsp
│ └── rebalance_skipped.wsp
├── pcnt_used.wsp
├── rebal_status.wsp
├── snap_count.wsp
├── state.wsp
├── status.wsp
├── subvol_count.wsp
├── usable_capacity.wsp
└── used_capacity.wsp

Cluster -> Node -> Brick -> Block Device

This tree contains all the cluster specific information for Nodes, Bricks and Block Devices.

Block Device
Size on disk: 37325481 (~36 MB)
Structure:
├── disk_octets
│ ├── read.wsp
│ └── write.wsp
├── disk_ops
│ ├── read.wsp
│ └── write.wsp
├── disk_time
│ ├── read.wsp
│ └── write.wsp
├── mount_utilization
│ ├── percent_used.wsp
│ ├── total.wsp
│ └── used.wsp
└── utilization
├── percent_used.wsp
├── total.wsp
└── used.wsp

Brick - Without file operations
Size on disk: 40435965 (~39 MB per brick) + (no of devices * (36 MB per devices))
├── device
│ └── vda
│ ├── disk_octets
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── disk_ops
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── disk_time
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── mount_utilization
│ │ ├── percent_used.wsp
│ │ ├── total.wsp
│ │ └── used.wsp
│ └── utilization
│ ├── percent_used.wsp
│ ├── total.wsp
│ └── used.wsp
├── entry_ops.wsp
├── inode_ops.wsp
├── inode_utilization
│ ├── gauge-total.wsp
│ ├── gauge-used.wsp
│ └── percent-percent_bytes.wsp
├── iops
│ ├── gauge-read.wsp
│ └── gauge-write.wsp
├── lock_ops.wsp
├── read_write_ops.wsp
├── status.wsp
└── utilization
├── gauge-total.wsp
├── gauge-used.wsp
└── percent-percent_bytes.wsp

With File operations
Size on disk: 90203242 (~86MB per brick) + (no of devices * (36 MB per devices))
├── device
│ └── vda
│ ├── disk_octets
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── disk_ops
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── disk_time
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── mount_utilization
│ │ ├── percent_used.wsp
│ │ ├── total.wsp
│ │ └── used.wsp
│ └── utilization
│ ├── percent_used.wsp
│ ├── total.wsp
│ └── used.wsp
├── entry_ops.wsp
├── fop
│ ├── GETXATTR
│ │ ├── hits.wsp
│ │ ├── latencyAvg.wsp
│ │ ├── latencyMax.wsp
│ │ └── latencyMin.wsp
│ ├── LOOKUP
│ │ ├── hits.wsp
│ │ ├── latencyAvg.wsp
│ │ ├── latencyMax.wsp
│ │ └── latencyMin.wsp
│ ├── OPENDIR
│ │ ├── hits.wsp
│ │ ├── latencyAvg.wsp
│ │ ├── latencyMax.wsp
│ │ └── latencyMin.wsp
│ └── READDIR
│ ├── hits.wsp
│ ├── latencyAvg.wsp
│ ├── latencyMax.wsp
│ └── latencyMin.wsp
├── inode_ops.wsp
├── inode_utilization
│ ├── gauge-total.wsp
│ ├── gauge-used.wsp
│ └── percent-percent_bytes.wsp
├── iops
│ ├── gauge-read.wsp
│ └── gauge-write.wsp
├── lock_ops.wsp
├── read_write_ops.wsp
├── status.wsp
└── utilization
├── gauge-total.wsp
├── gauge-used.wsp
└── percent-percent_bytes.wsp

Node

Size on disk: 401282895 (~382 MB per host) + (no of LVM disk * (24 MB per disk)) +(no of virtual disk * (30 MB per disk)) + (no of bricks * (86 MB per brick)) + (no of devices * (36 MB per device))
.
├── aggregation-memory-sum
│ └── memory.wsp
├── aggregation-swap-sum
│ └── swap.wsp
├── brick_count
│ ├── down.wsp
│ ├── total.wsp
│ └── up.wsp
├── bricks
│ ├── |root|bricks|v1
│ │ ├── device
│ │ │ └── vda
│ │ │ ├── disk_octets
│ │ │ │ ├── read.wsp
│ │ │ │ └── write.wsp
│ │ │ ├── disk_ops
│ │ │ │ ├── read.wsp
│ │ │ │ └── write.wsp
│ │ │ ├── disk_time
│ │ │ │ ├── read.wsp
│ │ │ │ └── write.wsp
│ │ │ ├── mount_utilization
│ │ │ │ ├── percent_used.wsp
│ │ │ │ ├── total.wsp
│ │ │ │ └── used.wsp
│ │ │ └── utilization
│ │ │ ├── percent_used.wsp
│ │ │ ├── total.wsp
│ │ │ └── used.wsp
│ │ ├── entry_ops.wsp
│ │ ├── inode_ops.wsp
│ │ ├── inode_utilization
│ │ │ ├── gauge-total.wsp
│ │ │ ├── gauge-used.wsp
│ │ │ └── percent-percent_bytes.wsp
│ │ ├── iops
│ │ │ ├── gauge-read.wsp
│ │ │ └── gauge-write.wsp
│ │ ├── lock_ops.wsp
│ │ ├── read_write_ops.wsp
│ │ ├── status.wsp
│ │ └── utilization
│ │ ├── gauge-total.wsp
│ │ ├── gauge-used.wsp
│ │ └── percent-percent_bytes.wsp
│ ├── cpu
│ ├── percent-idle.wsp
│ ├── percent-interrupt.wsp
│ ├── percent-nice.wsp
│ ├── percent-softirq.wsp
│ ├── percent-steal.wsp
│ ├── percent-system.wsp
│ ├── percent-user.wsp
│ └── percent-wait.wsp
├── df-boot
│ ├── df_complex-free.wsp
│ ├── df_complex-reserved.wsp
│ ├── df_complex-used.wsp
│ ├── df_inodes-free.wsp
│ ├── df_inodes-reserved.wsp
│ ├── df_inodes-used.wsp
│ ├── percent_bytes-free.wsp
│ ├── percent_bytes-reserved.wsp
│ ├── percent_bytes-used.wsp
│ ├── percent_inodes-free.wsp
│ ├── percent_inodes-reserved.wsp
│ └── percent_inodes-used.wsp
├── df-dev
│ ├── df_complex-free.wsp
│ ├── df_complex-reserved.wsp
│ ├── df_complex-used.wsp
│ ├── df_inodes-free.wsp
│ ├── df_inodes-reserved.wsp
│ ├── df_inodes-used.wsp
│ ├── percent_bytes-free.wsp
│ ├── percent_bytes-reserved.wsp
│ ├── percent_bytes-used.wsp
│ ├── percent_inodes-free.wsp
│ ├── percent_inodes-reserved.wsp
│ └── percent_inodes-used.wsp
├── df-dev-shm
│ ├── df_complex-free.wsp
│ ├── df_complex-reserved.wsp
│ ├── df_complex-used.wsp
│ ├── df_inodes-free.wsp
│ ├── df_inodes-reserved.wsp
│ ├── df_inodes-used.wsp
│ ├── percent_bytes-free.wsp
│ ├── percent_bytes-reserved.wsp
│ ├── percent_bytes-used.wsp
│ ├── percent_inodes-free.wsp
│ ├── percent_inodes-reserved.wsp
│ └── percent_inodes-used.wsp
├── df-root
│ ├── df_complex-free.wsp
│ ├── df_complex-reserved.wsp
│ ├── df_complex-used.wsp
│ ├── df_inodes-free.wsp
│ ├── df_inodes-reserved.wsp
│ ├── df_inodes-used.wsp
│ ├── percent_bytes-free.wsp
│ ├── percent_bytes-reserved.wsp
│ ├── percent_bytes-used.wsp
│ ├── percent_inodes-free.wsp
│ ├── percent_inodes-reserved.wsp
│ └── percent_inodes-used.wsp
├── df-run
│ ├── df_complex-free.wsp
│ ├── df_complex-reserved.wsp
│ ├── df_complex-used.wsp
│ ├── df_inodes-free.wsp
│ ├── df_inodes-reserved.wsp
│ ├── df_inodes-used.wsp
│ ├── percent_bytes-free.wsp
│ ├── percent_bytes-reserved.wsp
│ ├── percent_bytes-used.wsp
│ ├── percent_inodes-free.wsp
│ ├── percent_inodes-reserved.wsp
│ └── percent_inodes-used.wsp
├── df-run-user-0
│ ├── df_complex-free.wsp
│ ├── df_complex-reserved.wsp
│ ├── df_complex-used.wsp
│ ├── df_inodes-free.wsp
│ ├── df_inodes-reserved.wsp
│ ├── df_inodes-used.wsp
│ ├── percent_bytes-free.wsp
│ ├── percent_bytes-reserved.wsp
│ ├── percent_bytes-used.wsp
│ ├── percent_inodes-free.wsp
│ ├── percent_inodes-reserved.wsp
│ └── percent_inodes-used.wsp
├── df-sys-fs-cgroup
│ ├── df_complex-free.wsp
│ ├── df_complex-reserved.wsp
│ ├── df_complex-used.wsp
│ ├── df_inodes-free.wsp
│ ├── df_inodes-reserved.wsp
│ ├── df_inodes-used.wsp
│ ├── percent_bytes-free.wsp
│ ├── percent_bytes-reserved.wsp
│ ├── percent_bytes-used.wsp
│ ├── percent_inodes-free.wsp
│ ├── percent_inodes-reserved.wsp
│ └── percent_inodes-used.wsp
├── disk-dm-0
│ ├── disk_io_time
│ │ ├── io_time.wsp
│ │ └── weighted_io_time.wsp
│ ├── disk_octets
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── disk_ops
│ │ ├── read.wsp
│ │ └── write.wsp
│ └── disk_time
│ ├── read.wsp
│ └── write.wsp
├── disk-vda
│ ├── disk_io_time
│ │ ├── io_time.wsp
│ │ └── weighted_io_time.wsp
│ ├── disk_merged
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── disk_octets
│ │ ├── read.wsp
│ │ └── write.wsp
│ ├── disk_ops
│ │ ├── read.wsp
│ │ └── write.wsp
│ └── disk_time
│ ├── read.wsp
│ └── write.wsp
├── interface-eth0
│ ├── if_dropped
│ │ ├── rx.wsp
│ │ └── tx.wsp
│ ├── if_errors
│ │ ├── rx.wsp
│ │ └── tx.wsp
│ ├── if_octets
│ │ ├── rx.wsp
│ │ └── tx.wsp
│ └── if_packets
│ ├── rx.wsp
│ └── tx.wsp
├── memory
│ ├── memory-buffered.wsp
│ ├── memory-cached.wsp
│ ├── memory-free.wsp
│ ├── memory-slab_recl.wsp
│ ├── memory-slab_unrecl.wsp
│ ├── memory-used.wsp
│ ├── percent-buffered.wsp
│ ├── percent-cached.wsp
│ ├── percent-free.wsp
│ ├── percent-slab_recl.wsp
│ ├── percent-slab_unrecl.wsp
│ └── percent-used.wsp
├── ping
│ ├── ping-10_70_42_151.wsp
│ ├── ping_droprate-10_70_42_151.wsp
│ └── ping_stddev-10_70_42_151.wsp
├── status.wsp
└── swap
├── percent-cached.wsp
├── percent-free.wsp
├── percent-used.wsp
├── swap-cached.wsp
├── swap-free.wsp
├── swap_io-in.wsp
├── swap_io-out.wsp
└── swap-used.wsp

Single cluster (Approx utilization of a cluster)

Size on disk: 49767242(~48 MB per cluster) + (no of host * (~382 MB per host)) + (no of LVM disk * (24 MB per disk)) +(no of virtual disk * (30 MB per disk)) + (no of bricks * (86 MB per brick)) + (no of devices * (36 MB per device)) + (no of volume * (~44.5 MB per volume)) + (no of hosts * (12 MB per host)) + (no of bricks * (98 MB per brick)) + (no of devices * (36 MB per device))

@cloudbehl
Copy link
Member Author

@r0h4n @brainfunked @shtripat Please review
@Tendrl/tendrl-qe Please verify that the disk usage for a single cluster is approx to value that is calculated via this formula. Please do let me know if you require some help.

@julienlim
Copy link
Member

julienlim commented Feb 21, 2018

@ltrilety @r0h4n @nthomas-redhat @Tendrl/qe @jjkabrown1

Have we done any Graphite disk usage estimates for following sizes (I tried to come up with 3 "sizes" based on some typical deployments):

Small

  • 4-6 node cluster
  • 2-4 volumes in the cluster
  • 2-4 bricks per node

Medium

  • 8-12 node cluster
  • 4-6 volumes in the cluster
  • 6-8 bricks per node

Large

  • 24-36 node cluster
  • 6-8 volumes in the cluster
  • 12-36 bricks per node

In Tendrl/commons#819 @ltrilety mentioned "1 day of metrics for 6 gluster servers [Small] takes about 10G."

@nthomas-redhat
Copy link
Contributor

@julienlim, the formula for the calculation is already provided in #261 (comment), which is as below:

Size on disk: 49767242(~48 MB per cluster) + (no of host * (~382 MB per host)) + (no of LVM disk * (24 MB per disk)) +(no of virtual disk * (30 MB per disk)) + (no of bricks * (86 MB per brick)) + (no of devices * (36 MB per device)) + (no of volume * (~44.5 MB per volume)) + (no of hosts * (12 MB per host)) + (no of bricks * (98 MB per brick)) + (no of devices * (36 MB per device))

Size may vary depends on the no disks, lvms etc. So I would reccomend to calculate this based on per deployment basis. What you think?

@ltrilety
Copy link

@nthomas-redhat @julienlim @shtripat it's great we have a formula for 180 days period, but from what I see we should simplify it as it's not easy to read. We could use deployments in #261 (comment) and provide some numbers.
Of course first we have to decide how long we keep these metrics data. Moreover don't forget to count with free space for un-managing.
Oh one more thing those typical deployments are a little strange for me, as for example medium ends with 12 nodes, but large begins with 24. So then there's a question where 20 nodes belongs and so on. Anyway the idea is there.

@julienlim
Copy link
Member

julienlim commented Feb 22, 2018

@nthomas-redhat @shtripat @ltrilety @jjkabrown1

It's good to have a formula for the 180 day period, but we'll need to adjust according to the retention policies.

That being said, this formula is too cumbersome for someone to calculate. We need to provide an easy-to-use calculator (think Ceph's pgcalc-like or some kind of spreadsheet), where user can input some numbers (e.g. # nodes in cluster, # clusters, # volumes, #bricks, and how long to retain data), and it provides an estimate.

@ltrilety As to the deployment sizes, I took a first stab at try to come up with something, and it does need further discussion and tweaking. Suggestions?

@nthomas-redhat
Copy link
Contributor

@cloudbehl, @r0h4n , let us sync up and put together guidelines for possible standard configurations

@cloudbehl
Copy link
Member Author

@nthomas-redhat ack!

@r0h4n
Copy link
Contributor

r0h4n commented Apr 9, 2018

@nthomas-redhat Please provide change in disk size requirements for Graphite for below scenarios

  • User adds a storage node and manages it under Tendrl
  • User adds a brick or volume or object under gluster cluster that is managed by Tendrl

@ltrilety
Copy link

A note for the assessment. Don't forget that un-manage screws the counting as it takes all data from graphite and saved them on /usr/share/tendrl/graphite/archive path. That brings several questions:

  1. How many un-manage saves will be allowed? Any of them means multiply of the current size number.
  2. Does it mean that we will require some disk size for the main disk too? Because we are requiring a special disk for graphite/carbon, but the archive, as it is, is located on "main" disk.

@nthomas-redhat
Copy link
Contributor

For standard cluster sizes please see below:

Small Configuration

Up to 8 Nodes
6-8 volumes per cluster
Number of bricks are 2 - 3 per node for replicated volumes with RAID 6 and 12 - 36 per node for EC volumes

Recommendation:
200 GB free size per cluster for this configuration

Medium Configuration

9 - 16 Nodes
6-8 volumes per cluster
Number of bricks are 2 - 3 per node for replicated volumes with RAID 6 and 12 - 36 per node for EC volumes

Recommendation
350GB free size per cluster for this configuration

Large Configuration
17 - 24 Nodes
6-8 volumes per cluster
Number of bricks are 2 - 3 per node for replicated volumes with RAID 6 and 12 - 36 per node for EC volumes

Recommendation
500 GB free size per cluster for this configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants