You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the current standing proposal for implementing cache for dysk.
Terms
Upstream Azure Storage/Page Blob
ephemeral disk is a VM directly attached ssd (from host). i/o to this device are network i/o but rather they go through the host hypervisor ->host disk.
Motivation
The cost of network call to azure storage is extremely high. While elevators/schedulers and the likes provide some much needed relief by stretching together multiple i/o requests, they don't eliminate the need for an upstream call for every i/o. Tools such as page cache and vfs/inode cache provide additional relief for short-cycled repeatable reads but they don't provide much help for completely random access disks.
in order to provide higher i/o than what is provided by upstream we need to a local cache solution that provide larger cache pool on a medium that does not require upstream calls. on Azure VMs this medium is the ephemeral directly attached SSD (also referred to as resource disk).
The original idea was to leave dm-cache configuration + setup to the user, but that tuned out to be too much of burden on the user, specifically when disks are scheduled via kubernetes
below are options for implementing such a cache
In all proposals the user experience will remain the same. The user is expected to use dyskctl to interact with dysk module for dysk management.
option 1: lvm + dm-cache
dm-cache is an old and battle tested feature of the kernel since 3.x versions. While having lvm as a front end is not as old as dm-cache itself, lvm is quite stable and will support the needed scenarios.
Scenario:
dyskctl command to convert ssd or partition of ssd as a cache pool.
dyskctl additional arguments for mount to enable cache.
if user enabled cache without step Code review comments #1 we transparently enable ssd as cache pool.
The advantages of this approach is we offload a lot of complexity to battle tested implementation of dm-cache. The cli implementation will become a bit more complex because of callout to lv* commands via exec but i believe they are within manageable complexity.
The disadvantages/complexities of this are:
Find the ephemeral disk on VMs, the ephemeral disk is auto mounted at different location for each distor, and also named differently on each distro.
The mount and unmount operations of dyskctl will need to be able to find pools<->dysks without maintaining any state. A principal currently used by the entire stack.
inability to implement read ahead allowing faster sequential reads. Which improves multiple workloads such as index scan, log reads
option 2: custom implementation
The idea is dysk kernel module will implement the cache as the following:
All read I/Os will be aligned to PAGESIZE (8K on x64)
All read I/Os will be original aligned size + PAGESIZE to enable read ahead.
The cahce implementation delegate all the work to the current worker pattern implement in dysk.
The cache implementation sets on top of in-memory rbtree each node has a reference to a PAGE allocated via alloc_page.
Hot cache are pages that are new and/or frequently accessed pages. hot cache is saved in memory (100M for small cache, 250M for large cache. option selected by the user via cli).
Cache compaction is two levels:
Level 1: demote low frequently accessed pages to disk cold cache when hot cache becomes too big.
Level 2: delete pages of the disk when cold cache becomes too big.
Pages can be promoted from cold cache to hot cache when they
An incoming write operation (promote, wait for write to complete, leave in hot cache).
Becoming increasingly accessed.
option 2 has the same cli use experience as option 1.
disadvantages:
Increased complexity in dysk implementation.
File i/o in kernel which is usually is a do not do thing in kernel.
Find a way to do memory mapped file in kernel space, which does not look like possible based on initial analysis.
Advantages
The cli is simpler, all work is done in kernel space (preference is to move complexity to userspace not to kernel space).
Flexibility as i/o read ahead(s) can be customized on for later workloads.
The text was updated successfully, but these errors were encountered:
This is the current standing proposal for implementing cache for dysk.
Terms
Motivation
The cost of network call to azure storage is extremely high. While elevators/schedulers and the likes provide some much needed relief by stretching together multiple i/o requests, they don't eliminate the need for an upstream call for every i/o. Tools such as page cache and vfs/inode cache provide additional relief for short-cycled repeatable reads but they don't provide much help for completely random access disks.
in order to provide higher i/o than what is provided by upstream we need to a local cache solution that provide larger cache pool on a medium that does not require upstream calls. on Azure VMs this medium is the ephemeral directly attached SSD (also referred to as resource disk).
The original idea was to leave
dm-cache
configuration + setup to the user, but that tuned out to be too much of burden on the user, specifically when disks are scheduled viakubernetes
In all proposals the user experience will remain the same. The user is expected to use dyskctl to interact with dysk module for dysk management.
option 1: lvm + dm-cache
dm-cache is an old and battle tested feature of the kernel since 3.x versions. While having lvm as a front end is not as old as dm-cache itself, lvm is quite stable and will support the needed scenarios.
Scenario:
mount
to enable cache.The advantages of this approach is we offload a lot of complexity to battle tested implementation of dm-cache. The cli implementation will become a bit more complex because of callout to lv* commands via
exec
but i believe they are within manageable complexity.The disadvantages/complexities of this are:
mount
andunmount
operations of dyskctl will need to be able to find pools<->dysks without maintaining any state. A principal currently used by the entire stack.index scan
,log reads
option 2: custom implementation
The idea is dysk kernel module will implement the cache as the following:
PAGESIZE
(8K on x64)PAGESIZE
to enable read ahead.worker
pattern implement in dysk.rbtree
each node has a reference to aPAGE
allocated viaalloc_page
.Hot cache
are pages that are new and/or frequently accessed pages.hot cache
is saved in memory (100M for small cache, 250M for large cache. option selected by the user via cli).demote
low frequently accessed pages to diskcold cache
when hot cache becomes too big.cold cache
becomes too big.cold cache
tohot cache
when theywrite
operation (promote, wait for write to complete, leave inhot cache
).disadvantages:
do not do
thing in kernel.Advantages
The text was updated successfully, but these errors were encountered: