Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache proposal #30

Open
khenidak opened this issue Mar 11, 2018 · 0 comments
Open

cache proposal #30

khenidak opened this issue Mar 11, 2018 · 0 comments

Comments

@khenidak
Copy link
Owner

This is the current standing proposal for implementing cache for dysk.

Terms

  • Upstream Azure Storage/Page Blob
  • ephemeral disk is a VM directly attached ssd (from host). i/o to this device are network i/o but rather they go through the host hypervisor ->host disk.

Motivation

The cost of network call to azure storage is extremely high. While elevators/schedulers and the likes provide some much needed relief by stretching together multiple i/o requests, they don't eliminate the need for an upstream call for every i/o. Tools such as page cache and vfs/inode cache provide additional relief for short-cycled repeatable reads but they don't provide much help for completely random access disks.

in order to provide higher i/o than what is provided by upstream we need to a local cache solution that provide larger cache pool on a medium that does not require upstream calls. on Azure VMs this medium is the ephemeral directly attached SSD (also referred to as resource disk).

The original idea was to leave dm-cache configuration + setup to the user, but that tuned out to be too much of burden on the user, specifically when disks are scheduled via kubernetes

below are options for implementing such a cache

In all proposals the user experience will remain the same. The user is expected to use dyskctl to interact with dysk module for dysk management.

option 1: lvm + dm-cache

dm-cache is an old and battle tested feature of the kernel since 3.x versions. While having lvm as a front end is not as old as dm-cache itself, lvm is quite stable and will support the needed scenarios.

Scenario:

  1. dyskctl command to convert ssd or partition of ssd as a cache pool.
  2. dyskctl additional arguments for mount to enable cache.
  3. if user enabled cache without step Code review comments #1 we transparently enable ssd as cache pool.

We will need to find a solution for multiple volumes using the [same cache pool] (https://www.redhat.com/archives/dm-devel/2013-July/msg00113.html)

The advantages of this approach is we offload a lot of complexity to battle tested implementation of dm-cache. The cli implementation will become a bit more complex because of callout to lv* commands via exec but i believe they are within manageable complexity.

The disadvantages/complexities of this are:

  1. Find the ephemeral disk on VMs, the ephemeral disk is auto mounted at different location for each distor, and also named differently on each distro.
  2. The mount and unmount operations of dyskctl will need to be able to find pools<->dysks without maintaining any state. A principal currently used by the entire stack.
  3. inability to implement read ahead allowing faster sequential reads. Which improves multiple workloads such as index scan, log reads

option 2: custom implementation

The idea is dysk kernel module will implement the cache as the following:

  1. All read I/Os will be aligned to PAGESIZE (8K on x64)
  2. All read I/Os will be original aligned size + PAGESIZE to enable read ahead.
  3. The cahce implementation delegate all the work to the current worker pattern implement in dysk.
  4. The cache implementation sets on top of in-memory rbtree each node has a reference to a PAGE allocated via alloc_page.
  5. Hot cache are pages that are new and/or frequently accessed pages. hot cache is saved in memory (100M for small cache, 250M for large cache. option selected by the user via cli).
  6. Cache compaction is two levels:
  • Level 1: demote low frequently accessed pages to disk cold cache when hot cache becomes too big.
  • Level 2: delete pages of the disk when cold cache becomes too big.
  1. Pages can be promoted from cold cache to hot cache when they
  • An incoming write operation (promote, wait for write to complete, leave in hot cache).
  • Becoming increasingly accessed.

option 2 has the same cli use experience as option 1.

disadvantages:

  1. Increased complexity in dysk implementation.
  2. File i/o in kernel which is usually is a do not do thing in kernel.
  3. Find a way to do memory mapped file in kernel space, which does not look like possible based on initial analysis.

Advantages

  1. The cli is simpler, all work is done in kernel space (preference is to move complexity to userspace not to kernel space).
  2. Flexibility as i/o read ahead(s) can be customized on for later workloads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant