From ec78d987d4aa09c625f45cf743aad48787704695 Mon Sep 17 00:00:00 2001 From: Swatid Date: Fri, 29 Jul 2022 17:31:09 +0530 Subject: [PATCH 1/3] Uploading markdown files to doc folder Signed-off-by: Swatid --- doc/HLD-OF-Motr-LNet-Transport.md | 444 +++++++++++++++++ doc/HLD-Version-Numbers.md | 4 +- doc/HLD-of-FDMI.md | 464 +++++++++++++++++ doc/HLD-of-FOL.md | 170 +++++++ doc/HLD-of-Metadata-Backend.md | 66 +++ doc/HLD-of-Motr-Caching.md | 276 +++++++++++ doc/HLD-of-Motr-Lostore.md | 115 +++++ doc/HLD-of-Motr-Network-Benchmark.md | 163 ++++++ doc/HLD-of-Motr-Object-Index.md | 192 +++++++ doc/HLD-of-Motr-Spiel-API.md | 715 +++++++++++++++++++++++++++ doc/HLD-of-SNS-Repair.md | 704 ++++++++++++++++++++++++++ doc/HLD-of-SNS-client.md | 185 +++++++ doc/HLD-of-distributed-indexing.md | 190 +++++++ doc/ISC-Service-User-Guide | 358 ++++++++++++++ doc/Motr-Epochs-HLD.md | 87 ++++ 15 files changed, 4131 insertions(+), 2 deletions(-) create mode 100644 doc/HLD-OF-Motr-LNet-Transport.md create mode 100644 doc/HLD-of-FDMI.md create mode 100644 doc/HLD-of-FOL.md create mode 100644 doc/HLD-of-Metadata-Backend.md create mode 100644 doc/HLD-of-Motr-Caching.md create mode 100644 doc/HLD-of-Motr-Lostore.md create mode 100644 doc/HLD-of-Motr-Network-Benchmark.md create mode 100644 doc/HLD-of-Motr-Object-Index.md create mode 100644 doc/HLD-of-Motr-Spiel-API.md create mode 100644 doc/HLD-of-SNS-Repair.md create mode 100644 doc/HLD-of-SNS-client.md create mode 100644 doc/HLD-of-distributed-indexing.md create mode 100644 doc/ISC-Service-User-Guide create mode 100644 doc/Motr-Epochs-HLD.md diff --git a/doc/HLD-OF-Motr-LNet-Transport.md b/doc/HLD-OF-Motr-LNet-Transport.md new file mode 100644 index 00000000000..ee407defc9c --- /dev/null +++ b/doc/HLD-OF-Motr-LNet-Transport.md @@ -0,0 +1,444 @@ +# HLD OF Motr LNet Transport +This document presents a high level design (HLD) of the Motr LNet Transport. The main purposes of this document are: +1. to be inspected by Motr architects and peer designers to ascertain that high level design is aligned with Motr architecture and other designs, and contains no defects, +2. to be a source of material for Active Reviews of Intermediate Design (ARID) and detailed level design (DLD) of the same component, +3. to serve as a design reference document. + +## Introduction +The scope of this HLD includes the net.lnet-user and net.lnet-kernel tasks described in [1]. Portions of the design are influenced by [4]. + +## Definitions +* **Network Buffer**: This term is used to refer to a struct m0_net_buffer. The word “buffer”, if used by itself, will be qualified by its context - it may not always refer to a network buffer. +* **Network Buffer Vector**: This term is used to refer to the struct m0_bufvec that is embedded in a network buffer. The related term, “I/O vector” if used, will be qualified by its context - it may not always refer to a network buffer vector. +* **Event queue, EQ LNet**: A data structure used to receive LNet events. Associated with an MD. The Lustre Networking module. It implements version 3.2 of the Portals Message Passing Interface, and provides access to a number of different transport protocols including InfiniBand and TCP over Ethernet. +* **LNet address**: This is composed of (NID, PID, Portal Number, Match bits, offset). The NID specifies a network interface end point on a host, the PID identifies a process on that host, the Portal Number identifies an opening in the address space of that process, the Match Bits identify a memory region in that opening, and the offset identifies a relative position in that memory region. +* **LNet API**: The LNet Application Programming Interface. This API is provided in the kernel and implicitly defines a PID with value LUSTRE_SRV_LNET_PID, representing the kernel. Also see ULA. +* **LNetNetworkIdentifierString**: The external string representation of an LNet Network Identifier (NID). It is typically expressed as a string of the form “Address@InterfaceType[Number]” where the number is used if there are multiple instances of the type, or plans to configure more than one interface in the future. e.g. “10.67.75.100@o2ib0”. +* **Match bits**: An unsigned 64 bit integer used to identify a memory region within the address space defined by a Portal. Every local memory region that will be remotely accessed through a Portal must either be matched exactly by the remote request, or wildcard matched after masking off specific bits specified on the local side when configuring the memory region. +* **Memory Descriptor, MD**: An LNet data structure identifying a memory region and an EQ. +* **Match Entry, ME**: An LNet data structure identifying an MD and a set of match criteria, including Match bits. Associated with a portal. +* **NID, lnet_nid_t**: Network Identifier portion of an LNet address, identifying a network end point. There can be multiple NIDs defined for a host, one per network interface on the host that is used by LNet. A NID is represented internally by an unsigned 64 bit integer, with the upper 32 bits identifying the network and the lower 32 bits the address. The network portion itself is composed of an upper 16 bit network interface type and the lower 16 bits identify an instance of that type. See LNetNetworkIdentifierString for the external representation of a NID. +* **PID, lnet_pid_t**: Process identifier portion of an LNet address. This is represented internally by a 32 bit unsigned integer. LNet assigns the kernel a PID of LUSTRE_SRV_LNET_PID (12345) when the module gets configured. This should not be confused with the operating system process identifier space which is unrelated. +* **Portal Number**: This is an unsigned integer that identifies an opening in a process address space. The process can associate multiple memory regions with the portal, each region identified by a unique set of Match bits. LNet allows up to MAX_PORTALS portals per process (64 with the Lustre 2.0 release) +* **Portals Messages Passing Interface**: An RDMA based specification that supports direct access to application memory. LNet adheres to version 3.2 of the specification. +* **RDMA ULA**: A port of the LNet API to user space, that communicates with LNet in the kernel using a private device driver. User space processes still share the same portal number space with the kernel, though their PIDs can be different. Event processing using the user space library is relatively expensive compared to direct kernel use of the LNet API, as an ioctl call is required to transfer each LNet event to user space. The user space LNet library is protected by the GNU Public License. ULA makes modifications to the LNet module in the kernel that have not yet, at the time of Lustre 2.0, been merged into the mainstream Lustre source repository. The changes are fully compatible with existing usage. The ULA code is currently in a Motr repository module. +* **LNet Transport End Point Address**: The design defines an LNet transport end point address to be a 4-tuple string in the format “LNETNetworkIdentifierString : PID : PortalNumber : TransferMachineIdentifier”. The TransferMachineIdentifier serves to distinguish between transfer machines sharing the same NID, PID and PortalNumber. The LNet Transport End Point Addresses concurrently in use on a host are distinct. +* **Mapped memory page A memory page (struct page)**: that has been pinned in memory using the get_user_pages subroutine. +* **Receive Network Buffer Pool**: This is a pool of network buffers, shared between several transfer machines. This common pool reduces the fragmentation of the cache of receive buffers in a network domain that would arise were each transfer machine to be individually provisioned with receive buffers. The actual staging and management of network buffers in the pool is provided through the [r.m0.net.network-buffer-pool] dependency. +* **Transfer Machine Identifier**: This is an unsigned integer that is a component of the end point address of a transfer machine. The number identifies a unique instance of a transfer machine in the set of addresses that use the same 3-tuple of NID, PID and Portal Number. The transfer machine identifier is related to a portion of the Match bits address space in an LNet address - i.e. it is used in the ME associated with the receive queue of the transfer machine. + +Refer to [3], [5] and to net/net.h in the Motr source tree, for additional terms and definitions. + +## Requirements +* [r.m0.net.rdma] Remote DMA is supported. [2] +* [r.m0.net.ib] Infiniband is supported. [2] +* [r.m0.net.xprt.lnet.kernel] Create an LNET transport in the kernel. [1] +* [r.m0.net.xprt.lnet.user] Create an LNET transport for user space. [1] +* [r.m0.net.xprt.lnet.user.multi-process] Multiple user space processes can concurrently use the LNet transport. [1] +* [r.m0.net.xprt.lnet.user.no-gpl] Do not get tainted with the use of GPL interfaces in the user space implementation. [1] +* [r.m0.net.xprt.lnet.user.min-syscalls] Minimize the number of system calls required by the user space transport. [1] +* [r.m0.net.xprt.lnet.min-buffer-vm-setup] Minimize the amount of virtual memory setup required for network buffers in the user space transport. [1] +* [r.m0.net.xprt.lnet.processor-affinity] Provide optimizations based on processor affinity. +* [r.m0.net.buffer-event-delivery-control] Provide control over the detection and delivery of network buffer events. +* [r.m0.net.xprt.lnet.buffer-registration] Provide support for hardware optimization through buffer pre-registration. +* [r.m0.net.xprt.auto-provisioned-receive-buffer-pool] Provide support for a pool of network buffers from which transfer machines can automatically be provisioned with receive buffers. Multiple transfer machines can share the same pool, but each transfer machine is only associated with a single pool. There can be multiple pools in a network domain, but a pool cannot span multiple network domains. + +## Design Highlights +The following figure shows the components of the proposed design and usage relationships between it and other related components: + +![image](./Images/LNET.PNG) + +* The design provides an LNet based transport for the Motr Network Layer, that co-exists with the concurrent use of LNet by Lustre. In the figure, the transport is labelled m0_lnet_u in user space and m0_lnet_k in the kernel. +* The user space transport does not use ULA to avoid GPL tainting. Instead it uses a proprietary device driver, labelled m0_lnet_dd in the figure, to communicate with the kernel transport module through private interfaces. +* Each transfer machine is assigned an end point address that directly identifies the NID, PID and Portal Number portion of an LNet address, and a transfer machine identifier. The design will support multiple transfer machines for a given 3-tuple of NID, PID and Portal Number. It is the responsibility of higher level software to make network address assignments to Motr components such as servers and command line utilities, and how clients are provided these addresses. +* The design provides transport independent support to automatically provision the receive queues of transfer machines on demand, from pools of unused, registered, network buffers. This results in greater utilization of receive buffers, as fragmentation of the available buffer space is reduced by delaying the commitment of attaching a buffer to specific transfer machines. +* The design supports the reception of multiple messages into a single network buffer. Events will be delivered for each message serially. +* The design addresses the overhead of communication between user space and kernel space. In particular, shared memory is used as much as possible, and each context switch involves more than one operation or event if possible. +* The design allows an application to specify processor affinity for a transfer machine. +* The design allows an application to control how and when buffer event delivery takes place. This is of particular interest to the user space request handler. + +## Functional Specification +The design follows the existing specification of the Motr Network module described in net/net.h and [5] for the most part. See the Logical Specification for reasons behind the features described in the functional specification. + +### LNet Transfer Machine End Point Address +The Motr LNet transport defines the following 4-tuple end point address format for transfer machines: + +* NetworkIdentifierString : PID : PortalNumber : TransferMachineIdentifier + +where the NetworkIdentifierString (a NID string), the PID and the Portal Number are as defined in an LNet Address. The TransferMachineIdentifier is defined in the definition section. + +Every Motr service request handler, client and utility program needs a set of unique end point addresses. This requirement is not unique to the LNet transport: an end point address is in general pattern + +* TransportAddress : TransferMachineIdentifier + +with the transfer machine identifier component further qualifying the transport address portion, resulting in a unique end point address per transfer machine. The existing bulk emulation transports use the same pattern, though they use a 2-tuple transport address and call the transfer machine identifier component a “service id” [5]. Furthermore, there is a strong relationship between a TransferMachineIdentifier and a FOP state machine locality [6] which needs further investigation. These issues are beyond the scope of this document and are captured in the [r.m0.net.xprt.lnet.address-assignment] dependency. + +The TransferMachineIdentifier is represented in an LNet ME by a portion of the higher order Match bits that form a complete LNet address. See Mapping of Endpoint Address to LNet Address for details. + +All fields in the end point address must be specified. For example: + +* 10.72.49.14@o2ib0:12345:31:0 +* 192.168.96.128@tcp1:12345:32:0 + +The implementation should provide support to make it easy to dynamically assign an available transfer machine identifier by specifying a * (asterisk) character as the transfer machine component of the end point addressed passed to the m0_net_tm_start subroutine: + +* 10.72.49.14@o2ib0:12345:31: + +If the call succeeds, the real address assigned by be recovered from the transfer machine’s ntm_ep field. This is captured in refinement [r.m0.net.xprt.lnet.dynamic-address-assignment]. + +#### Transport Variable +The design requires the implementation to expose the following variable in user and kernel space through the header file net/lnet.h: + +* extern struct m0_net_xprt m0_lnet_xprt; + +The variable represents the LNet transport module, and its address should be passed to the m0_net_domain_init() subroutine to create a network domain that uses this transport. This is captured in the refinement [r.m0.net.xprt.lnet.transport-variable]. + +#### Support for automatic provisioning from receive buffer pools + +The design includes support for the use of pools of network buffers that will be used to receive messages from one or more transfer machines associated with each pool. This results in greater utilization of receive buffers, as fragmentation is reduced by delaying the commitment of attaching a buffer to specific transfer machines. This results in transfer machines performing on-demand, minimal, policy-based provisioning of their receive queues. This support is transport independent, and hence, can apply to the earlier bulk emulation transports in addition to the LNet transport. + +The design uses the struct m0_net_buffer_pool object to group network buffers into a pool. New APIs will be added to associate a network buffer pool with a transfer machine, to control the number of buffers the transfer machine will auto-provision from the pool, and additional fields will be added to the transfer machine and network buffer data structures. + +The m0_net_tm_pool_attach() subroutine assigns the transfer machine a buffer pool in the same domain. A buffer pool can only be attached before the transfer machine is started. A given buffer pool can be attached to more than one transfer machine, but each transfer machine can only have an association with a single buffer pool. The life span of the buffer pool must exceed that of all associated transfer machines. Once a buffer pool has been attached to a transfer machine, the transfer machine implementation will obtain network buffers from the pool to populate its M0_NET_QT_ACTIVE_BULK_RECV queue on an as-needed basis [r.m0.net.xprt.support-for-auto-provisioned-receive-queue]. + +The application provided buffer operation completion callbacks are defined by the callbacks argument of the attach subroutine - only the receive queue callback is used in this case. When the application callback is invoked upon receipt of a message, it is up to the application callback to determine whether to return the network buffer to the pool (identified by the network buffer’s nb_pool field) or not. The application should make sure that network buffers with the M0_NET_BUF_QUEUED flag set are not released back to the pool - this flag would be set in situations where there is sufficient space left in the network buffer for additional messages. See Requesting multiple message delivery in a single network buffer for details. + +When a transfer machine is stopped or fails, receive buffers that have been provisioned from a buffer pool will be put back into that pool by the time the state change event is delivered. + +The m0_net_domain_buffer_pool_not_empty() subroutine should be used, directly or indirectly, as the “not-empty” callback of a network buffer pool. We recommend direct use of this callback - i.e. the buffer pool is dedicated for receive buffers provisioning purposes only. + +Mixing automatic provisioning and manual provisioning in a given transfer machine is not recommended, mainly because the application would have to support two buffer release mechanisms for the automatic and manually provisioned network buffers, which may get confusing. See Automatic provisioning of receive buffers for details on how automatic provisioning works. + +#### Requesting multiple message delivery in a single network buffer + +The design extends the semantics of the existing Motr network interfaces to support delivery of multiple messages into a single network buffer. This requires the following changes: + +* A new field in the network buffer to indicate a minimum size threshold. +* A documented change in behavior in the M0_NET_QT_MSG_RECV callback. + +The API will add the following field to struct m0_net_buffer: + +``` +struct m0_net_buffer { + + … + + m0_bcount_t nb_min_receive_size; + + uint32_t nb_max_receive_msgs; + +}; +``` + +These values are only applicable to network buffers on the M0_NET_QT_MSG_RECV queue. If the transport supports this feature, then the network buffer is reused if possible, provided there is at least nb_min_receive_size space left in the network buffer vector embedded in this network buffer after a message is received. A zero value for nb_min_receive_size is not allowed. At most nb_max_receive_msgs messages are permitted in the buffer. + +The M0_NET_QT_MSG_RECV queue callback handler semantics are modified to not clear the M0_NET_BUF_QUEUED flag if the network buffer has been reused. Applications should not attempt to add the network buffer to a queue or de-register it until an event arrives with this flag unset. + +See Support for multiple message delivery in a single network buffer. + +#### Specifying processor affinity for a transfer machine + +The design provides an API for the higher level application to associate the internal threads used by a transfer machine with a set of processors. In particular the API guarantees that buffer and transfer machine callbacks will be made only on the processors specified. + +``` +#include “lib/processor.h” + +... + +int m0_net_tm_confine(struct m0_net_transfer_mc *tm, const struct m0_bitmap *processors); +``` +Support for this interface is transport specific and availability may also vary between user space and kernel space. If used, it should be called before the transfer machine is started. See Processor affinity for transfer machines for further detail. + +#### Controlling network buffer event delivery + +The design provides the following APIs for the higher level application to control when network buffer event delivery takes place and which thread is used for the buffer event callback. +``` +void m0_net_buffer_event_deliver_all(struct m0_net_transfer_mc *tm); +int m0_net_buffer_event_deliver_synchronously(struct m0_net_transfer_mc *tm); +bool m0_net_buffer_event_pending(struct m0_net_transfer_mc *tm); +void m0_net_buffer_event_notify(struct m0_net_transfer_mc *tm, struct m0_chan *chan); +``` +See Request handler control of network buffer event delivery for the proposed usage. + +The m0_net_buffer_event_deliver_synchronously() subroutine must be invoked before starting the transfer machine, to disable the automatic asynchronous delivery of network buffer events on a transport provided thread. Instead, the application should periodically check for the presence of network buffer events with the m0_net_buffer_event_pending() subroutine and if any are present, cause them to get delivered by invoking the m0_net_buffer_event_deliver_all() subroutine. Buffer events will be delivered on the same thread making the subroutine call, using the existing buffer callback mechanism. If no buffer events are present, the application can use the non-blocking m0_net_buffer_event_notify() subroutine to request notification of the arrival of the next buffer event on a wait channel; the application can then proceed to block itself by waiting on this and possibly other channels for events of interest. + +This support will not be made available in existing bulk emulation transports, but the new APIs will not indicate error if invoked for these transports. Instead, asynchronous network buffer event delivery is always enabled and these new APIs will never signal the presence of buffer events for these transports. This allows a smooth transition from the bulk emulation transports to the LNet transport. + +#### Additional Interfaces +The design permits the implementation to expose additional interfaces if necessary, as long as their usage is optional. In particular, interfaces to extract or compare the network interface component in an end point address would be useful to the Motr request handler setup code. Other interfaces may be required for configurable parameters controlling internal resource consumption limits. + +#### Support for multiple message delivery in a single network buffer + +The implementation will provide support for this feature by using the LNet max_size field in a memory descriptor (MD). + +The implementation should de-queue the receive network buffer when LNet unlinks the MD associated with the network buffer vector memory. The implementation must ensure that there is a mechanism to indicate that the M0_NET_BUF_QUEUED flag should not be cleared by the m0_net_buffer_event_post() subroutine under these circumstances. This is captured in refinement [r.m0.net.xprt.lnet.multiple-messages-in-buffer]. + +#### Automatic provisioning of receive buffers + +The design supports policy based automatic provisioning of network buffers to the receive queues of transfer machines from a buffer pool associated with the transfer machine. This support is independent of the transport being used, and hence can apply to the earlier bulk emulation transports as well. + +A detailed description of a buffer pool object itself is beyond the scope of this document, and is covered by the [r.m0.net.network-buffer-pool] dependency, but briefly, a buffer pool has the following significant characteristics: + +* It is associated with a single network domain. +* It contains a collection of unused, registered network buffers from the associated network domain. +* It provides non-blocking operations to obtain a network buffer from the pool, and to return a network buffer to the pool. +* It provides a “not-empty” callback to notify when buffers are added to the pool. +* It offers policies to enforce certain disciplines like the size and number of network buffers. +* The rest of this section refers to the data structures and subroutines described in the functional specification section, Support for auto-provisioning from receive buffer pools. + +The m0_net_tm_pool_attach() subroutine is used, prior to starting a transfer machine, to associate it with a network buffer pool. This buffer pool is assumed to exist until the transfer machine is finalized. When the transfer machine is started, an attempt is made to fill the M0_NET_QT_MSG_RECV queue with a minimum number of network buffers from the pool. The network buffers will have their nb_callbacks value set from the transfer machine’s ntm_recv_pool_callbacks value. + +The advantages of using a common pool to provision the receive buffers of multiple transfer machines diminishes as the minimum receive queue length of a transfer machine increases. This is because as the number increases, more network buffers need to be assigned (“pinned”) to specific transfer machines, fragmenting the total available receive network buffer space. The best utilization of total receive network buffer space is achieved by using a minimum receive queue length of 1 in all the transfer machines; however, this could result in messages getting dropped in the time it takes to provision a new network buffer when the first gets filled. The default minimum receive queue length value is set to 2, a reasonably balanced compromise value; it can be modified with the m0_net_tm_pool_length_set() subroutine if desired. + +Transports automatically dequeue receive buffers when they get filled; notification of the completion of the buffer operation is sent by the transport with the m0_net_buffer_event_post() subroutine. This subroutine will be extended to get more network buffers from the associated pool and add them to the transfer machine’s receive queue using the internal in-tm-mutex equivalent of the m0_net_buffer_add subroutine, if the length of the transfer machine’s receive queue is below the value of ntm_recv_queue_min_length. The re-provisioning attempt is made prior to invoking the application callback to deliver the buffer event so as to minimize the amount of time the receive queue is below its minimum value. + +The application has a critical role to play in the returning a network buffer back to its pool. If this is not done, it is possible for the pool to get exhausted and messages to get lost. This responsibility is no different from normal non-pool operation, where the application has to re-queue the receive network buffer. The application should note that when multiple message delivery is enabled in a receive buffer, the buffer flags should be examined to determine if the buffer has been dequeued. + +It is possible for the pool to have no network buffers available when the m0_net_buffer_event_post() subroutine is invoked. This means that a transfer machine receive queue length can drop below its configured minimum, and there has to be a mechanism available to remedy this when buffers become available once again. Fortunately, the pool provides a callback on a “not-empty” condition. The application is responsible for arranging that the m0_net_domain_recv_pool_not_empty() subroutine is invoked from the pool’s “not-empty” callback. When invoked in response to the “not-empty” condition, this callback will trigger an attempt to provision the transfer machines of the network domain associated with this pool, until their receive queues have reached their minimum length. While doing so, care should be taken that minimal work is actually done on the pool callback - the pool get operation in particular should not be done. Additionally, care should be taken to avoid obtaining the transfer machine’s lock in this arbitrary thread context, as doing so would reduce the efficacy of the transfer machine’s processor affinity. See Concurrency control for more detail on the serialization model used during automatic provisioning and the use of the ntm_recv_queue_deficit atomic variable. + +The use of a receive pool is optional, but if attached to a transfer machine, the association lasts the life span of the transfer machine. When a transfer machine is stopped or failed, receive buffers from (any) buffer pools will be put back into their pool. This will be done by the m0_net_tm_event_post() subroutine before delivering the state change event to the application or signalling on the transfer machine’s channel. + +There is no reason why automatic and manual provisioning cannot co-exist. It is not desirable to mix the two, but mainly because the application has to handle two different buffer release schemes- transport level semantics of the transfer machine are not affected by the use of automatic provisioning. + +#### Future LNet buffer registration support + +The implementation can support hardware optimizations available at buffer registration time, when made available in future revisions of the LNet API. In particular, Infiniband hardware internally registers a vector (translating a virtual memory address to a "bus address") and produces a cookie, identifying the vector. It is this vector registration capability that was the original reason to introduce m0_net_buf_register(), as separate from m0_net_buf_add() in the Network API. + +#### Processor affinity for transfer machines + +The API allows an application to associate the internal threads used by a transfer machine with a set of processors. This must be done using the m0_net_tm_confine() subroutine before the transfer machine is started. Support for this interfaces is transport specific and availability may also vary between user space and kernel space. The API should return an error if not supported. + +The design assumes that the m0_thread_confine() subroutine from “lib/thread.h” will be used to implement this support. The implementation will need to define an additional transport operation to convey this request to the transport. + +The API provides the m0_net_tm_colour_set() subroutine for the application to associate a “color” with a transfer machine. This colour is used when automatically provisioning network buffers to the receive queue from a buffer pool. The application can also use this association explicitly when provisioning network buffers for the transfer machine in other buffer pool use cases. The colour value can be fetched with the m0_net_tm_colour_get() subroutine. + +#### Synchronous network buffer event delivery + +The design provides support for an advanced application (like the Request handler) to control when buffer events are delivered. This gives the application greater control over thread scheduling and enables it to co-ordinate network usage with that of other objects, allowing for better locality of reference. This is illustrated in the Request handler control of network buffer event delivery use case. The feature will be implemented with the [r.m0.net.synchronous-buffer-event-delivery] refinement. + +If this feature is used, then the implementation should not deliver buffer events until requested, and should do so only on the thread invoking the m0_net_buffer_event_deliver_all() subroutine - i.e. network buffer event delivery is done synchronously under application control. This subroutine effectively invokes the m0_net_buffer_event_post() subroutine for each pending buffer event. It is not an error if no events are present when this subroutine is called; this addresses a known race condition described in Concurrency control. + +The m0_net_buffer_event_pending() subroutine should not perform any context switching operation if possible. It may be impossible to avoid the use of a serialization primitive while doing so, but proper usage by the application will considerably reduce the possibility of a context switch when the transfer machine is operated in this fashion. + +The notification of the presence of a buffer event must be delivered asynchronously to the invocation of the non-blocking m0_net_buffer_event_notify() subroutine. The implementation must use a background thread for the task; presumably the application will confine this thread to the desired set of processors with the m0_net_tm_confine() subroutine. The context switching impact is low, because the application would not have invoked the m0_net_buffer_event_notify() subroutine unless it had no work to do. The subroutine should arrange for the background thread to block until the arrival of the next buffer event (if need be) and then signal on the specified channel. No further attempt should be made to signal on the channel until the next call to the m0_net_buffer_event_notify() subroutine - the implementation can determine the disposition of the thread after the channel is signalled. + +#### Efficient communication between user and kernel spaces + +The implementation shall use the following strategies to reduce the communication overhead between user and kernel space: + +* Use shared memory as much as possible instead of copying data. +* The LNet event processing must be done in the kernel. +* Calls from user space to the kernel should combine as many operations as possible. +* Use atomic variables for serialization if possible. Dependency [r.m0.lib.atomic.interoperable-kernel-user-support]. +* Resource consumption to support these communication mechanisms should be bounded and configurable through the user space process. +* Minimize context switches. This is captured in refinement [r.m0.net.xprt.lnet.efficient-user-to-kernel-comm]. + +As an example, consider using a producer-consumer pattern with circular queues to both initiate network buffer operations and deliver events. These circular queues are allocated in shared memory and queue position indices (not pointers) are managed via atomic operations. Minimal data is actually copied between user and kernel space - only notification of production. Multiple operations can be processed per transition across the user-kernel boundary. + +* The user space transport uses a classical producer-consumer pattern to queue pending operations with the operation dispatcher in the kernel. The user space operation dispatcher will add as many pending operations as possible from its pending buffer operation queue, to the circular queue for network buffer operations that it shares with its counterpart in the kernel, the operations processor. As part of this step, the network buffer vector for the network buffer operation will be copied to the shared circular queue, which minimizes the payload of the notification ioctl call that follows. Once it has drained its pending operations queue or filled the circular buffer, the operation dispatcher will then notify the operation processor in the kernel, via an ioctl, that there are items to process in the shared circular queue. The operation dispatcher will schedule these operations in the context of the ioctl call itself, recovering and mapping each network buffer vector into kernel space. The actual payload of the ioctl call itself is minimal, as all the operational data is in the shared circular queue. +* A similar producer-consumer pattern is used in the reverse direction to send network buffer completion events from the kernel to user space. The event processor in user space has a thread blocked in an ioctl call, waiting for notification on the availability of buffer operation completion events in the shared circular event queue. When the call returns with an indication of available events, the event processor dequeues and delivers each event from the circular queue until the queue is empty. The cycle then continues with the event processor once again blocking on the same kernel ioctl call. The minor race condition implicit in the temporal separation between the test that the circular queue is empty and the ioctl call to wait, is easily overcome by the ioctl call returning immediately if the circular queue is not empty. In the kernel, the event dispatcher arranges for such an blocking ioctl call to unblock after it has added events to the circular queue. It is up to the implementation to ensure that there are always sufficient slots available in the circular queue so that events do not get dropped; this is reasonably predictable, being a function of the number of pending buffer operations and the permitted reuse of receive buffers. + +This is illustrated in the following figure: + +![image](./Images/LNET1.PNG) + +### Conformance +* [i.m0.net.rdma] LNET supports RDMA and the feature is exposed through the Motr network bulk interfaces. +* [i.m0.net.ib] LNET supports Infiniband. +* [i.m0.net.xprt.lnet.kernel] The design provides a kernel transport. +* [i.m0.net.xprt.lnet.user] The design provides a user space transport. +* [i.m0.net.xprt.lnet.user.multi-process] The design allows multiple concurrent user space processes to use LNet. +* [i.m0.net.xprt.lnet.user.no-gpl] The design avoids using user space GPL interfaces. +* [i.m0.net.xprt.lnet.user.min-syscalls] The [r.m0.net.xprt.lnet.efficient-user-to-kernel-comm] refinement will address this. +* [i.m0.net.xprt.lnet.min-buffer-vm-setup] During buffer registration user memory pages get pinned in the kernel. +* [i.m0.net.xprt.lnet.processor-affinity] LNet currently provides no processor affinity support. The [r.m0.net.xprt.lnet.processor-affinity] refinement will provide higher layers the ability to associate transfer machine threads with processors. +* [r.m0.net.buffer-event-delivery-control] The [r.m0.net.synchronous-buffer-event-delivery] refinement will provide this feature. +* [i.m0.net.xprt.lnet.buffer-registration] The API supports buffer pre-registration before use. Any hardware optimizations possible at this time can be utilized when available through the LNet API. See Future LNet buffer registration support. +* [i.m0.net.xprt.auto-provisioned-receive-buffer-pool] The design provides transport independent support to automatically provision the receive queues of transfer machines on demand, from pools of unused, registered, network buffers. + +### Dependencies +* [r.lnet.preconfigured] The design assumes that LNET modules and associated LNDs are pre-configured on a host. +* [r.m0.lib.atomic.interoperable-kernel-user-support] The design assumes that the Motr library’s support for atomic operations is interoperable across the kernel and user space boundaries when using shared memory. +* [r.m0.net.xprt.lnet.address-assignment] The design assumes that the assignment of LNet transport addresses to Motr components is made elsewhere. Note the constraint that all addresses must use a PID value of 12345, and a Portal Number that does not clash with existing usage (Lustre and Cray). It is recommended that all Motr servers be assigned low (values close to 0) transfer machine identifiers values. In addition, it is recommended that some set of such addresses be reserved for Motr tools that are relatively short lived - they will dynamically get transfer machine identifiers at run time. These two recommendations reduce the chance of a collision between Motr server transfer machine identifiers and dynamic transfer machine identifiers. Another aspect to consider is the possible alignment of FOP state machine localities [6] with transfer machine identifiers. +* [r.m0.net.network-buffer-pool] Support for a pool of network buffers involving no higher level interfaces than the network module itself. There can be multiple pools in a network domain, but a pool cannot span multiple network domains. Non-blocking interfaces are available to get and put network buffers, and a callback to signal the availability of buffers is provided. This design benefits considerably from a “colored” variant of the get operation, one that will preferentially return the most recently used buffer last associated with a specific transfer machine, or if none such are found, a buffer which has no previous transfer machine association, or if none such are found, the least recently used buffer from the pool, if any. + +Supporting this variant efficiently may require a more sophisticated internal organization of the buffer pool than is possible with a simple linked list; however, a simple ordered linked list could suffice if coupled with a little more sophisticated selection mechanism than “head-of-the-list”. Note that buffers have no transfer machine affinity until first used, and that the nb_tm field of the buffer can be used to determine the last transfer machine association when the buffer is put back into the pool. Here are some possible approaches: + +* Add buffers with no affinity to the tail of the list, and push returned buffers to the head of the list. This approach allows for a simple O(n) worst case selection algorithm with possibly less average overhead (n is the average number of buffers in the free list). A linear search from the head of the list will break off when a buffer of the correct affinity is found, or a buffer with no affinity is found, or else the buffer at the tail of the list is selected, meeting the requirements mentioned above. In steady state, assuming an even load over the transfer machines, a default minimum queue length of 2, and a receive buffer processing rate that keeps up with the receive buffer consumption rate, there would only be one network buffer per transfer machine in the free list, and hence the number of list elements to traverse would be proportional to the number of transfer machines in use. In reality, there may be more than one buffer affiliated with a given transfer machine to account for the occasional traffic burst. A periodic sweep of the list to clear the buffer affiliation after some minimum time in the free list (reflecting the fact that that the value of such affinity reduces with time spent in the buffer pool), would remove such extra buffers over time, and serve to maintain the average level of efficiency of the selection algorithm. The nb_add_time field of the buffer could be used for this purpose, and the sweep itself could be piggybacked into any get or put call, based upon some time interval. Because of the sorting order, the sweep can stop when it finds the first un-affiliated buffer or the first buffer within the minimum time bound. +* A further refinement of the above would be to maintain two linked lists, one for un-affiliated buffers and one for affiliated buffers. If the search of the affiliated list is not successful, then the head of the unaffiliated list is chosen. A big part of this variant is that returned buffers get added to the tail of the affiliated list. This will increase the likelihood that a get operation would find an affiliated buffer toward the head of the affiliated list, because automatic re-provisioning by a transfer machine takes place before the network buffer completion callback is made, and hence before the application gets to process and return the network buffer to the pool. The sweep starts from the head of the affiliated list, moving buffers to the unaffiliated list, until it finds a buffer that is within the minimum time bound. +Better than O(n) search (closer to O(1)) can be accomplished with more complex data structures and algorithms. Essentially it will require maintaining a per transfer machine list somewhere. The pool can only learn of the existence of a new transfer machine when the put operation is involved and will have to be told when the transfer machine is stopped. If the per transfer machine list is anchored in the pool, then the set of such anchors must be dynamically extensible. The alternative of anchoring the list in the transfer machine itself has pros and cons; it would work very well for the receive buffer queue, but does not extend to support other buffer pools for arbitrary purposes. In other words, it is possible to create an optimal 2-level pool (a per transfer machine pool in the data structure itself, with a shared backing store buffer pool) dedicated to receive network buffer processing, but not a generalized solution. Such a pool would exhibit excellent locality of reference but would be more complex because high water thresholds would have to be maintained to return buffers back to the global pool. + +### Security Model +No security model is defined; the new transport inherits whatever security model LNet provides today. + +### Refinement +* [r.m0.net.xprt.lnet.transport-variable] + * The implementation shall name the transport variable as specified in this document. +* [r.m0.net.xprt.lnet.end-point-address] + * The implementation should support the mapping of end point address to LNet address as described in Mapping of Endpoint Address to LNet Address, including the reservation of a portion of the match bit space in which to encode the transfer machine identifier. +* [r.m0.net.xprt.support-for-auto-provisioned-receive-queue] The implementation should follow the strategy outlined in Automatic provisioning of receive buffers. It should also follow the serialization model outlined in Concurrency control. +* [r.m0.net.xprt.lnet.multiple-messages-in-buffer] + * Add a nb_min_receive_size field to struct m0_net_buffer. + * Document the behavioral change of the receive message callback. + * Provide a mechanism for the transport to indicate that the M0_NET_BUF_QUEUED flag should not be cleared by the m0_net_buffer_event_post() subroutine. + * Modify all existing usage to set the nb_min_receive_size field to the buffer length. +* [r.m0.net.xprt.lnet.efficient-user-to-kernel-comm] + * The implementation should follow the strategies recommended in Efficient communication between user and kernel spaces, including the creation of a private device driver to facilitate such communication. +* [r.m0.net.xprt.lnet.cleanup-on-process-termination] + * The implementation should release all kernel resources held by a process using the LNet transport when that process terminates. +* [r.m0.net.xprt.lnet.dynamic-address-assignment] + * The implementation may support dynamic assignment of transfer machine identifier using the strategy outlined in Mapping of Endpoint Address to LNet Address. We recommend that the implementation dynamically assign transfer machine identifiers from higher numbers downward to reduce the chance of conflicting with well-known transfer machine identifiers. +* [r.m0.net.xprt.lnet.processor-affinity] + * The implementation must provide support for this feature, as outlined in Processor affinity for transfer machines. The implementation will need to define an additional transport operation to convey this request to the transport. Availability may vary by kernel or user space. +* [r.m0.net.synchronous-buffer-event-delivery] + * The implementation must provide support for this feature as outlined in Controlling network buffer event delivery and Synchronous network buffer event delivery. + +### State +A network buffer used to receive messages may be used to deliver multiple messages if its nb_min_receive_size field is non-zero. Such a network buffer may still be queued when the buffer event signifying a received message is delivered. + +When a transfer machine stops or fails, all network buffers associated with buffer pools should be put back into their pool. The atomic variable, ntm_recv_pool_deficit, used to count the number of network buffers needed should be set to zero. This should be done before notification of the state change is made. + +Transfer machines now either support automatic asynchronous buffer event delivery on a transport thread (the default), or can be configured to synchronously deliver buffer events on an application thread. The two modes of operation are mutually exclusive and must be established before starting the transfer machine. + +#### State Invariants +User space buffers pin memory pages in the kernel when registered. Hence, registered user space buffers must be associated with a set of kernel struct page pointers to the referenced memory. + +The invariants of the transfer machine and network buffer objects should capture the fact that if a pool is associated with these objects, then the pool is in the same network domain. The transfer machine invariant, in particular, should ensure that the value of the atomic variable, ntm_recv_pool_deficit is zero when the transfer machine is in an inoperable state. + +See the refinement [r.m0.net.xprt.support-for-auto-provisioned-receive-queue]. + +#### Concurrency Control +The LNet transport module is sandwiched between the asynchronous Motr network API above, and the asynchronous LNet API below. It must plan on operating within the serialization models of both these components. In addition, significant use is made of the kernel’s memory management interfaces, which have their own serialization model. The use of a device driver to facilitate user space to kernel communication must also be addressed. + +The implementation mechanism chosen will further govern the serialization model in the kernel. The choice of the number of EQs will control how much inherent independent concurrency is possible. For example, sharing of EQs across transfer machines or for different network buffer queues could require greater concurrency control than the use of dedicated EQs per network buffer queue per transfer machine. + +Serialization of the kernel transport is anticipated to be relatively straightforward, with safeguards required for network buffer queues. + +Serialization between user and kernel space should take the form of shared memory circular queues co-ordinated with atomic indices. A producer-consumer model should be used, with opposite roles assigned to the kernel and user space process; appropriate notification of change should be made through the device driver. Separate circular queues should be used for buffer operations (user to kernel) and event delivery (kernel to user). [r.m0.net.xprt.lnet.efficient-user-to-kernel-comm] + +Automatic provisioning can only be enabled before a transfer machine is started. Once enabled, it cannot be disabled. Thus, provisioning operations are implicitly protected by the state of the transfer machine - the “not-empty” callback subroutine will never fail to find its transfer machine, though it should take care to examine the state before performing any provisioning. The life span of a network buffer pool must exceed that of the transfer machines that use the pool. The life span of a network domain must exceed that of associated network buffer pools. + +Automatic provisioning of receive network buffers from the receive buffer pool takes place either through the m0_net_buffer_event_post() subroutine or triggered by the receive buffer pool’s “not-empty” callback with the m0_net_domain_buffer_pool_not_empty subroutine. Two important conditions should be met while provisioning: + +* Minimize processing on the pool callback: The buffer pool maintains its own independent lock domain; it invokes the m0_net_domain_buffer_pool_not_empty subroutine (provided for use as the not-empty callback) while holding its lock. The callback is invoked on the stack of the caller who used the put operation on the pool. It is essential, therefore, that the not-empty callback perform minimal work - it should only trigger an attempt to reprovision transfer machines, not do the provisioning. +* Minimize interference with the processor affinity of the transfer machine: Ideally, the transfer machine is only referenced on a single processor, resulting in a strong likelihood that its data structures are in the cache of that processor. Provisioning transfer machines requires iteration over a list, and if the transfer machine lock has to be obtained for each, it could adversely impact such caching. We provided the atomic variable, ntm_recv_pool_deficit, with a count of the number of network buffers to provision so that this lock is obtained only when the transfer machine really needs to be provisioned, and not for every invocation of the buffer pool callback. The transfer machine invariant will enforce that the value of this atomic will be 0 when the transfer machine is not in an operable state. + + +Actual provisioning should be done on a domain private thread awoken for this purpose. A transfer machine needs provisioning if it is in the started state, it is associated with the pool, and its receive queue length is less than the configured minimum (determined via an atomic variable as outlined above). To provision, the thread will obtain network buffers from the pool with the get() operation, and add them to the receive queue of the transfer machine with the (internal equivalent) of the m0_net_buffer_add_call that assumes that the transfer machine is locked. + +The design requires that receive buffers obtained from buffer pools be put back to their pools when a transfer machine is stopped or fails, prior to notifying the higher level application of the change in state. This action will be done in the m0_net_tm_event_post() subroutine, before invoking the state change callback. The subroutine obtains the transfer machine mutex, and hence has the same degree of serialization as that used in automatic provisioning. + +The synchronous delivery of network buffer events utilizes the transfer machine lock internally, when needed. The lock must not be held in the m0_net_buffer_event_deliver_all() subroutine across calls to the m0_net_buffer_event_post() subroutine. + +In the use case described in Request handler control of network buffer event delivery there is a possibility that the application could wake up for reasons other than the arrival of a network buffer event, and once more test for the presence of network buffer events even while the background thread is making a similar test. It is possible that the application could consume all events and once more make a request for future notification while the semaphore count in its wait channel is non-zero. In this case it would return immediately, find no additional network events and repeat the request; the m0_net_buffer_event_deliver_all() subroutine will not return an error if no events are present. + +### Scenarios +A Motr component, whether it is a kernel file system client, server, or tool, uses the following pattern for multiple-message reception into a single network buffer. + +1. The component creates and starts one or more transfer machines, identifying the actual end points of the transfer machines. +2. The component provisions network buffers to be used for receipt of unsolicited messages. The method differs based on whether a buffer pool is used or not. + i. When a buffer pool is used, these steps are performed. + a. The network buffers are provisioned, with nb_min_receive_size set to allow multiple delivery of messages. The network buffers are added to a buffer pool. + b. The buffer pool is registered with a network domain and associated with one or more transfer machines. Internally, the transfer machines will get buffers from the pool and add them to their M0_NET_QT_MSG_RECV queues. + ii. When a buffer pool is not used, these steps are performed. + a. Network buffers are provisioned with nb_min_receive_size set to allow multiple delivery of messages. + b. The network buffers are registered with the network domain and added to a transfer machine M0_NET_QT_MSG_RECV queue. +3. When a message is received, two sub-cases are possible as part of processing the message. It is the responsibility of the component itself to coordinate between these two sub-cases. + i. When a message is received and the M0_NET_BUF_QUEUED flag is set in the network buffer, then the client does not re-enqueue the network buffer as there is still space remaining in the buffer for additional messages. + ii. When a message is received and the M0_NET_BUF_QUEUED flag is not set in the network buffer, then the component takes one of two paths, depending on whether a buffer pool is in use or not. + a. When a buffer pool is in use, the component puts the buffer back in the buffer pool so it can be re-used. + b. When a buffer pool is not in use, the component may re-enqueue the network buffer after processing is complete, as there is no space remaining in the buffer for additional messages. + +#### Sending non-bulk messages from Motr components + +A Motr component, whether a user-space server, user-space tool or kernel file system client uses the following pattern to use the LNet transport to send messages to another component. Memory for send queues can be allocated once, or the send buffer can be built up dynamically from serialized data and references to existing memory. + +1. The component optionally allocates memory to one or more m0_net_buffer objects and registers those objects with the network layer. These network buffers are a pool of message send buffers. +2. To send a message, the component uses one of two strategies. + i. The component selects one of the buffers previously allocated and serializes the message data into that buffer. + ii. The component builds up a fresh m0_net_buffer object out of memory pages newly allocated and references to other memory (to avoid copies), and registers the resulting object with the network layer. +3. The component enqueues the message for transmission. +4. When a buffer operation completes, it uses one of two strategies, corresponding to the earlier approach. + i. If the component used previously allocated buffers, it returns the buffer to the pool of send buffers. + ii. If the component built up the buffer from partly serialized and partly referenced data, it de-registers the buffer and de-provisions the memory. + +#### Kernel space bulk buffer access from file system clients + +A motr file system client uses the following pattern to use the LNet transport to initiate passive bulk transfers with motr servers. Memory for bulk queues will come from user space memory. The user space memory is not controlled by motr; it is used as a result of system calls, eg read() and write(). + +1. The client populates a network buffer from mapped user pages, registers this buffer with the network layer and enqueues the buffer for transmission. +2. When a buffer operation completes, the client will de-register the network buffer and de-provision the memory assigned. + +#### User space bulk buffer access from Motr servers + +A Motr server uses the following pattern to use the LNet transport to initiate active bulk transfers to other Motr components. + +1. The server establishes a network buffer pool. The server allocates a set of network buffers provisioned with memory and registers them with the network domain. +2. To perform a bulk operation, the server gets a network buffer from the network buffer pool, populates the memory with data to send in the case of active send, and enqueues the network buffer for transmission. +3. When a network buffer operation completes, the network buffer can be returned to the pool of network buffers. + +#### User space bulk buffer access from Motr tools + +A Motr tool uses the following pattern to use the LNet transport to initiate passive bulk tranfers to Motr server components: + +1. The tool should use an end point address that is not assigned to any mero server or file system client. It should use a dynamic address to achieve this. +2. To perform a bulk operation, the tool provisions a network buffer. The tool then registers this buffer and enqueues the buffer for transmission. +3. When a buffer operation completes, the buffer can be de-registered and the memory can be de-provisioned. + +#### Obtaining dynamic addresses for Motr tools + +A Motr tool is a relatively short lived process, typically a command line invocation of a program to communicate with a Motr server. One cannot assign fixed addresses to such tools, as the failure of a human interactive program because of the existence of another executing instance of the same program is generally considered unacceptable behavior, and one that precludes the creation of scriptable tools. + +Instead, all tools could be assigned a shared combination of NID, PID and Portal Number, and at run time, the tool process can dynamically assign unique addresses to itself by creating a transfer machine with a wildcard transfer machine identifier. This is captured in refinement [r.m0.net.xprt.lnet.dynamic-address-assignment] and Mapping of Endpoint Address to LNet Address. Dependency: [r.m0.net.xprt.lnet.address-assignment] + +#### Request handler control of network buffer event delivery + +The user space Motr request handler operates within a locality domain that includes, among other things, a processor, a transfer machine, a set of FOMs in execution, and handlers to create new FOMs for FOPs. The request handler attempts to perform all I/O operations asynchronously, using a single handler thread, to minimize the thread context switching overhead. + +### Failures +One failure situation that must be explicitly addressed is the termination of the user space process that uses the LNet transport. All resources consumed by this process must be released in the kernel. In particular, where shared memory is used, the implementation design should take into account the accessibility of this shared memory at this time. Refinement: [r.m0.net.xprt.lnet.cleanup-on-process-termination] + +### Analysis +The number of LNet based transfer machines that can be created on a host is constrained by the number of LNet portals not assigned to Lustre or other consumers such as Cray. In Lustre 2.0, the number of unassigned portal numbers is 30. + +In terms of performance, the design is no more scalable than LNet itself. The design does not impose a very high overhead in communicating between user space and the kernel and uses considerably more efficient event processing than ULA. + +### Other +We had some concerns and questions regarding the serialization model used by LNet, and whether using multiple portals is more efficient than sharing a single portal. The feedback we received indicates that LNet uses relatively coarse locking internally, with no appreciable difference in performance for these cases. There may be improvements in the future, but that is not guaranteed; the suggestion was to use multiple portals if possible, but that also raises concerns about the restricted available portal space left in LNet (around 30 unused portals) and the fact that all LNet users share the same portals space. [4]. + +### Rationale +One important design choice was the choice to use a custom driver rather than ULA, or a re-implementation of the ULA. The primary reason for not using the ULA directly is that it is covered by the GPL, which would limit the licensing choices for Motr overall. It would have been possible to implement our own ULA-like driver and library. After that, a user-level LNet transport would still be required on top of this ULA-like driver. However, Motr does not require the full set of possible functions and use cases supported by LNet. Implementing a custom driver, tailored to the Motr net bulk transport, means that only the functionality required by Motr must be supported. The driver can also be optimized specifically for the Motr use cases, without concern for other users. For these reasons, a re-implementation of the ULA was not pursued. + +Certain LNet implementation idiosyncrasies also impact the design. We call out the following, in particular: + +The portal number space is huge, but the implementation supports just the values 0-63 [4]. + +* Only messages addressed to PID 12345 get delivered. This is despite the fact that LNet allows us to specify any address in the LNetGet, LNetPut and LNetMEAttach subroutines. +* ME matches are constrained to either all network interfaces or to those matching a single NID, i.e. a set of NIDs cannot be specified. +* No processor affinity support. + +Combined, this translates to LNet only supporting a single PID (12345) with up to 64 portals, out of which about half (34 actually) seem to be in use by Lustre and other clients. Looking at this another way: discounting the NID component of an external LNet address, out of the remaining 64 bits (32 bit PID and 32 bit Portal Number), about 5 bits only are available for Motr use! This forced the design to extend its external end point address to cover a portion of the match bit space, represented by the Transfer Machine Identifier. + +Additional information on current LNet behavior can be found in [4]. + +### Deployment +Motr’s use of LNet must co-exist with simultaneous use of LNet by Lustre on the same host. + +#### Network +LNet must be set up using existing tools and interfaces provided by Lustre. Dependency: [r.lnet.preconfigured]. + +LNet Transfer machine end point addresses are statically assigned to Motr runtime components through the central configuration database. The specification requires that the implementation use a disjoint set of portals from Lustre, primarily because of limitations in the LNet implementation. See Rationale for details. + +#### Core +This specification will benefit if Lustre is distributed with a larger value of MAX_PORTALS than the current value of 64 in Lustre 2.0. + +#### Installation +LNet is capable of running without Lustre, but currently is distributed only through Lustre packages. It is not in the scope of this document to require changes to this situation, but it would be beneficial to pure Motr servers (non-Lustre) to have LNet distributed in packages independent of Lustre. + +### References +* [1] T1 Task Definitions +* [2] Mero Summary Requirements Table +* [3] m0 Glossary +* [4] m0LNet Preliminary Design Questions +* [5] RPC Bulk Transfer Task Plan +* [6] HLD of the FOP state machine diff --git a/doc/HLD-Version-Numbers.md b/doc/HLD-Version-Numbers.md index 56cd4df4c55..c2f83bab753 100644 --- a/doc/HLD-Version-Numbers.md +++ b/doc/HLD-Version-Numbers.md @@ -13,7 +13,7 @@ A version number is stored together with the file system state whose version it ## Definitions See the Glossary for general M0 definitions and HLD of FOL for the definitions of file system operation, update, and lsn. The following additional definitions are required: -- For the present design, it is assumed that a file system update acts on units (r.dtx.units). For example, a typical meta-data update acts on one or more "inodes" and a typical data update acts on inodes and data blocks. Inodes, data blocks, directory entries, etc. are all examples of units. It is further assumed that units involved in an update are unambiguously identified (r.dtx.units.identify) and that a complete file system state is a disjoint union of states comprising units. (there are consistent relationships between units, e.g., the inode nlink counter must be consistent with the contents of directories in the name-space). +- For the present design, it is assumed that a file system update acts on units (r.dtx.units). For example, a typical meta-data update acts on one or more "inodes" and a typical data update acts on inodes and data blocks. Inodes, data blocks, directory entries, etc. are all examples of units. It is further assumed that units involved in an update are unambiguously identified (r.dtx.units.identify) and that a complete file system state is a disjoint union of states comprising units. (Of course, there are consistent relationships between units, e.g., the inode nlink counter must be consistent with the contents of directories in the name-space). - It is guaranteed that operations (updates and queries) against a given unit are serializable in the face of concurrent requests issued by the file system users. This means that the observable (through query requests) unit state looks as if updates of the unit were executed serially in some order. Note that the ordering of updates is further constrained by the distributed transaction management considerations which are outside the scope of this document. - A unit version number is an additional piece of information attached to the unit. A version number is drawn from some linearly ordered domain. A version number changes on every update of the unit state in such a way that the ordering of unit states in the serial history can be deduced by comparing version numbers associated with the corresponding states. @@ -27,7 +27,7 @@ See the Glossary for general M0 definitions and HLD of FOL for the definitions o ## Design Highlights -In the presence of caching, requirements [r.verno.resource] and [r.verno.fol] are seemingly contradictory: if two caching client nodes assigned (as allowed by [r.verno.resource]) version numbers to two independent units, then after re-integration of units to their common primary server, the version numbers must refer to primary fol, but clients cannot produce such references without extremely inefficient serialization of all accesses to the units on the server. +In the presence of caching, requirements [r.verno.resource] and [r.verno.fol] are seemingly contradictory: if two caching client nodes assigned (as allowed by [r.verno.resource]) version numbers to two independent units, then after re-integration of units to their common master server, the version numbers must refer to the master's fol, but clients cannot produce such references without extremely inefficient serialization of all accesses to the units on the server. To deal with that, a version number is made compound: it consists of two components: diff --git a/doc/HLD-of-FDMI.md b/doc/HLD-of-FDMI.md new file mode 100644 index 00000000000..62a018d50ff --- /dev/null +++ b/doc/HLD-of-FDMI.md @@ -0,0 +1,464 @@ +# HLD of FDMI +This document presents a High-Level Design (HLD) of Motr’s FDMI interface. + +### Introduction +This document specifies the design of the Motr FDMI interface. FDMI is a part of the Motr product and provides an interface for the Motr plugins. It horizontally extends the features and capabilities of the system. The intended audience for this document includes product architects, developers, and QA engineers. + +### Definitions +* FDMI: File Data Manipulation Interface +* FDMI source +* FDMI plugin +* FDMI source dock +* FDMI plugin dock +* FDMI record +* FDMI record type +* FDMI filter + + +## Overview +Motr is a storage core capable of deployment for a wide range of large-scale storage regimes from the cloud and enterprise systems to exascale HPC installations. FDMI is a part of Motr core, providing an interface for plugins implementation. FDMI is built around the core and allows for horizontally extending the features and capabilities of the system in a scalable and reliable manner. + +## Architecture +This section provides the architectural information including the below but not limited to: +1. Common design strategy including + * General approach to the decomposition + * Chosen architecture style and template if any +2. Key aspects and consideration that affect on the other design + +### FDMI Position in Overall Motr Core Design +FDMI is an interface to allow the Motr Core to scale horizontally. The scaling includes two aspects: +* Core expansion in the aspect of adding core data processing abilities, including data volumes as well as transformation into alternative representation. +The expansion is provided by introducing FDMI plugins. +* Initial design implies that the FOL records are the only data plugins that can process. +* Core expansion in an aspect of adding new types of data that core can feed plugins. This sort of expansion is provided by introducing FDMI sources. +* Initial design implies that the FOL record is the only source data type that Motr Core provides. + +The FDMI plugin is linked with the Motr Core to make use of corresponding FDMI interfaces. It runs as a part of the Motr instance or service and provides various capabilities (data storing, etc.). The purpose of the FDMI plugin is to receive notifications from the Motr Core on events of interest for the plugin and further post-processing of the received events for producing some additional classes of service the Core currently is not able to provide. + +The FDMI source is a part of Motr instance linked with the appropriate FDMI interfaces and allowing connection to additional data providers. + +Considering the amount of data Motr Core operates, it is obvious that the plugin typically requires a sufficiently reduced bulk of data to be routed to it for post-processing. The mechanism of subscription provides data reduction to particular data types and conditions met at runtime. The subscription mechanism is based on a set of filters that the plugin registers in the Motr Filter Database during its initialization. + +Source in its turn, refreshes its subset of filters against the database. The subset selects filters from the overall filters as per the knowledge of data types. The source feeds the FDMI as well as operations with the data that the source supports. + +### FDMI Roles +The FDMI consists of APIs that implements particular roles as per the FDMI use cases. + The roles are: +* plugin dock, responsible for: + * plugin registration in FDMI instance + * Filter registration in Motr Filter Database + * Listening to notifications coming over RPC + * Payload processing + * Self-diagnostic (TBD) +* Source dock (FDMI service), responsible for: + * Source registration + * Retrieving/refreshing filter set for the source + * Input data filtration + * Deciding on and posting notifications to filter subscribers over Motr RPC + * Deferred input data release + * Self-diagnostic (TBD) + +![Image](./Images/FDMI_Dock_Architecture.png) + +### FDMI plugin dock + +Initialization + +![Image](./Images/FDMI_plugin_dock_Plugin_initialization.png) + +The application starts with getting a private FDMI plugin dock API allowing it to start communicating with the dock. +Further initialization consists of registering the number of filters in the filtered database. Every filter instance is given by the plugin creator with a filter ID unique across the whole system. +On the filter registration, the plugin dock checks filter semantics. If the filter appears invalid, the registration process stops. + +**NB**: +The plugin performs the filter check at the time of registration, there can be errors in the run-time during the filter condition check. The criteria for filter correctness will be defined later. If the filter is treated as incorrect by the FDMI source dock, the corresponding ADDB record is posted and optionally HA will be informed. + + +**Data Processing** +![Image](./Images/FDMI_plugin_dock_Data_Processing.png) + +The remote FDMI instance has the Source Dock role and provides data payload via the RPC channel. The RPC sink calls back the local FDMI instance has the plugin Dock role. Later, resolves the filter ID to plugin callback and calls the one passing the data to the plugin instance. +It may take some time for the plugin to do post-processing and decide if the FDMI record could be released. Meanwhile, the plugin instructs FDMI to notify the corresponding source to allow the particular FDMI record to be released. + +**Filter “active” status** +![image](./Images/FDMI_Controlling_filter_status.png) + +The filter active status is used to enable/disable this filter from the database. The active status filter notifies all the registered sources. If the filter active status is set to false (filter is disabled), it is ignored by sources. +The application plugin can change filter active status by sending the enable filter or disable filter command for the already registered filter: +• The Initial value of filter active status is specified during the filter registration +• To enable/disable the filter, the application sends enable filter or disable filter request to the filter service. The Filter ID is specified as a parameter. + +**De-initialization** +![image](./Images/FDMI_plugin_dock_Plugin_deinitialization.png) + +The plugin initiates de-initialization by calling the local FDMI. The latter deregisters the plugin’s filter set with filterd service. After confirmation, it deregisters the associated plugin’s callback function. +All the registered sources are notified about changes in the filter set if any occurred as the result of the plugin coming off. + +### FDMI source dock +**Initialization** + +![image](./Images/FDMI_source_dock_Source_initialization.png) + +* TBD where to validate, on Source side or inside FDMI +The FDMI Source dock does not need explicit registration in filterd. Each FDMI source dock on start requests the filter list from the filterd and stores it locally. + +In order to notify FDMI source dock about ongoing changes in filter data set, the Resource manager’s locks mechanism is used. Filters change notification: TBD. On read operation, the FDMI source acquires Read lock for the filterd database. On filter metadata change, each instance holding read lock is being notified. + +On receiving filter metadata change notification, the FDMI source dock re-requests filter data set. +On receiving each new filter, the FDMI source dock parses it, checks for its consistency, and stores its in-memory representation suitable for calculations. + +As an optimization, execution plan could be built for every added filter to be kept along with the one. As an option, execution plan can be built on every data filtering action to trade off memory consumption for CPU ticks. + + +**Input Data Filtering** +![image](./Images/FDMI_Input_Data_Filtering.png) + +*In case of RPC channel failure, input data reference counter has to be decremented. See Deferred Input Data Release. +**RPC mechanism is responsible for reliable data delivery, and is expected to do its best for re-sending data appeared to be stuck in the channel. The same way it is responsible for freeing RPC items once the connection found broken. +Steps (***) and (*****) are needed to lock data entry during internal FDMI record processing. To make sure that the source would not dispose it before FDMI engine evaluates all filters. +Step (***), on the other hand, increases the counter for each record FDMI sends out. Matching decrement operation is not displayed on this diagram, it’s discussed later. **** + +When input data identified by FDMI record ID go to Source, the latter calls local FDMI instance with the data. Once the data is arrived, the FDMI starts iterating through local filter set. + +According to its in-memory representation (execution plan) each filter is traversed node by node, and for every node a predicate result is calculated by appropriate source callback. + +**NB**: +It is expected that source will be provided with operand definitions only. Inside the callback the source is going to extract corresponding operand as per the description passed in. The predicate result is calculated as per the extracted and transformed data. + +Note how the FDMI handles tree: all the operations are evaluated by the FDMI engine, and only get the atomic values from the FDMI record payload are delegated to Source. + +When traversing is completed, the FDMI engine calculates the final Boolean result for the filter tree and decides whether to put serialized input data onto RPC for the plugin associated with the filter. + +#### Deferred Input Data Release +![image](./Images/FDMI_source_dock_Deferred_input_data_release.png) + +The input data may require to remain preserved in the Source until the moment when plugin does not need it anymore. The preservation implies the protection from being deleted/modified. The data processing inside the plugin is an asynchronous process in general, and the plugin is expected to notify corresponding source allowing it to release the data. The message comes from the plugin to the FDMI instance hosting the corresponding source. + +### FDMI Service Found Dead +When interaction between Motr services results in a timeout exceeding pre-configured value, the not responding service needs to be announced dead across the whole system. First of all confc client is notified by HA about the service not responding and announced dead. After being marked dead in confc cache, the service has to be reported by HA to filterd as well + + +## Interfaces +1. FDMI service +2. FDMI source registration +3. FDMI source implementation guideline +4. FDMI record +5. FDMI record post +6. FDMI source dock FOM + i. Normal workflow + ii. FDMI source: filters set support + iii. Corner cases (plugin node dead) +7. FilterD +8. FilterC +9. FDMI plugin registration +10. FDMI plugin dock FOM +11. FDMI plugin implementation guideline + +## FDMI Service +![image](./Images/FDMI_Service_Startup.png) + +The FDMI service runs as a part of Motr instance. The FDMI service stores context data for both FDMI source dock and FDMI plugin dock. The FDMI service is initialized and started on Motr instance start up, the FDMI Source dock and FDMI plugin dock are both initialised on the service start unconditionally. + +**TBD**: + +Later the docks can be managed separately and specific API may be provided for this purposes. + + +### FDMI source registration +![image](./Images/FDMI_source_registration.png) + +The FDMI source instance main task is to post the FDMI records of a specific type to FDMI source dock for further analysis, Only one FDMI source instance with a specific type should be registered: the FDMI record type uniquely identifies FDMI source instance. A list of FDMI record types: + +* FOL record type +* ADDB record type +* TBD +The FDMI source instance provides the following interface functions for the FDMI source dock to handle the FDMI records: +* Test filter condition +* Increase/decrease record reference counter +* Xcode functions + +On the FDMI source registration all its internals are initialized and saved as FDMI generic source context data. Pointer to the FDMI source instance is passed to the FDMI source dock and saved in a list. In its turn, the FDMI source dock provides back to the FDMI source instance an interface function to perform the FDMI record posting. The FDMI generic source context stores the following: + +* FDMI record type +* FDMI generic source interface +* FDMI source dock interface + +### FDMI source implementation guideline +The FDMI source implementation depends on data domain. Specific FDMI source type stores: +* FDMI generic source interface +* FDMI specific source context data (source private data) + +Currently, the FDMI FOL source is implemented as the 1st FDMI source. The FDMI FOL source provides ability for detailed FOL data analysis. As per the generic FOL record knowledge, the test filter condition function implemented by the FOL source checks FOP data: the FOL operation code and pointer to FOL record specific data. + +For the FOL record specific data handling the FDMI FOL record type is declared and registered for each specific FOL record type. For example, write operation FOL record, set attributes FOL record, etc. + +The FDMI FOL record type context stores the following: +* FOL operation code +* FOL record interface + +The FOL record interface functions are aware of particular FOL record structure and provides basic primitives to access data: + +* Check condition + +On the FDMI FOL record type FDMI record registration all its internals are initialized and saved as FDMI FOL record context data. Pointer to FDMI FOL record type is stored as a list in FDMI specific source context data. + +### FDMI Record Post +![image](./Images/FDMI_source_dock_Source_Feed.png) + +Source starts with local locking data to be fed to the FDMI interface, then it calls post FDMI API. On the FDMI side a new FDMI record (data container) is created with new record ID, and posted data gets packed into the record. The record is queued for further processing to the FDMI FOM queue, and the FDMI record ID is returned to Source. + +To process further calling back from the FDMI about a particular data (such as original record) the Source establishes the relation between returned FDMI record ID and original record identifier. + +**NB**: +The Source is responsible for the initial record locking (incrementing ref counter), but the FDMI is responsible for further record release. + +### FDMI Source Dock FOM +The FDMI source dock FOM implements the main control flow for the FDMI source dock: +* Takes out posted FDMI records +* Examines filters +* Sends notifications to FDMI plugins +* Analyzes FDMI plugin responses + + +**Normal workflow** +The FDMI source dock FOM remains in an idle state if no FDMI record is posted (FDMI record queue is empty). If any FDMI record is posted, the FOM switches to the busy state to take out the FDMI record from a queue and start the analysis. + +Before examining against all the filters, the FOM requests filter list from filterc. On getting the filter list, the FOM iterates throw the filter list and examine the filter one by one. If the filter number is quite big, a possible option is to limit the number of filters examined in one FOM invocation to avoid task blocking. + +To examine the filter, the FOM builds filter execution plan. The Filter execution plan is a tree structure, with expressions specified in its nodes. + +Each expression is described by elementary operation to execute and one or two operands. The Operand may be a constant value, already calculated result of previous condition check or FDMI record specific field value. + +The FDMI source dock calculates all expressions by itself. If some of the operands are the FDMI record specific field value, then the source dock executes callback provided by the FDMI source to get operand value. + +Also, the operations supported during filter execution by the FDMI source dock can be extended. So the FDMI source can add new operation codes and corresponding handlers to support processing data types specific to the FDMI source. Operation overloading is not supported, if the FDMI source wants to define the multiplication for some “non-standard” type, it should add a new operation and handler for that operation. + +If no filter shows a match for a FDMI record, the record is released. To inform the FDMI source that this record is no more needed for FDMI system, the FDMI generic source interface function decrease record reference counter is used. + +If one or more filters match the FDMI record, the record is scheduled to be sent to a particular FDMI node(s). If several filters matched, the following operations are performed to optimize data flow: +* Send the FDMI record only once to a particular FDMI node. Filter provides RCP endpoint to communicate. +* Specify a list of matched filters, include only filters that are related to the node. +* On receipt, the FDMI plugin dock is responsible for dispatching received FDMI records and pass it to plugins according to specified matched filters list + +![Image](./Images/FDMI_source_dock_FDMI_FOM.png) + +In order to manage the FDMI records I/O operations, the following information should be stored as the FDMI source dock context information: +* Sent the FDMI record stored in a FDMI source dock communication context +* Relation between destination Filter ID and FDMI record ID being sent to the specified Filter ID + * Map may be used in this case + * This information is needed to handle Corner case “Motr instance running “FDMI plugin dock” death” – see below. + +The FDMI record is sent and serialized using FDMI generic source interface function Xcode functions. + +To send the FDMI record, its reference counter is increased: The FDMI generic source interface function increase record reference counter is used. + +The FDMI source dock increments the internal FDMI record reference counter for the FDMI record sent for each send operation. + +On the FDMI record receipt, the FDMI plugin dock should answer with a reply understood as a data delivery acknowledgement. The data acknowledgment should be sent as soon as possible – no blocking operations are allowed. + +![image](./Images/FDMI_source_dock_Release_Request_from_Plugin.png) + +On the received data acknowledgement, the internal FDMI record reference counter for the FDMI record is decremented. If internal reference counter becomes 0, the FDMI record is removed from the FDMI source dock communication context. + +After the FDMI record is handled by all involved plugins, the FDMI plugin dock should send the FDMI record release request to the FDMI record originator (FDMI source dock). On receiving this request, the FDMI source dock removes appropriate pair from its context and informs FDMI source that the record is released. FDMI generic source interface function decrease record reference counter is used for this purpose. If the FDMI source reference counter for a particular FDMI record becomes 0, FDMI source may release this FDMI record. + +

+ Note What value should be returned if “Test filter condition” cannot calculate particular filter? “record mismatch” (legal ret code) or “some error ret code”? +

+ +Filters set support +FilterC ID is responsible for storing a local copy of the filters database and supporting its consistency. By request, FilterC returns a set of filters, related to the specified FDMI record type. Filter set request/response operation is simple and easy to execute because a pointer to local storage is returned. It allows the FDMI source dock to re-request filter set from FilterC every time it needs it without any resources over-usage. No additional actions should be done by the FDMI source dock to maintain filter set consistency. + +Corner cases +Special handling should be applied for the following corner cases: +* Motr instance running FDMI plugin doc” death +* FDMI filter is disabled + +Motr instance running FDMI plugin dock death may + cause 2 cases: + +* RPC error while sending the FDMI record to a FDMI source dock. No data acknowledgement received. +* No FDMI record release request is received from FDMI plugin dock. + + + ![Image](./Images/FDMI_source_dock_On_Plugin_Node_Dead.png) + + + If the RPC error while sending the FDMI record to a FDMI source dock appears, the FDMI source dock should decrement the internal FDMI record reference counter and the FDMI Source specific reference counter, the general logic described above. In this case all the FDMI record context information is stored in the communication context; it makes it obvious how to fill in parameters for interface functions calls. + +The FDMI record release request is not received from the FDMI plugin dock case is not detected by FDMI source dock explicitly. This case may cause storing the FDMI records on the source during unpredictable time period. It depends on FDMI source domain: it may store FDMI records permanently until receiving from the plugin confirmation on the FDMI record handling. Possible ways to escape the described issue: + +* Based on some internal criteria, the FDMI source resets reference counter information and re-posts FDMI record. +* FDMI source dock receives notification on a death of the node running the FDMI plugin dock. Notification is sent by HA. + +In the latter case the FDMI source dock should inform the FDMI source to release all the FDMI records that were sent to plugins hosted on the dead node. To do this, the context information stored as relation between destination Filter Id and FDMI record id is used: all the filters related to the dead node may be determined by EP address. The same handling that is done for “FDMI record release request” should be done in this case for all the FDMI records, bound to the specified filters id. + + ![Image](./Images/FDMI_source_dock_Deferred_input_data_releases.png) + + The FDMI filter may be disabled by plugin itself or by some 3rd parties (administrator, HA, etc.). On the filter state change (disabling the filter) a signal is sent to FDMI source dock. Upon receiving this signal, the FDMI source dock iterates through the stored map and check each filter status. If a filter status is found to be disabled, the same handling that is done for “FDMI record release request” should be done for all the FDMI records, bound to the specified filter id. + + ### FilterD + The FDMI plugin creates a filter to specify the criteria for FDMI records. The FDMI filter service (filterD) maintains a central database of FDMI filters available in the Motr cluster. There is only one (possibly duplicated) Motr instance with filterD service in the whole Motr cluster. The FilterD provides users read/write access to its database via RPC requests. + + The FilterD service starts as a part of chosen for this purpose Motr instance. Address of FilterD service endpoint is stored in confd database. The FilterD database is empty after startup. + + The FilterD database is protected by distributed read/write lock. When the FilterD database needs to changed, the filterD service acquires exclusive write lock from the Resource Manager (RM), thus invalidating all read locks held by database readers. This mechanism is used to notify readers about the filterD database changes, forcing them to re-read database content afterwards. + + There are two types of filterD users: + * FDMI plugin dock + * FDMI filter client (filterC) + + + FDMI filter description stored in database contains following fields: +* Filter ID +* Filter conditions stored in serialized form +* RPC endpoint of the FDMI plugin dock that registered a filter +* Node on which FDMI plugin dock that registered a filter is running + + FDMI plugin dock can issue following requests: +* Add filter with provided description +* Remove filter by filter ID +* Activate filter by filter ID +* Deactivate filter by filter ID +* Remove all filters by FDMI plugin dock RPC endpoint. + + Also there are other events that cause some filters deactivation in database: +* HA notification about node death + + Filters stored in database are grouped by FDMI record type ID they are intended for. + + FilterD clients can issue following queries to filterD: +* Get all FDMI record type ID’s known to filterD +* Get all FDMI filters registered for specific FDMI record type ID + +**NB**: + +Initial implementation of filterD will be based on confd. Actually, two types of conf objects will be added to confd database: +* Directory of FDMI record types IDs +* Directory of filter descriptions for specific FDMI record type ID. + +This implementation makes handling of HA notifications on filterD impossible, because confd doesn’t track the HA statuses for conf objects. + +### FilterC +FilterC is a part of Motr instance that caches locally filters obtained from filterD. The FDMI source dock initialize the FilterC service at its startup. + +Also, the FilterC have a channel in its context which is signaled when some filter state is changed from enabled to disabled. + +The FilterC achieves local cache consistency with filterD database content by using distributed read/write lock mechanism. The FilterD database change is the only reason for the FilterC local cache update. The HA notifications about filter or node death are ignored by the FilterC. + +**NB**: + +The initial implementation of the FilterC will be based on confc. So the confc will cache filter descriptions locally. In that case implementation of the FilterC channel for signaling disabled filters is quite problematic. + +### FDMI Plugin Registration +![Image](./Images/FDMI_plugin_dock_Plugin_Startup.png) + +* Filter id: + * Expected to be 128 bits long + * Filter ID is provided by plugin creator + * Providing filter ID uniqueness is a responsibility of plugin creator + * Filter id may reuse m0_fid structure declaration + + + **TBD**: + + A possible situation when the plugin is being notified with the Filter ID and already announced inactive, which change did not reach the source to the moment of emitting notification. Should the id be passed to the plugin by FDMI? + + Another thing to consider on: what should be done by the FDMI in case the filter ID arrived in notification is unknown to the node, i.e. no match to any locally registered filter rule encountered? + + A complimentary case occurs when plugin was just fed with the FDMI record and did not instructed the FDMI to release the one yet. Instead, it declares the corresponding filter instance to be de-activated. Current approach implies that plugin is responsible for proper issuing release commands once it was fed with the FDMI record, disregarding filter activation aspect. + + +### FDMI Plugin Dock FOM +![Image](./Images/FDMI_plugin_dock_On_FDMI_Record.png) + +Received FDMI record goes directly to plugin Dock’s FOM. At this time a new session for re-using the incoming RPC connection needs to be created and stored in communication context being associated with FDMI Record ID. Immediately at this step RPC reply is sent confirming FDMI record delivery. + +Per filter ID, the corresponding plugin is called feeding it with FDMI data, the FDMI record ID, and filter ID specific to the plugin. Every successful plugin feed results in incrementing the FDMI Record ID reference counter. When done with the ids, the FOM needs to test if at least a single feed succeeded. In case it was not successful, i.e. there was not a single active filter encountered, or plugins never confirmed FDMI record acceptance, the FDMI record has to be released immediately. + +The plugin decides on its own when to report the FDMI original record to be released by the Source. It calls the plugin dock about releasing a particular record identified by the FDMI record ID. In the context of the call FDMI record reference counter is decremented locally, and in case the reference counter gets to 0, the corresponding Source is called via RPC to release the record (see Normal workflow, FDMI Source Dock: Release Request from plugin). + + +### FDMI Plugin Implementation Guideline +The main logic behind the use of the FDMI plugin is a subscription to some events in sources that comply with conditions described in the filters that the plugin registers at its start. In case some source record matches with at least one filter, the source-originated record is routed to the corresponding plugin. + +#### Plugin responsibilities +**During standard initialization workflow plugin**: + +* Obtains private plugin Dock callback interface +* Registers set of filters, where filter definition: + * Identifies the FDMI record type to be watched + * Provides plugin callback interface + * Provides description of condition(s) the source record to meet to invoke notification. + + +**NB**: + +Condition description syntax must follow the source functionality completely and unambiguously. Source of the type specified by filter description must understand every elementary construct of condition definition. This way the evolution of filter definition syntax is going to be driven by evolution of source functionality. + +**NB**: + +The source is responsible for validation of filter definition. This may result in deactivating filters that violate syntax rules the particular source supports. The facts of syntax violation ideally must become known some way to Motr cloud admin staff. + +* Starts subscription by activating registered filters. Opaquely for the plugin the filter set is propagated among Motr nodes running FDMI Source Dock role which enables source record filtering and notifications. + +During active subscription workflow looks like following: +* Plugin is called back with: + * FDMI record id + * FDMI record data + * Filter ID indicating the filter that signaled during the original source record processing + +* Plugin must keep trace of FDMI record (identified by FDMI record id globally unique across the Motr cloud) during its internal processing. +* Plugin must return from the callback as quick as possible to not block other callback interfaces from being called. plugin writers must take into account the fact that several plugins may be registered simultaneously, and therefore, must do their best to provide smooth cooperation among those. +* However plugin is allowed to take as much time as required for the FDMI record processing. During the entire processing the FDMI record remains locked in its source. +* When done with the record, plugin is responsible for the record release. +* Plugin is allowed to activate/deactivate any subset of its registered filters. The decision making is entirely on plugin’s side. +* The same way plugin is allowed to de-register and quit any time it wants. The decision making is again entirely on plugin’s side. After de-registering itself the plugin is not allowed to call private FDMI plugin Dock in part of filter activation/deactivation as well as FDMI record releasing. The said actions become available only after registering filter set another time. + + +## Implementation Plan + +Phase 1 +1. Implement FDMI service +2. FDMI source dock + i. FDMI source dock API + ii. Generic FDMI records handling (check against filters, send matched records (only one recipient is supported)) +iii. Handle FDMI records deferred release +3. FOL as FDMI source support + i. Generic FDMI source support + ii. Limited FOL data analysis (operation code only) +4. Filters +i. Simplified filters storing and propagation (use confd, confc) +ii. Static filter configuration +iii. Limited filtering operation set +iv. Generic filters execution +5. FDMI plugin dock +i. FDMI plugin dock API +ii. Generic FDMI records handling (receive records, pass records to target filter) +6. Sample echo plugin + + +Backlog +1. Filters + i. FilterD, FilterC + ii. Full filtering operation set + iii. Register/deregister filter command + iv. Enable/disable filter command + v. Filter sanity check + vi. Query language to describe filters +2. FDMI Source dock +i. Multiple localities support +ii. Filters validation +iii. FDMI kernel mode support +iv. Support several concurrent RPC connections to clients (FDMI plugin docks) +3. FDMI Plugin dock +i. Filter management (register/enable/disable) +4. HA support (node/filter is dead) in: +i. FDMI source dock +ii. FDMI plugin dock +iii. Filters subsystem +5. FOL as FDMI source support +i. FOL data full analysis support +ii. Transactions support (rollback/roll-forward) +6. ADDB diagnostic support in both FDMI source dock and plugin dock +7. ADDB as FDMI source support diff --git a/doc/HLD-of-FOL.md b/doc/HLD-of-FOL.md new file mode 100644 index 00000000000..28b73ffa190 --- /dev/null +++ b/doc/HLD-of-FOL.md @@ -0,0 +1,170 @@ +# High-Level Design of a File Operation Log +This document provides a High-Level Design **(HLD)** of a File Operation Log **(FOL)** of the Motr M0 core. The main purposes of this document are: +1. To be inspected by M0 architects and peer designers to ensure that HLD is aligned with M0 architecture and other designs and contains no defects. +2. To be a source of material for Active Reviews of Intermediate Design **(ARID)** and Detailed Level Design **(DLD)** of the same component. +3. To be served as a design reference document. + +The intended audience of this document consists of M0 customers, architects, designers, and developers. + +## Introduction +A FOL is a central M0 data structure, maintained by every node where the M0 core is deployed and serves multiple goals: +- It is used by a node database component to implement local transactions through WAL logging. +- It is used by DTM to implement distributed transactions. DTM uses FOL for multiple purposes internally: + - On a node where a transaction originates: for replay; + - On a node where a transaction update is executed: for undo, redo, and for replay (sending redo requests to recovering nodes); + - To determine when a transaction becomes stable; +- It is used by a cache pressure handler to determine what cached updates have to be re-integrated into upward caches; +- It is used by FDML to feed file system updates to FOL consumers asynchronously; +- More generally, a FOL is used by various components (snapshots, addb, etc.) to consistently reconstruct the system state at a certain moment in the (logical) past. + +Roughly speaking, a FOL is a partially ordered collection of FOL records, each corresponding to (part of) a consistent modification of the file system state. A FOL record contains information determining the durability of the modification (how many volatile and persistent copies it has and where etc.) and dependencies between modifications, among other things. When a client node has to modify a file system state to serve a system call from a user, it places a record in its (possibly volatile) FOL. The record keeps track of operation state: has it been re-integrated to servers, has it been committed on the servers, etc. A server, on receiving a request to execute an update on a client's behalf, inserts a record, describing the request into its FOL. Eventually, FOL is purged to reclaim storage, culling some of the records. + +## Definitions +- a (file system) operation is a modification of a file system state preserving file system consistency (i.e., when applied to a file system in a consistent state it produces a consistent state). There is a limited repertoire of operation types: mkdir, link, create, write, truncate, etc. M0 core maintains serializability of operation execution; +- an update (of an operation) is a sub-modification of a file system state that modifies the state on a single node only. For example, a typical write operation against a RAID-6 striped file includes updates that modify data blocks on a server A and updates which modify parity blocks on a server B; +- an operation or update undo is a reversal of state modification, restoring the original state. An operation can be undone only when the parts of the state it modifies are compatible with the operation having been executed. Similarly, an operation or update redo is modifying state in the "forward" direction, possibly after undo; +- the recovery is a distributed process of restoring file system consistency after a failure. M0 core implements recovery in terms of undoing and coherently redoing individual updates; +- an update (or more generally, a piece of a file system state) is persistent when it is recorded on persistent storage, where persistent storage is defined as one whose contents survive a reset; +- an operation (or more generally, a piece of a file system state) is durable when it has enough persistent copies to survive any failure. Note that even as a record of a durable operation exists after a failure, the recovery might decide to undo the operation to preserve overall system consistency. Also, note that the notion of failure is configuration dependent. +- an operation is stable when it is guaranteed that the system state would be consistent with this operation having been executed. A stable operation is always durable. Additionally, the M0 core guarantees that the recovery would never undo the operation. The system can renege stability guarantee only in the face of a catastrophe or a system failure; +- updates U and V are conflicting or non-commutative if the final system state after U and V is executed depends on the relative order of their execution (note that system state includes information and result codes returned to the user applications); otherwise, the updates are commutative. The later of the two non-commutative updates depends on the earlier one. The earlier one is a pre-requisite of a later one (this ordering is well-defined due to serializability); +- to maintain operation serializability, all conflicting updates of a given file system object must be serialized. An object version is a number associated with a storage object with a property that for any two conflicting updates U (a pre-requisite) and V (a dependent update) modifying the object, the object version at the moment of U execution is less than at the moment of V execution; +• a FOL of a node is a sequence of records, each describing an update carried out on the node, together with information identifying operations these updates are parts of, file system objects that were updated and their versions, and contain enough data to undo or redo the updates and to determine operation dependencies; + +- a record in a node FOL is uniquely identified by a Log Sequence Number (LSN). Log sequence numbers have two crucial properties: + - a FOL record can be found efficiently (i.e., without FOL scanning) given its LSN, and + - for any pair of conflicting updates recorded in the FOL, the LSN of the pre-requisite is less than that of the dependent update. + +

+ Note: This property implies that the LSN data type has an infinite range and hence, is unimplementable in practice. This property holds for two conflicting updates sufficiently close in logical time, where the precise closeness condition is defined by the FOL pruning algorithm. The same applies to object versions. +

+ +

+ Note: It would be nice to refine the terminology to distinguish between operation description (i.e., intent to carry it out) and its actual execution. This would make a description of dependencies and recovery less obscure, at the expense of some additional complexity. +

+ +## Requirements + +- `[R.FOL.EVERY-NODE]`: every node where M0 core is deployed maintains FOL; +- `[R.FOL.LOCAL-TXN]`: a node FOL is used to implement local transactional containers +- `[R.FOL]`: A File Operations Log is maintained by M0; +- `[R.FOL.VARIABILITY]`: FOL supports various system configurations. FOL is maintained by every M0 back-end. FOL stores enough information to efficiently find modifications to the file system state that has to be propagated through the caching graph, and to construct network-optimal messages carrying these updates. A FOL can be maintained in volatile or persistent transactional storage; +- `[R.FOL.LSN]`: A FOL record is identified by an LSN. There is a compact identifier (LSN) with which a log record can be identified and efficiently located; +- `[R.FOL.CONSISTENCY]`: A FOL record describes a storage operation. A FOL record describes a complete storage operation, that is, a change to a storage system state that preserves state consistency; +- `[R.FOL.IDEMPOTENCY]`: A FOL record application is idempotent. A FOL record contains enough information to detect that operation is already applied to the state, guaranteeing EOS (Exactly Once Semantics); +- `[R.FOL.ORDERING]`: A FOL records are applied in order. A FOL record contains enough information to detect when all necessary pre-requisite state changes have been applied; +- `[R.FOL.DEPENDENCIES]`: Operation dependencies can be discovered through FOL. FOL contains enough information to determine dependencies between operations; +- `[R.FOL.DIX]`: FOL supports DIX; +- `[R.FOL.SNS]`: FOL supports SNS; +- `[R.FOL.REINT]`: FOL can be used for cache reintegration. FOL contains enough information to find out what has to be re-integrated; +- `[R.FOL.PRUNE]`: FOL can be pruned. A mechanism exists to determine what portions of FOL can be re-claimed; +- `[R.FOL.REPLAY]`: FOL records can be replayed; +- `[R.FOL.REDO]`: FOL can be used for redo-only recovery; +- `[R.FOL.UNDO]`: FOL can be used for undo-redo recovery; +- `[R.FOL.EPOCHS]`: FOL records for a given epoch can be found efficiently; +- `[R.FOL.CONSUME.SYNC]`: storage applications can process FOL records synchronously; +- `[R.FOL.CONSUME.ASYNC]`: storage applications can process FOL records asynchronously; +- `[R.FOL.CONSUME.RESUME]`: a storage application can be resumed after a failure; +- `[R.FOL.ADDB]`: FOL is integrated with ADDB. ADDB records matching a given FOL record can be found efficiently; +- `[R.FOL.FILE]`: FOL records pertaining to a given file (-set) can be found efficiently. + +## Design Highlights +A FOL record is identified by its LSN. LSN is defined and selected as to be able to encode various partial orders imposed on FOL records by the requirements. + +## Functional Specification +The FOL manager exports two interfaces: +- main interface used by the request handler. Through this interface FOL records can be added to the FOL and the FOL can be forced (i.e., made persistent up to a certain record); +- auxiliary interfaces, used for FOL pruning and querying. + +## Logical Specification + +### Overview +FOL is stored in a transactional container [1] populated with records indexed [2] by LSN. An LSN is used to refer to a point in FOL from other meta-data tables (epochs table, object index, sessions table, etc.). To make such references more flexible, a FOL, in addition to genuine records corresponding to updates, might contain pseudo-records marking points of interest in the FOL to which other file system tables might want to refer (for example, an epoch boundary, a snapshot origin, a new server secret key, etc.). By abuse of terminology, such pseudo-records will be called FOL records too. Similarly, as part of the redo-recovery implementation, DTM might populate a node FOL with records describing updates to be performed on other nodes. + +[1][R.BACK-END.TRANSACTIONAL] ST +[2][R.BACK-END.INDEXING] ST + +### Record Structure +A FOL record, added via the main FOL interface, contains the following: +- an operation opcode, identifying the type of file system operation; +- LSN; +- information sufficient to undo and redo the update, described by the record, including: + - for each file system object affected by the update, its identity (a fid) and its object version identifying the state of the object in which the update can be applied; + - any additional operation type-dependent information (file names, attributes, etc.) necessary to execute or roll back the update; + +- information sufficient to undo and redo the update, described by the record, including: +- for each file system object affected by the update, its identity (a fid) and its object version identifying the state of the object in which the update can be applied; +- any additional operation type-dependent information (file names, attributes, etc.) necessary to execute or roll back the update; +- information sufficient to identify other updates of the same operation (if any) and their state. For the present design specification it's enough to posit that this can be done utilizing some opaque identifier; +- for each object modified by the update, a reference (in the form of lsn) to the record of the previous update to this object (null is the update is object creation). This reference is called the prev-lsn reference; +- distributed transaction management data, including an epoch this update and operation, are parts of; +- liveness state: a number of outstanding references to this record. + +### Liveness and Pruning +A node FOL must be prunable if only to function correctly on a node without persistent storage. At the same time, a variety of sub-systems both from M0 core and outside of it might want to refer to FOL records. To make pruning possible and flexible, each FOL record is augmented with a reference counter, counting all outstanding references to the record. A record can be pruned if its reference count drops to 0 together with reference counters of all earlier (in lsn sense) unpruned records in the FOL. + + +### Conformance +- `[R.FOL.EVERY-NODE]`: on nodes with persistent storage, M0 core runs in the user space and the FOL is stored in a database table. On a node without persistent storage, or M0 core runs in the kernel space, the FOL is stored in the memory-only index. Data-base and memory-only index provide the same external interface, making FOL code portable; +- `[R.FOL.LOCAL-TXN]`: request handler inserts a record into FOL table in the context of the same transaction where the update is executed. This guarantees WAL property of FOL; +- `[R.FOL]`: vacuous; +- `[R.FOL.VARIABILITY]`: FOL records contain enough information to determine where to forward updates to; +- `[R.FOL.LSN]`: explicitly by design; +- `[R.FOL.CONSISTENCY]`: explicitly by design; +- `[R.FOL.IDEMPOTENCY]`: object versions stored in every FOL record are used to implement EOS; +- `[R.FOL.ORDERING]`: object versions and LSN are used to implement ordering; +- `[R.FOL.DEPENDENCIES]`: object versions and epoch numbers are used to track operation dependencies; +- `[R.FOL.DIX]`: distinction between operation and update makes multi-server operations possible; +- `[R.FOL.SNS]`: same as for r.FOL.DIX; +- `[R.FOL.REINT]`: cache pressure manager on a node keeps a reference to the last re-integrated record using auxiliary FOL interface; +- `[R.FOL.PRUNE]`: explicitly by design; +- `[R.FOL.REPLAY]`: the same as r.FOL.reint: a client keeps a reference to the earliest FOL record that might require a replay. Liveness rules guarantee that all later records are present in the FOL; +- `[R.FOL.REDO]`: by design, FOL record contains enough information for an update redo. See DTM documentation for details; +- `[R.FOL.UNDO]`: by design, FOL record contains enough information for update undo. See DTM documentation for details; +- `[R.FOL.EPOCHS]`: an epoch table contains references (LSN) of FOL (pseudo-)records marking epoch boundaries; +- `[R.FOL.CONSUME.SYNC]`: request handler feed a FOL record to registered synchronous consumers in the same local transaction context where the record is inserted and where the operation is executed; +- `[R.FOL.CONSUME.ASYNC]`: asynchronous FOL consumers receive batches of FOL records from multiple nodes and consume them in the context of distributed transactions on which these records are parts; +- `[R.FOL.CONSUME.RESUME]`: the same mechanism is used for resumption of FOL consumption as for re-integration and replay: a record to the last consumed FOL records is updated transactionally with consumption; +- `[R.FOL.ADDB]`: see ADDB documentation for details; +- `[R.FOL.FILE]`: an object index table, enumerating all files and file sets for the node contains references to the latest FOL record for the file (or file-set). By following the previous operation LSN references the history of modifications of a given file can be recovered. + + +### Dependencies +- back-end: + - `[R.BACK-END.TRANSACTIONAL] ST`: back-end supports local transactions so that FOL could be populated atomically with other tables. + - `[R.BACK-END.INDEXING] ST`: back-end supports containers with records indexed by a key. + +### Security Model +FOL manager by itself does not deal with security issues. It trusts its callers (request handler, DTM, etc.) to carry out necessary authentication and authorization checks before manipulating FOL records. The FOL stores some security information as part of its records. + +### Refinement +The FOL is organized as a single indexed table containing records with LSN as a primary key. The structure of an individual record as outlined above. The detailed main FOL interface is straightforward. FOL navigation and querying in the auxiliary interface are based on a FOL cursor. + +## State +FOL introduces no extra state. + +## Use Cases +### Scenarios + +FOL QAS list is included here by reference. + +### Failures +Failure of the underlying storage container in which FOL is stored is treated as any storage failure. All other FOL-related failures are handled by DTM. + +## Analysis + +### Other +An alternative design is to store FOL in a special data structure, instead of a standard indexed container. For example, FOL can be stored in an append-only flat file with starting offset of a record serving as its lsn. The perceived advantage of this solution is avoiding overhead of full-fledged indexing (b-tree). Indeed, general-purpose indexing is not needed, because records with lsn less than the maximal one used in the past are never inserted into the FOL (aren't they?). + +Yet another possible design is to use db4 extensible logging to store FOL records directly in a db4 transactional log. The advantage of this is that forcing FOL up to a specific record becomes possible (and easy to implement), and the overhead of indexing is again avoided. On the other hand, it is not clear how to deal with pruning. + +### Rationale +The simplest solution first. +## References +• [0] FOL QAS +• [1] FOL architecture view packet +• [2] FOL overview +• [3] WAL +• [4] Summary requirements table +• [5] M0 glossary +• [6] HLD of the request handler diff --git a/doc/HLD-of-Metadata-Backend.md b/doc/HLD-of-Metadata-Backend.md new file mode 100644 index 00000000000..344653b91c1 --- /dev/null +++ b/doc/HLD-of-Metadata-Backend.md @@ -0,0 +1,66 @@ +# HLD of Metadata Backend +This document presents a high level design **(HLD)** of the meta-data back-end for Motr. +The main purposes of this document are: + 1. To be inspected by Motr architects and peer designers to ascertain that high level design is aligned with Motr architecture and other designs, and contains no defects. + 2. To be a source of material for Active Reviews of Intermediate Design **(ARID)** and detailed level design **(DLD)** of the same component. + 3. To serve as a design reference document. The intended audience of this document consists of Motr customers, architects, designers, and developers. + + + ## Introduction + Meta-data back-end (BE) is a module presenting an interface for a transactional local meta-data storage. BE users manipulate and access meta-data structures in memory. BE maps this memory to persistent storage. User groups meta-data updates in transactions. BE guarantees that transactions are atomic in the face of process failures. + +BE provides support for a few frequently used data structures: double linked list, B-tree, and exit map. + + + ## Dependencies + - a storage object *(stob)* is a container for unstructured data, accessible through the `m0_stob` interface. BE uses stobs to store meta-data on a persistent store. BE accesses persistent store only through the `m0_stob` interface and assumes that every completed stob write survives any node failure. It is up to a stob implementation to guarantee this. + - a segment is a stob mapped to an extent in process address space. Each address in the extent uniquely corresponds to the offset in the stob and vice versa. Stob is divided into blocks of fixed size. Memory extent is divided into pages of fixed size. Page size is a multiple of the block size (it follows that stob size is a multiple of page size). At a given moment in time, some pages are up-to-date (their contents are the same as of the corresponding stob blocks) and some are dirty (their contents were modified relative to the stob blocks). In the initial implementation, all pages are up-to-date, when the segment is opened. In the later versions, pages will be loaded dynamically on demand. The memory extent to which a segment is mapped is called segment memory. + - a region is an extent within segment memory. A (meta-data) update is a modification of some region. + - a transaction is a collection of updates. The user adds an update to a transaction by capturing the update's region. The user explicitly closes a transaction. BE guarantees that a closed transaction is atomic concerning process crashes that happen after transaction close call returns. That is, after such a crash, either all or none of the transaction updates will be present in the segment memory when the segment is opened next time. If a process crashes before a transaction closes, BE guarantees that none of the transaction updates will be present in the segment memory. + - a credit is a measure of a group of updates. A credit is a pair (nr, size), where nr is the number of updates and size is the total size in bytes of modified regions. + + ## Requirements + +* `R.M0.MDSTORE.NUMA`: allocator respects NUMA topology. +* `R.MO.REQH.10M`: performance goal of 10M transactions per second on a 16-core system with a battery-backed memory. +* `R.M0.MDSTORE.LOOKUP`: Lookup of a value by key is supported. +* `R.M0.MDSTORE.ITERATE`: Iteration through records is supported. +* `R.M0.MDSTORE.CAN-GROW`: The linear size of the address space can grow dynamically. +* `R.M0.MDSTORE.SPARSE-PROVISIONING`: including pre-allocation. +* `R.M0.MDSTORE.COMPACT`, `R.M0.MDSTORE.DEFRAGMENT`: used container space can be compacted and de-fragmented. +* `R.M0.MDSTORE.FSCK`: scavenger is supported +* `R.M0.MDSTORE.PERSISTENT-MEMORY`: The log and dirty pages are (optionally) in a persistent memory. +* `R.M0.MDSTORE.SEGMENT-SERVER-REMOTE`: backing containers can be either local or remote +* `R.M0.MDSTORE.ADDRESS-MAPPING-OFFSETS`: offset structure friendly to container migration and merging +* `R.M0.MDSTORE.SNAPSHOTS`: snapshots are supported. +* `R.M0.MDSTORE.SLABS-ON-VOLUMES`: slab-based space allocator. +* `R.M0.MDSTORE.SEGMENT-LAYOUT` Any object layout for a meta-data segment is supported. +* `R.M0.MDSTORE.DATA.MDKEY`: Data objects carry a meta-data key for sorting (like the reiser4 key assignment does). +* `R.M0.MDSTORE.RECOVERY-SIMPLER`: There is a possibility of doing a recovery twice. There is also a possibility to use either object-level mirroring or logical transaction mirroring. +* `R.M0.MDSTORE.CRYPTOGRAPHY`: optionally meta-data records are encrypted. +* `R.M0.MDSTORE.PROXY`: proxy meta-data server is supported. A client and a server are almost identical. + +## Design Highlights +BE transaction engine uses write-ahead redo-only logging. Concurrency control is delegated to BE users. + +## Functional Specification +BE provides an interface to make in-memory structures transactionally persistent. A user opens a (previously created) segment. An area of virtual address space is allocated to the segment. The user then reads and writes the memory in this area, by using BE-provided interfaces together with normal memory access operations. When the memory address is read for the first time, its contents are loaded from the segment (initial BE implementation loads the entire segment stob in memory when the segment is opened). Modifications to segment memory are grouped in transactions. After a transaction is closed, BE asynchronous writes updated memory to the segment stob. + +When a segment is closed (perhaps implicitly as a result of a failure) and re-opened again, the same virtual address space area is allocated to it. This guarantees that it is safe to store pointers to segment memory in segment memory. Because of this property, a user can place in segment memory in-memory structures, relying on pointers: linked lists, trees, hash tables, strings, etc. Some in-memory structures, notably locks, are meaningless on storage, but for simplicity (to avoid allocation and maintenance of a separate set of volatile-only objects), can nevertheless be placed in the segment. When such a structure is modified (e.g., a lock is taken or released), the modification is not captured in any transaction and, hence, is not written to the segment stob. + +BE-exported objects (domain, segment, region, transaction, linked list, and b-tree) support Motr non-blocking server architecture. + +## Use Cases +### Scenarios + +|Scenario | Description | +|---------|-------------| +|Scenario | `[usecase.component.name]` | +|Relevant quality attributes| [e.g., fault tolerance, scalability, usability, re-usability]| +|Stimulus| [an incoming event that triggers the use case]| +|Stimulus source | [system or external world entity that caused the stimulus]| +|Environment | [part of the system involved in the scenario]| +|Artifact | [change to the system produced by the stimulus]| +|Response | [how the component responds to the system change]| +|Response measure |[qualitative and (preferably) quantitative measures of response that must be maintained]| +|Questions and Answers | diff --git a/doc/HLD-of-Motr-Caching.md b/doc/HLD-of-Motr-Caching.md new file mode 100644 index 00000000000..1ceb5cb7aca --- /dev/null +++ b/doc/HLD-of-Motr-Caching.md @@ -0,0 +1,276 @@ +# High level design of Motr configuration caching +This document presents a high-level design **(HLD)** of a layout schema of Motr M0 core. + The main purposes of this document are: + 1. To be inspected by M0 architects and peer designers to ascertain that high-level design is aligned with M0 architecture and other designs, and contains no defects. + 2. To be a source of material for Active Reviews of Intermediate Design **(ARID)** and detailed level design **(DLD)** of the same component. + 3. To serve as a design reference document. + +The intended audience of this document consists of M0 customers, architects, designers, and developers. + +## Introduction +Configuration information of a Motr cluster (node data, device data, filesystem tuning parameters, etc.[1]) is stored in a database, which is maintained by a dedicated management service --- configuration server. Other services and clients access configuration information by using the API of the configuration client library. + +Configuration caching provides a higher-level interface to the configuration database, convenient to use by upper layers. The implementation maintains data structures in memory, fetching them from the configuration server if necessary. Configuration caches are maintained in management-, metadata- and io- services, and in clients. + +## Definitions +- Motr configuration is part of M0 cluster meta-data. +- Configuration database is a central repository of M0 configuration. +- Confd (configuration server) is a management service that provides configuration clients with information obtained from configuration database. +- Confc (configuration client library, configuration client) is a library that provides configuration consumers with interfaces to query M0 configuration. +- Configuration consumer is any software that uses confc API to access M0 configuration. +- Configuration cache is configuration data stored in node’s memory. Confc library maintains such a cache and provides configuration consumers with access to its data. Confd also uses configuration cache for faster retrieval of information requested by configuration clients. +- Configuration object is a data structure that contains configuration information. There are several types of configuration objects: profile, service, node, etc. + +## Requirements +- `[r.conf.async]` Configuration querying should be a non-blocking operation. +- `[r.conf.cache]` Configuration information is stored (cached) in management-, metadata-, and io- services, and on clients. +- `[r.conf.cache.in-memory]` Implementation maintains configuration structures in memory, fetching them from the management service if necessary. +- `[r.conf.cache.resource-management]` Configuration caches should be integrated with the resource manager[4]. + +## Design Highlights +This design assumes Motr configuration to be read-only. Implementation of writable configuration is postponed. + +A typical use case is when a client (confc) requests configuration from a server (confd). The latter most likely already has all configuration in memory: even a large configuration database is very small compared with other meta-data. + +A simplistic variant of configuration server always loads the entire database into the cache. This considerably simplifies the locking model (reader-writer) and configuration request processing. + + +## Functional Specification +Configuration of a Motr cluster is stored in the configuration database. Motr services and filesystem clients --- configuration consumers --- have no access to this database. To work with configuration information, they use API and data structures provided by confc library. + +Confc's obtain configuration data from configuration server (confd); only the latter is supposed to work with configuration database directly. + +Configuration consumer accesses configuration information, which is stored in the configuration cache. The cache is maintained by confc library. If the data needed by a consumer is not cached, confc will fetch this data from confd. + +Confd has its configuration cache. If the data requested by a confc is missing from this cache, confd gets the information from the configuration database. + +### Configuration Data Model +Configuration database consists of tables, each table being a set of {key, value} pairs. The schema of the configuration database is documented in [2]. Confd and confc organize configuration data as a directed acyclic graph (DAG) with vertices being configuration objects and edges being relations between objects. +Profile object is the root of configuration data provided by confc. To access other configuration objects, a consumer follows the links (relations), “descending” from the profile object. + +profile +``` + + \_ filesystem + + \_ service + + \ _ node + + \_ nic + + \_ storage device + + \_ partition +``` + + +The relation is a pointer from one configuration object to another configuration object or a collection of objects. In the former case, it is one-to-one relation, in the latter case, it is one-to-many. + +Some relations are explicitly specified in corresponding records of the configuration database (e.g., a record of the “profiles” table contains the name of the filesystem associated with this profile). Other relations are deduced by the confd (e.g., the list of services that belong given filesystem is obtained by scanning the “services” table and selecting entries with a particular value of the ‘filesystem’ field). + +|Configuration object | Relations specified in the DB record| Relations deduced by scanning other DB tables| +|---------------------|-------------------------------------|---------------| +|profile| .filesystem| --| +|filesystem |-- |.services| +|service| .filesystem, .node| --| +|node |-- |.services, .nics, .sdevs| +|nic| .node |--| +|storage device (sdev)| .node |.partitions| +|partition |.storage_device| --| + +Relation is a downlink if its destination is located further from the root of configuration DAG than the origin. Relation is an uplink if its destination is closer to the root than the origin. Configuration object is a stub if its status `(.*_obj.co_status subfield)` is not equal to M0_CS_READY. Stubs contain no meaningful configuration data apart from the object’s type and key. Configuration object is pinned if its reference counter `(.*_obj.co_nrefs subfield)` is non-zero. When a configuration consumer wants to use an object, it pins it to protect the existence of the object in the cache. Pinning of an object makes confc library request a corresponding distributed lock (resource) from the resource manager. + +### Path +Imagine a sequence of downlinks and keys for which the following is true: +- the first element (if any) is a downlink; +- a one-to-many downlink is either followed by a key or is the last element; +- a key is preceded by a one-to-many downlink + +Such a sequence R and a configuration object X represent a path to configuration object Y if Y can be reached by starting at X and following all relations from R sequentially. X object is called path origin, elements of R are path components. + +Confc library uses the m0_confc_path data structure to represent a path. The members of this structure are: +- p_origin --- path origin (struct m0_conf_obj*). NULL for absolute path; +- p_comps --- array of components. A component is either a downlink, represented by a type of target object (enum m0_conf_objtype) or a key. + +Examples: +- `{ NULL, [FILESYSTEM, SERVICE, “foo”, NODE, NIC, “bar”] }` --- absolute path **(origin = NULL)** to the NIC with key “bar” of the node that hosts service “foo”; +- `{ node_obj, [SDEV, “baz”, PARTITION] }` --- relative path to a list of partitions that belong “baz” storage device of a given node. + +### Subroutines +- `m0_confc_init()` +Initiates configuration client creates the root configuration object. + + Arguments: + - profile --- name of the profile to be used by this confc; + - confd_addr --- address of confd endpoint; + - sm_group --- state machine group (struct m0_sm_group*) that will be associated with configuration cache. + + +- `m0_confc_open()` + Requests an asynchronous opening of a configuration object. Initiates retrieval of configuration data from the confd, if the data needed to fulfill this request is missing from configuration cache. + + Arguments: + - path --- path to the configuration object. The caller must guarantee the existence and immutability of the path until the state machine, embedded in ctx argument, terminates or fails; + - ctx --- fetch context (struct m0_confc_fetchctx*) containing: + - state machine (struct m0_sm); + - FOP (struct m0_fop); + - asynchronous system trap (struct m0_sm_ast) that will be posted to confc’s state machine group when a response from confd arrives; + - resulting pointer (void*) that will be set to the address of the requested configuration object iff the state machine terminates successfully. Otherwise, the value is NULL; + - errno. + + +- `m0_confc_open_sync()` + + Synchronous variant of `m0_confc_open()`. Returns a pointer to the requested configuration object or NULL in case of error. + + Argument: path --- path to configuration object. + +- `m0_confc_close()` + + Closes a configuration object opened with `m0_confc_open()` or `m0_confc_open_sync()`. + +- `m0_confc_diropen()` + + Requests an asynchronous opening of a collection of configuration objects. Initiates retrieval of configuration data from the confd, if the data needed to fulfill this request is missing from configuration cache. + + Arguments: + - path --- path to the collection of configuration objects. The caller must guarantee the existence and immutability of the path until the state machine, embedded in ctx argument, terminates or fails; + - ctx --- fetch context (the structure is described above). Its ‘resulting pointer’ member will be set to a non-NULL opaque value iff the state machine terminates successfully. This value is an argument for `m0_confc_dirnext()` and `m0_confc_dirclose()` functions. + + +- `m0_confc_diropen_sync()` + + Synchronous variant of m0_confc_diropen(). Returns an opaque pointer to be passed to `m0_confc_dirnext()`. Returns NULL in case of error. + +- `m0_confc_dirnext()` + + Returns next element in a collection of configuration objects. + Argument: dir --- opaque pointer obtained from a fetch context (see ctx argument of `m0_confc_diropen()`). + +- `m0_confc_dirclose()` + + Closes a collection of configuration objects opened with `m0_confc_diropen()` or `m0_confc_diropen_sync()`. + + Argument: dir --- opaque pointer obtained from a fetch context. + +- `m0_confc_fini()` + + Terminating routine: destroys configuration cache, freeing allocated memory. + +### FOP Types + Confc requests configuration information by sending m0_conf_fetch FOP to confd. This FOP contains the path to the requested configuration object/directory. Note that the path in FOP may be shorter than the path originally specified in `m0_confc_*open*()` call: if some of the objects are already present in confc cache, there is no reason to re-fetch them from confd. + + Confd replies to m0_conf_fetch with m0_conf_fetch_resp FOP, containing: + + - status of the retrieval operation (0 = success, -Exxx = failure); + - array (SEQUENCE in .ff terms) of configuration object descriptors. + +If the last past component, specified in m0_conf_fetch, denotes a directory (i.e., a collection of configuration objects), then confd’s reply must include descriptors of all the configuration objects of this directory. For example, if a path targets a collection of partitions, then m0_conf_fetch_resp should describe every partition of the targeted collection. + +Note that in the future, configuration data will be transferred from confd to confc using RPC bulk interfaces. The current implementation embeds configuration information in a response FOP and uses encoding and decoding functions generated by fop2c from the .ff description. + +## Logical Specification +Confd service is parametrized by the address of the endpoint, associated with this service, and path to the configuration database. + Confd state machine, created by the request handler[3], obtains the requested configuration (either from the configuration cache or from the configuration database; in the latter case the data gets added to the cache), generates a response FOP `(m0_conf_fetch_resp)` populated with configuration information and sends it back to confc. + +Confc is parametrized by name of the profile, address of confd endpoint, and state machine group. When `m0_confc_fetch()` function is called, confc checks whether all of the requested configuration data is available in cache. If something is missing, confc sends a request to the confd. When the response arrives, confc updates the configuration cache. + + +### Conformance +- `[i.conf.async]`: `m0_confc_fetch()` is an asynchronous non-blocking call. Confc state machine keeps the information about the progress of the query. +- `[i.conf.cache]`: confc components, used by Motr services and filesystem clients, maintain the caches of configuration data. +- `[i.conf.cache.in-memory]`: a confc obtains configuration from the confd and builds an in-memory data structure in the address space of configuration consumer. +- `[r.conf.cache.resource-management]`: before pinning configuration object or modifying configuration cache, configuration module (confd or confc) requests an appropriate resource from the resource manager. + +### Dependencies +- `[configuration.schema]` The confd is aware of the schema of configuration database. The actual schema is defined by another task[2] and is beyond the scope of this document. +- `[confc.startup]` Parameters, needed by confc startup procedure, are provided by calling code (presumably the request handler or the service startup logic). + +### Security Model +No security model is defined. + +### Refinement +- `[r.conf.confc.kernel]` Confc library must be implemented for the kernel. +- `[r.conf.confc.user]` Confc library must be implement for user space. +- `[r.conf.confd]` Confd service must be implemented for user space. +- `[r.conf.cache.data-model]` The implementation should organize configuration information as outlined in sections above. The same data structures should be used for confc and confd caches, if possible. Configuration structures must be kept in memory. +- `[r.conf.cache.pinning]` Pinning of an object protects the existence of this object in the cache. Pinned objects can be moved from stub condition to “ready”. +- `[r.conf.cache.unique-objects]` Configuration cache must not contain multiple objects with the same identity (the identity of a configuration object is a tuple of type and key). + +## State +Confd and confc modules define state machines for asynchronous non-blocking processing of configuration requests. State diagrams of such machines are shown below. + +### States, events, transitions +While a confd state machine is in the CHECK_SERIALIZE state, it keeps configuration cache locked. LOAD state unlocks the cache, fetches missing objects from the configuration database, and falls back to **CHECK_SERIALIZE**. After configuration object is successfully loaded from the database, its status is set to M0_CS_READY and its channel `(m0_conf_obj::co_chan)` is broadcasted. + +Configuration cache is associated with a state machine group **(m0_sm_group)**. While a confc state machine is in CHECK state, it keeps the state machine group locked. + +GROW_CACHE state releases state machine lock and performs the following actions for every configuration descriptor decoded from confd’s response `(m0_conf_fetch_resp)`: + +- lock the state machine group; +- make sure that the target object of every relation, mentioned by the descriptor, is present in cache (stubs are created for absent target objects); +- if an object with described identity (type and key) already exists in cache and its status is M0_CS_READY then compare the existing object with the one received from confd, reporting inequality by means of ADDB API; +- if an object with described identify already exists and is a stub then fill it with the received configuration data, change its status to **M0_CS_READY**, and announce status update on object’s channel; +- unlock the state machine group. + +### State Variants +If a confc state machine in GROW_CACHE state, while trying to add an object, finds that the object with this key already exists, then either the existing object is a stub or new and existing objects are equal. + +This invariant is also applicable to the LOAD state of a confd state machine. + +### Concurrency Control +#### Confd +Several confd state machines (FOMs) can work with configuration cache --- read from it and add new objects to it --- concurrently. + +A confd state machine keeps configuration cache locked while examining its completeness (i.e., checking whether it has enough data to fulfill confc’s request) and serializing configuration data into response FOP. Another confd state machine is unable to enter the CHECK_SERIALIZE state until the mutex, embedded into configuration cache data structure `(m0_conf_cache::cc_lock)`, is released. + +The same lock is used to prevent concurrent modifications to configuration DAG. A confd state machine must hold `m0_conf_cache::cc_lock` to add a new configuration object to the cache or fill a stub object with configuration data. + + +#### Confc +Configuration cache at confc side is shared by configuration consumers that read from it and confc state machines that traverse the cache and add new objects to it. Consumers and state machines can work with configuration cache concurrently. + +Implementation of confc serializes state transitions by means of a state machine group, associated with configuration cache. It is up to the upper layers to decide whether this group will be used exclusively for confc or for some other state machines as well. + +The state machine group is locked by a confc state machine running in the CHECK state. The lock `(m0_sm_group::s_lock)` is also acquired by a confc state machine for the time interval needed to append a new configuration object to the cache. + +Configuration consumers must pin the objects they are using (and not forget to unpin them afterward, of course). Pinned configuration objects cannot be deleted by confc library but those that have zero references may be deleted at any moment (e.g., when another consumer thread calls `m0_confc_fini()`). + +## Analysis +### Scalability +The present design assumes that there is exactly one configuration server per cluster. The problem is that however fast a confd is, the amount of data it can serve per fixed time is limited (the same goes for the number of requests it can process). Given a big enough number of confc’s willing to obtain configuration simultaneously (e.g., when a cluster is starting), RPC timeouts are inevitable. + +One of the possible solutions (which will probably be employed anyway, in one form or another) is to make configuration consumers keep on re-fetching the configuration data until a request succeeds or a maximum number of retries is reached. Another solution is to replicate the configuration database and have as many confd’s as there are replicas. Both solutions are beyond the scope of this document. + +Of course, once a confc has configuration data cached it will not need to fetch the same data from confd again (at least not until the configuration is updated, which is not going to happen in this incarnation of M0 configuration system). + + +### Other +Rejected: +- `m0_confc_lookup()` function that returns configuration record given the name of the configuration database table and record’s key; + +- YAML for configuration data serialization. + +(Rationale: YAML is not convenient to be parsed in the kernel. It is much simpler to use encoding and decoding functions generated by fop2c from the .ff description.). + +Postponed: +- writable configuration; +- configuration auto-discovery; +- using RPC bulk interfaces for transmitting configuration data. + +Pointer to the root configuration object cannot be embedded in the m0_reqh structure, because filesystem clients do not have a request handler (yet). + + +### Installation +- Configuration database is created with the ‘yaml2db’ utility (the product of ‘configuration. devenum’ component). +- Confd is a management service, started by request handler. It is installed together with Motr. +- Confc is a library that configuration consumers link with. + + +#### References +- [1] Configuration one-pager +- [2] HLD of configuration. schema +- [3] HLD of request handler +- [4] HLD of resource management interfaces +- [5] configuration. caching drafts. diff --git a/doc/HLD-of-Motr-Lostore.md b/doc/HLD-of-Motr-Lostore.md new file mode 100644 index 00000000000..5043219d01b --- /dev/null +++ b/doc/HLD-of-Motr-Lostore.md @@ -0,0 +1,115 @@ +# High level design of a Motr lostore module +This document presents a high level design **(HLD)** of a lower store (lostore) module of Motr core. +The main purposes of this document are: +1. To be inspected by M0 architects and peer designers to ascertain that high level design is aligned with M0 architecture and other designs, and contains no defects. +2. To be a source of material for Active Reviews of Intermediate Design **(ARID)** and detailed level design **(DLD)** of the same component. +3. To serve as a design reference document. + + +The intended audience of this document consists of M0 customers, architects, designers, and developers. + + +## Introduction +- A table is a collection of pairs, each consisting of a key and a record. Keys and records of pairs in a given table have the same structure as defined by the table type. "Container" might be a better term for a collection of pairs (compare with various "container libraries"), but this term is already used by Motr; +- records consist of fields and some of the fields can be pointers to pairs in (the same or other) table; +- a table is ordered, if a total ordering is defined on possible keys of its records. For an ordered table an interface is defined to iterate over existing keys in order; +- a sack is a special type of table, where the key carries no semantic information and acts purely as an opaque record identifier, called record address. Addresses are assigned to records by the lostore module; +- a table is persistent, if its contents survive a certain class of failures, viz. "power failures". A table can be created persistent or non-persistent (volatile); +- tables are stored in segments. A segment is an array of pages, backed by either persistent or volatile storage. Each page can be assigned to a particular table. A table can be stored in multiple segments and a segment can store pages belonging to the multiple tables. Assignment of pages to tables is done by the lostore module. +- updates to one or more tables can be grouped in a transaction, which is a set of updates atomic concerning a failure (from the same class as used in the definition of the persistent table). By abuse of terminology, an update of a volatile table can also be said to belong to a transaction; +- after a transaction is opened, updates can be added to the transaction, until the transaction is closed by either committing or aborting. Updates added to an aborted transaction are reverted (rolled back or undone). In the absence of failures, a committed transaction eventually becomes persistent, which means that its updates will survive any further power failures. On recovery from a power failure, committed, but not yet persistent transactions are rolled back by the lostore implementation; +- a function call is blocking if before return it waits for + - a completion of a network communication, or + - a completion of a storage transfer operation, or + - a long-term synchronisation event, where the class of long-term events is to be defined later. + +Otherwise, a function call is called non-blocking. + +## Requirements +- `R.M0.MDSTORE.BACKEND.VARIABILITY`: Supports various implementations: db5 and RVM +- `R.M0.MDSTORE.SCHEMA.EXPLICIT`: Entities and their relations are explicitly documented. "Foreign key" following access functions provided. +- `R.M0.MDSTORE.SCHEMA.STABLE`: Resistant against upgrades and interoperable +- `R.M0.MDSTORE.SCHEMA.EXTENSIBLE`: Schema can be gradually updated over time without breaking interoperability. +- `R.M0.MDSTORE.SCHEMA.EFFICIENT`: Locality of reference translates into storage locality. Within blocks (for flash) and across larger extents (for rotating drives) +- `R.M0.MDSTORE.PERSISTENT-VOLATILE`: Both volatile and persistent entities are supported +- `R.M0.MDSTORE.ADVANCED-FEATURES`: renaming symlinks, changelog, parent pointers. +- `R.M0.REQH.DEPENDENCIES mkdir a`; touch a/b +- `R.M0.DTM.LOCAL-DISTRIBUTED`: the same mechanism is used for distributed transactions and local transactions on multiple cores. +- `R.M0.MDSTORE.PARTIAL-TXN-WRITEOUT`: transactions can be written out partially (requires optional undo logging support). +- `R.M0.MDSTORE.NUMA`: allocator respects NUMA topology +- `R.M0.REQH.10M`: performance goal of 10M transactions per second on a 16-core system with a battery-backed memory. +- `R.M0.MDSTORE.LOOKUP`: Lookup of a value by key is supported. +- `R.M0.MDSTORE.ITERATE`: Iteration through records is supported. +- `R.M0.MDSTORE.CAN-GROW`: The linear size of the address space can grow dynamically. +- `R.M0.MDSTORE.SPARSE-PROVISIONING`: including pre-allocation +- `R.M0.MDSTORE.COMPACT`, `R.M0.MDSTORE.DEFRAGMENT`: used container space can be compacted and de-fragmented +- `R.M0.MDSTORE.FSCK`: scavenger is supported +- `R.M0.MDSTORE.PERSISTENT-MEMORY`: The log and dirty pages are (optionally) in a persistent memory. +- `R.M0.MDSTORE.SEGMENT-SERVER-REMOTE`: backing containers can be either local or remote. +- `R.M0.MDSTORE.ADDRESS-MAPPING-OFFSETS`: offset structure friendly to container migration and merging +- `R.M0.MDSTORE.SNAPSHOTS`: snapshots are supported +- `R.M0.MDSTORE.SLABS-ON-VOLUMES`: slab-based space allocator +- `R.M0.MDSTORE.SEGMENT-LAYOUT`: Any object layout for a meta-data segment is supported. +- `R.M0.MDSTORE.DATA.MDKEY`: Data objects carry a meta-data key for sorting (like the reiser4 key assignment does). +- `R.M0.MDSTORE.RECOVERY-SIMPLER`: There is a possibility of doing a recovery twice. There is also a possibility to use either object-level mirroring or logical transaction mirroring. +- `R.M0.MDSTORE.CRYPTOGRAPHY`: optionally meta-data records are encrypted. +- `R.M0.MDSTORE.PROXY`: proxy meta-data server is supported. A client and a server are almost identical. + +## Design highlights +Key problem of lostore interface design is to accommodate for different implementations, viz., a db5-based one and RVM-based. To address this problem, keys, records, tables and their relationships are carefully defined in a way that allows different underlying implementations without impairing efficiency. + +lostore transactions provide very weak guarantees, compared with the typical **ACID** transactions: + +- no isolation: lostore transaction engine does not guarantee transaction serializability. The user has to implement any concurrency control measure necessary. The reason for this decision is that transactions used by Motr are much shorter and smaller than typical transactions in a general-purpose RDBMS and a user is better equipped to implement a locking protocol that guarantees consistency and deadlock freedom. Note, however, that lostore transactions can be aborted and this restricts the class of usable locking protocols; +- no durability: lostore transactions are made durable asynchronously, see the Definitions section above. + +The requirement of non-blocking access to tables implies that access is implemented as a state machine, rather than a function call. The same state machine is used to iterate over tables. + +## Functional Specification +mdstore and iostore use lostore to access a database, where meta-data is kept. Meta-data are orgianised according to a meta-data schema, which is defined as a set of lostore tables referencing each other together with consistency constraints. + +lostore public interface consists of the following major types of entities: +- table type: defines common characteristics of all tables of this type, including: + - structure of keys and records, + - optional key ordering, + - usage hints (e.g., how large is the table? Should it be implemented as a b-tree or a hash table?) +- table: defines table attributes, such as persistence, name; +- segment: segment attributes such as volatility or persistency. A segment can be local or remote; +- transaction: transaction objects can be created (opened) and closed (committed or aborted). Persistence notification can be optionally delivered when the transaction becomes persistent; +- table operation, table iterator: a state machine encoding the state of table operation; +- domain: a collection of tables sharing underlying database implementation. A transaction is confined to a single domain. + +### Tables +For a table type, a user has to define operations to encode and decode records and keys (unless the table is a sack, in which case key encoding and decoding functions are provided automatically) and an optional key comparison function. It's expected that encoding and decoding functions will often be generated automatically in a way similar to fop encoding and decoding functions. + +The following functions are defined in tables: +- create: create a new instance of a table type. The table created can be either persistent or volatile, as specified by the user. The backing segment can be optionally specified; +- open: open an existing table by name; +- destroy: destroy a table; +- insert: insert a pair into a table; +- lookup: find a pair given its key; +- delete: delete a pair with a given key; +- update: replace the pair's record with a new value; +- next: move to the pair with the next key; +- follow: follow a pointer to a record. + +### Transactions +The following operations are defined for transactions: +- open: start a new transaction. Transaction flags can be specified, e.g., whether the transaction can be aborted, or whether persistence notification is needed. +- add: add an update to the transaction. This is internally called part of any table update operation (insert, update, delete); +- commit: close the transaction; +- abort: close the transaction and roll it back; +- force: indicate that transaction should be made persistent as soon as possible. + +### Segments +The following operations are defined for segments: +- create a segment backed up by a storage object (note that because the meta-data describing the storage object are accessible through lostore, the boot-strapping issues have to be addressed); +- create a segment backed up by a remote storage object; +- destroy a segment, none of which pages are assigned to tables. + +## Logical Specification +Internally, a lostore domain belongs to one of the following types: +- a db5 domain, where tables are implemented as db5 databases and transactions as db5 transactions; +- an rvm domain, where tables are implemented as hash tables in the RVM segments and transactions as RVM transactions; + +a light domain, where tables are implemented as linked lists in memory and transaction calls are ignored. diff --git a/doc/HLD-of-Motr-Network-Benchmark.md b/doc/HLD-of-Motr-Network-Benchmark.md new file mode 100644 index 00000000000..9eb200460b8 --- /dev/null +++ b/doc/HLD-of-Motr-Network-Benchmark.md @@ -0,0 +1,163 @@ +# High level design of Motr Network Benchmark +This document presents a high level design **(HLD)** of Motr Network Benchmark. + The main purposes of this document are: +1. To be inspected by M0 architects and peer designers to ascertain that high level design is aligned with M0 architecture and other designs, and contains no defects. +2. To be a source of material for Active Reviews of Intermediate Design **(ARID)** and detailed level design **(DLD)** of the same component. +3. To serve as a design reference document. + +The intended audience of this document consists of M0 customers, architects, designers, and developers. + +## Introduction +Motr network benchmark is designed to test the network subsystem of Motr and network connections between nodes that are running Motr. + +## Definitions +- m0_net_test: a network benchmark framework that uses the Motr API; +- test client: active side of a test, e.g. when A pings B, A is the client; +- test server: passive side of a test, e.g. when A pings B, B is the server; +- test node: a test client or server; +- test group: a set of test nodes; +- test console: a node where test commands are issued; +- test message: unit of data exchange between test nodes; +- test statistics: summary information about test messages for some time; +- bulk test: a test that uses bulk data transfer; +- ping test: a test that uses short message transfer; +- command line: command line parameters for a user-mode program or kernel module parameters for a kernel module. + +## Requirements +- `[r.m0.net.self-test.statistics]`: should be able to gather statistics from all nodes; +- `[r.m0.net.self-test.statistics.live]`: display live aggregate statistics when the test is still running; +- `[r.m0.net.self-test.test.ping]`: simple ping/pong to test connectivity, and measure latency; +- `[r.m0.net.self-test.test.bulk]`: bulk message read/write +- `[r.m0.net.self-test.test.bulk.integrity.no-check]`: for pure link saturation tests, or stress tests. This should be the default. +- `[r.m0.net.self-test.test.duration.simple]`: end-user should be able to specify how long a test should run, by loop; +- `[r.m0.net.self-test.kernel]`: test nodes must be able to run in kernel space. + + +## Design Highlights +- Bootstrapping Before the test console can issue commands to a test node, m0_net_test must be running on that node, as a kernel module pdsh can be used to load a kernel module. +- Statistics collecting Kernel module creates a file in /proc filesystem, which can be accessed from the user space. This file contains aggregate statistics for a node. +- Live statistics collecting pdsh used for live statistics. + + +## Functional Specification +- `[r.m0.net.self-test.test.ping]` + - Test client sends the desired number of test messages to a test server with desired size, type, etc. Test client waits for replies (test messages) from the test server and collects statistics for send/received messages; + - Test server waits for messages from a test client, then sends messages back to the test client. Test server collects messages statistics too; + - Messages RTT need to be measured; +- `[r.m0.net.self-test.test.bulk]` + - Test client is a passive bulk sender/receiver; + - Test server is an active bulk sender/receiver; + - Messages RTT and bandwidth from each test client to each corresponding test server in both directions need to be measured; +- Test console sends commands to load/unload the kernel module (implementation of **m0_net_test** test client/test server), obtains statistics from every node, and generates aggregate statistics. + +## Logical specification + +### Kernel module specification + +#### start/finish +After the kernel module is loaded all statistics are reset. Then the module determines is it a test client or a test server and acts according to this. Test client starts sending test messages immediately. All statistical data remain accessible until the kernel module is unloaded. + +#### test console +Test console uses pdsh to load/unload kernel module and gather statistics from every node's procfs. pdsh can be configured to use a different network if needed. + +#### test duration +End-user should be able to specify how long a test should run, by the loop. Test client checks command line parameters to determine the number of test messages. + +### Test Message + +#### message structure +Every message contains a timestamp and sequence number, which is set and checked on the test client and the test server and must be preserved on the test server. The timestamp is used to measure latency and the sequence number is used to identify message loss. + +#### message integrity +Currently, test message integrity is not checked either on the test client nor the test server. + +#### measuring message latency +Every test message will include a timestamp, which is set and tested on the test client. When the test client receives test message reply, it will update round-trip time statistics: minimum, maximum, average, and standard deviation. Lost messages aren’t included in these statistics. + +Test client will keep statistics for all test servers with which communication was. + +#### measuring messages bandwidth +- messages bandwidth can be measured only in the bulk test. It is assumed that all messages in the ping test have zero sizes, and all messages in the bulk test have specified sizes; +- messages bandwidth statistics are kept separately from node/to node directions on the test server and total bandwidth only is measured on the test client; +- messages bandwidth is the ratio of the total messages size (in the corresponding direction) and time from the start time to the finish time in the corresponding direction; +- on the test client start time is time just before sending the network buffer descriptor to the test server (for the first bulk transfer); +- on the test server start time in the “to node” direction is the time, when the first bulk transfer request was received from the test client, and in the “from node” direction it is time just before the bulk transfer request will be sent to the test client for the first time; +- finish time is the time when the last bulk transfer (in the corresponding direction) is finished or it is considered that there was a message loss; +- ideally, time in the “from test client to test server” direction must be measured from Tc0 to Ts4, and time in the “from a test server to test client” direction must be measured from Ts5 to Tc8. But in the real world, we can only measure the time between Tc(i) and Tc(j) or between Ts(i) and Ts(j). Therefore always will be some errors and differences between test client statistics and test server statistics; +- absolute value of the error is O(Ts1 - Tc0)(for the first message) + O(abs(Tc8 - Ts8))(for the last message); +- with the message number increasing the relative error will tend to zero. + +#### messages concurrency +Messages concurrency looks like the test client has a semaphore, which has several concurrent operations as its initial value. One thread will down this semaphore and send a message to the test client in a loop, and the other thread will up this semaphore when the reply message is received or the message is considered lost. + +#### messages loss +Message loss is determined using timeouts. + +#### message frequency +Measure how many messages can be sent in a time interval. + +### Bulk Test +#### test client +The test client allocates a set of network buffers, used to receive replies from test servers. Then test client sends bulk data messages to all test servers (as a passive bulk sender) from the command line. After that, the test client will wait for the bulk transfer (as a passive bulk receiver) from the test server. Test clients can perform more than one concurrent send/receive to the same server. + +#### test server +Test server allocates a set of network buffers and then waits for a message from clients as an active bulk receiver. When the bulk data arrives, the test server will send it back to the test client as an active bulk sender. + +### Ping test +#### test server +Test server waits for incoming test messages and simply sends them back. + +#### test client +Test client sends test messages to the server and waits for reply messages. If reply message isn't received within a timeout, then it is considered that the message is lost. + +### Conformance +- `[i.m0.net.self-test.statistics]` statistics from all nodes can be collected on the test console; +- `[i.m0.net.self-test.statistics.live]`: statistics from all nodes can be collected on the test console at any time during the test; +- `[i.m0.net.self-test.test.ping]`: latency is automatically measured for all messages; +- `[i.m0.net.self-test.test.bulk]`: used messages with additional data; +- `[i.m0.net.self-test.test.bulk.integrity.no-check]`: bulk messages additional data isn't checked; +- `[i.m0.net.self-test.test.duration.simple]`: end-user should be able to specify how long a test should run, by loop; +- `[i.m0.net.self-test.kernel]`: test client/server is implemented as a kernel module. + +## Use Cases + +### Scenarios +Scenario 1 + +|Scenario | Description | +|---------|-------------------------| +|Scenario | [usecase.net-test.test] | +|Relevant |quality attributes usability| +|Stimulus |user starts the test| +|Stimulus source | user | +|Environment |network benchmark | +|Artifact |test started and completed| +|Response |benchmark statistics produced| +|Response measure| statistics are consistent| + + +### Failures + +#### network failure +Message loss is determined by timeout. If the message wasn't expected and if it did come, it is rejected. If some node isn't accessible from the console, it is assumed that all messages, associated with this node have been lost. + +#### test node failure +If the test node isn't accessible at the beginning of the test, it is assumed a network failure. Otherwise, the test console will try to reach it every time it uses other nodes. + +## Analysis +### Scalability + +#### network buffer sharing +Single buffer (except timestamp/sequence number fields in the test message) can be shared between all bulk send/receive operations. + +#### statistics gathering +For a few tens of nodes pdsh can be used - scalability issues do not exists on this scale. + +### Rationale +pdsh was chosen as an instrument for starting/stopping/gathering purposes because of a few tens of nodes in the test. At a large scale, something else must be used. + +## References +- [0] Motr Summary Requirements Table +- [1] HLD of Motr LNet Transport +- [2] Parallel Distributed Shell +- [3] Round-trip time (RTT) diff --git a/doc/HLD-of-Motr-Object-Index.md b/doc/HLD-of-Motr-Object-Index.md new file mode 100644 index 00000000000..959e6f6c9fb --- /dev/null +++ b/doc/HLD-of-Motr-Object-Index.md @@ -0,0 +1,192 @@ +# High-Level Design of a Motr Object Index +This document provides a High-Level Design **(HLD)** of an Object Index for the Motr M0 core. The main purposes of this document are: +- To be inspected by M0 architects and peer designers to ensure that HLD is aligned with M0 architecture and other designs and contains no defects. +- To be a source of material for Active Reviews of Intermediate Design **(ARID)** and Detailed Level Design **(DLD)** of the same component. +- To be served as a design reference document + +The intended audience of this document consists of M0 customers, architects, designers, and developers. + +## Introduction +The Object Index performs the function of a metadata layer on top of the M0 storage objects. The M0 storage object (stob) is a flat address space where one can read or write with block size granularity. The stobs have no metadata associated with them. Additionally, metadata must be associated with the stobs to use as files or components of files, aka stripes. +- namespace information: parent object id, name, links +- file attributes: owner/mode/group, size, m/a/ctime, acls +- fol reference information: log sequence number (lsn), version counter +Metadata must be associated with both component objects (cobs), and global objects. Global objects would be files striped across multiple component objects. Ideally, global objects and component objects should reuse the same metadata design (a cob can be treated as a gob with a local layout). + +## Definitions +- A storage object (stob) is a basic M0 data structure containing raw data. +- A component object (cob) is a component (stripe) of a file, referencing a single storage object and containing metadata describing the object. +- A global object (gob) is an object describing a striped file, by referring to a collection of component objects. + +## Requirements +- `[R.M0.BACK-END.OBJECT-INDEX]`: an object index allows the back-end to locate an object by its fid +- `[R.M0.BACK-END.INDEXING]`: back-end has mechanisms to build metadata indices +- `[R.M0.LAYOUT.BY-REFERENCE]`: file layouts are stored in file attributes by reference +- `[R.M0.BACK-END.FAST-STAT]`: back-end data structures are optimized to make stat(2) call fast +- `[R.M0.DIR.READDIR.ATTR]`: readdir should be able to return file attributes without additional IO +- `[R.M0.FOL.UNDO]`: FOL can be used for undo-redo recovery +- `[R.M0.CACHE.MD]`: metadata caching is supported + +## Design Highlights +- The file operation log will reference particular versions of cobs (or gobs). The version information enables undo and redo of file operations. +- cob metadata will be stored in database tables. +- The database tables will be stored persistently in a metadata container. +- There may be multiple cob domains with distinct tables inside a single container. + +## Functional Specification +Cob code: +- provides access to file metadata via fid lookup +- provides access to file metadata via namespace lookup +- organizes metadata for efficient filesystem usage (esp. stat() calls) +- allows creation and destruction of component objects +- facilitates metadata modification under a user-provided transaction + +## Logical Specification +### Structures + +Three database tables are used to capture cob metadata: +- object-index table + - key is {child_fid, link_index} pair + - record is {parent_fid, filename} +- namespace table + - key is {parent_fid, filename} + - record is {child_fid, nlink, attrs} + - if nlink > 0, attrs = {size, mactime, omg_ref, nlink}; else attrs = {} + - multiple keys may point to the same record for hard links if the database can support this. Otherwise, we store the attrs in one of the records only (link number 0). This leads to a long sequence of operations to delete a hard link, but straightforward. +- fileattr_basic table + - key is {child_fid} + - record is {layout_ref, version, lsn, acl} (version and lsn are updated at every fop involving this fid) + + link_index is an ordinal number distinguishing between hard links of the same fid. E.g. file a/b with fid 3 has a hard link c/d. In the object index table, the key {3,0} refers to a/b, and {3,1} refers to c/d. + + omg_ref and layout_ref refer to common owner/mode/group settings and layout definitions; these will frequently be cached in memory and referenced by cobs in a many-to-one manner. The exact specification of these is beyond the scope of this document. + + References to the database tables are stored in a cob_domain in-memory structure. The database contents are stored persistently in a metadata container. + + There may be multiple cob_domains within a metadata container, but the usual case will be 1 cob_domain per container. A cob_domain may be identified by an ordinal index inside a container. + The list of domains will be created at container ingest. + +```` + struct m0_cob_domain { + + cob_domain_id cd_id /* domain identifier */ + + m0_list_link cd_domain_linkage + + m0_dbenv *cd_dbenv + + m0_table *cd_obj_idx_table + + m0_table *cd_namespace_table + + m0_table *cd_file-attr-basic_table + + m0_addb_ctx cd_addb + + } + ````` + + A m0_cob is an in-memory structure, instantiated by the method cob_find and populated as needed from the above database tables. The m0_cob may be cached and should be protected by a lock. + + ```` + struct m0_cob { + + fid co_fid; + + m0_ref co_ref; /* refcounter for caching cobs */ + + struct m0_stob *co_stob; /* underlying storage object */ + + m0_fol_obj_ref co_lsn; + + u64 co_version + + struct namespace_rec *co_ns_rec; + + struct fileattr_basic_rec *co_fab_rec; + + struct object_index_rec *co_oi_rec; /* pfid, filename */ +}; +`````` + +The *_rec members are pointers to the records from the database tables. These records may or may not be populated at various stages in cob life. +The co_stob reference is also likely to remain unset, as metadata operations will not frequently affect the underlying storage object and, indeed, the storage object is likely to live on a different node. + +### Usage +m0_cob_domain methods locate the database tables associated with a container. These methods are called container discovery/setup. +m0_cob methods are used to create, find, and destroy in-memory and on-disk cobs. These might be: +- cob_locate: find an object via a fid using the object_index table. +- cob_lookup: find an object via a namespace lookup (namespace table). +- cob_create: add a new cob to the cob_domain namespace +- cob_remove: remove the object from the namespace. +- cob_get/put: take references on the cob. At last put, cob may be destroyed. + +m0_cob_domain methods are limited to initial setup and cleanup functions and are called during container setup/cleanup. + +Simple mapping functions from the fid to stob:so_id and to the cob_domain:cd_id are assumed to be available. + +### Conformance +- `[I.M0.BACK-END.OBJECT-INDEX]`: object-index table facilitates lookup by fid +- `[I.M0.BACK-END.INDEXING]`: new namespace entries are added to the db table +- `[I.M0.LAYOUT.BY-REFERENCE]`: layouts are referenced by layout ID in fileattr_basic table. +- `[I.M0.BACK-END.FAST-STAT]`: stat data is stored adjacent to the namespace record in the namespace table. +- `[I.M0.DIR.READDIR.ATTR]`: namespace table contains attrs +- `[I.M0.FOL.UNDO]`: versions and lsn's are stored with metadata for recovery +- `[I.M0.CACHE.MD]`: m0_cob is refcounted and locked. + +### Dependencies +- `[R.M0.FID.UNIQUE]`: uses; fids can be used to uniquely identify a stob +- `[R.M0.CONTAINER.FID]`: uses; fids indentify the cob_domain via the container +- `[R.M0.LAYOUT.LAYID]`: uses; reference stored in fileattr_basic table. + +## Use Cases + +### Scenarios + +|Scenario 1|QA.schema.op| +|----------|------------| +|Relevant quality attributes|variability, reusability, flexibility, modifiability| +|Stimulus source | a file system operation request originating from protocol translator, native M0 client, or storage application | +|Environment | normal operation| +|Artifact| a series of Schema accesses| +|Response| The metadata back-end contains enough information to handle file system operation requests. This information includes the below-mentioned aspects
  • standard file attributes as defined by POSIX, including access control related information
  • description of file system name-space, including directory structure, hard links, and symbolic links;
  • references to remote parts of file-system namespace
  • file data allocation information| +|Response Measure| +|Questions and issues| + + +|Scenario 2|QA.schema.stat| +|----------|--------------| +|Relevant quality attributes| usability| +Stimulus |a stat(2) request arrives in a Request Handler| +|Stimulus source |a user application | +|Environment| normal operation| +|Artifact |a back-end query to locate the file and fetch its basic attributes| +|Response |The schema must be structured so that stat(2) processing can be done quickly without extracting index lookups and associated storage accesses| +|Response Measure|
    • an average number of schema operations necessary to complete stat(2) processing;
    • an average number of storage accesses during stat(2) processing.| +|Questions and issues| + + +|Scenario 3| QA.schema.duplicates| +|----------|---------------------| +|Relevant | quality attributes usability| +|Stimulus |a new file is created| +|Stimulus source| protocol translator, native C2 client, or storage application| +|Environment| normal operation| +|Artifact| records, describing new files are inserted in various schema indices| +|Response| records must be small. The schema must exploit the fact that in a typical file system, certain sets of file attributes have much fewer different values than combinatorially possible. Such sets of attributes are stored by reference, rather than by duplicating the same values in multiple records. Examples of such sets of attributes are:
      • {file owner, file group, permission bits}
      • {access control list}
      • {file layout formula}| +|Response Measure|
        • the average size of data that is added to the indices as a result of file creation
        • attribute and attribute set sharing ratio| +|Questions and issues| + + +|Scenario 5|QA.schema.index| +|----------|---------------| +|Relevant quality attributes| variability, extensibility, re-usability| +|Stimulus| storage application wants to maintain additional metadata index| +|Stimulus source| storage application| +|Environment| normal operation| +|Artifact| index creation operation| +|Response| schema allows dynamic index creation| +|Response Measure| +|Questions and issues| + +This is OBSOLETED content diff --git a/doc/HLD-of-Motr-Spiel-API.md b/doc/HLD-of-Motr-Spiel-API.md new file mode 100644 index 00000000000..0172987ec04 --- /dev/null +++ b/doc/HLD-of-Motr-Spiel-API.md @@ -0,0 +1,715 @@ +# High Level Design of Motr Spiel API +This document presents a High-Level Design (HLD) for Motr Spiel API. +The main purposes of this document are: +To be inspected by the Motr architects and the peer designers to ascertain that the HLD is aligned with Motr architecture and other designs, and contains no defects - To be a source of material for the Active Reviews of Intermediate Design (ARID) and the Detailed Level Design (DLD) of the same component - To serve as a design reference document +The intended audience of this document consists of Motr customers, architects, designers, and developers. + +## Introduction +## 1. Definitions +* CMI: Spiel Configuration Management Interface +* CI: Spiel Command Interface + +## 2. Requirements +Spiel requirements +* `[r.m0.spiel]` Spiel library implements the API to allow the controlling of cluster elements state/subordination/etc. opaquely for API user. +* `[r.m0.spiel.concurrency-and-consistency]` Every CMI API call results in originating a Spiel transaction to: + * Allow the concurrent calls originated by several API users + * Allow the maintaining of the cluster configuration database consistent in the concurrent environment +* `[r.m0.spiel.transaction]` Currently expected transaction is to: + * Compose the complete configuration database from the scratch as per the information from the Spiel supplementary functionality (HALON/SSPL) + * Post the new configuration to all instances of confd databases currently known in the Spiel environment + * Result in the newly introduced configuration change propagation across all cluster nodes having: + * confd database instances eventually consistent + * confc caches eventually matching to stable confd database version +* `[r.m0.spiel.conf-db-abstraction]` From the API user's point of view, the configuration database is considered as an abstract elements tree, where the aspects of physical database replication must be hidden from the API user and are resolved at the correct lower levels and/or with appropriate mechanisms +* `[r.m0.spiel.element-id]` A database element is identified by the ID unique across the entire cluster: + * This implies that the Spiel API must resolve the element ID to node endpoint, where the action is to be executed, based on information initially provided as CLI parameters but later updated from recent configuration transactions + * The API user remains unaware of any details related to transport level +* `[r.m0.spiel.element-id-generation]` The Spiel API user is responsible for the correct Element ID generation: + * ID is unique (see [r.m0.spiel.element-id]) across the cluster + * Particular Element ID embeds element type ID +* `[r.m0.spiel.element-types]` Element types to be supported so far: + * Regular: + * Filesystem + * Node + * Process + * Service + * Device + * Pool + * Rack + * Enclosure + * Controller + * Pool version + + * V-objects + * Rack_v + * Enclosure_v + * Controller_v +* `[r.m0.spiel.conf-actions]` Currently, actions to be supported, per element: + * Add + * Delete +* `[r.m0.spiel.conf-validation]` The validity of configuration is created by a series of Spiel calls is to be tested only outside of Spiel, e.g. on confd side, but not in Spiel calls themselves + * However, some level of trivial on-the-run validation is required. For example, m0_fid embedded type must match with the processed call semantically, and similar aspects. +* `[r.m0.spiel.environment]` The Spiel must always make sure it runs on a consistently made list of confd servers, i.e., the passed list of confd endpoints is identical to the list of confd servers in every existent database replica +* `[r.m0.spiel.cmd-iface]` The Spiel APIs allow to changing the cluster state via special operation requests to dedicated Motr service. Spiel API user is unaware of any underlying transport details. +* `[r.m0.spiel.cmd-iface-sync]` The Spiel command interface is synchronous. +* `[r.m0.spiel.cmd-iface-ids]`The Command action is applicable only to objects avaialble in the configuration database and identified by the element ID. +* `[r.m0.spiel.cmd-explicit]`A configuration change that occurs does not imply any automatic command execution. Any required command must be issued explicitly. +* `[r.m0.spiel.supported-commands]` Commands supported so far: + * Service + * Initialize + * Start + * Stop + * Health + * Quiesce + + * Device + * Attach + * Detach + * Format + * Process + * Stop + * Reconfigure + * Health + * Quiesce + * List services + + * Pool + * Start rebalance + * Quiesce rebalance + * Start repair + * Quiesce repair + +Supplementary requirements +* Resource Manager + * `[r.m0.rm.rw-lock]` Need to introduce an additional type of shared resource Read/Write Lock: + * The Read lock acquisition requires letting every confc instance access/update the local cache + * The Write lock acquires the start updating configuration database instances in a consistent and non-conflicting manner + * The Write lock acquisition has to invalidate already provided read locks and this way to force confc instances to initiate version re-election and updating local caches + * `[r.m0.rm.failover]` If the RM service serving configuration RW lock fails, the new RM service should be chosen to serve configuration RW lock resource. +* Configuration database: + * `[r.m0.conf.replicas]` Cluster must run more than one database instances (replicas) available at any moment, and served by a single configuration server each + * `[r.m0.conf.consistency]` Configuration database replicas at any given moment are allowed to be inconsistent under condition that one particular version number reaches a quorum across all the cluster + * `[r.m0.conf.quorum]` The number of configuration servers running simultaneously in the cluster must meet the following requirement: +Nconfd = Q + A +where Q is a quorum number to be reached during version negotiation, and A is a number meeting the condition: A < Q +Note: The simplest case is A = Q -1, which gives Nconfd = 2Q - 1 +* `[r.m0.conf.transaction]` Configuration database version needs to be distributed to all known confd servers in transactional manner. The Spiel client must be able to distribute the version in normal way, i.e. reaching quorum on the cluster, as well as forcible way, when the version is uploaded to as many confd servers as possible at the moment. In the latter case the forcible version upload must allow multiple attempts for the same transaction dataset. +* `[r.m0.conf.attributes]` Configuration database version needs to be identified by the following attributes: + * Version number, incremented from update to update + * Transaction ID, generated uniquely across the entire cluster + * Epoch (to be decided later) +* Configuration server (confd): + * `[r.m0.confd.update-protocol]` Two-phase database update protocol, LOAD (long but non-blocking) preceding FLIP (short and blocking) need to be supported to minimize the blocking effect on the entire cluster +* Configuration client (confc): + * `[r.m0.confc.rw-lock]` The new lock is adopted by confc implementation where applicable + * `[r.m0.confc.quorum]` Client needs to poll all known configuration servers to find out a configuration version reaching quorum Q in the cluster right now. The quorum version must be the only one used in process of serving configuration requests from Motr modules + +## 3. Design highlights +The Spiel roughly falls into 2 types of interfaces: +* Configuration Management Interface (CMI) +* Command Interface (CI) +Logically these interfaces provide different modes of operation. +The CMI calls series must explicitly form a transaction. +The CI calls does not require any transaction mechanism. + +## 4. Functional specification + +### 4.1 Configuration database +The configuration database format remains as is. But a new configuration path must be introduced to represent current database attributes: +* Version number +* Quorum Q +* Epoch (??) + +TBD: A possible option would be storing the attributes as a metadata, i.e. not in configuration database but somewhere outside the one. In this case, an extra call to confd must be introduced to let version attributes be communicated between confc and confd. + +Configuration change +The Spiel client is able to compose new database from the scratch using information on the cluster configuration stored/collected outside of Motr. + +

          + Note Collecting configuration information is out of the Spiel responsibility. +

          + +`[i.m0.conf.transaction]` The Spiel library guides the entire process and completes it in a transactional manner. The Spiel client is responsible for the explicit transaction opening, composing a new configuration database tree, and committing the transaction. The transaction commits opaquely for the Spiel client does all the required communication, version negotiation, and distribution. + +The same opened transaction may be committed either the normal way or forcibly. The normal way implies the successful transaction dataset uploaded to the number of confd servers reaching the quorum or greater than that. The forcible transaction commit succeeds always no matter how many successful uploads were done. The same transaction dataset being repeatedly committed to uploading all known confd servers available in the environment. + +![IMAGE](./Images/SpielAPI01.png) + +TBD: Potential issues identified to the moment: +* Read lock invalidation is too slow or fails with some credit borrower +* This may block write lock acquisition +* Write lock acquired but never released due to unrecoverable communication failure or Spiel client death +* This may cause all the cluster to lock out, as no confc is going to be able to complete reading configuration data + +

          + Note: +`[i.m0.spiel.environment]` To comply with `[r.m0.spiel.environment]` requirements the transaction must end with the Spiel making sure its confd list was initialised with identical to the list of confd contained in the newly distributed quorum version. Or the transaction make sure at the very beginning it starts with the list of confd identical to the latest known list of ones. +

          + +Otherwise the requirement seems excessive and should not be imposed at all, having only the Spiel confd initialisation list to rely on. +4.2 Configuration client (h3) +Cache refresh + +![IMAGE](./Images/SpielAPI02.png) + +Roughly, refreshing looks like gather the current version numbers of all the known confd servers currently run, decide which version number reaches the quorum, and further loading requested configuration data from the confd server currently assigned to be active. + +

          + Note: The Read lock acquisition is not a standalone call, but in accordance with current RM design it is hidden inside every GET operation issued to the confc cache. The diagram just highlights the fact of mandatory lock acquisition to let operation complete. +

          + + +**Version re-election** + +Re-election comes in effect either on client’s start, or when the read lock revocation detected. + + ![IMAGE](./Images/SpielAPI03.png) + + The configuration client emits asynchronous requests to all currently known confd servers aiming to collect their configuration database version numbers. When some version reaches a quorum, it becomes a working (active) version until next re-election. + +TBD: In case the required quorum was not reached, the node appears inoperable having no active confc context to communicate over. This sort of ambiguity must be solved somehow, or reporting to HA may take place as a fallback policy. + +Note: Ideally, the process of version re-election must end with the rconfc updating its list of known confd getting the one from the newly elected version Vq, as the list may appear changed. + +TBD: The Quorum number Q must be re-calculated either, immediately or on next re-election start. + +**RCONFC - Redundant configuration client** + +To work in the cluster with the multiple configuration servers, the consumer (Motr module) makes use of redundant configuration client (rconfc). The client aware of the multiple confd instances, is able to poll all the confd servers, find out the version number of configuration databases the servers run, and decide on the version to be used for reading. The decision is made when some version reaches a quorum. The rconfc carries out the whole process of version election providing communication with confd instances. + +When the appropriate version elected, the rconfc acquires the read lock from RM. Successful acquisition of the lock indicates that no configuration change is currently in-progress. The lock remains granted until the next configuration change is initiated by some Spiel client. + +

          + Note: The rconfc provides no special API for reading configuration data. Instead, it exposes a standard confc instance the consumer is to deal with standard way. When the rconfc initialisation succeeded, the confc instance is properly set up and connected to one of confd servers running the elected version. +When the read lock is acquired, the rconfc sets up its confc by connecting it to a confd instance running the version that reached quorum. Rconfc remains in control of all confc contexts initialisations, blocking them when the read lock is being revoked. +

          + +Having confc context initialised, consumer is allowed to conduct reading operations until the context finalisation. Rconfc is not to interfere into reading itself. When confc cache cleanup is needed, rconfc waits for all configuration objects to be properly closed and all configuration contexts to be detached from the corresponding confc instance. During this waiting rconfc is to prevent all new contexts from initialisation, and therefore, attaching to the controlled confc. + +

          + Note: The confc instances participating in version election are never blocked by rconfc. +

          + +**RCONFC Initialisation** +![IMAGE](./Images/SpielAPI04.png) + +The RCONFC initialisation starts with allocation of internal structures. Mainly those are quorum calculation context and read lock context. The read lock is requested on the next step, and once successfully acquired, is continued by version election. On this step all known confd servers are polled for configuration version number. Based on the replied values a decision is made on the version number the consumer is going to read from in subsequent operations. With this version number a list of active confd servers is built, and the confc instance, hosted and exposed by the rconfc instance, is connected to one of those servers. Being successfully connected it appears ready for reading configuration data. + +**Normal Processing** +![IMAGE](Images/SpielAPI05.png) + + +After the successful rconfc initialisation the consumer (Motr module) operates with configuration client exposed from the rconfc. The client reads the requested path usual way starting with initialising configuration context.. Internally the operation starts with confc referring to rconfc if read operation allowed at the moment, then when rconfc is not locked, the context appears initialised, and consumer goes to reading configuration path from the confc’s cache. + In case the path is not in the cache, the corresponding confd is requested for the data. + +In case when rconfc is locked for reading due to read lock revocation occurred because of configuration change and/or new version election, the configuration context initialisation appears blocked until the moment of rconfc unlock. Multiple configuration context initialisations done simultaneously on the same confc will be waiting for the same rconfc unlock. + +**Fault tolerance** +Robustness is based on the configuration redundancy and provided by the ability of rconfc to switch its confc dynamically among confd instances without need in invalidating any currently cached data, because each time the switch is done to confd running the same configuration version. + +![IMAGE](./Images/SpielAPI06.png) + +Read operation from the selected confd may fail in some cases. In this case, the rconfc must pass the configuration client using the list of confd addresses. The confd addresses point to the same quorum version (Vq) of the configuration database and try to request the respective confd servers one-by-one until getting to success or end of the list. + +In the latter case the call returns failure to Motr module originated the call. + +

          + Note: It is up to module what scenario to choose to properly react to the failure. It might decide to wait and re-try to accomplish the call, or notify HA about the failure(s) and do nothing, or initiate the caller’s Motr node shutdown, etc. +

          + +**Resilience against top-level RM death** + +**Requirements:** + +As long as the RCONFC workflow relies on the proper handling (getting/releasing) Conf Read Lock between client and remote RM, in case of remote RM’s death there must be a way for RCONFC instance for the following: +* Detect the fact of RM death +* Properly handle (drop in timely fashion) all previously borrowed resource credits on client side +* Initiate RCONFC restart routine +* In the course of the latter one, discover an endpoint of top-level online RM newly designated by HA as well as refresh the set of CONFD services currently expected to be up and running +* Based on the endpoints obtained from cluster, conduct RCONFC restart and re-borrow the required credits +Assumptions: +* Two-tier RM architecture is in effect. (is it? does it matter???) +* Conf database: + * Profile includes a set of CONFD and top-level RM services sharing same endpoints with similar services in all other profiles. + * The point is: no matter what profile client runs in, the set of CONFDs+ RMs endpoints remains the same, and the same online top-level RM endpoint is announced among profiles’ clients. +* RM service: + * No matter local or remote one, is aware of HA notification mechanism, and smart enough to handle on its own any node death that may affect the balance of credits/loans. + * Therefore, RM client side is responsible for a limited number of aspects to take into consideration, and must not care about remote service side. + + * Running top-level, is present in configuration on every node running CONFD service, this way making the cluster run redundant set of top-level RM services remaining in transient state all but one chosen by HA and promoted to online state. + * Serves simultaneously all requests for all known resource types + * This may be a subject for change in future design when RM service may be configured for serving some particular resource type(s) requests only + +* RM client: + * Initially is unaware of any CONFD/RM endpoints + * On (re)initialisation, is responsible for discovering from cluster: + * The set of CONFD endpoints to connect its conf clients to + * Current online RM and guaranteeing this only RM is to be used until getting explicit death notification + * is expected to be aware of HA notification mechanism and subscribe to HA notifications whenever required + + +* HA service: + * is responsible for nodes’ HA status delivery to every part constituting Motr cluster. This implies, clients are to be notified the same way as the servers. + * is responsible for detection of current top-level RM death and designating in place of just died one a new one from the rest of transient RMs. + * notifies all known cluster nodes, currently reachable ones, about death of particular process as well as all its services. + * It is up to HA about the logic behind the notion of ‘dead process’ + * This aspect may be a subject for future change, as conflict between ‘dead service’ and ‘dead process’ may cause some ambiguity in understanding and evaluating HA state for particular conf object. The simplest resolution of the ambiguity might be considering ‘process is dead’ when having at least one ‘service is dead’ state. + * The process FID is included into notification vector + * This may require current HA state update mechanism to be revised in the aspect of updating process conf object status + * The process’ service FIDs are included into notification vector + * is responsible for announcing a new epoch to disable processing requests came from the past + + +**Questions, points for consideration:** +* Should transient RM care about its current HA state and actively reject client requests? +* Should formerly online RM care about its current HA state and start actively reject client requests once being announced dead? + * Note: HA may decide to announce dead a node that actually is alive but suffered from degradation in performance or network environment. +* Conf database versions are expected to be governed based on the same set of CONFD service endpoints, no matter what profile conf clients run in. So should be with top-level RM service used for Conf Read-Write Locks. But this scheme may appear inapplicable when we are talking about other resource types. +* So the question of resource-type dependent RM service/endpoint still requires for additional consideration. +* So does the question of relationship between RM service and conf profile. + +**RCONFC initialisation details** + +The HA session is established when module setup takes place. The session is kept in globally accessible HA context along with the list of confc instances. The Client is allowed to add an arbitrary confc instance to the list, and further HA acceptance routine is to apply standard way the received change vector to every instance in the list. + +When RCONFC gets started, it obtains an already established session to HA and sends request about information regarding current set of CONFD servers and top-level RM server. The returned information is expected to include the mentioned servers’ endpoints as well as fids. + +RCONFC stores the server endpoints internally to use those in subsequent calls. And received service fids are put into special ‘phoney’ instance of confc, while the confc is added to HA context. RCONFC subscribes to all confc objects placed into the confc instance cache. + +

          + Note: The confc cache is to be artificially filled with the fids of interest, but not standard ways. The reason is that at the moment RCONFC has no quorum, therefore, does not know what CONFD server to read from. +

          + +Local RM is to use the said confc instance for its owner’s subscription as well. + +With the subscription successfully done RCONFC continues its start routine with getting read lock and starting version election. + +![IMAGE](Images/SpielAPI07.png) + +**RCONFC behavior on top-level RM death** + +When HA notification about top-level RM death comes, local RM performs standard processing for creditor’s death case. That implies, RCNOFC’s resource conflict callback is called, which makes RCONFC put the held read lock. At the time RCONFC detects that its credit owner object got to final state, and after that RCONFC invokes ‘start’ routine, i.e. queries HA about current CONFD and top-level RM servers, re-subscribes in accordance with the most recent information and goes on with read lock acquisition and configuration version election. + +![IMAGE](Images/SpielAPI08.png) + +### 4.3 Configuration server +The configuration server remains operating as before, i.e. on locally stored configuration database snapshot. However, the server is able to update the database file being explicitly instructed to do that. + +The configuration files are placed on server side as files in the IO STOB domain. Folder STOB domain equal folder current configure FID file consists of two version number - old, new and TX ID. +The update procedure is implemented by introducing two-phase protocol. + +**Phase 1: LOAD** + +The command semantic implies the new database file is to be just locally stored without any interference with currently active version. + +![IMAGE](Images/SpielAPI09.png) + +where: + +Vb: Base version number, the number recently read by confc when reaching quorum + +Vn: New version number, the number generated by Spiel on transaction start, expected to be Vmax + 1 (maximum number reported by confc contexts plus one) + +Vcurrent: Current version number stored on confd, normally equal to Vb Though, the minimal database validation is required before saving it to local file. + +During this command execution the server is able to serve any read requests from clients. + +TBD: Any dead-end situations like being lack of drive free space, etc. are to be resolved some way, most probably by reporting to HA about being unable to operate any further, and possibly confd leaving cluster. + +**Phase 2: FLIP** +The command explicitly instructs to find a previously loaded particular version and put it in effect + +![IMAGE](Images/SpielAPI10.png) + +### 4.4 Resource Manager +Resource manager is used to control access to the contents of confd configuration database replicas +There are mainly three types of configuration database users that operate concurrently: +* confc +* confd +* spiel client +The Confc makes read-only requests to the confd. The confd maintains configuration database and handles read/write requests to it. The spiel client issues both read and write requests to the confd. + +In order to serialize access to the confd configuration database new read/write lock (RW lock) resource type is introduced. One RW lock is used to protect an access for all confd configuration databases. + +**RW Lock** + +The distributed RW lock is implemented over Motr resource manager. The distributed RW Lock allows concurrent access for read-only operations, while write operations require exclusive access. The confc acquires read lock and holds it to protect its local database cache. The confd does not acquire RW lock by itself, but confd clients acquire RW lock instead. The spiel client acquire read lock on transaction opening to reach the quorum and acquires write lock in order to accomplish FLIP operation. + +**RM Service Failover** + +RM service controls the access to configuration database essential for proper functioning of the whole cluster and becomes single point of failure. The following algorithm addresses this issue: + 1. There are Nrms RM services starting automatically at cluster startup. Nrms= Nconfd. + 2. Only one RM service is active at any point in time. + 3. There is no synchronisation between RM services. While one is active, tracking credits for RW lock, others are idle and do nothing. + 4. HA is responsible for making decision about active RM service failure. + 5. When active RM service fails, HA selects new active RM service and notifies all nodes about the switch to new RM service. + 6. On reception of “RM service switch” HA notification, every node drops cached credits for configuration RW lock and reacquires them using new RM address. + +HA decision about RM service failure is based on notifications that are sent by Motr nodes encountering RPC timeouts during interaction with RM service. The decision about the next RM instance to be used is done by HA. This decision is provided through general HA object state notifications interface. In order to switch to new version consistently across the cluster rm services states are maintained the following way: + +1. Initially states of all, but one, RM instances are set to TRANSIENT, the remaining instance being ONLINE. +2. Then, each state transition notification fop, which brings an RM service to FAILED state, simultaneously moves another RM instance to ONLINE state, maintaining an invariant that no more than one RM instance is ONLINE at any time. + +If all RMs are FAILED, then it is a system failure. + +Since currently active RM address should be known prior to accessing configuration database, it is provided explicitly to configuration database users (command-line parameter for m0d, initialisation parameter for spiel). + +The spiel is initialised with active RM service address and then maintains this address internally, changing it on HA updates if necessary. + +HA guarantees that it doesn’t change configuration database until all notifications about switching to new RM service are replied. That prevents situation when configuration database changes and some part of clients works with it through new RM service and other part works through RM service considered failed. If some node doesn’t reply for switch notification, then it should be considered FAILED by HA. + +### 4.5 Command Interface +**Common Part** + +Command interface provides an ability to change cluster state by applying actions to the objects stored in configuration database. The command interface performs actions by sending operation requests (FOPs) to the dedicated service on remote node and waiting for reply synchronously. Dedicated service RPC endpoint is determined using the information presented in the configuration database. Start-stop service (SSS) service will play role of dedicated service in current implementation. + +![image](./Images/SpielAPI11.png) + + +

          + Note: Some commands return not only status, but some other information as well. For example “list services” command for process object returns status of the command and list of services. +

          + +**Device Commands** +Device command: attach, detach and format. Target service for these commands - SSS service. + +Current devices context was loaded at start Motr instance or before Load/Flip command. Device command change current status Disk on Pool machine. + +**Attach** +Search disk item in current Pool machine. Set disk status to Online uses Pool machine API (State Transit function). If call State Transit is success then create STOB if STOB not exist yet. + +**Detach** +Search disk item in current Pool machine. The set disk status to Offline uses Pool machine API (State Transit function). If call State Transit is success then free STOB. + +**Format** + +**Process Reconfig Command** + +Reconfig command applies Motr process configuration parameters stored in configuration database. There are two parameters for now: memory limits and processor cores mask. + +Memory limits can be applied easily using setrlimit Linux API call in m0_init before initialisation of all subsystems. + +Core mask specifies on which processors (cores) localities should be running. The core mask is applied through restarting localities. Localities is a part of FOM domain, which is initialised during Motr initialisation (m0_init). So, in order to restart localities the whole Motr instance should be re-initialised. That involves stopping all running services, Motr instance reinitialisation (m0_fini/m0_init) and starting basic services again. Motr library user is responsible for proper Motr reinitialisation. Reconfig command will be supported only by m0d Motr user for now. In order to notify m0d that Motr should be re-initialised UNIX signal is used. After reception of this signal m0d finalizes Motr instance and start it again. + +Note that unit test can’t check process reconfiguration since UT framework behaves differently than m0d. + +Reply to for process reconfig command is sent to the spiel user before actual reconfiguration is done. The is becasue during the reconfiguration RPC infrastructure for sending reply is lost. Therefore reply is sent before running services are stopped. + +![image](./Images/SpielAPI12.png) + + +**Pool commands** + +**Pool repair start command** + +This command starts SNS repair processes on nodes related to the pool. + +The algorithm is: +1. Find all nodes related to a pool in cluster configuration. A configuration node has pointer to the appropriate configuration pool. Therefore, nodes can be found by the pool FID. +2. Find SNS repair services, that belong to the nodes. Endpoints of a SNS repair service and the corresponding ioservice are the same. Thus, it suffices to find endpoints of ioservices. +3. Send a FOP with REPAIR opcode to every service. +4. Once the fop received, SNS repair service sends reply fop immediately and start repair process. Spiel client is able to check status of the running repair process with “pool repair status” command. +5. Return success result code to the spiel client if every service replies with success result code or return error code if one replies with error code. + +![image](./Images/SpielAPI13.png) + + +**Pool rebalance start command** + +This command starts SNS rebalance processes on nodes related to the pool. +The algorithm is similar to SNS repair: +1. Find all nodes related to a pool in cluster configuration. A configuration node has pointer to the appropriate configuration pool. Therefore, nodes can be found by the pool fid. +2. Find SNS rebalance services, that belong to the nodes. The endpoints of a SNS rebalance service and the corresponding ioservice are the same. Thus, it suffices to find endpoints of ioservices. +3. Send a FOP with REBALANCE opcode to every service. +4. Once the FOP received, the SNS repair service sends reply FOP immediately and start rebalance process. Spiel client is able to check status of the running rebalance process with “pool rebalance status” command. +5. Return success result code to Spiel client if every service replies with success result code or return error code if one replies with error code. + +![image](./Images/SpielAPI14.png) + + +**Pool repair quiesce command** + +This command pauses the SNS repair processes on nodes related to the pool. + +

          + Note: Currently, functionality of SNS repair pause or resume is not implemented. Therefore, the Spiel function returns -ENOSYS. +

          + +**Pool rebalance quiesce command** + +This command pauses SNS rebalance processes on nodes related to the pool. + +

          + Note: Currently, the functionality of SNS rebalance pause or resume is not implemented. Therefore, the Spiel function returns -ENOSYS. +

          + +**Pool repair continue command** + +This command resumes SNS repair process which was paused on nodes related to the pool. + +The algorithm is: +1. Find all nodes related to a pool in cluster configuration. A configuration node has pointer to the appropriate configuration pool. Therefore, nodes can be found by the pool FID. +2. Find SNS repair services, that belong to the nodes. Endpoints of a SNS repair service and the corresponding ioservice are the same. Thus, it suffices to find endpoints of ioservices. +3. Send a fop with CONTINUE opcode to every service. +4. Once the fop received, SNS repair service sends reply fop immediately and resumes repair process. +5. Return success result code to Spiel client if every service replies with success result code or return error code if one replies with error code. + + +![image](./Images/SpielAPI15.png) + + +**Pool rebalance continue command** + +This command resumes the SNS rebalance process which was paused on nodes related to the pool, but the SNS rebalance services imply instead SNS repair. + +**Pool repair status command** + +This command polls the progress of current repair process on nodes related to the pool. The SNS service reply consists of two values: + +* State of current repair process +* Progress in percentage or number of copied bytes/total bytes or error code if repair was failed. + +SNS repair may be in the following states: IDLE, STARTED, PAUSED, FAILED. + +* The service is considered IDLE, if no running repair at the moment. +* It is STARTED, if repair is running. +* It is PAUSED, if repair was paused. +* It is FAILED if an error occurred during the repair; + +The state diagram for repair status: + +![image](./Images/SpielAPI16.png) + +The algorithm is: +1. Find all nodes related to a pool in the cluster configuration. A configuration node has pointer to the appropriate configuration pool. Therefore, nodes can be found by the pool FID. +2. Find SNS repair services, that belong to the nodes. The endpoints of a SNS repair service and the corresponding ioservice are the same. Thus, it suffices to find endpoints of ioservices. +3. Send a FOP with STATUS opcode to every service. +4. Return success result code to Spiel client if every service replies with the progress of current repair process if it is happening. + +![image](./Images/SpielAPI17.png) + +**Pool rebalance status command** + +This command polls the progress of current rebalance process on nodes related to the pool. + +The algorithm is the same as for the SNS repair, but the SNS rebalance services imply instead the SNS repair. + +**File System Commands** + +This section describes commands specific to FS object. + +**Get FS free/total space** + +This command is intended to report free/total space the distributed file system provides across all its nodes. The counters are to be of uint64_t size. So far, only ioservice and mdservice are going to report the space counters. The counters are used as those are at the moment of querying, constituting no transaction of any sort. + +To obtain the space sizes the spiel user has to always make an explicit call, as no size change notification mechanism is expected so far. Schematically execution is done as follows: + +* Spiel user calls API entry providing it with FS FID +* Spiel finds FS object with the FID in configuration profile and iterates through FS node objects +* Per node, iterates through process objects +* Per process, + * Spiel makes a call for process health + * On correspondent SSS service side: + * It iterates through m0 list of BE domain segments + * Per segment + * Information is obtained from segment’s BE allocator object + * total size + * free size + * Free/total size is added to respective accumulator values +* If IO service is up and running, it iterates through m0 list of storage devices + * Per device object: + * Stob domain object is fetched, the one device operates on information is extracted from domain’s allocator object: + * free blocks + * block size + * total size + * Total size is added to respective accumulator value unconditionally + * Free size is added to respective accumulator value only in case the storage device is found in the list of pool devices stored in m0 pool machine state object, and corresponding pool device state is “on-line” + * The final accumulator values are included into service status reply + * The call returns process status reply including free/total size values + * Spiel accumulates the size values collected per process call +* Resultant free/total size values are returned to Spiel user + + +### 4.6 System Tests +This section defines scenarios for system tests for Spiel functionality. +Tests are to cover two major areas: +* Distributed confd management. +* Control commands. + +**Confd#1: normal case** +* All confd instances get the same initial DB. +* Start all Motr nodes. +* Validate all the confc consumers are connected to confd and loaded the same DB version. + +**Confd#2: split versions, with quorum** +* First N confd instances get initial DB with version v1. +* Remaining N + 1 confd instances get initial DB with version v2. +* Start all Motr nodes. +* Validate all confc consumers are connected to confd, and loaded DB version v2. + +**Confd#3: start with broken quorum and then restore it** +* First N confd instances get initial DB with version v1. +* Next N confd instances get initial DB with version v2. +* Remaining 1 confd instance gets initial DB with version v3. + +

          + Note: there’s no “winner”, no quorum in this setup. +

          + +* Start all Motr nodes. + +* Expected behavior: + + * No crashes, but + * Cluster is not functioning. + +

          + Note: This behavior needs discussion; as of now behavior for this situation is not defined. +

          + +* Use spiel client to apply new DB version, v4. + + * Expected to succeed. + + +* Validate that all the confc consumers are now unlocked, services are started, and Motr cluster is ready to process requests (probably the easiest way would be to feed some I/O requests to Motr and make they succeeded). + +**Confd#4: concurrent DB updates** + +* Launch N spiel clients (processes or threads). +* Each client, in a loop, tries to upload new DB version to the confd. + * This has to happen simultaneously over all clients, with minimal delays, in attempt to make them send actual LOAD/FLIP commands at the same time. + * Run a loop for some prolonged time, to increase the chances of getting the conflict. + +Note: This is non-deterministic test, but there does not seem to be a way within current system test framework to make this test deterministic. + +* Use (N+1)th spiel client to monitor confd state. Expectations are: + * confd remains “in quorum” means the algo is stable against load + * confd version increases means that at least some of those parallel clients succeed in updating the db -- no permanent dead-lock situations occur. + +**Cmd#1: control commands** + * Start Motr cluster. + * Use spiel commands to validate health of all services and processes. + * Restart M services. + * Reconfig N processes. (Make sure that there are untouched services, untouched processes, restarted services within untouched processes and vice versa, and restarted services within reconfigured processes.) + * Use spiel commands to validate health of all services and processes. + * Perform some I/O on the Motr fs, make sure it all succeeds (thus validating that cluster is truly alive, and all services and processes are truly OK). + * Use spiel commands to validate health of all services and processes. + +**Cmd#2: FS stats commands** + * Start Motr cluster. + * Use spiel commands to validate health of all services and processes. + * Test IO operation effects: + * Get file system stats. + * Create new file, inflate it by a predefined number of bytes. + * Get file system stats, make sure free space decreased, total space remained. + * Delete the new file. + * Get file system stats, make sure free space returned to original value, total space remained. + + * Test file system repair effects: + * Provoke file system error. + * Start repair process + * While repairing, get file system stats, make sure free space decreased. + * When repair completed, get file system stats, make sure free space recovered. + + * Test reconfigure effects: + * Detach some device, make sure free space decreased, total space decreased. + * Attach the device back, make sure free space recovered, total space recovered. + * Format device. + * When formatted, make sure free space increased. + • Stop Motr cluster. + + +## 5. Logical specification + +### 5.1 Conformance + +`[i.m0.spiel]` +The Spiel library purpose, behavior patterns, and mechanisms are explicitly and extensively described by the current design document. + +`[i.m0.spiel.concurrency-and-consistency]` +The concurrent and consistent configuration change distribution is described in Configuration change, Cache refresh, Version re-election sections. + +`[i.m0.spiel.transaction]`, `[i.m0.conf.transaction]` +The transaction basic flow is described in Configuration change section. + +`[i.m0.spiel.conf-db-abstraction]` +The configuration database abstraction approach is described in Normal processing and Fault tolerance sections. + +`[i.m0.spiel.element-id-generation]` +Element ID is provided by the spiel API user by design. + +`[i.m0.spiel.element-types]` +The supported element types coverage is subject for DLD. + +`[i.m0.spiel.conf-actions]` +The supported configuration actions is subject for DLD. + +`[i.m0.spiel.conf-validation]` +The configuration validation is described in Phase 1: LOAD section. + +`[i.m0.spiel.environment]` +See final note in Configuration change section. + +`[i.m0.spiel.cmd-iface]` +Command interface control flow is described in Command interface section. Implementation details will be covered with Command interface DLD. + +`[i.m0.spiel.cmd-iface-sync]` +The spiel command interface is synchronous. See Command interface section. + +`[i.m0.spiel.cmd-iface-ids]` +Every command interface API function has element ID as parameter. + +`[i.m0.spiel.cmd-explicit]` +The configuration database change does not trigger any commands execution in the cluster (starting services, attaching new devices, etc.) These commands are issued explicitly by Spiel API user. See Phase 1: LOAD and Phase 2: FLIP sections, where on configuration change no automatic command execution occurs. + +`[i.m0.spiel.supported-commands]` +The supported commands list is subject for DLD. + +`[i.m0.rm.rw-lock]` +The RW lock usage is described in Cache refresh and Version re-election sections. + +`[i.m0.rm.failover]` +Failover is provided by means described in RM service failover section. + +`[i.m0.conf.replicas]` +The use of configuration database replicas is described in Cache refresh, Version re-election, Phase 1: LOAD, and Phase 2: FLIP sections. + +`[i.m0.conf.consistency]` +The issue of providing consistent use of potentially inconsistent replicas is described in Cache refresh, Version re-election, Phase 1: LOAD, and Phase 2: FLIP sections. + +`[i.m0.conf.delivery]` +TBD + +`[i.m0.conf.quorum]` +The process of reaching quorum is described in Cache refresh and Version re-election sections. + +`[i.m0.conf.attributes]` +The use of configuration database attributes is described in Cache refresh and Version re-election sections. + +`[i.m0.confd.update-protocol]` +The two-phase update protocol is described in Phase 1: LOAD and Phase 2: FLIP sections. + +`[i.m0.confc.rw-lock]` + +<..> + +`[i.m0.confc.quorum]` +The process of reaching quorum in configuration client is described in Version re-election section. + + +## Use cases + +### 6.1 Scenarios + +|Scenario| [usecase.component.name] | +|--------|---------------------------| +|Relevant quality| attributes [e.g., fault tolerance, scalability, usability, re-usability] | +|Stimulus |[an incoming event that triggers the use case] | +|Stimulus source| [system or external world entity that caused the stimulus]| +|Environment |[part of the system involved in the scenario] | +|Artifact |[change to the system produced by the stimulus] | +|Response | [how the component responds to the system change] | +|Response measure | [qualitative and (preferably) quantitative measures of response that must be maintained] | +|Questions and issues| diff --git a/doc/HLD-of-SNS-Repair.md b/doc/HLD-of-SNS-Repair.md new file mode 100644 index 00000000000..180e97a3f79 --- /dev/null +++ b/doc/HLD-of-SNS-Repair.md @@ -0,0 +1,704 @@ +# High Level Design of SNS Repair +This document provides a High-Level Design (HLD) of SNS repair for Motr. + The main purposes of this document are: +* To be inspected by Motr architects and peer designers to make sure that HLD is aligned with Motr architecture and other designs, and contains no defects +* To be a source of material for Active Reviews of Intermediate Design (ARID) and Detailed Level Design (DLD) of the same component +* To serve as a design reference document +The intended audience of this document consists of Motr customers, architects, designers, and developers. + +## Introduction +Redundant striping is a proven technology to achieve higher throughput and data availability. The Server Network Striping (SNS) applies this technology to network devices and achieves similar goals as a local RAID. In the case of storage and/or server failure, the SNS repair reconstructs the lost data from survived data reliably and quickly, without a major impact on the production systems. This document presents the HLD of SNS repair. + +## Definitions +Repair is a scalable mechanism to reconstruct data or meta-data in a redundant striping pool. Redundant striping stores a cluster-wide object in a collection of components, according to a particular striping pattern. + +The following terms are used to discuss and describe Repair: +* Storage devices: Attached to data servers. +* Storage objects: Provide access to storage device contents using a linear namespace associated with an object. +* Container objects: Some of the objects are containers, capable of storing other objects. Containers are organized into a hierarchy. +* Cluster-wide object data: Object linear namespace called cluster-wide object data. +* A cluster-wide object is an array of bytes, accompanying meta-data, stored in the containers, and accessed using the read and write operations. It can also be accessed using the potentially other operations of the POSIX or a similar interface. The index in this array is called an offset. An offset is a cluster-wide object that can appear as a file if visible in the file system namespace. +* Fid: A cluster-wide object is uniquely identified by an identifier. +* A cluster-wide object layout is a map used to determine the location of a particular element and its byte array on the storage. For this purpose, the current discussion is about the striping layouts, and it will be called layouts. There are other layouts for encryption, deduplication, etc. that look different. In addition to the cluster-wide object data, a cluster-wide object layout specifies where and how redundant information is stored. +* A cluster-wide object layout specifies a location of its data or redundancy information as a pair (component-id, component-offset). The component-id is the FID of a component stored in a certain container. (On the components FIDs, the layouts are introduced for the provided cluster-wide object. But these are not important for the present specification.) +* A component is an object like a cluster-wide object. It is an array of bytes (component data) and can be identified by component offset plus some meta-data. +* A component has the allocation data; the meta-data to specify the location of the container component data. +* The cluster-wide object layouts used in this document are piecewise linear mappings in the sense that for every layout there exists a positive integer S, called a (layout) striping unit size. Such as, the layout maps any extent [p*S, (p + 1)*S - 1] onto some [q*S, (q + 1)*S - 1] extent of target, that is some component. Such an extent in a target is called a striping unit of the layout. The layout mapping within a striping unit is increasing (this characterizes it uniquely). +* In addition to placing object data, the striping layouts to define the contents, and placement of redundancy information are used to recreate the lost data. The simplest example of redundancy information is given by RAID5 parity. In general, redundancy information is some form of check-sum over data. +* A striping unit to which cluster-wide objects or component data are mapped is called a data unit. +* A striping unit to which redundancy information is mapped is called a parity unit (this standard term will be used even though the redundancy information might be something different than parity). +* A striping layout belongs to a striping pattern (N+K)/G. If it stores the K parity unit with the redundancy information for every N data unit and the units are stored in G containers. Typically G is equal to the number of storage devices in the pool. Where G is not important or clear from the context, one talks about the N+K striping pattern (which coincides with the standard RAID terminology). +* A parity group is a collection of data units and their parity units. We only consider layouts where data units of a parity group are contiguous in the source. We do consider layouts where units of a parity group are not contiguous in the target (parity declustering). Layouts of the N+K pattern allow data in a parity group to be reconstructed when no more than K units of the parity group are missing. +* For completeness, this specification does not consider the meta-data objects associated with layouts. The meta-data object layouts are described in terms of meta-data keys (rather than byte offsets) and also based on redundancy, typically in the form of mirroring. Replicas in such mirrors are not necessarily byte-wise copies of each other, but they are key-wise copies. +* components of a cluster-wide object are normally located on the servers of the same pool. For example, during the migration, a cluster-wide object can have a more complex layout with the components scattered across multiple pools. +* A layout is said to intersect (at the moment) with a storage device or a data-server if it maps any data to the device or any device currently attached to the server, respectively. +* A pool is a collection of storage, communication, and computational resources (server nodes, storage devices, and network interconnects) configured to provide IO services with certain fault-tolerance characteristics. Specifically, cluster-wide objects are stored in the pool with striping layouts with such striping patterns that guarantee that data are accessible after a certain number of server and storage device failures. Additionally, pools guarantee (using the SNS repair described in this specification) that a failure is repaired in a certain time. + +Examples of striping patterns: +* K = 0. RAID0, striping without redundancy +* N = K = 1. RAID1, mirroring +* K = 1 < N. RAID5 +* K = 2 < N. RAID6 +A cluster-wide object layout owns its target components. That is, no two cluster-wide objects store data or redundancy information in the same component object. + +## Requirements +These requirements are already listed in the SNS repair requirement analysis document: +* [r.sns.repair.triggers] Repair can be triggered by storage, network, or node failure. +* [r.sns.repair.code-reuse] Repair of a network array and a local array is done by the same code. +* [r.sns.repair.layout-update] Repair changes layouts as data are reconstructed. +* [r.sns.repair.client-io] Client IO continues as repair is going on. +* [r.sns.repair.io-copy-reuse] The following variability points are implemented as a policy applied to the same underlying IO copying and re-structuring mechanism: + * Do the client writes target the same data that is being repaired or the client writes are directed elsewhere? + * Does the repair reconstruct only a sub-set of data (For example, data missing due to a failure) or all data in the array? + +The following use cases are covered by the same IO restructuring mechanism: + +| | same layout |separate layouts | +|------|--------------|-----------------| +|missing data| in-place repair |NBA| +|all data| migration, replication| snapshot taking| + +Here "same layout" means the client IO continues to the source layouts while data restructuring is in-progress and "separate layout" means the client IO is re-directed to a new layout at the moment when data restructuring starts. + +"Missing" data means only a portion of source data is copied into a target and "all data" means all the data in the source layouts are copied. + +While the data restructuring is in-progress, the affected objects that have the composite layouts display the parts of the object's linear namespace that have already been restructured. Due to the possible ongoing client IO against an object, such a composite layout can have a structure more complex than the "old layout up to a certain point, new layout after". + +* [r.sns.repair.priority] Containers can be assigned a repair priority specifying in what order they are to be repaired. This allows for restoring critical cluster-wide objects (meta-data indices, cluster configuration database, etc.) quickly and reduces the damage of a potential double failure. +* [r.sns.repair.degraded] Pool state machine is in degraded mode during the repair. Individual layouts are moved out of degraded mode as they are reconstructed. +* [r.sns.repair.c4] Repair is controllable by advanced C4 settings: can be paused, or aborted, and its IO priority can be changed. Repair reports its progress to C4. +* [r.sns.repair.addb] Repair should produce ADDB records of its actions. +* [r.sns.repair.device-oriented] Repair uses device-oriented repair algorithm. +* [r.sns.repair.failure.transient] Repair survives transient node and network failures. +* [r.sns.repair.failure.permanent] Repair handles permanent failures gracefully. +* [r.sns.repair.used-only] Repair should not reconstruct unused (free) parts of failed storage. + +## Design highlights +The current design structure of SNS repair implementation is a composition of two sub-systems: +* Generic data restructuring engine (copy machine): A copy machine is a scalable distributed mechanism to restructure data in multiple ways (copying, moving, re-striping, reconstructing, encrypting, compressing, re-integrating, etc.). It can be used in a variety of scenarios, some enumerated in the following text. +* SNS repair specific part: An SNS-specific part of repair interacts with sources of repair relevant events (failures, recoveries, administrative commands, client IO requests). It constructs a copy machine suitable for SNS repair and controls its execution. + +Following topics deserve attention: +* All issues and questions mentioned in the requirements analysis document must be addressed. +* Pool state machine must be specified precisely. +* Repair state machine must be specified precisely. +* Handling of transient and permanent failures during repair must be specified precisely. +* Interaction between repair and layouts must be specified. +* Definitions must be made precise. +* Details of iteration over objects must be specified. +* Details of interaction between repair and DTM must be specified. +* Redundancy other than N+1 (N+K, K > 1) must be regarded as a default configuration. +* Multiple failures and repair in the presence of multiple failures must be considered systematically. +* Repair and re-balancing must be clearly distinguished. +* Reclaim of a distributed spare space must be addressed (this is done in a separate Distributed Spare design documentation). +* locking optimizations. + +## Functional specification + +### 5.1. Overview +When a failure is detected, the system decides to do the SNS repair. The SNS repair can simultaneously read data from the multiple storage devices, aggregates them, transfer them over the network, and place them into distributed spare space. The entire process can utilize the system resources with complete bandwidth. If another failure happens during this process, it is reconfigured with new parameters and starts repair again, or fails gracefully. + +### 5.2. Failure type +* Transient failure: Transient failure includes a short network partition or a node crash followed by a reboot. Formally, a transient failure is a failure that was healed before the system decided to declare the failure permanent. RPC and networking layer (resend) handles the transient network failures transparently. The DTM (recovery) handles the transient node failures. The Data or meta-data stored on the media drive is not damaged. +* Permanent failure. Permanent failure means permanent damage to media drives and there is no way to recover the data physically from the drive. The Data will have to be reconstructed from redundant information living in surviving drives or restored from archival backups. +* For the SNS repair purposes, we only talk about the permanent failure of the storage devices or nodes. The C4 and/or SNS repair manager can distinguish the two types of failures from each other. +* Failure detections will be done by various components, e.g. liveness. + + +### 5.3. Redundancy level +* A pool using the N+K striping pattern can recover most from the K drives failures. System can reconstruct lost units from the surviving unit. K can be selected so that a pool can recover from a given number Kd or device failures and a given number Ks of server failures (assuming a uniform distribution of units across servers). +* The default configuration will always have K > 1 (and L > 1) to ensure the system can tolerate multiple failures. +* More detailed discussion on this can be found in: "Reliability Calculations and Redundancy Level" and in the Scalability analysis section. + +### 5.4. Triggers of SNS repair +* When a failure of storage, network, or node is detected by various components (For example, the liveness layer). It will be reported to components that are interested in the failure, including the pool machine and C4. The Pool machine will decide whether to trigger a SNS repair. +* Multiple SNS repairs can be running simultaneously. + +### 5.5. Control of SNS repair +* Running and queued SNS repair can be listed upon query by management tools. +* Status of individual SNS repair can be retrieved upon query by management tools: estimated progress, estimated size, estimated time left, queued or running or completed, etc. +* Individual SNS repair can be paused/resumed. +* A fraction of resource usage can be set to individual repair by management tools. These resources include disk bandwidth, network bandwidth, memory, CPU usage, and others. System has a default value when SNS repair is initiated. This value can be changed dynamically by management tools. +* Resource usage will be reported and collected at some rate. These information will be used to guide future repair activities. +* Status report will be trapped asynchronously to C4 while a repair is started, completed, failed, or progressed. + + +### 5.6. Concurrency & priority +* To guarantee that a sufficient fraction of system resources are used, we: + * Guarantee that only a single repair can go on a given server pool and + * Different pools do not compete for resources. +* Every container has a repair priority. A repair for failed container has the priority derived from the container. + + +### 5.7. Client I/O during SNS repair +* From the client's point of view, the client I/O will be served while the SNS repair is going on. Some performance degradation may be experienced, but this should not lead to starvation or indefinite delays. +* Client I/O to surviving containers or servers will be handled normally. But the SNS repair agent will also read from or write to the containers while SNS repair is going on. +* Client I/O to the failed container (or failed server) will be directed to the proper container according to the new layout, or data will be served by retrieving from other containers and computing from parity/data unit. This depends on the implementation options. We will discuss this later. +* When repair is completed, the client I/O will restore to its normal performance. + + +### 5.8. Repair throttling +• The SNS manager can throttle the repair according to system bandwidth, and user control. This is done by dynamically changing the fraction of resource usage of individual repair or overall. + +### 5.9. Repair logging +• SNS repair will produce ADDB records about its operations and progress. These records include but are not limited to, (start, pause, resume, or complete) of individual repair, failure of individual repair, the progress of the individual repair, the throughput of individual repair, etc. + +### 5.10. Device-oriented repair +Agent iterates components over the affected container or all the containers which have surviving data/parity units in the need-to-reconstruct parity group. These data/parity units will be read and sent to a proper agent where spare space lives and used to re-compute the lost data. Please refer to "HLD of Copy Machine and Agents". + + +### 5.11. SNS repair and layout +The SNS manager gets an input set configuration and output set configuration as the repair is initiated. These input/output sets can be described by some form of layout. The SNS repair will read the data/parity from the devices described with the input set and reconstruct the missing data. In the process of reconstruction object layouts affected by the data reconstruction (layouts with data located on the lost storage device or node) are transactionally updated to reflect changed data placement. Additionally, while the reconstruction is in-progress, all affected layouts are switched into a degraded mode so that the clients can continue to access and modify data. + +Note that the standard mode of operation is a so-called "non-blocking availability" (NBA) where after a failure the client can immediately continue writing new data without any IO degradation. To this end, a client is handed out a new layout to which it can write. After this point, the cluster-wide object has a composite layout: some parts of the object's linear name-space are laid accordingly to the old layout, and other parts (ones where clients write after a failure)—are a new one. In this configuration, clients never write to the old layout, while its content is being reconstructed. + +The situation where there is a client-originated IO against layouts being reconstructed is possible because of: +* Reads have to access old data even under NBA policy and +* The non-repair reconstructions like migration or replication. + +## Logical specification +Please refer to "HLD of Copy Machine and Agents" for logical specifications of a copy machine. + +### Concurrency control +Motr will support a variety of concurrency control mechanisms selected dynamically to optimize resource utilization. Without going into much detail, the following mechanisms are considered for controlling access to cluster-wide object data: +* A complete file lock is acquired on a meta-data server when the cluster-wide object meta-data are fetched. This works only for the cluster-wide objects visible in a file system name-space (i.e., for files). +* An extent lock is taken on one of the lock servers. A replicated lock service runs on the pool servers. Every cluster-wide object has an associated locking server where locks on extents of object data are taken. The locking server might be one of the servers where object data are stored. +* "Component level locking" is achieved by taking a lock on an extent of object data on the same server where these data are located. +* Time-stamp-based optimistic concurrency control. See "Scalable Concurrency Control and Recovery for Shared Storage". + +Independently of whether a cluster-wide object-level locking model [1], where the data are protected by locks taken on the cluster-wide object (these can be either extent locks taken in cluster-wide object byte offset name-space [2] or "whole-file" locks [3]), or component level locking model or time-stamping model is used, locks or time-stamps are served by a potentially replicated locking service running on a set of lock servers (a set that might be equal to the set of servers in the pool). The standard locking protocol as used by the file system clients would imply that all locks or time-stamps necessary for an aggregation group processing must be acquired before any processing can be done. This implies a high degree of synchronization between agents processing copy packets from the same aggregation group. + +Fortunately, this ordering requirement can be weakened by making every agent take (the same) required to lock and assuming that the lock manager recognizes, by comparing transaction identifiers, that lock requests from different agents are part of the same transaction and, hence, are not in conflict [4]. Overhead of locking can be amortized by batching and locking-ahead. + +### Pool machine +Pool machine is a replicated state machine, having replicas on all pool nodes. Each replica maintains the following state: +```` +node : array of struct { id : node identity, + state : enum state }; +device : array of struct { id : device identity, + state : enum state }; +read-version: integer; +`````` +write-version: integer; +where the state is enum { ONLINE, FAILED, OFFLINE, RECOVERING }. It is assumed that there is a function device-node() mapping device identity to the index in node[] corresponding to the node to the device is currently attached to. The elements of the device[] array corresponding to devices attached to non-online nodes are effectively undefined (the state transition function does not depend on them). To avoid mentioning this condition in the following, it is assumed that: + +device-node(device[i].id).state == ONLINE, +For any index i in device[] array, that is, devices attached to non-online nodes are excised from the state. + +State transitions of a pool machine happen when the state is changed on a quorum [5] of replicas. To describe state transitions the following derived state (that is not necessarily stored on replicas) is introduced: + +* nr-nodes : number of elements in node[] array +* nr-devices : number of elements in device[] array +* nodes-in-state[S] : number of elements in node[] array with the state field equal to S +* devices-in-state[S] : number of elements in device[] array with the state field equal to S +* nodes-missing = nr-nodes - nodes-in-state[ONLINE] +* devices-missing = nr-devices - devices-in-state[ONLINE] + +In addition to the state described above, a pool is equipped with a "constant" (in the sense that its modifications are beyond the scope of the present design specification) configuration state including: +* max-node-failures: integer, a number of node failures that the pool tolerates; +* max-device-failures: integer, a number of storage device failures that the pool tolerates. +A pool is said to be a dud (Data possibly Unavailable or Damaged) when more device and node failed in it than the pool is configured to tolerate. +Based on the values of derived state fields, the pool machine state space is partitioned as: + +|devices-missing \ nodes-missing | 0 |1 .. m ax-node-failures| m ax-node-failures + 1 .. nr-nodes| +|--------------------------------|----|-----------------------|------------------------------------| +|0 |normal |degraded| dud| +|1 .. max -device-failures| degraded| degraded| dud| +|max -device-failures + 1 .. nr-device| dud| dud| dud| + + +A pool state with nodes-missing = n and devices-missing = k is said to belong to a state class S(n, k), for example, any normal state belongs to the class S(0,0) . +As part of changing its state, a pool machine interacts with external entities such as layout managers or client caches. During this interaction, multiple failures, delays, and concurrent pool machine state transitions might happen. In general, it is impossible to guarantee that all external states will be updated by the time the pool machine reaches its target state. To deal with this, the pool state contains a version vector, some components of which are increased on any state transition. All external requests to the pool (specifically, IO requests) are tagged with the version vector of the pool state the request issuer knows about. The pool rejects requests with incompatibly stale versions, forcing the issuer to renew its knowledge of the pool state. Separate read and write versions are used to avoid unnecessary rejections. For example, read requests are not invalidated by adding a new device or a new server to the pool. Finer-grained version vector can be used, if necessary. + +Additional STOPPED state can be introduced for nodes and devices. This state is entered when a node or a device is deliberately temporarily inactivated, for example, to move a device from one node to another or to re-cable a node as part of preventive maintenance. After a device or a node stood in STOPPED state for more than some predefined time, it enters OFFLINE state. See details in the State section. + +### Server state machine +Persistent server state consists of its copy of the pool state. +On boot, a server contacts a quorum [6] of pool servers (counting itself) and updates its copy of the pool state. If recovery is necessary (unclean shutdown, server state as returned by the quorum is not OFFLINE), the server changes the pool state (through the quorum) to register that it is recovering. After the recovery of distributed transactions completes, the server changes the pool state to indicate that the server is now in ONLINE state (which must have been the server's pre-recovery state). See details in the State section. + +#### 6.1. Conformance +* [i.sns.repair.triggers] A pool machine registers with the health layer its interest [7] in hearing about device [8], node [9], and network [10] failures. When health layer notifies [11] the pool machine about a failure, state transition happens [12] and repair, if necessary, is triggered. +* [i.sns.repair.code-reuse] Local RAID repair is a special case of general repair. When a storage device fails that requires only local repair, the pool machine records this failure as in general case and creates a copy engine to handle the repair. All agents of this machine are operating on the same node. +* [i.sns.repair.layout-update] When a pool state machine enters a non-normal state, it changes its version. The client attempts to do the IO on layouts tagged with the old version, would have to re-fetch the pool state. Optionally, the requests layout manager proactively revokes all layouts intersecting [13] with the failed device or node. Optionally, use the copy machine "enter layout" progress call-back to revoke a particular layout. As part of re-fetching layouts, clients learn the updated list of alive nodes and devices. This list is a parameter to the layout [14]. The layout IO engine uses this parameter to do IO in degraded mode [15]. +* [i.sns.repair.client-io] The Client IO operation continues as repair is going on. This is achieved by redirecting the clients to degraded layouts. This allows clients to collaborate with the copy machine in repair. After the copy machine notifies the pool machine of processing progress (through the "leave" progress call-back), repaired parts of the layout [16] are upgraded. +* [i.sns.repair.io-copy-reuse] The following table provides the input parameters to the copy machines implementing required shared functionality: + +|...|layout setup |aggregation function| transformation function| "enter layout" progress c all-back| "leave layout" progress c all-back| +|---|--------------|-------------------|-------------------------|-------------------------------------|------------------------------------| +|in-place repair| |aggregate striping units| recalculate lost striping units| layout moved into degraded mode| upgrade layout out of degraded mode| +|NBA repair |original layout moved into degraded mode, new NBA layout created for writes| aggregate striping units| recalculate lost striping units| | update NBA layout| +|migration| migration layout created| no aggregation| identity| |discard old layout| +|replication| replication layout created| no aggregation |identity| | nothing| +|snapshot taking| new layout created for writes| no aggregation| identity| | nothing| + + +* [i.sns.repair.priority] Containers can be assigned a repair priority specifying in what order they are to be repaired. Prioritization is part of the storage-in agent logic. +* [i.sns.repair.degraded] The pool state machine is in degraded mode during repair: described in the pool machine logical specification. Individual layouts are moved out of degraded mode as they are reconstructed. When the copy machine is done with all components of a layout. It sends signals to the layout manager that the layout can be upgraded (either lazily [17] or by revoking all degraded layouts). +* [i.sns.repair.c4] + * Repair is controllable by advanced C4 settings: It can be paused and its IO priority can be changed. This is guaranteed by dynamically adjustable copy machine resource consumption thresholds. + * Repair reports its progress to C4. This is guaranteed by the standard state machine functionality. +* [i.sns.repair.addb] Repair should produce ADDB records of its actions: this is a part of standard state machine functionality. +* [i.sns.repair.device-oriented] Repair uses a device-oriented repair algorithm, as described in the On-line Data reconstruction in Redundant Disk Arrays dissertation: this follows from the storage-in agent processing logic. +* [i.sns.repair.failure.transient] Repair survives transient node and network failures. After the failed node restarts or network partitions heal, distributed transactions, including repair transactions created by the copy machine are redone or undone to restore consistency. Due to the construction of repair transactions, the recovery also restores repair to a consistent state from which it can resume. +* [i.sns.repair.failure.permanent] Repair handles permanent failures gracefully. Repair updates file layouts at the transaction boundary. Together with copy machine state replication, this guarantees that repair can continue in the face of multiple failures. +* [i.sns.repair.used-only] Repair should not reconstruct unused (free) parts of failed storage: this is a property of a container-based repair design. + +### 6.2. Dependencies +* Layouts + * [r.layout.intersects]: It must be possible to efficiently find all layouts intersecting with a given server or a given storage device. + * [r.layout.parameter.dead]: A list of failed servers and devices is a parameter to a layout formula. + * [r.layout.degraded-mode]: Layout IO engine does degraded mode IO if directed to do so by the layout parameters. + * [r.layout.lazy-invalidation]: Layout can be invalidated lazily, on the next IO request. +* DTM + * [r.fol.record.custom]: Custom FOL record type, with user-defined redo and undo actions can be defined. + * [r.dtm.intercept]: It is possible to execute additional actions in the context of a user-level transaction. + * [r.dtm.tid.generate]: Transaction identifiers can be assigned by DTM users. +* Management tool +* RPC + * [r.rpc.maximal.bulk-size] + * [r.network.utilization]: An interface to estimate network utilization. + * [r.rpc.pluggable]: It is possible to register a call-back to be called by the RPC layer to process a particular RPC type. + * Health and liveness layer: + * [r.health.interest], [r.health.node], [r.health.device], [r.health.network] It is possible to register interest in certain failure event types (network, node, storage device) for certain system components (e.g., all nodes in a pool). + * [r.health.call-back] Liveness layer invokes a call-back when an event of interest happens. + * [r.health.fault-tolerance] Liveness layer is fault-tolerant. Call-back invocation is carried through the node and network failures. + * [r.rpc.streaming.bandwidth]: Optimally, streamed RPCs can utilize at least 95% of raw network bandwidth. + * [r.rpc.async]: There is an asynchronous RPC sending interface. +* DLM + * [r.dlm.enqueue.async]: A lock can be enqueued asynchronously. + * [r.dlm.logical-locking]: Locks are taken on cluster-wide objects. + * [r.dlm.transaction-based]: Lock requests issued on behalf of transactions. Lock requests made on behalf of the same transaction are never in conflict. +* Meta-data: + * [u.md.iterator]: Generic meta-data iterators suitable for input set description. + * [u.md.iterator.position]: Meta-data iterators come with a ordered space of possible iteration positions. +* State machines: + * [r.machine.addb]: State machines report statistics about their state transitions to ADDB. + * [r.machine.persistence]: State machine can be made persistent and recoverable. Local transaction manager invokes restart event on persistent state machines after node reboots. + * [r.machine.discoverability]: State machines can be discovered by C4. + * [r.machine.queuing]: A state machine has a queue of incoming requests. +* Containers: + * [r.container.enumerate]: It is possible to efficiently iterate through the containers stored (at the moment) on a given storage device. + * [r.container.migration.call-back]: A container notifies interested parties in its migration events. + * [r.container.migration.vote]: Container migration, if possible, includes a voting phase, giving interested parties an opportunity to prepare for future migration. + * [r.container.offset-order]: Container offset order matches underlying storage device block ordering enough to make container offset ordered transfers optimal. + * [r.container.read-ahead]: Container do read-ahead. + * [r.container.streaming.bandwidth]: Large-chunk streaming container IO can utilize at least 95% of raw storage device throughput. + * [r.container.async]: There is an asynchronous container IO interface. +* Storage: + * [r.storage.utilization]: An interface to measure the utilization a given device for a certain time period. + * [r.storage.transfer-size]: An interface to determine maximal efficient request size of a given storage device. + * [r.storage.intercept]: It should be possible to intercept IO requests targeting a given storage device. +* SNS: + * [r.sns.trusted-client] (constraint): Only trusted clients can operate on SNS objects. +* Miscellaneous: + * [r.processor.utilization]: An interface to measure processor utilization for a certain time period. +* Quorum: + * [r.quorum.consensus]: Quorum based consensus mechanism is needed. + * [r.quorum.read]: Read access to quorum decisions is needed. + +### 6.3. Security model + +#### 6.3.1. Network + It is assumed that messages exchanged over the network are signed so that a message sender can be established reliably. Under this condition, nodes cannot impersonate each other. + +#### 6.3.2. Servers + The present design provides very little protection against a compromised server. While compromised storage-in or network agents can be detected by using striping redundancy information, there is no way to independently validate the output of a collecting agent or check that the storage-out agent wrote the right data to the storage. In general, this issue is unavoidable as long as the output set can be non-redundant. + + If we restrict ourselves to the situations where the output set is always redundant, the quorum-based agreement can be used to deal with malicious servers in the spirit of Practical Byzantine Fault Tolerance. Replicated state machine design of a copy machine lends itself naturally to a quorum-based solution. + + The deeper problem is due to servers collaborating in the distributed transactions. Given that the transaction identifiers used by the copy machine are generated by a known method. A server can check that the server-to-server requests it receives are from well-formed transactions and a malicious server cannot cause chaos by initiating malformed transactions. What is harder to counter is a server not sending requests that it must send according to the copying algorithm. We assume that the worst thing that can happen when a server delay or omits certain messages is that the corresponding transaction will eventually be aborted and undone. An unresponsive server is evicted from the cluster and the pool handles this as a server failure. This still doesn't guarantee progress, because the server might immediately re-join the cluster only to sabotage more transactions. + + The systematic solution to such problems is to utilize the already present redundancy in the input set. For example, when a layout with N+K (where K > 2) striping pattern is repaired after a single failure, the N+K-1 survived striping units are gathered from each parity group. The collecting agent uses the additional units to check in three ways that every received unit matches the redundancy and uses the majority in case of a mismatch. This guarantees a single malign server can be detected. RAID-like striping patterns can be generalized from fail-stop failures to Byzantine failures. It seems that as typical for agreement protocols, an N+K pattern with K > 2*F would suffice to handle up to F arbitrary failures (including usual fail-stop failures). + +#### 6.3.3. Clients + In general, the fundamental difference between a server and a client is that the latter cannot be replicated because it runs arbitrary code outside of Motr control. While the well-formedness of client-supplied transactions and client liveness can be checked with some effort, there is no obvious way to verify that a client calculates redundancy information correctly, without sacrificing system performance to a considerable degree. It is, hence, posited that SNS operations, including client interaction with repair machinery, can originate only from trusted clients [18]. + +#### 6.3.4. Others + The SNS repair interacts with and depends on a variety of core distributed Motr services including liveness layer, lock servers, distributed transaction manager, and management tool. Security concerns for such services should be addressed generically and are beyond the scope of the present design. + + **6.3.5. Issues** + +It is in no way clear that the analysis above is any close to exhaustive. A formal security model is required [19]. + +### 6.4. Refinement +* Pool machine: + * Device-node function: Mapping between the device identifier and the node identifier is an implicit part of the pool state. + +## State +### 7.1. Pool machine states, events, transitions + The pool machine states can be classified into state classes S(n, k). "Macroscopic" transitions between state classes are described by the following state machine diagram: + +![image](./Images/StateMachine.png) +Here device leave is any event that increases devices-missing field of pool machine state: planned device shutdown, device failure, detaching a device from a server, etc. Similarly, device join is any event that decreases devices-missing: addition of a new device to the pool, device startup, etc. The same, mutatis mutandis for node leave and node join events. + +Within each state class the following "microscopic" state transitions happen: + +![image](./Images/MicroscopicState.png) + + +Where join is either node_join or device_join and leave is either node_leave or device_leave, and spare means distributed spare space. + +Or, in the table form: + +| |has_spare _space| !has_spare _space| repair_done| rebalance_done| join |leave| +|---|----------------|-------------------|-------------|-----------------------------|-------|-----| +|choice pseudo-state| repair |spare_grab (); start_re pair_ma chine()|dud| impossible| impossible| impossible| impossible| +|repair| impossible| impossible |repair_complete| impossible| defer, queue | S(n+a, k+b)/stop_repair()| +|repair _complete| impossible| impossible| impossible| impossible| rebalance| S:sub :(n+a, k+b) | +|rebalance| impossible| impossible| impossible| S(n-a, k-b) `:sub:/spare_release()`/| defer, queue| S(n+a, k+b)/stop_rebalance| + +Recall that a pool state machine is replicated and its state transition is in fact, a state transition on a quorum of replicas. Impossible state transitions happening when a replica receives an unexpected event are logged and ignored. It's easy to see that every transition out of S(n, k) state class is either directly caused by a join or leave the event or directly preceded by such an event. + + +Events: +* storage device failure +* node failure +* network failure +* storage device recovery +* node recovery +* network recovery +* media failure +* container transition +* client read +* client write + +### 7.2. Pool machine state invariants +In a state class S(n, k) the following invariants are maintained (for a replicated machine a state invariant is a condition that is true on at least some quorum of state replicas): +* If n <= max-node-failures and k <= max-device-failures, then exactly F(n, k)/F(max-node-failures, max-device-failures) of pool (distributed) spare space is busy, where the definition of F(n, k) function depends on details of striping pattern is used by the pool (described elsewhere). Otherwise, the pool is a dud and all spare space is busy. +* (This is not, technically speaking, an invariant) Version vector a part of pool machine state is updated so that layouts issued before cross-class state transition can be invalidated if necessary. +* No repair or rebalancing copy machine is running when a state class is entered or left. + + +### 7.3. Server machine states, events, transitions + +State transition diagram: +![image](./Images/StateTransition.png) +Where +* Restart state queries pool quorum (including the server itself) for the pool machine state (including the server state). +* Notify action notifies replicated pool machine about changes in the server state or the state of some storage device attached to the server. + +In the table form: + +| |restart | in_pool .recovering | in_pool .online| in_pool .offline| +|---|--------|----------------------|----------------|------------------| +|got_state| in_pool. .recove ring/notify| impossible| impossible| impossible| +|fail| restart |impossible| impossible |impossible| +|done |impossible| on line/notify| impossible |impossible| +|off| impossible |impossible |off line/notify| impossible| +|on |impossible| impossible| impossible| on line/notify| +|IO_req |defer| defer| online/process| off line/ignore| +|device_join| defer| defer |on line/notify| off line/notify| +|device_leave| defer| defer| on line/notify| off line/notify| +|reset| restart| restart |restart| restart| + +### 7.4. Server state machine invariants +Server state machine: no storage operations in OFFLINE state. + +### 7.5. Concurrency control +All state machines function according to the Run-To-Completion (RTC) model. In the RTC model each state transition is executed completely before the next state transition is allowed to start. Queuing [20] is used to defer concurrently incoming events. + + +## Use cases +### 8.1. Scenarios + +|Scenario| usecase.repair.throughput-single| +|--------|----------------------------------| +|Business goals |High availability | +|Relevant quality attributes| Scalability| +|Stimulus |Repair invocation| +|Stimulus source| Node, storage or network failure, or administrative action| +|Environment| Server pool| +|Artifact| Repair data reconstruction process running on the pool| +|Response| Repair utilizes hardware efficiently| +|Response measure| Repair can utilize at least 90 percent of the raw hardware bandwidth of any storage device and any network connection it uses, subject to administrative restrictions. This is achieved by:
          • The streaming IO is done by storage-in and storage-out agents, together with the guarantee that large-chunk streaming container IO can consume at least 95% of raw storage device bandwidth [21]
          • Streaming network transfers are done by network-in and-out agents, together with the guarantee that optimal network transfers can consume at least 95% of raw network bandwidth [22].
          • Assumption that there are enough processor cycles to reconstruct data from redundant information without the processor being bottleneck.
            More convincing argument can be made by simulating the repair.| +|Questions and issues| | + +|Scenario |usecase.repair.throughput-total| +|---------|-------------------------------| +|Business| goals| High availability| +|Relevant quality attributes| Scalability| +|Stimulus |Repair invocation| +|Stimulus source| Node, storage or network failure, or administrative action| +|Environment |A pool| +|Artifact| Repair data reconstruction process running on the pool| +|Response| Repair process utilizes storage and network bandwidth of as many pool elements as possible, even if some elements have already failed and are not being replaced.| +|Response measure| Fraction of pool elements participating in the repair, as a function of a number of failed units. This is achieved by distributed parity layouts, on average uniformly spreading parity groups across all devices in a pool.| +|Questions and issues| | + + +|Scenario |usecase.repair.degradation | +|---------|---------------------------| +|Business goals| Maintain an acceptable level of system performance in degraded mode| +|Relevant quality attributes| Availability| +|Stimulus| Repair invocation| +|Stimulus source| Node, storage or network failure, or administrative action| +|Environment |A pool | +|Artifact |Repair process competing with ongoing IO requests to the pool| +|Response| fraction of the total pool throughput consumed by the repair at any moment in time is limited| +|Response measure | Fraction of total throughput consumed by the repair at any moment is lower than the specified limit. This is achieved by: Repair algorithm throttles itself to consume no more than a certain fraction of system resources (storage bandwidth, network bandwidth, memory) allocated to it by a system parameter. The following agents will do the throttle respectively according to its parameters:
            • Storage-in agent,
            • Storage-out agent,
            • Network-in agent,
            • Network-out agent,
            • Collecting agent
              Additionally, the storage and network IO requests are issued by repair with a certain priority, controllable by a system parameter.| +|Questions and issues| | + +|Scenario| usecase.repair.io-copy-reuse| +|--------|------------------------------| +|Business goals| Flexible deployment| +|Relevant quality attributes| Reusability| +|Stimulus| Local RAID repair | +|Stimulus source| Storage unit failure on a node| +|Environment |Motr node with a failed storage unit| +|Artifact |Local RAID repair| +|Response |Local RAID repair uses the same algorithms and the same data structures as network array repair| +|Response measure| A ratio of code shared between local and network repair. This is achieved by:
              • The same algorithm and data structures will be used to do parity computing, data reconstructing, resource consumption limitation, etc. | +|Questions and issues| | + +|Scenario |usecase.repair.multiple-failure| +|---------|-------------------------------| +|Business goals| System behaves predictably in any failure scenario - Failures beyond redundancy need to have a likelihood over the lifetime of the system (e.g. 5-10 years) to achieve a certain number of 9's in data availability/reliability. The case where not an entire drive fails beyond the redundancy level (media failure) is considered elsewhere.| +|Relevant quality attributes| Fault-tolerance| +|Stimulus |A node, storage, or network failure happens while a repair is going on.| +|Stimulus source| Hardware or software malfunction| +|Environment| A pool in degraded mode| +|Artifact |More units from a certain parity group are erased by the failure than a striping pattern can recover from.| +|Response |Repair identifies lost data and communicates the information about data loss to the interested parties.| +|Response measure |Data loss is contained and identified. This is achieved by proper pool state machine transition:
                • When more units from a certain parity group are detected to be failed than a stripe pattern can tolerate, the pool machine transits to DUD, and
                • Internal read-version and write-version of the state machine will be increased, and pending client I/O will get an error.| +|Questions and issues| | + +|Scenario| usecase.repair.management| +|--------|---------------------------| +|Business goals| System behavior is controllable by C4 but normally automatic. Will be reported by C4, some parameters are somewhat tunable (in the advanced-advanced-advanced box).| +|Relevant quality attributes| Observability, manageability| +|Stimulus| A control request to repair from a management tool| +|Stimulus source |Management tool| +|Environment| An array in degraded mode| +|Artifact |Management tool can request repair cancellation, and change to repair IO priority.| +|Response |Repair executes control requests. Additionally, repair notifies management tools about state changes: start, stop, double failure, ETA.| +|Response measure| Control requests are handled properly, correctly, and timely. Repair status and events are reported to C4 properly, correctly, and timely. This is achieved by the commander handler in the SNS repair manager and its call-backs to C4.| +|Questions and issues| | + +|Scenario| usecase.repair.migration| +|--------|--------------------------| +|Business goals | | +|Relevant quality attributes| reusability| +|Stimulus| Migration of a file set from one pool to another.| +|Stimulus source| Administrative action or policy decision (For example, space re-balancing).| +|Environment| Normal system operation| +|Artifact| A process to migrate data from the pool starts| +|Response |
                  • Migrate data according to its policy correctly, under the limitation of resource usage.
                  • Data migration re-uses algorithms and data structures of repair| +|Response measure|
                    • Data is migrated correctly.
                    • A ratio of code shared between migration and repairThese are achieved by:
                    • Using the same components and algorithm with repair, but with different integration.| +|Questions and issues | + +|Scenario |usecase.repair.replication | +|---------|---------------------------| +|Business goals| | +|Relevant quality attributes| Reusability| +|Stimulus |Replication of a file set from one pool to another.| +|Stimulus source| Administrative action or policy decision.| +|Environment| Normal file system operation.| +|Artifact |A process to replicate data from the pool.| +|Response| Data replication reuses algorithms and data structures of repair.| +|Response measure |A ratio of code shared between replication and repair. This is achieved by using the same components and algorithm with repair, but with different integration.| +|Questions and issues| + + +|Scenario| usecase.repair.resurrection (optional)| +|--------|----------------------------------------| +|Business goals | +|Relevant quality attributes| fault tolerance| +|Stimulus |A failed storage unit or data server comes back into service.| +|Stimulus source| | +|Environment| A storage pool in a degraded mode.| +|Artifact |Repair detects that reconstructed data are back online.| +|Response |Depending on the fraction of data already reconstructed various policies can be selected:
                      • Abort the repair and copy all data modified since reconstruction back to the original, and restore the layout to its original one.
                      • Abandon original data and continue the repair.| +|Response measure| Less time and resources should be used, regardless to continue the repair or not.| +|Questions and issues |If we choose to restore layouts to their original state, it is a potentially lengthy process (definitely not atomic) and additional failures can happen while it is in progress. It requires scanning already processed layouts and reverting them to their original form, freeing spare space. This is further complicated by the possibility of client IO modifying data stored in the spare space before roll-back starts. So, the resurrection will be marked as an "optional" feature to be implemented later.| + + +|Scenario| usecase.repair.local| +|--------|----------------------| +|Business goals | | +|Relevant quality attributes| Resource usage| +|Stimulus |Local RAID repair starts| +|Stimulus source| Storage unit failure on a node| +|Environment |Motr node with a failed storage unit| +|Artifact |Local RAID repair| +|Response |Local RAID repair uses a copy machine buffer pool to exchange data. No network traffic is needed.| +|Response measure| No network traffic in the repair. This is achieved by running a storage-in agent, collecting an agent, storing out the agent on the same node, and exchanging data through a buffer pool.| +|Questions and issues| | + + +|Scenario |usecase.repair.ADDB| +|---------|-------------------| +|Business goals |Better diagnostics| +|Relevant quality attributes| ADDB| +|Stimulus| Repair| +|Stimulus source| SNS repair| +|Environment| Running SNS repair in Motr | +|Artifact| Diagnostic information are logged in ADDB.| +|Response| SNS repair log status, state transition, progress, etc. in ADDB for better diagnostic in the future.| +|Response measure| The amount of ADDB records is useful for later diagnostic. This is achieved by integrating ADDB infrastructure tightly into SNS repair and producing ADDB records correctly, efficiently, and timely.| +|Questions and issues| | + + +|Scenario |usecase.repair.persistency| +|---------|--------------------------| +|Business goals| Availability| +|Relevant quality attributes| Local and distributed transactions| +|Stimulus |SNS| +|Stimulus source| SNS| +|Environment |Node in Motr| +|Artifact| State machines survive node failures.| +|Response| State machines use the services of local and distributed transaction managers to recover from node failures. After a restart, the persistent state machine receives a restart event, that it can use to recover its lost volatile state.| +|Response measure| State machines survive node failures. This is achieved by using replicated state machine mechanism.| +|Questions and issues| | + + +|Scenario| usecase.repair.priority| +|--------|-------------------------| +|Business goals| availability| +|Relevant quality attributes| container| +|Stimulus| SNS repair initialization| +|Stimulus source| Pool machine| +|Environment |a running SNS repair| +|Artifact| Priority of SNS repair assigned.| +|Response |A priority is set for every container included in the input set, and the SNS repair will be executed by this priority.| +|Response measure| SNS repairs are executed with higher priority first. This is achieved by looping from the highest priority to the lowest one to initiate new repairs.| +|Questions and issues| + +### 8.2. Failures +See Logical specification and State section for failure handling and Security model sub-section for Byzantine failures. + +## 9.1. Scalability +Major input parameters, affecting SNS repair behavior are: +* Number of storage devices attached to a server +* Number of servers in the pool +* Storage device bandwidth +* Storage device capacity +* Storage device space utilization +* Server-to-server network bandwidth +* Processor bandwidth per server +* Frequency and statistical distribution of client IO with the pool +* Frequency and distribution of permanent device failures +* Frequency and distribution of permanent server failures +* Frequency and distribution of transient server failures (restarts) +* Mean time to replace a storage device +* Mean time to replace a server +* Fraction of storage bandwidth used by repair +* Fraction of storage bandwidth used by rebalancing +* Fraction of network bandwidth used by repair +* Fraction of network bandwidth used by rebalancing +* Degradation in visible client IO rates +* Pool striping pattern: (N+K)/G +Major SNS repair behavior metrics, affected by the above parameters are: +* Mean time to repair a storage device +* Mean time to repair a server +* Mean time to re-balance to a new storage device +* Mean time to re-balance to a new server +To keep this section reasonably short, a number of simplifying assumptions, some of which can be easily lifted, are made: +* A pool consists of ND devices attached to NS servers (the same number of devices on each server) +* Every cluster-wide object is N+K striped across all servers and devices in the pool using parity de-clustering +* Device size is SD (bytes) +* Average device utilization (a fraction of used device space) is U +* Device bandwidth is BD (bytes/sec) +* Server-to-server network bandwidth is BS (bytes/sec) +* Server processor bandwidth is BP (defined as a rate at which RAID6 redundancy codes can be calculated, bytes/sec) +* Fractions of respective storage, network and processor bandwidth dedicated to repair are AS , AN , and AP + +Let's consider a steady-state of repair in a pool with FS failed servers and FD failed devices (FD includes all devices on failed servers), assuming at most one unit is lost in any parity group. Define GD , GS , GP and GO as rates (bytes/sec) at which repair reads data from a device, sends data from a given server, computes redundancy codes, and writes data to a device respectively. + +Every one of NS - FS survived servers have on average (ND - FD )/(NS - FS ) devices attached and from each of these data are read at the rate GD . Assuming that none of these data are for "internal consumption" (that is, assuming that no parity group has a spare space unit on a server where it has data or parity units), servers send out all these data, giving + +![Image](./Images/Formula-1.PNG) + +Every server fans data out to every other survived server. Hence, every server receives data at the same GS rate. Received data (again assuming no "internal consumption") are processed at GP rate, giving +GS = GP + +Redundancy codes calculation produces a byte of output for every N bytes of input. Finally, reconstructed data are uniformly distributed across all the devices of the server and written out, giving + +![Image](./Images/Formula-2.PNG) + +Steady-state rates are subject to the following constraints: + +![Image](./Images/Formula-3.PNG) + +To reconstruct a failed unit in a parity group, N of its N + K - 1 unit, scattered across ND - 1 devices have to be read, meaning that to reconstruct a device an N/(ND - 1) fraction of used space on every device in the pool has to be read, giving + +![Image](./Images/Formula-4.PNG) + + +As a mean time to repair a device. To minimize MTTRD , GD has to be maximized. From the equations and inequalities above, the maximal possible value of GD is obviously + +![Image](./Images/Formula-5.PNG) + + +Let's substitute vaguely reasonable data: + +|Parameter| Value |Unit |Explanation| +|---------|-------|-----|-----------| +|U | 1.0 | | | +|SD |2.0e12 | bytes| 2TB drive| +|BS | 4.0e9 |bytes/sec| IB QDR| +|Bp |8.0e9 |bytes/sec| +|BD |7.0e7 |bytes/sec| +|AS |1 | +|AD |1 | +|AP |1 | | + + +For a small configuration with NS = 1, ND = 48: +* !4+2 striping: GD is 56MB/sec, MTTRD is 3800 sec; +* 1+1 striping (mirroring): GD is 35MB/sec, MTTRD is 1215 sec. + +For a larger configuration with NS = 10, ND = 480: + +* 10+3 striping: GD is 64MB/sec, MTTRD is 787 sec; +* 1+1 striping: GD is 35MB/sec, MTTRD is 119 sec. + +In all cases, storage IO is the bottleneck. + +### 9.2. Rationale +* Flow control. Should network-in agents drop incoming copy packets when node runs out of resource limits? +* Pipeline based flow control. +* A dedicated lock(-ahead) agent can be split out of storage-in agent for uniformity. + +[1] [u.dlm.logical-locking] +[2] +[u.IO.EXTENT-LOCKING] ST +[3] +[u.IO.MD-LOCKING] ST +[4] +[u.dlm.transaction-based] +[5] +[u.quorum.consensus] +[6] +[u.quorum.read] +[7] +[u.health.interest] +[8] +[u.health.device] +[9] +[u.health.node] +[10] +[u.health.network] +[11] +[u.health.call-back] +[12] +[u.health.fault-tolerance] +[13] +[u.layout.intersects] +[14] +[u.layout.parameter.dead] +[15] +[u.layout.degraded-mode] +[16] +[u.LAYOUT.EXTENT] ST +[17] +[u.layout.lazy-invalidation] +[18] +[u.sns.trusted-client] +[19] +[u.SECURITY.FORMAL-MODEL] ST +[20] +[u.machine.queuing] +[21] +[u.container.streaming.bandwidth] +[22] +[u.rpc.streaming.bandwidth] diff --git a/doc/HLD-of-SNS-client.md b/doc/HLD-of-SNS-client.md new file mode 100644 index 00000000000..2d9d32b4db9 --- /dev/null +++ b/doc/HLD-of-SNS-client.md @@ -0,0 +1,185 @@ +# High-Level Design of an SNS Client Module for M0 +This document provides a High-Level Design **(HLD)** of an SNS Client Module from M0. The main purposes of this document are: +* To be inspected by M0 architects and peer designers to ensure that **HLD** is aligned with M0 architecture and other designs and contains no defects. +* To be a source of material for Active Reviews of Intermediate Design **(ARID)** and Detailed Level Design **(DLD)** of the same component. +* To be served as a design reference document. + +## Introduction +SNS client component interacts with Linux kernel VFS and VM to translate user application IO operation requests into IO FOPs sent to the servers according to SNS layout. + + +## Definitions +The following terms are used to discuss and describe SNS client: +* IO operation is a read or write call issued by a user space application (or any other entity invoking M0 client services, e.g., loop block device driver or NFS server). Truncate is also, technically, an IO operation, but truncates are beyond the scope of this document; +* network transfer is a process of transmitting operation code and operation parameters, potentially including data pages, to the corresponding data server, and waiting for the server's reply. Network transfer is implemented as a synchronous call to the M0 rpc service; +* an IO operation consists of (IO) updates, each accessing or modifying file system state on a single server. Updates of the same IO operation are siblings. + +## Requirements +* `[r.sns-client.nocache]`: Motr client does not cache data because: + * This simplifies its structure and implementation + * Motr exports block device interface and block device users, including file systems, do caching internally. +* `[r.sns-client.norecovery]`: recovery from node or network failure is beyond the scope of the present design; +* `[R.M0.IO.DIRECT]`: direct-IO is supported + * For completeness and to avoid confusion, below are listed non-requirements (i.e., things that are out of the scope of this specification): +* resource management, including distributed cache coherency and locking; +* distributed transactions; +* recovery, replay, resend; +* advanced layout functionality: + * layout formulae; + * layouts on file extents; + * layout revocation; +* meta-data operations, including file creation and deletion; +* security of IO; +* data integrity, fsck; +* availability, NBA. + +## Design Highlights +The design of M0 client will eventually conform to the Conceptual Design [0]. As indicated in the Requirements section, only a small part of the complete conceptual design is considered at the moment. Yet the structure of client code is chosen so that it can be extended in the future to conform to the conceptual design. + +## Functional Specification +External SNS client interfaces are standard Linux file_operations and address_space_operations methods. VFS and VM entry points create fops describing the operation to be performed. The fop is forwarded to the request handler. At the present moment, request handler functionality on a client is trivial, in the future, the request handler will be expanded with generic functionality, see Request Handler HLD [2]. A fop passes through a sequence of state transitions, guided by the availability of resources (such as memory and, in the future, locks) and layout io engine, and staged in the fop cache until rpcs are formed. Current rpc formation logic is very simple due to the no-cache requirement. rpcs are sent to their respective destinations. Once replies are received, fop is destroyed and results are returned to the caller. + +## Logical Specification + +### fop builder, NRS, and request handler +A fop, representing IO operation is created at the VFS or VM entry point1. The fop is then passed to the dummy NRS(23), which immediately passes it down to the request handler. The request handler uses file meta-data to identify the layout and calls the layout IO engine to proceed with the IO operation. + +### Layout Schema +The layout formula generates a parity de-clustered file layout for a particular file, using file id (fid) as an identifier[2]. See Parity De-clustering Algorithm HLD [3] for details. At the moment, **m0t1fs** supports a single file with fid supplied as a mount option. + +### Layout IO Engine +Layout IO engine takes as an input layout and a fop (including operation parameters such as file offsets and user data pages). It creates sub-fops3 for individual updates using layout[67] for data[4], based on pool objects[5]. Sub-fops corresponding to the parity units reference temporarily allocated pages[6], released (under the no-cache policy) when processing completes. + +RPC + +-------- +caching policy is used: fops are accumulated[7] in the staging area while IO operation is being processed by the request handler and IO layout engine. Once the operation is processed, the staged area is unplugged [8] fops are converted into rpcs[9] and rpcs are transferred to their respective target servers. If the IO operation extent is larger than the parity group, multiple sibling updates on a given target server are batched together[10]. + +Conformance + +------- +* 1[u.fop] st +* 2[u.layout.parametrized] st +* 3[u.fop.sns] st +* 4[u.layout.data] st +* 5[u.layout.pools] st +* 6[u.lib.allocate-page] +* 7[u.fop.cache.add] +* 8[u.fop.cache.unplug] +* 9[u.fop.rpc.to] +* 10[u.fop.batching] st +* [r.sns-client.nocache]: holds per caching policy described in the rpc sub-section. +* [r.sns-client.norecovery]: holds obviously; +* [r.m0.io.direct]: no-caching and 0-copy for data together implement **direct-io**. + +Dependencies + +------- +* layout: + * `[u.layout.sns] st`: server network striping can be expressed as a layout + * `[u.layout.data] st`: layouts for data are supported + * `[u.layout.pools] st`: layouts use server and device pools for object allocation, location, and identification + * `[u.layout.parametrized] st`: layouts have parameters that are substituted to perform actual mapping + +* fop: + * `[u.fop] ST`: M0 uses File Operation Packets (FOPs) + * `[u.fop.rpc.to]`: a fop can be serialized into an rpc + * `[u.fop.nrs] ST`: FOPs can be used by NRS + * `[u.fop.sns] ST`: FOP supports SNS + * `[u.fop.batching] ST`: FOPs can be batched in a batch-FOP + * `[u.fop.cache.put]`: fops can be cached + * `[u.fop.cache.unplug]`: fop cache can be de-staged forcibly + +* NRS: + * `[u.nrs] ST`: Network Request Scheduler optimizes processing order globally + +* misc: + * `[u.io.sns] ST`: server network striping is supported + * `[u.lib.allocate-page]`: page allocation interface is present in the library + + +Security Model + +--- +Security is outside of the scope of the present design. + + + + Refinement + +---- + +Detailed level design specification should address the following: +* concurrency control and liveness rules for fops and layouts; +* data structures for mapping between layout and target objects in the pool; +* instantiation of a layout formula; +* relationship between a fop and its sub-fops: concurrency control, liveness, ownership + + +## State +State diagrams are part of the detailed level design specification. +## Use Cases +Scenarios + +|Scenario | Description | +|---------|-------------------------| +|Scenario |[usecase.sns-client-read]| +|Relevant quality attributes| usability| +|Stimulus| an incoming read operation request from a user space operation| +|Stimulus source| a user-space application, potentially meditated by a loop-back device driver| +|Environment| normal client operation| +|Artifact| call to VFS ->read() entry point| +|Response |a fop is created, network transmission of operation parameters to all involved data servers is started as specified by the file layout, servers place retrieved data directly in user buffers, and once transmission completes, the fop is destroyed.| +|Response measure |no data copying in the process| + +|Scenario | Description | +|---------|-------------------------| +|Scenario |[usecase.sns-client-write]| +|Relevant quality attributes| usability| +|Stimulus| an incoming write operation request from a user space operation| +|Stimulus source |a user-space application, potentially meditated by a loop-back device driver| +|Environment | normal client operation| +|Artifact |call to VFS ->write() entry point| +|Response |a fop is created, network transmission of operation parameters to all involved data servers is started as specified by the file layout, servers place retrieved data directly in user buffers, and once transmission completes, the fop is destroyed. servers place retrieved data directly in user buffers, and once transmission completes, the fop is destroyed.| +|Response measure |no data copying in the process| + + +## Analysis +Scalability + +------ +No scalability issues are expected in this component. Relatively few resources (processor cycles, memory) are consumed per byte of processed data. With a large number of concurrent IO operation requests, scalability of layout, pool, and fop data structures might become a bottleneck (in the case of small file IO initially). + +## Deployment +Compatability + +------ + +Network + + -------- +No issues at this point. + +Persistent storage + +----- +The design is not concerned with persistent storage manipulation. + + +Core + + ------ +No issues at this point. No additional external interfaces are introduced. + +Installation + + ------- +The SNS client module is a part of m0t1fs.ko kernel module and requires no additional installation. System testing scripts in **m0t1fs/st** must be updated. + +## References +* [0] Outline of M0 core conceptual design +* [1] Summary requirements table +* [2] Request Handler HLD +* [3] Parity De-clustering Algorithm HLD +* [4] High level design inspection trail of SNS client +* [5] SNS server HLD diff --git a/doc/HLD-of-distributed-indexing.md b/doc/HLD-of-distributed-indexing.md new file mode 100644 index 00000000000..11c38500013 --- /dev/null +++ b/doc/HLD-of-distributed-indexing.md @@ -0,0 +1,190 @@ +# High level design of the distributed indexing +This document presents a high level design **(HLD)** of the Motr distributed indexing. +The main purposes of this document are: +1. To be inspected by the Motr architects and peer designers to ascertain that high level design is aligned with Motr architecture and other designs, and contains no defects. +2. To be a source of material for Active Reviews of Intermediate Design **(ARID)** and detailed level design **(DLD)** of the same component. +3. To serve as a design reference document. + +The intended audience of this document consists of Motr customers, architects, designers, and developers. + +## Introduction +Distributed indexing is a Motr component that provides key-value indices distributed over multiple storage devices and network nodes in the cluster for performance, scalability, and fault tolerance. + +Distributed indices are exported via interface and provide applications with a method to store application-level meta-data. Distributed indices are also used internally by Motr to store certain internal meta-data. Distributed indexing is implemented on top of "non-distributed indexing" provided by the catalogue service (cas). + +## Definitions +- definitions from the high level design of catalog service are included: catalogue, identifier, record, key, value, key order, user; +- definitions from the high level design of parity de-clustered algorithm are included: pool, P, N, K, spare slot, failure vector; +- a distributed index, or simply index is an ordered container of key-value records; +- a component catalogue of a distributed index is a non-distributed catalogue of key-value records, provided by the catalogue service, in which the records of the distributed index are stored. + +## Requirements +- `[r.idx.entity]`: an index is a Motr entity with a fid. There is a fid type for distributed indices. +- `[r.idx.layout]`: an index has a layout attribute, which determines how the index is stored in non-distributed catalogues. +- `[r.idx.pdclust]`: an index is stored according to a parity de-clustered layout with N = 1, i.e., some form of replication. The existing parity de-clustered code is re-used. +- `[r.idx.hash]`: partition of index records into parity groups is done via key hashing. The hash of a key determines the parity group (in the sense of parity de-clustered layout algorithm) and, therefore, the location of all replicas and spare spaces. +- `[r.idx.hash-tune]`: the layout of an index can specify one of the pre-determined hash functions and specify the part of the key used as the input for the hash function. This provides an application with some degree of control over locality. +- `[r.idx.cas]`: indices are built on top of catalogues and use appropriately extended catalogue service. +- `[r.idx.repair]`: distributed indexing sub-system has a mechanism of background repair. In case of a permanent storage failure, index repair restores redundancy by generating more replicas in spare space. +- `[r.idx.re-balance]`: distributed indexing sub-system has a mechanism of background re-balance. When a replacement hardware element (a device, a node, a rack) is added to the system, re-balance copies appropriate replicas from the spare space to the replacement unit. +- `[r.idx.repair.reuse]`: index repair and re-balance, if possible, are built on top of copy machine abstraction used by the SNS repair and re-balance; +- `[r.idx.degraded-mode]`: access to indices is possible during repair and re-balance. +- `[r.idx.root-index]`: a root index is provided, which has a known built-in layout and, hence, can be accessed without learning its layout first. The root index is stored in a pre-determined pool, specified in the configuration database. The root index contains a small number of global records. +- `[r.idx.layout-index]`: a layout index is provided, which contains (key: index-fid, value: index-layout-id) records for all indices except the root index, itself, and other indices mentioned in the root index. The layout of the layout index is stored in the root index. Multiple indices can use the same layout-id. +- `[r.idx.layout-descr]`: a layout descriptor index is provided, which contains (key: index-layout-id, value: index-layout-descriptor) records for all indices. + +Relevant requirements from the Motr Summary Requirements table: +- `[r.m0.cmd]`: clustered meta-data are supported; +- `[r.m0.layout.layid]`: a layout is uniquely identified by a layout id (layid). +- `[r.m0.layout.meta-data]`: layouts for meta-data are supported. + +## Design Highlights +An index is stored in a collection of catalogues, referred to as component catalogues (similarly to a component object, cob), distributed across the pool according to the index layout. Individual component catalogues are either created during explicit index creation operation or created lazily on the first access. + +To access the index record with a known key, the hash of the key is calculated and used as the data unit index input of the parity de-clustered layout algorithm. The algorithm outputs the locations of N+K component catalogues, where the replicas of the record are located and S component catalogues that hold spare space for the record. Each component catalogue stores a subset of records of the index without any transformation of keys or values. + +Iteration through an index from a given starting key is implemented by querying all component catalogues about records following the key and merge-sorting the results. This requires updating catalogue service to correctly handle NEXT operation with a non-existent starting key. + +The new fid type is registered for index fids. + + +## Functional Specification +Indices are available through Motr interface. Spiel and HA interfaces are extended to control repair and re-balance of indices. + +## Logical Specification +### Index Layouts +Index layouts are based on the N+K+S parity de-clustered layout algorithm, with the following modifications: +- N = 1. The layout provides (K+1)-way replication. +- parity de-clustered layouts for data objects come with unit size as a parameter. Unit size is used to calculate parity group number, which is an essential input to the parity de-clustered layout function. For indices there is no natural way to partition key-space into units, so the implementation should provide some flexibility to select suitable partitioning. One possible (but not mandated) design is calculate a unit number by specifying an identity mask within a key: +- identity mask is a sequence of ranges of bit positions in the index key (keys are considered as bit-strings): [S0, E0], [S1, E1], ..., [Sm, Em], here Si and Ei are bit-offsets counted from 0. The ranges can be empty, overlapping, and are not necessarily monotone offset-wise; + +- given a key bit-string X, calculate its seed as + - seed = `X[S0, E0]::X[S1, E1]:: ... :: X[Sm, Em` + where:: is the bit-string concatenation. + + +- if the layout is hashed (a Boolean parameter), then the key belongs to the parity group hash(seed), where the hash is some fixed hash function, otherwise (not hashed), the parity group number equals seed, which must not exceed 64 bits; + +- the intention is that if it is known that some parts of keys of a particular index have good statistical properties, e.g., are generated as a sequential counter, these parts of the key can be included in the identity mask of a non-hashed layout. In addition, some parts of a key can be omitted from the identity mask to improve the locality of reference, so that "close" keys are stored in the same component catalogue, increasing the possibility of network aggregation. Note that a user can always use a hash function tailored for a particular index by appending arbitrary hash values to the keys. + +A few special cases require special mention: + +- redundant, but the not striped layout is a layout with the empty identity mask. In an index with such layout, all records belong to the same parity group. As a result, the index is stored in (K+1) component catalogues. The location of the next record is known in advance and iteration through the index can be implemented without broadcasting all component catalogues. The layout provides fault-tolerance but doesn't provide full scalability within a single index, specifically the total size of an index is bound by the size of the storage controlled by a single catalogue service. + Note, however, that different indices with the same layout will use different sets of services; + +- A fully hashed layout is a layout with an infinite identity mask [0, +∞] and with a "hashed" attribute true. Records of an index with such a layout are uniformly distributed across the entire pool. This layout is the default layout for "generic" indices. + +- fid index layout. It is expected that there will be many indices using fids as keys. The default hash function should work effectively in this case. Similarly for the case of an index, where a 64-bit unsigned integer is used as a key. + +- lingua franca layout is the layout type optimized for storing lingua franca namespace indices, by hashing the filename part of the key and omitting attributes from the hash. + +Layout descriptor is the set of parameters necessary to do index operations. Layout descriptor consists of: + +- storage pool version fid. Component catalogues of an index using the layout are stored in the pool version. Pool version object is stored in confc and contains, among other attributes, N, K, and P; +- identity mask, as described above. +- hashed flag, as described above (or the identifier of a hash function to use for this layout, if multiple hash functions are supported). +- for uniformity, layout descriptors are also defined for catalogues (i.e., non-distributed indices). A catalogue layout descriptor consists of the fid of the service hosting the catalogue. + +Typically a layout descriptor will be shared by a large number of indices. To reduce the amount of meta-data, a level of indirection is introduced, see the Internal meta-data sub-section below. + +In-memory representation of a Motr index includes index fid and index layout descriptor. + +### Internal meta-data +The root index is intended to be a small index containing some small number of rarely updated global meta-data. As the root index is small and rarely updated it can be stored in a highly replicated default pool (specified in confd), that can remain unchanged as system configuration changes over time. + +Layout and layout-descr indices collectively provide layouts to indices. The separation between layout and layout-desc allows layout descriptors to be shared between indices. A record in the layout index can contain as a value either a layout identifier (usual case) or full layout descriptor (special case). Because layout-descr and especially layout indices can grow very large, it is not possible to store them once and for all in the original default pool. Instead, the layout descriptors of the layout and layout-descr indices are stored in the root index. When the system grows layout index can be migrated to a larger pool and its layout descriptor in the root index updated. + +A catalogue-index is a local (non-distributed) catalogue maintained by the index sub-system on each node in the pool. When a component catalogue is created for a distributed index, a record mapping the catalogue to the index is inserted in the catalogue-index. This record is used by the index repair and re-balance to find locations of other replicas. + +### Client +Initialization +- find default index pool in confc +- construct root index layout descriptor +- fetch layout and layout-descr layouts from the root index. + +Index creation +- construct layout descriptor +- cas-client: send CREATE to all cases holding the component catalogues. + + +Index open +- cas-client: lookup in layout, given index fid, get layout-id or layout descriptor +- if got identifier, lookup descriptor in layout-descr. + +Index operation (get, put, next) +- use layout descriptor to find component catalogues +- cas-client: operate on the component catalogues + +Operation concurrent with repair or re-balance +- use spare components; +- for PUT, use overwrite flag (see below), when updating the spare; +- for re-balance, update correct replicas, spares, and re-balance target (use overwrite + +flag); +- for DEL, delete from spares, re-balance target, and correct replicas; +- DEL is 2 phases: +- use cas-client to update correct replicas, get a reply +- use cas-client to update spares and re-balance target. +This avoids a possible race, where repair sends the old value to the spares concurrently with a client update. + + +### Service +Catalogue service (cas) implementation is extended in the following ways: +- a record is inserted in the meta-index, when a component catalogue is created. The key is the catalogue fid, the value is (tree, index-fid, pos, layout-descr), where + - a tree is the b-tree as for a catalogue, + - index-fid is the fid of the index this catalogue is a component of, + - pos is the position of the catalogue within the index layout, from 0 to P; + - layout-descr is the layout descriptor of the index; + • values in the meta-index can be distinguished by their size; +- when a catalogue with the fid cat-fid is created as a component of an index with the fid idx-fid, the record (key: idx-fid, val: cat-fid) is inserted in the catalogue-index; +- NEXT operation accepts a flag parameter (slant), which allows iteration to start with the smallest key following the start key; +- PUT operation accepts a flag (create) instructing it to be a no-op if the record with the given key already exists; +- PUT operation accepts a flag (overwrite) instructing it to silently overwrite existing record with the same key, if any; +- before executing operations on component catalogues, cas checks that the index fid and layout descriptor, supplied in the fop match contents of the meta-index record. + +### Repair +Index repair service is started along with every cas, similarly to SNS repair service being started along with every ios. + +When index repair is activated (by Halon by using spiel), index repair service goes through catalogue-index catalogue in index fid order. For each index, repair fetches layout descriptor from the meta-index, uses it to calculate the spare space location and invokes cas-client to do the copy. The copy should be done with the create flag to preserve updates to spares made by the clients. + + +### Re-balance +Similar to repair. + +## Dependencies +- cas: add the "flags" field to cas record structure, with the following bits: + - slant: allows NEXT operation to start with a non-existent key; + - overwrite: PUT operation discards the existing value of the key; + - create: PUT operation is a successful no-op if the key already exists; +- conf: assign devices to cas services (necessary to control repair and re-balance) +- spiel: support for repair and re-balance of indices +- halon: interact with catalogue services (similarly to io services) +- halon: introduce the "global mkfs" phase of cluster initialization, use it to call a script to create global meta-data. + +### Security model +None at the moment. The security model should be designed for all storage objects together. + +## Implementation plan +- implement basic pdclust math, hashing, identity mask; +- implement layout descriptors in memory +- implement a subset of motr sufficient to access the root index, i.e., without + - fetching layouts from network + - catalogue-index +- add unit tests, working with the root index +- implement layout index and layout-descr index +- more UT +- system tests? +- implement catalogue-index +- modify conf schema to record devices used by cas + - deployment? +- implement index repair +- implement index re-balance +- halon controls for repair and re-balance +- system tests with halon, repair, and re-balance +- implement concurrent repair, re-balance, and index access +- system tests (acceptance, S3) + +## References +- HLD of catalogue service +- HLD of parity de-clustered algorithm +- HLD of SNS repair diff --git a/doc/ISC-Service-User-Guide b/doc/ISC-Service-User-Guide new file mode 100644 index 00000000000..e78000ad3b5 --- /dev/null +++ b/doc/ISC-Service-User-Guide @@ -0,0 +1,358 @@ +# ISC User Guide +This is ISC user guide. +## Preparing Library +APIs from external library can not be linked directly with a Motr instance. A library is supposed to have a function named motr_lib_init(). This function will then link the relevant APIs with Motr. Every function to be linked with motr shall confine to the following signature: + +``` +int comp(struct m0_buf *args, struct m0_buf *out, + struct m0_isc_comp_private *comp_data, int *rc) +``` + +All relevant library APIs shall be prepared with a wrapper confining to this signature. Let libarray be the library we intend to link with Motr, with following APIs: arr_max(), arr_min(), arr_histo(). + +### Registering APIs +motr_lib_init() links all the APIs. Here is an example code (please see iscservice/isc.h for more details): + +``` + void motr_lib_init(void) + { + rc = m0_isc_comp_register(arr_max, “max”, + string_to_fid1(“arr_max”)); + + if (rc != 0) + error_handle(rc); + + rc = m0_isc_comp_register(arr_min, “min”, + string_to_fid(“arr_min”)); + + if (rc != 0) + error_handle(rc); + + rc = m0_isc_comp_register(arr_histo, “arr_histo”, + string_to_fid(“arr_histo”)); + + if (rc != 0) + error_handle(rc); + +} +``` + +## Registering Library +Let libpath be the path the library is located at. The program needs to load the same at each of the Motr node. This needs to be done using: + +``` +int m0_spiel_process_lib_load2(struct m0_spiel *spiel, + struct m0_fid *proc_fid, + char *libpath) +``` +This will ensure that motr_lib_init() is called to register the relevant APIs. + +## Invoking API +Motr has its own RPC mechanism to invoke a remote operation. In order to conduct a computation on data stored with Motr it’s necessary to share the computation’s fid (a unique identifier associated with it during its registration) and relevant input arguments. Motr uses fop/fom framework to execute an RPC. A fop represents a request to invoke a remote operation and it shall be populated with relevant parameters by a client. A request is executed by a server using a fom. The fop for ISC service is self-explanatory. Examples in next subsection shall make it more clear. + +``` + /** A fop for the ISC service */ + struct m0_fop_isc { + /** An identifier of the computation registered with the + Service. + */ + struct m0_fid fi_comp_id; + + /** + * An array holding the relevant arguments for the + * computation. + * This might involve gfid, cob fid, and few other parameters + * relevant to the required computation. + */ + struct m0_rpc_at_buf fi_args; + + /** + * An rpc AT buffer requesting the output of computation. + */ + struct m0_rpc_at_buf fi_ret; + + /** A cookie for fast searching of a computation. */ + struct m0_cookie fi_comp_cookie; + +} M0_XCA_RECORD M0_XCA_DOMAIN(rpc)3; +``` + +## Examples +**Hello-World** + +Consider a simple API that on reception of string “Hello” responds with “World” along with return code 0. For any other input it does not respond with any string, but returns an error code of -EINVAL. Client needs to send m0_isc_fop populated with “Hello”. First we will see how client or caller needs to initialise certain structures and send them across. Subsequently we will see what needs to be done at the server side. Following code snippet illustrates how we can initialize m0_isc_fop . + +``` + /** + * prerequisite: in_string is null terminated. + * isc_fop : A fop to be populated. + * in_args : Input to be shared with ISC service. + * in_string: Input string. + * conn : An rpc-connection to ISC service. Should be established + * beforehand. + */ + + int isc_fop_init(struct m0_fop_isc *isc_fop, struct m0_buf *in_args, + char *in_string, struct m0_rpc_conn *conn) + { + int rc; + /* A string is mapped to a mero buffer. */ + m0_buf_init(in_args, in_string, strlen(in_string)); + /* Initialise RPC adaptive transmission data structure. */ + m0_rpc_at_init(&isc_fop->fi_args); + /* Add mero buffer to m0_rpc_at */ + + rc = m0_rpc_at_add(&isc_fop->fi_args, in_args, conn); + + if (rc != 0) + return rc; + + /* Initialise the return buffer. */ + m0_rpc_at_init(&isc_fop->fi_ret); + rc = m0_rpc_at_recv(&isc_fop->fi_ret, conn, REPLY_SIZE4, false); + + if (rc != 0) + return rc; + + return 0; +} +``` + +Let’s see how this fop is sent across to execute the required computation. + + +``` +#include “iscservice/isc.h” +#include “fop/fop.h” +#include “rpc/rpclib.h” + + +int isc_fop_send_sync(struct m0_isc_fop *isc_fop, + struct m0_rpc_session *session) + +{ + struct m0_fop fop; + struct m0_fop reply_fop5; + /* Holds the reply from a computation. */ + struct m0_fop_isc_rep reply; + struct m0_buf *recv_buf; + struct m0_buf *send_buf; + int rc; + + M0_SET0(&fop); + m0_fop_init(&fop, &m0_fop_isc_fopt, isc_fop, m0_fop_release); + /* + * A blocking call that comes out only when reply or error in + * sending is received. + */ + rc = m0_rpc_post_sync(&fop, session, NULL, M0_TIME_IMMEDIATELY); + + if (rc != 0) + return error_handle(); + + /* Capture the reply from computation. */ + reply_fop = m0_rpc_item_to_fop(fop.f_item.ti_reply); + reply = *(struct m0_fop_isc_rep *)m0_fop_data(reply_fop); + /* Handle an error received during run-time. */ + if (reply.fir_rc != 0) + return error_handle(); + + /* Obtain the result of computation. */ + rc = m0_rpc_at_rep_get(isc_fop->fi_ret, reply.fir_ret, recv_buf); + if (rc != 0) { + comp_error_handle(rc, recv_buf); + } + + if (!strcmp(fetch_reply(recv_buf), “World”)) { + comp_error_handle(rc, recv_buf); + } else { + /* Process the reply. */ + reply_handle(recv_buf); + /* Finalize relevant structure. */ + m0_rpc_at_fini(&isc_fop->fi_args); + m0_rpc_at_fini(&reply.fir_ret); + } + + return 0 +} +``` + +We now discuss the callee side code. Let’s assume that the function is registered as “greetings” with the service. +``` + void motr_lib_init(void) + { + rc = m0_isc_comp_register(greetings, “hello-world”, + string_to_fid6(“greetings”)); + + if (rc != 0) + error_handle(rc); + + } + + int greetings(struct m0_buf *in, struct m0_buf *out, + struct m0_isc_comp_private *comp_data, int *rc) + { + + char *out_str; + + if (m0_buf_streq(in, “Hello”)) { + /* + * The string allocated here should not be freed by + * computation and Mero takes care of freeing it. + */ + + out_str = m0_strdup(“World”); + if (out_str != NULL) { + m0_buf_init(out, out_str, strlen(out_str)); + rc = 0; + } else + *rc = -ENOMEM; + } else + *rc = -EINVAL; + + return M0_FSO_AGAIN; +} +``` + +## Min/Max +Hello-World example sends across a string. In real applications the input can be a composition of multiple data types. It’s necessary to serialise a composite data type into a buffer. Motr provides a mechanism to do so using xcode/xcode.[ch]. Any other serialization mechanism that’s suitable and tested can also be used eg. Google’s Protocol buffers . But we have not tested any such external library for serialization and hence in this document would use Motr’s xcode APIs. + +In this example we will see how to send a composite data type to a registered function. A declaration of an object that needs to be serialised shall be tagged with one of the types identified by xcode. Every member of this structure shall also be representable using xcode type. Please refer xcode/ut/ for different examples. + +Suppose we have a collection of arrays of integers, each stored as a Motr object. Our aim is to find out the min or max of the values stored across all arrays. The caller communicates the list of global fids(unique identification of stored object in Motr) with the registered computation for min/max. The computation then returns the min or max of locally (on relevant node) stored values. The caller then takes min or max of all the received values. The following structure can be used to communicate with registered computation. + +``` +/* Arguments for getting min/max. */ +struct arr_fids { + /* Number of arrays stored with Mero. */ + uint32_t af_arr_nr; + /* An array holding unique identifiers of arrays. */ + struct m0_fid *af_gfids +} M0_XCA_SEQUENCE7; +Before sending the list of fids to identify the min/max it’s necessary to serialise it into a buffer, because it’s a requirement of ISC that all the computations take input in the form of a buffer. Following snippet illustrates the same. + +int arr_to_buff (struct arr_fids *in_array, struct m0_buf *out_buf) +{ + int rc; + rc = m0_xcode_obj_enc_to_buf(XCODE_OBJ8(arr_fids), + &out_buf->b_addr, + &out_buf->b_nob); + + if (rc != 0) + error_handle(rc); + + return rc; + +} +``` + +The output buffer out_buf can now be used with RPC AT mechanism introduced in previous subsection. On the receiver side a computation can deserialize the buffer to convert into original structure. The following snippet demonstrates the same. + +``` +int buff_to_arr(struct m0_buf *in_buf, struct arr_fids *out_arr) +{ + int rc; + rc = m0_xcode_obj_dec_from_buf(XCODE_OBJ(arr_fids), + &in_buf->b_addr, + in_buf->b_nob); + + if (rc != 0) + error_handle(rc); + + return rc; +} +``` + +Preparation and handling of a fop is similar to that in Hello-World example. Once a computation is invoked, it will read each object’s locally stored values, and find min/max of the same, eventually finding out min/max across all arrays stored locally. In the next example we shall see how a computation involving an IO can be designed. + +## Histogram +We now explore a complex example where a computation involves an IO, and hence needs to wait for the completion of IO. User stores an object with Motr. This object holds a sequence of values. The size of an object in terms of the number of values held is known. The aim is to generate a histogram of values stored. This is accomplished in two steps. In the first step user invokes a computation with remote Motr servers and each server generates a histogram of values stored with it. In the second step, these histograms are communicated with the user and it adds them cumulatively to generate the final histogram. The following structure describes a list of arguments that will be communicated by a caller with the ISC service for generating a histogram. ISC is associated only with the first part. + +``` +/* Input for histogram generation. */ +struct histo_args { + /** Number of bins for histogram. */ + uint32_t ha_bins_nr; + + /** Maximum value. */ + uint64_t ha_max_val; + + /** Minimum value. */ + uint64_t ha_min_val; + + /** Global fid of object stored with Mero. */ + struct m0_fid ha_gob_fid; + +} M0_XCA_RECORD; +``` + +The array of values stored with Motr will be identified using a global id represented here as ha_gob_fid. It has been assumed that maximum and minimum values over the array are known or are made available by previous calls to arr_max() and arr_min(). + +Here we discuss the API for generating a histogram of values, local to a node. The caller side or client side shall be populating the struct histo_args and sending it across using m0_isc_fop. + +``` +/* + * Structure of a computation is advisable to be similar to + * Motr foms. It returns M0_FSO_WAIT when it has to wait for + * an external event (n/w or disk I/O)else it returns + * M0_FSO_AGAIN. These two symbols are defined in Mero. + */ +int histo_generate(struct m0_buf *in, struct m0_buf *out, + struct m0_isc_comp_private *comp_data, + int *ret) +{ + + struct *histo_args; + struct *histogram; + struct *hist_partial; + uint32_t disk_id; + uint32_t nxt_disk; + int rc; + int phase; + phase = comp_phase_get(comp_data); + + switch(phase) { + case COMP_INIT: + + /* + * Deserializes input buffer into “struct histo_args” + * and stores the same in comp_data. + */ + histo_args_fetch(in, out, comp_data); + rc = args_sanity_check(comp_data); + + if (rc != 0) { + private_data_cleanup(comp_data); + *ret = rc; + return M0_FSO_AGAIN; + } + + comp_phase_set(comp_data, COMP_IO); + + case COMP_IO: + disk = disk_id_fetch(comp_data); + /** + This will make the fom (comp_data->icp_fom) wait + on a Motr channel which will be signalled on completion + of the IO event. + */ + + rc = m0_ios_read_launch(gfid, disk, buf, offset, len, + comp_data->icp_fom); + if (rc != 0) { + private_data_cleanup(comp_data); + /* Computation is complete, with an error. */ + + *ret = rc; + + return M0_FSO_AGAIN; + } + + comp_phase_set(comp_data, COMP_EXEC); + + /** + This is necessary for Motr instance to decide whether to retry. + */ + } +} +``` diff --git a/doc/Motr-Epochs-HLD.md b/doc/Motr-Epochs-HLD.md new file mode 100644 index 00000000000..db66d47fefa --- /dev/null +++ b/doc/Motr-Epochs-HLD.md @@ -0,0 +1,87 @@ +# Motr Epochs HLD +Motr services may, in general, depend on global state that changes over time. Some of this global state changes only as a result of failures, such as changing the striping formula across storage pools. Since these changes are failure driven, it is the HA subsystem that coordinates state transition as well as broadcasts of state updates across the cluster. + +Because failures can happen concurrently, broadcasting state updates across the cluster can never be reliable. In general, some arbitrarily large portion of the cluster will be in a crashed state or otherwise unavailable during the broadcast of a new state. It is therefore unreasonable to expect that all nodes will share the same view of the global state at all times. For example, following a storage pool layout change, some nodes may still assume an old layout. Epochs are a Motr-specific mechanism to ensure that operations from one node to another are done with respect to a shared understanding of what the global state is. The idea is to assign a number (called the epoch number) to each global state (or item thereof), and communicate this number with every operation. A node should only accept operations from other nodes if the associated epoch number is the same same as that of the current epoch of the node. In a sense, the epoch number acts as a proxy to the shared state, since epoch numbers uniquely identify states. + +This document describes the design of the HA subsystem in relation to coordinating epoch transitions and recovery of nodes that have not yet joined the latest epoch. + +## Definitions + +* Shared state item: a datum, copies of which are stored on multiple in the cluster. While a set S of nodes are in a given epoch, they all agree on the value of the shared state. +* State transition: a node is said to transition to a state when it mutates the value of a datum to match that state. +* Epoch failure: a class of failures whose recovery entails a global state transition. It is safe to include all failures in this class, but in practice some failures might be excluded if doing so does not affect correctness but improves performance. +* Epoch: the maximal interval in the system history through which no epoch failures are agreed upon. In terms of HA components, this means that an epoch is the interval of time spanning between two recovery sequences by the Recovery Coordinator, or between the last recovery sequence and infinity. +Epoch number: an epoch is identified by an epoch number, which is a 64-bit integer. A later epoch has a greater number. Epoch numbers are totally ordered in the usual way. +* HA domain: an epoch is always an attribute of a domain. Each domain has its own epoch. Nodes can be members of multiple domains so each node may be tracking not just one epoch but in fact multiple epochs. A domain also has a set of epoch handlers associated with it. + +## Requirements + +* [R.MOTR.EPOCH.BROADCAST] When the HA subsystem decides to transition to a new epoch, a message is broadcast to all nodes in the cluster to notify them of an epoch transition. This message may include (recovery) instructions to transition a new shared state. +* [R.MOTR.EPOCH.DOMAIN] There can be multiple concurrent HA domains in the cluster. Typically, there is one epoch domain associated with each Motr request handler. +* [R.MOTR.EPOCH.MONOTONE] The epoch for any given domain on any given node changes monotonically. That is, a node can only transition to a later epoch in a domain, not an older one. +* [R.MOTR.EPOCH.CATCH-UP] A node A that is that is told by another node B that a newer epoch exists, either in response to a message B or because B sent a message to A mentioning a later epoch, can send to failure event to the HA subsystem to request instructions on how to reach the latest epoch. + +## Design Highlights +* Upon transitioning to a new epoch, the HA subsystem broadcasts the new epoch number to the entire cluster. +* The HA subsystem does not wait for acknowledgements from individual nodes that they have indeed transitioned to the new epoch. Aggregation of acknowledgements is therefore not required. +* Communication of HA subsystem to nodes is done using Cloud Haskell messages, to make communication with services uniform within the HA subsystem, and to circumscribe the Motr-specific parts of the epoch functionality to the services themselves. + +## Functional Specification + +It is the Recovery Coordinator (RC), a component of the HA subsystem, that maintains the authoritative answer as to which is the current epoch of any particular HA domain. Within an HA domain, failures which fall in the class of epoch failures warrant changing the epoch of that domain. This is done by the RC following receipt of an HA event failure. + +Because the epoch of a domain is global, the RC notifies the entire cluster of the fact that the domain has just transitioned to a new epoch, through a broadcast. + +The broadcasted message contains instructions for the node to transition to a new state. A destination node processes the message and transitions both to the new state and the new epoch, both as per the content of the message. Transitioning to both a new state and a new epoch is an atomic operation. + +Because only nodes that have transitioned to the new epoch also transition to the new state, it is safe for such nodes to communicate. Moreover, because nodes normally refuse to process RPC requests from other nodes unless the epoch number attached to the request matches the expected one, only nodes that are in the same epoch can communicate. A corollary of this fact is that nodes do not need to acknowledge to the RC that it has in fact transitioned to a new epoch. The RC can merely assume that it has, because even if the node fails to transition to the new epoch, it does not jeopardize the safety of the cluster. A node that fails to transition while all other nodes have is effectively quarantined, since it can no longer communicate in any meaningful way with any other node in the cluster. + +### Maximal Epoch Failure Events +A node is expected to post an HA event to the HA subsystem when receiving a RPC request R in any of the following scenarios, where e(R) denotes the epoch of the RPC request and e(self) the epoch of the current node: + +if e(R)≠e(self) then post an HA event to report + +fst(max((e(R),src(R)), (e(self),self))) +as late, where pairs are ordered lexicographically. + +If in addition, src(R) expects a response, then the node should respond with a “wrong epoch” error. + +## Logical Specification + +![image](./Images/RC.PNG) + +In general, the RC interacts with services using a messaging protocol internal to the HA subsystem. This protocol is that of Cloud Haskell (CH). We introduce Cloud Haskell processes to wrap actual Motr services. This provides an endpoint for Cloud Haskell messages as well as a place to monitor the Motr service. Communication between the actual Motr service and the wrapping Cloud Haskell process happens using a Motr specific communication protocol, Motr RPC. + +### Reporting a Late Node +Upon receipt of an RPC request, the matching request handler checks the epoch of the sender of the message. If the epoch does not match its current epoch, then epoch handlers are invoked in turn until one epoch handler is found that returns an outcome other than M0_HEO_CONTINUE. + +An epoch handler is installed that sends a message to an RPC endpoint associated with the wrapper process upon epoch mismatch, mentioning the current epoch of the HA domain to which the service belongs. If the original RPC request requires a response, then the outcome of this handler is M0_HEO_ERROR. Otherwise, the outcome is M0_HEO_DROP. + +The wrapper CH process listens on the designated endpoint and waits for messages from the Motr service. Upon receipt of an epoch failure message, it forwards this message in the form of an HA event to the tracking station, through the usual channels for HA events. + +### Setting the Epoch +In response to a late epoch HA event, the RC needs to perform recovery of the affected node, by replaying each recovery message associated with each epoch change since the current epoch of the affected node. These messages are sent to the wrapper process, who then forwards them using Motr RPC to the actual Motr service. In the extreme, such messages can be as simple as just updating the epoch of the affected HA domain, without touching any other state in the node. Transitioning to the new epoch is done using an epoch handler, who must respond to messages from the wrapper process of a set type with the outcome M0_HEO_OBEY. + +Each wrapper process maintains a structure representing the HA domain. Upon receiving an epoch transition message, the wrapper process acquires a write lock on the HA domain structure, increments the epoch and then releases the write lock. In this manner, any RPC request in given HA domain sent between the wrapper process and the actual Motr service will mention the new epoch, and hence trigger the epoch handlers in the Motr service. + +## Interfaces +To handle epoch failures, we introduce the following HA event: +``` +EpochTransitionRequest { service :: ServiceId + + , currentEpoch :: EpochId + + , targetEpoch :: EpochId } +``` +HA services that wrap Motr services must be able to accept the following additional messages: +``` +EpochTransition { targetEpoch :: EpochId + + , epochTransitionPayload :: a } +``` +Nodes cannot skip epochs. Therefore, an EpochTransition can only be processed if the current epoch is the one immediately preceding the target epoch. In general, to reach the target epoch of an EpochTransitionRequest, several EpochTransition messages must be sent. These will be processed in order by the target node. An EpochTransition message is parameterized by the type a of the transition payload. The transition payload is specific to each Motr service and in general contains instructions understood by the Motr service to alter the item of shared state associated with HA domain of the epoch. + +We do not include an HA domain identifier explicitly in messages. This is because the HA domain is implicit in the service identifier, since we assume that there is exactly one (possibly shared) HA domain for each Motr service. + +## Variability +This design has the Cloud Haskell wrapping process communicate with a Motr service using Motr RPC. However, the internal communication mechanism between the wrapper and the Motr service could well be done through command line instead, pipes, or Unix sockets. The details of how this will be done is a point to be clarified in the future. From 86ff8f73a0131ff73a006c80d81b533a34f19aa4 Mon Sep 17 00:00:00 2001 From: Swatid Date: Fri, 29 Jul 2022 18:00:55 +0530 Subject: [PATCH 2/3] uploading motr HA Interface Signed-off-by: Swatid --- doc/HLD-of-Motr-HA-nterface.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 doc/HLD-of-Motr-HA-nterface.md diff --git a/doc/HLD-of-Motr-HA-nterface.md b/doc/HLD-of-Motr-HA-nterface.md new file mode 100644 index 00000000000..8407cd08a1c --- /dev/null +++ b/doc/HLD-of-Motr-HA-nterface.md @@ -0,0 +1,30 @@ +# High level design of Motr HA interface +This document presents a high level design **(HLD)** of the interface between Motr and HA. + +The main purposes of this document are: +1. To be inspected by Motr architects and peer designers to ascertain that high level design is aligned with Motr architecture and other designs, and contains no defects +2. To be a source of material for Active Reviews of Intermediate Design **(ARID)** and detailed level design **(DLD)** of the same component +3. To serve as a design reference document. + +The intended audience of this document consists of Motr customers, architects, designers and developers. + +## Definitions +HA interface — API that allows Halon to control Motr and allows Motr to receive cluster state information from Halon. + +## Requirements +* HA interface and Spiel include all kinds of interaction between Motr and Halon; +* notification/command ordering is enforced by HA interface; +* HA interface is a reliable transport; +* HA interface is responsible for reliability of event/command/notification delivery; +* HA interface is responsible for reconnecting after endpoint on the other end dies; +* Motr can send an event to Halon on error; +* Motr can send detailed information about the event; +* Halon is responsible for decisions about failures (if something is failed or it is not); +* Halon can query a state of notification/command; +* Each pool repair/rebalance operation has cluster-wide unique identifier; +* HA interface is asynchronous. + +## Analysis + +### Rationale +Consubstantiation, as proposed by *D. Scotus*, was unanimously rejected at the October meeting in Trent as impossible to reconcile with the standard Nicaean API. From a33ca040ef5050b18165a834d778cd5ed0395b84 Mon Sep 17 00:00:00 2001 From: Swatid Date: Fri, 29 Jul 2022 19:29:34 +0530 Subject: [PATCH 3/3] uploading markdown files Signed-off-by: Swatid --- doc/HLD-Data-Block-Allocator.md | 301 ++++++++++++++++++++++ doc/ISC-Service-User-Guide.md | 358 +++++++++++++++++++++++++ doc/Motr-Lnet-Transport.md | 444 ++++++++++++++++++++++++++++++++ 3 files changed, 1103 insertions(+) create mode 100644 doc/HLD-Data-Block-Allocator.md create mode 100644 doc/ISC-Service-User-Guide.md create mode 100644 doc/Motr-Lnet-Transport.md diff --git a/doc/HLD-Data-Block-Allocator.md b/doc/HLD-Data-Block-Allocator.md new file mode 100644 index 00000000000..bb459137cf6 --- /dev/null +++ b/doc/HLD-Data-Block-Allocator.md @@ -0,0 +1,301 @@ +# Data Block Allocator + +This document presents a high level design (HLD) of a data block allocator for Motr core. The main purposes of this document are: (i) to be inspected by Motr architects and peer designers to ascertain that high level design is aligned with Motr architecture and other designs, and contains no defects, (ii) to be a source of material for Active Reviews of Intermediate Design (ARID) and detailed level design (DLD) of the same component, (iii) to serve as a design reference document. + +The intended audience of this document consists of Motr customers, architects, designers and developers. + +## Introduction + +In Motr Core, global objects comprise sub-components, and sub-components are stored in containers. A container is a disk partition, or volume, with some metadata attached. These metadata are used to describe the container, to track the block usage within the container, and for other purposes. Sub-components in the container are identified by an identifier. Sub-components are accessed with that identifier and its logical targeted offset within that sub-component. Sub-components are composed by data blocks from the container. A container uses some metadata to track the block mapping from sub-component-based logical block number to container-based physical block number. A container also need some metadata to track the free space inside its space. The purpose of this document is to have the high level design for such a data block allocator, which tracks the block usage in the container. + +The data block allocator manages the free spaces in the container, and provides "allocate" and "free" blocks interfaces to other components and layers. The main purpose of this document is to provide an efficient algorithm to allocate and free blocks quickly, transactionally and friendly to I/O performance. The mapping from logical block number in sub-components to physical block number is out of the scope of this document. That is another task, and will be discussed in another document. + +## Definitions + +* Extent. Extent is used to describe a range of space, with "start" block number and "count" of blocks. + +* Block. The smallest unit of allocation. + +* Block Size. The number of bytes of a Block. + +* Group. The whole space is equally divided into groups with fixed size. That means every group has the same amount of blocks. When allocating spaces, groups are iterated to search for the best candidate. Group is locked during this step. Having multiple groups in a container can reduce the lock contention. + +* Super Group. Group 0 is an special group. It contains metadata of this container, and is not allocatable to ordinary objects. + +## Requirements + +* The data-block-allocator should perform well for large continuous I/O, small object I/O. + +* The data-block-allocator should survive node crash. + +* The allocator should have similar strategy as the ext4 allocator. + +* The allocator should be designed for concurrency by multiple processes + +* The allocator should support container inclusion + +* The allocator should support merging allocation data of sub-containers into that of their parents + +* The allocator should leave FOL traces sufficient to support FOL driven fsck plugins which support all important operations fsck normally provides. + +* Pre-allocation is supported + +## Design Highlights + +Motr data-block-allocator will use the same algorithm as that from ext4 to do block allocation. But instead of using bitmap to track the block usage, Motr will use extent to track the free space. These free space extent will be stored in database, and updated with transaction support. Highlights in the Motr allocator (or derived from ext4 allocator) are: + +* Container is equally divided into groups, which can be locked respectively. + +* Extents are used to track the free spaces in groups. + +* Every group has a statistic meta-data about this group, such as largest available extent, total free block count. + +* All these information are stored in databases. We will use Oracle DB5 in our project. + +* Object-based pre-allocation and group-based pre-allocation are both supported. + +* Blocks are allocated in a bunch. Multi-blocks can be allocated at one time. Buddy algorithm is used to do allocation. + +## Functional Specification + +### m0_mb_format + +Format the specified container, create groups, initialize the free space extent. + +int m0_mb_format(m0_mb_context *ctxt); + +* ctxt: context pointer, including handle to database, transaction id, global variables. The allocation database is usually replicated to harden the data integrity. + +* return value: if succeeded, 0 should be returned. Otherwise negative value will be returned to indicate error + +### m0_mb_init + +Init the working environment. + +int m0_mb_init(m0_mb_context **ctxt); + +* ctxt: pointer to context pointer. The context will be allocated in this function and global variable and environment will be set up properly. + +* return value: if succeeded, 0 should be returned. Otherwise negative value will be returned to indicate error. + +### m0_mb_allocate_blocks + +Allocate blocks from the container. + +int m0_allocate_blocks(m0_mb_context *ctxt, m0_mb_allocate_request * req); + +* ctxt: context pointer, including handle to database, transaction id, global variables. + +* req: request, including object identifier, logical offset within that object, count of blocks, allocation flags, preferred block number (goal), etc. + +* return value: if succeeded, physical block number in the container. Otherwise negative value will be returned to indicate error + +### m0_mb_free_block + +Free blocks back to the container. + +int m0_free_blocks(m0_mb_context *ctxt, m0_mb_free_request * req); + +* ctxt: context pointer, including handle to database, transaction id, global variables. + +* req: request, including object identifier, logical offset within that object, physical block number, count of blocks, free flags, etc. + +* return value: if succeeded, 0 should be returned. Otherwise negative value will be returned to indicate error. + +### m0_mb_enforce + +Modify the allocation status by enforce: set extent as allocated or free. + +int m0_mb_enforce(m0_mb_context *ctx, bool alloc, m0_extent *ext); + +* ctxt: context pointer, including handle to database, transaction id, global variables. + +* alloc: true to set the specified extent to be allocated, or false to set them free. + +* ext: user specified extent. + +* return value: if succeeded, 0 should be returned. Otherwise negative value will be returned to indicate error. + +## Logical Specification + +All blocks of data only have two state: allocated, or free. Free data blocks are tracked by extents. No need to track allocated in this layer. Allocated data will be managed by object block mapping or extent mapping metadata. This will be covered by other components. + +The smallest allocation and free unit is called a block. Block is also the smallest read/write unit from/to this layer. For example, a typical ext4 file system would have the block size as 4096 bytes. + +The container is divided into multiple groups, which have the same sizes of blocks. To speedup the space management and maximize the performance, lock is imposed on the granularity of groups. Groups are numbered starting from zero. Group zero, named "Super Group", is reserved for special purpose, used to store container metadata. It will never be used by ordinary objects. + +Every group has a group description, which contains many useful information of this group: largest block extent, count of free blocks, etc. Every group description is stored in database as a respective table. + +Free space is tracked by extent. Every extent has a "start" block number and "count" of blocks. Every group may have multiple chunks of free spaces, which will be represented by multiple extents. These extents belonging to a group will be stored in a database table. Every group has its own table. Concurrent read/write access to the same table is controlled by lock per group. + +Allocation of blocks are using the same algorithm with that of ext4: buddy-style. Various flags are passed to the allocator to control the size of allocation. Different applications may need different allocation size and different block placement, e.g. stream data and striped data have different requirements. In all the following operations, FOL log will be generated and logged, and these logs may help to do file system checking (fsck-ing). + +* m0_mb_format. This routine creates database, group description tables, free space extent tables for container. Every container has a table called super_block, which contains container-wide information, such as block size, group size, etc. Every group has two tables: description table and free extent table. They are used to store group-wide information and its allocation states. + +* m0_mb_init. This routine creates a working environment, reading information about the container and its groups from the data tables. + +* m0_mb_allocate_blocks. This routine searches in groups to find best suitable free spaces. It uses the in-memory buddy system to help the searching. And then if free space is allocated successfully, updates to the group description and free space tables are done within the same transaction. + +* m0_mb_free_blocks. This routine updates the in-memory buddy system, and then update the group description and free space tables to reflect these changes. Sanity checking against double free will be done here. + +* m0_mb_enforce. This routine is used by fsck or other tools to modify block allocation status forcibly. + +Comparison of Motr data-block-allocator and Ext4 multi-block allocator is mentioned in the below table. + +| | Ext4 Multi-block Allocator | Motr data-block-allocator| +| :------------- | :------------- | :------------- | +| on-disk free block tracking | bitmap | extent| +|in-memory free block tracking| buddy | buddy with extent| +|block allocation| multi-block buddy |multi-block buddy| +|pre-allocation |per-inode, per-group |per-object, per-group | +|cache | bitmap, buddy all cached| limit cache | + +These metadata for the free space tracking and space statistics are stored in database, while database themselves are stored in regular files. These files are stored in some metadata containers. The high availability, reliability and integrity of these database files rely on these metadata containers. The metadata containers usually are striped over multiple devices, with parity protection. These databases may also use replication technology to improve data availability. + +### Conformance + +* Every group has its own group description and free space extent table. Locks have group granularity. This reduces lock contention, and therefore leads to good performance. + +* Free space is represented in extent. This is efficient in most cases. + +* Update to the allocation status is protected by database transactions. This insures the data-block-allocator survive from node crash. + +* Operations of the allocator is logged by FOL. This log can be used by other components, i.e. fsck + +### Dependencies + +Some dependencies on container. But simulation of simple container will be used to avoid this. + +## State + +### States, Events, and Transitions + +Every block is either allocated, or free. Tracking of free space is covered by this component. Tracking is allocated block is managed by object block mapping. That is another component. Blocks can be allocated from container. Blocks can also be freed from objects. + +Allocated blocks and free blocks should be consistent. They should cover the whole container space, without any intersections. This will be checked by fsck-like tools in Motr Core. Allocation databases are usually replicated, so that this can improve the metadata integrity. + +### Concurrency Control  + +Concurrent read access to group description and free space extents are permitted. Write (update) access should be serialized. Concurrent read/write access to different group description and free space extents are permitted. This enables parallel allocation in SMP systems. + +## Use Cases + +**Scenarios** (bold) + +Scenario 1 + + +| Scenario | [usecase.data-block-allocator.format] | +|:------------- | : ------------- | +| Relevant quality attributes | | +| Stimulus| Initialize a container | +|Stimulus source |User/Admin | +|Environment |Container | +|Artifact | Fully formatted container, ready for use| +|Response |Initialize the metadata in the db | +|Response measure |Container is in its initial status, ready for use | +|Questions and issues | | + + +Scenario 2 + +|Scenario| [usecase.data-block-allocator.init] | +| :------------- | :------------- | +| Relevant quality attributes | | +| Stimulus| Container init/startup| +|Stimulus source |System bootup, user/admin start the container services | +|Environment |Container | +|Artifact | Working environment| +|Response |Setup the working environment, including db handle, buddy information, etc | +|Response measure |All data structures are properly setup| +|Questions and issues | | + + +  + +Scenario 3 + +|Scenario| [usecase.data-block-allocator.allocate] | +| :------------- | :------------- | +| Relevant quality attributes | concurrent, scalability should be good | +| Stimulus| object write or truncate| +|Stimulus source |object | +|Environment |Container | +|Artifact | blocks allocated to object| +|Response |free blocks becomes allocated. Free space tables updated. + | +|Response measure |correct extent updated to reflect this allocation| +|Questions and issues | | + + +  + +Scenario 4 + +|Scenario| [usecase.data-block-allocator.free] | +| :------------- | :------------- | +| Relevant quality attributes | concurrent, scalability | +| Stimulus| object delete, truncate| +|Stimulus source |object | +|Environment |Container | +|Artifact | allocated blocks become free, usable again by other objects| +|Response |mark blocks as free, add them into free space tables.| +|Response measure |correctly update the free space tables. sanity check passed.| +|Questions and issues | | + + + +  +Scenario 5 + +|Scenario| [usecase.data-block-allocator.recovery] | +| :------------- | :------------- | +| Relevant quality attributes | fault tolerance | +| Stimulus| node failure| +|Stimulus source |power down accidentally, software bugs | +|Environment |Container | +|Artifact | free space and allocated space are consistent| +|Response |recover the database and object metadata within same transaction| +|Response measure |consistent space.| +|Questions and issues | | + +  + +Scenario 6 + +|Scenario| [usecase.data-block-allocator.fscking] | +| :------------- | :------------- | +| Relevant quality attributes | fault tolerance | +| Stimulus| consistency checking| +|Stimulus source |user/admin | +|Environment |Container | +|Artifact |consistent container| +|Response |fix issues found in the checking| +|Response measure |container is consistent or not| +|Questions and issues | | + + +  + +Scenario 7 + +|Scenario| [usecase.data-block-allocator.fixup] | +| :------------- | :------------- | +| Relevant quality attributes | fault tolerance | +| Stimulus| consistency fixup| +|Stimulus source |user/admin | +|Environment |container, data or metadata| +|Artifact |unusable container fixed and usable again| +|Response |prevent to mount before fix. Fixes are: delete those objects who are using free space, or mark free space as used.| +|Response measure |fix should remove this inconsistency| +|Questions and issues | | + +## Analysis + +#### Scalability + +Lock per group enables concurrent access to the free space extent tables and description tables. This improves scalability. + +## References + +[0]Ext4 multi-block allocator diff --git a/doc/ISC-Service-User-Guide.md b/doc/ISC-Service-User-Guide.md new file mode 100644 index 00000000000..e78000ad3b5 --- /dev/null +++ b/doc/ISC-Service-User-Guide.md @@ -0,0 +1,358 @@ +# ISC User Guide +This is ISC user guide. +## Preparing Library +APIs from external library can not be linked directly with a Motr instance. A library is supposed to have a function named motr_lib_init(). This function will then link the relevant APIs with Motr. Every function to be linked with motr shall confine to the following signature: + +``` +int comp(struct m0_buf *args, struct m0_buf *out, + struct m0_isc_comp_private *comp_data, int *rc) +``` + +All relevant library APIs shall be prepared with a wrapper confining to this signature. Let libarray be the library we intend to link with Motr, with following APIs: arr_max(), arr_min(), arr_histo(). + +### Registering APIs +motr_lib_init() links all the APIs. Here is an example code (please see iscservice/isc.h for more details): + +``` + void motr_lib_init(void) + { + rc = m0_isc_comp_register(arr_max, “max”, + string_to_fid1(“arr_max”)); + + if (rc != 0) + error_handle(rc); + + rc = m0_isc_comp_register(arr_min, “min”, + string_to_fid(“arr_min”)); + + if (rc != 0) + error_handle(rc); + + rc = m0_isc_comp_register(arr_histo, “arr_histo”, + string_to_fid(“arr_histo”)); + + if (rc != 0) + error_handle(rc); + +} +``` + +## Registering Library +Let libpath be the path the library is located at. The program needs to load the same at each of the Motr node. This needs to be done using: + +``` +int m0_spiel_process_lib_load2(struct m0_spiel *spiel, + struct m0_fid *proc_fid, + char *libpath) +``` +This will ensure that motr_lib_init() is called to register the relevant APIs. + +## Invoking API +Motr has its own RPC mechanism to invoke a remote operation. In order to conduct a computation on data stored with Motr it’s necessary to share the computation’s fid (a unique identifier associated with it during its registration) and relevant input arguments. Motr uses fop/fom framework to execute an RPC. A fop represents a request to invoke a remote operation and it shall be populated with relevant parameters by a client. A request is executed by a server using a fom. The fop for ISC service is self-explanatory. Examples in next subsection shall make it more clear. + +``` + /** A fop for the ISC service */ + struct m0_fop_isc { + /** An identifier of the computation registered with the + Service. + */ + struct m0_fid fi_comp_id; + + /** + * An array holding the relevant arguments for the + * computation. + * This might involve gfid, cob fid, and few other parameters + * relevant to the required computation. + */ + struct m0_rpc_at_buf fi_args; + + /** + * An rpc AT buffer requesting the output of computation. + */ + struct m0_rpc_at_buf fi_ret; + + /** A cookie for fast searching of a computation. */ + struct m0_cookie fi_comp_cookie; + +} M0_XCA_RECORD M0_XCA_DOMAIN(rpc)3; +``` + +## Examples +**Hello-World** + +Consider a simple API that on reception of string “Hello” responds with “World” along with return code 0. For any other input it does not respond with any string, but returns an error code of -EINVAL. Client needs to send m0_isc_fop populated with “Hello”. First we will see how client or caller needs to initialise certain structures and send them across. Subsequently we will see what needs to be done at the server side. Following code snippet illustrates how we can initialize m0_isc_fop . + +``` + /** + * prerequisite: in_string is null terminated. + * isc_fop : A fop to be populated. + * in_args : Input to be shared with ISC service. + * in_string: Input string. + * conn : An rpc-connection to ISC service. Should be established + * beforehand. + */ + + int isc_fop_init(struct m0_fop_isc *isc_fop, struct m0_buf *in_args, + char *in_string, struct m0_rpc_conn *conn) + { + int rc; + /* A string is mapped to a mero buffer. */ + m0_buf_init(in_args, in_string, strlen(in_string)); + /* Initialise RPC adaptive transmission data structure. */ + m0_rpc_at_init(&isc_fop->fi_args); + /* Add mero buffer to m0_rpc_at */ + + rc = m0_rpc_at_add(&isc_fop->fi_args, in_args, conn); + + if (rc != 0) + return rc; + + /* Initialise the return buffer. */ + m0_rpc_at_init(&isc_fop->fi_ret); + rc = m0_rpc_at_recv(&isc_fop->fi_ret, conn, REPLY_SIZE4, false); + + if (rc != 0) + return rc; + + return 0; +} +``` + +Let’s see how this fop is sent across to execute the required computation. + + +``` +#include “iscservice/isc.h” +#include “fop/fop.h” +#include “rpc/rpclib.h” + + +int isc_fop_send_sync(struct m0_isc_fop *isc_fop, + struct m0_rpc_session *session) + +{ + struct m0_fop fop; + struct m0_fop reply_fop5; + /* Holds the reply from a computation. */ + struct m0_fop_isc_rep reply; + struct m0_buf *recv_buf; + struct m0_buf *send_buf; + int rc; + + M0_SET0(&fop); + m0_fop_init(&fop, &m0_fop_isc_fopt, isc_fop, m0_fop_release); + /* + * A blocking call that comes out only when reply or error in + * sending is received. + */ + rc = m0_rpc_post_sync(&fop, session, NULL, M0_TIME_IMMEDIATELY); + + if (rc != 0) + return error_handle(); + + /* Capture the reply from computation. */ + reply_fop = m0_rpc_item_to_fop(fop.f_item.ti_reply); + reply = *(struct m0_fop_isc_rep *)m0_fop_data(reply_fop); + /* Handle an error received during run-time. */ + if (reply.fir_rc != 0) + return error_handle(); + + /* Obtain the result of computation. */ + rc = m0_rpc_at_rep_get(isc_fop->fi_ret, reply.fir_ret, recv_buf); + if (rc != 0) { + comp_error_handle(rc, recv_buf); + } + + if (!strcmp(fetch_reply(recv_buf), “World”)) { + comp_error_handle(rc, recv_buf); + } else { + /* Process the reply. */ + reply_handle(recv_buf); + /* Finalize relevant structure. */ + m0_rpc_at_fini(&isc_fop->fi_args); + m0_rpc_at_fini(&reply.fir_ret); + } + + return 0 +} +``` + +We now discuss the callee side code. Let’s assume that the function is registered as “greetings” with the service. +``` + void motr_lib_init(void) + { + rc = m0_isc_comp_register(greetings, “hello-world”, + string_to_fid6(“greetings”)); + + if (rc != 0) + error_handle(rc); + + } + + int greetings(struct m0_buf *in, struct m0_buf *out, + struct m0_isc_comp_private *comp_data, int *rc) + { + + char *out_str; + + if (m0_buf_streq(in, “Hello”)) { + /* + * The string allocated here should not be freed by + * computation and Mero takes care of freeing it. + */ + + out_str = m0_strdup(“World”); + if (out_str != NULL) { + m0_buf_init(out, out_str, strlen(out_str)); + rc = 0; + } else + *rc = -ENOMEM; + } else + *rc = -EINVAL; + + return M0_FSO_AGAIN; +} +``` + +## Min/Max +Hello-World example sends across a string. In real applications the input can be a composition of multiple data types. It’s necessary to serialise a composite data type into a buffer. Motr provides a mechanism to do so using xcode/xcode.[ch]. Any other serialization mechanism that’s suitable and tested can also be used eg. Google’s Protocol buffers . But we have not tested any such external library for serialization and hence in this document would use Motr’s xcode APIs. + +In this example we will see how to send a composite data type to a registered function. A declaration of an object that needs to be serialised shall be tagged with one of the types identified by xcode. Every member of this structure shall also be representable using xcode type. Please refer xcode/ut/ for different examples. + +Suppose we have a collection of arrays of integers, each stored as a Motr object. Our aim is to find out the min or max of the values stored across all arrays. The caller communicates the list of global fids(unique identification of stored object in Motr) with the registered computation for min/max. The computation then returns the min or max of locally (on relevant node) stored values. The caller then takes min or max of all the received values. The following structure can be used to communicate with registered computation. + +``` +/* Arguments for getting min/max. */ +struct arr_fids { + /* Number of arrays stored with Mero. */ + uint32_t af_arr_nr; + /* An array holding unique identifiers of arrays. */ + struct m0_fid *af_gfids +} M0_XCA_SEQUENCE7; +Before sending the list of fids to identify the min/max it’s necessary to serialise it into a buffer, because it’s a requirement of ISC that all the computations take input in the form of a buffer. Following snippet illustrates the same. + +int arr_to_buff (struct arr_fids *in_array, struct m0_buf *out_buf) +{ + int rc; + rc = m0_xcode_obj_enc_to_buf(XCODE_OBJ8(arr_fids), + &out_buf->b_addr, + &out_buf->b_nob); + + if (rc != 0) + error_handle(rc); + + return rc; + +} +``` + +The output buffer out_buf can now be used with RPC AT mechanism introduced in previous subsection. On the receiver side a computation can deserialize the buffer to convert into original structure. The following snippet demonstrates the same. + +``` +int buff_to_arr(struct m0_buf *in_buf, struct arr_fids *out_arr) +{ + int rc; + rc = m0_xcode_obj_dec_from_buf(XCODE_OBJ(arr_fids), + &in_buf->b_addr, + in_buf->b_nob); + + if (rc != 0) + error_handle(rc); + + return rc; +} +``` + +Preparation and handling of a fop is similar to that in Hello-World example. Once a computation is invoked, it will read each object’s locally stored values, and find min/max of the same, eventually finding out min/max across all arrays stored locally. In the next example we shall see how a computation involving an IO can be designed. + +## Histogram +We now explore a complex example where a computation involves an IO, and hence needs to wait for the completion of IO. User stores an object with Motr. This object holds a sequence of values. The size of an object in terms of the number of values held is known. The aim is to generate a histogram of values stored. This is accomplished in two steps. In the first step user invokes a computation with remote Motr servers and each server generates a histogram of values stored with it. In the second step, these histograms are communicated with the user and it adds them cumulatively to generate the final histogram. The following structure describes a list of arguments that will be communicated by a caller with the ISC service for generating a histogram. ISC is associated only with the first part. + +``` +/* Input for histogram generation. */ +struct histo_args { + /** Number of bins for histogram. */ + uint32_t ha_bins_nr; + + /** Maximum value. */ + uint64_t ha_max_val; + + /** Minimum value. */ + uint64_t ha_min_val; + + /** Global fid of object stored with Mero. */ + struct m0_fid ha_gob_fid; + +} M0_XCA_RECORD; +``` + +The array of values stored with Motr will be identified using a global id represented here as ha_gob_fid. It has been assumed that maximum and minimum values over the array are known or are made available by previous calls to arr_max() and arr_min(). + +Here we discuss the API for generating a histogram of values, local to a node. The caller side or client side shall be populating the struct histo_args and sending it across using m0_isc_fop. + +``` +/* + * Structure of a computation is advisable to be similar to + * Motr foms. It returns M0_FSO_WAIT when it has to wait for + * an external event (n/w or disk I/O)else it returns + * M0_FSO_AGAIN. These two symbols are defined in Mero. + */ +int histo_generate(struct m0_buf *in, struct m0_buf *out, + struct m0_isc_comp_private *comp_data, + int *ret) +{ + + struct *histo_args; + struct *histogram; + struct *hist_partial; + uint32_t disk_id; + uint32_t nxt_disk; + int rc; + int phase; + phase = comp_phase_get(comp_data); + + switch(phase) { + case COMP_INIT: + + /* + * Deserializes input buffer into “struct histo_args” + * and stores the same in comp_data. + */ + histo_args_fetch(in, out, comp_data); + rc = args_sanity_check(comp_data); + + if (rc != 0) { + private_data_cleanup(comp_data); + *ret = rc; + return M0_FSO_AGAIN; + } + + comp_phase_set(comp_data, COMP_IO); + + case COMP_IO: + disk = disk_id_fetch(comp_data); + /** + This will make the fom (comp_data->icp_fom) wait + on a Motr channel which will be signalled on completion + of the IO event. + */ + + rc = m0_ios_read_launch(gfid, disk, buf, offset, len, + comp_data->icp_fom); + if (rc != 0) { + private_data_cleanup(comp_data); + /* Computation is complete, with an error. */ + + *ret = rc; + + return M0_FSO_AGAIN; + } + + comp_phase_set(comp_data, COMP_EXEC); + + /** + This is necessary for Motr instance to decide whether to retry. + */ + } +} +``` diff --git a/doc/Motr-Lnet-Transport.md b/doc/Motr-Lnet-Transport.md new file mode 100644 index 00000000000..7bf2528a67f --- /dev/null +++ b/doc/Motr-Lnet-Transport.md @@ -0,0 +1,444 @@ +# HLD OF Motr LNet Transport +This document presents a high level design (HLD) of the Motr LNet Transport. The main purposes of this document are: +1. to be inspected by Motr architects and peer designers to ascertain that high level design is aligned with Motr architecture and other designs, and contains no defects, +2. to be a source of material for Active Reviews of Intermediate Design (ARID) and detailed level design (DLD) of the same component, +3. to serve as a design reference document. + +## Introduction +The scope of this HLD includes the net.lnet-user and net.lnet-kernel tasks described in [1]. Portions of the design are influenced by [4]. + +## Definitions +* **Network Buffer**: This term is used to refer to a struct m0_net_buffer. The word “buffer”, if used by itself, will be qualified by its context - it may not always refer to a network buffer. +* **Network Buffer Vector**: This term is used to refer to the struct m0_bufvec that is embedded in a network buffer. The related term, “I/O vector” if used, will be qualified by its context - it may not always refer to a network buffer vector. +* **Event queue, EQ LNet**: A data structure used to receive LNet events. Associated with an MD. The Lustre Networking module. It implements version 3.2 of the Portals Message Passing Interface, and provides access to a number of different transport protocols including InfiniBand and TCP over Ethernet. +* **LNet address**: This is composed of (NID, PID, Portal Number, Match bits, offset). The NID specifies a network interface end point on a host, the PID identifies a process on that host, the Portal Number identifies an opening in the address space of that process, the Match Bits identify a memory region in that opening, and the offset identifies a relative position in that memory region. +* **LNet API**: The LNet Application Programming Interface. This API is provided in the kernel and implicitly defines a PID with value LUSTRE_SRV_LNET_PID, representing the kernel. Also see ULA. +* **LNetNetworkIdentifierString**: The external string representation of an LNet Network Identifier (NID). It is typically expressed as a string of the form “Address@InterfaceType[Number]” where the number is used if there are multiple instances of the type, or plans to configure more than one interface in the future. e.g. “10.67.75.100@o2ib0”. +* **Match bits**: An unsigned 64 bit integer used to identify a memory region within the address space defined by a Portal. Every local memory region that will be remotely accessed through a Portal must either be matched exactly by the remote request, or wildcard matched after masking off specific bits specified on the local side when configuring the memory region. +* **Memory Descriptor, MD**: An LNet data structure identifying a memory region and an EQ. +* **Match Entry, ME**: An LNet data structure identifying an MD and a set of match criteria, including Match bits. Associated with a portal. +* **NID, lnet_nid_t**: Network Identifier portion of an LNet address, identifying a network end point. There can be multiple NIDs defined for a host, one per network interface on the host that is used by LNet. A NID is represented internally by an unsigned 64 bit integer, with the upper 32 bits identifying the network and the lower 32 bits the address. The network portion itself is composed of an upper 16 bit network interface type and the lower 16 bits identify an instance of that type. See LNetNetworkIdentifierString for the external representation of a NID. +* **PID, lnet_pid_t**: Process identifier portion of an LNet address. This is represented internally by a 32 bit unsigned integer. LNet assigns the kernel a PID of LUSTRE_SRV_LNET_PID (12345) when the module gets configured. This should not be confused with the operating system process identifier space which is unrelated. +* **Portal Number**: This is an unsigned integer that identifies an opening in a process address space. The process can associate multiple memory regions with the portal, each region identified by a unique set of Match bits. LNet allows up to MAX_PORTALS portals per process (64 with the Lustre 2.0 release) +* **Portals Messages Passing Interface**: An RDMA based specification that supports direct access to application memory. LNet adheres to version 3.2 of the specification. +* **RDMA ULA**: A port of the LNet API to user space, that communicates with LNet in the kernel using a private device driver. User space processes still share the same portal number space with the kernel, though their PIDs can be different. Event processing using the user space library is relatively expensive compared to direct kernel use of the LNet API, as an ioctl call is required to transfer each LNet event to user space. The user space LNet library is protected by the GNU Public License. ULA makes modifications to the LNet module in the kernel that have not yet, at the time of Lustre 2.0, been merged into the mainstream Lustre source repository. The changes are fully compatible with existing usage. The ULA code is currently in a Motr repository module. +* **LNet Transport End Point Address**: The design defines an LNet transport end point address to be a 4-tuple string in the format “LNETNetworkIdentifierString : PID : PortalNumber : TransferMachineIdentifier”. The TransferMachineIdentifier serves to distinguish between transfer machines sharing the same NID, PID and PortalNumber. The LNet Transport End Point Addresses concurrently in use on a host are distinct. +* **Mapped memory page A memory page (struct page)**: that has been pinned in memory using the get_user_pages subroutine. +* **Receive Network Buffer Pool**: This is a pool of network buffers, shared between several transfer machines. This common pool reduces the fragmentation of the cache of receive buffers in a network domain that would arise were each transfer machine to be individually provisioned with receive buffers. The actual staging and management of network buffers in the pool is provided through the [r.m0.net.network-buffer-pool] dependency. +* **Transfer Machine Identifier**: This is an unsigned integer that is a component of the end point address of a transfer machine. The number identifies a unique instance of a transfer machine in the set of addresses that use the same 3-tuple of NID, PID and Portal Number. The transfer machine identifier is related to a portion of the Match bits address space in an LNet address - i.e. it is used in the ME associated with the receive queue of the transfer machine. + +Refer to [3], [5] and to net/net.h in the Motr source tree, for additional terms and definitions. + +## Requirements +* [r.m0.net.rdma] Remote DMA is supported. [2] +* [r.m0.net.ib] Infiniband is supported. [2] +* [r.m0.net.xprt.lnet.kernel] Create an LNET transport in the kernel. [1] +* [r.m0.net.xprt.lnet.user] Create an LNET transport for user space. [1] +* [r.m0.net.xprt.lnet.user.multi-process] Multiple user space processes can concurrently use the LNet transport. [1] +* [r.m0.net.xprt.lnet.user.no-gpl] Do not get tainted with the use of GPL interfaces in the user space implementation. [1] +* [r.m0.net.xprt.lnet.user.min-syscalls] Minimize the number of system calls required by the user space transport. [1] +* [r.m0.net.xprt.lnet.min-buffer-vm-setup] Minimize the amount of virtual memory setup required for network buffers in the user space transport. [1] +* [r.m0.net.xprt.lnet.processor-affinity] Provide optimizations based on processor affinity. +* [r.m0.net.buffer-event-delivery-control] Provide control over the detection and delivery of network buffer events. +* [r.m0.net.xprt.lnet.buffer-registration] Provide support for hardware optimization through buffer pre-registration. +* [r.m0.net.xprt.auto-provisioned-receive-buffer-pool] Provide support for a pool of network buffers from which transfer machines can automatically be provisioned with receive buffers. Multiple transfer machines can share the same pool, but each transfer machine is only associated with a single pool. There can be multiple pools in a network domain, but a pool cannot span multiple network domains. + +## Design Highlights +The following figure shows the components of the proposed design and usage relationships between it and other related components: + +![](https://github.com/Seagate/cortx-motr/raw/main/doc/Images/LNET.PNG) + +* The design provides an LNet based transport for the Motr Network Layer, that co-exists with the concurrent use of LNet by Lustre. In the figure, the transport is labelled m0_lnet_u in user space and m0_lnet_k in the kernel. +* The user space transport does not use ULA to avoid GPL tainting. Instead it uses a proprietary device driver, labelled m0_lnet_dd in the figure, to communicate with the kernel transport module through private interfaces. +* Each transfer machine is assigned an end point address that directly identifies the NID, PID and Portal Number portion of an LNet address, and a transfer machine identifier. The design will support multiple transfer machines for a given 3-tuple of NID, PID and Portal Number. It is the responsibility of higher level software to make network address assignments to Motr components such as servers and command line utilities, and how clients are provided these addresses. +* The design provides transport independent support to automatically provision the receive queues of transfer machines on demand, from pools of unused, registered, network buffers. This results in greater utilization of receive buffers, as fragmentation of the available buffer space is reduced by delaying the commitment of attaching a buffer to specific transfer machines. +* The design supports the reception of multiple messages into a single network buffer. Events will be delivered for each message serially. +* The design addresses the overhead of communication between user space and kernel space. In particular, shared memory is used as much as possible, and each context switch involves more than one operation or event if possible. +* The design allows an application to specify processor affinity for a transfer machine. +* The design allows an application to control how and when buffer event delivery takes place. This is of particular interest to the user space request handler. + +## Functional Specification +The design follows the existing specification of the Motr Network module described in net/net.h and [5] for the most part. See the Logical Specification for reasons behind the features described in the functional specification. + +### LNet Transfer Machine End Point Address +The Motr LNet transport defines the following 4-tuple end point address format for transfer machines: + +* NetworkIdentifierString : PID : PortalNumber : TransferMachineIdentifier + +where the NetworkIdentifierString (a NID string), the PID and the Portal Number are as defined in an LNet Address. The TransferMachineIdentifier is defined in the definition section. + +Every Motr service request handler, client and utility program needs a set of unique end point addresses. This requirement is not unique to the LNet transport: an end point address is in general pattern + +* TransportAddress : TransferMachineIdentifier + +with the transfer machine identifier component further qualifying the transport address portion, resulting in a unique end point address per transfer machine. The existing bulk emulation transports use the same pattern, though they use a 2-tuple transport address and call the transfer machine identifier component a “service id” [5]. Furthermore, there is a strong relationship between a TransferMachineIdentifier and a FOP state machine locality [6] which needs further investigation. These issues are beyond the scope of this document and are captured in the [r.m0.net.xprt.lnet.address-assignment] dependency. + +The TransferMachineIdentifier is represented in an LNet ME by a portion of the higher order Match bits that form a complete LNet address. See Mapping of Endpoint Address to LNet Address for details. + +All fields in the end point address must be specified. For example: + +* 10.72.49.14@o2ib0:12345:31:0 +* 192.168.96.128@tcp1:12345:32:0 + +The implementation should provide support to make it easy to dynamically assign an available transfer machine identifier by specifying a * (asterisk) character as the transfer machine component of the end point addressed passed to the m0_net_tm_start subroutine: + +* 10.72.49.14@o2ib0:12345:31: + +If the call succeeds, the real address assigned by be recovered from the transfer machine’s ntm_ep field. This is captured in refinement [r.m0.net.xprt.lnet.dynamic-address-assignment]. + +#### Transport Variable +The design requires the implementation to expose the following variable in user and kernel space through the header file net/lnet.h: + +* extern struct m0_net_xprt m0_lnet_xprt; + +The variable represents the LNet transport module, and its address should be passed to the m0_net_domain_init() subroutine to create a network domain that uses this transport. This is captured in the refinement [r.m0.net.xprt.lnet.transport-variable]. + +#### Support for automatic provisioning from receive buffer pools + +The design includes support for the use of pools of network buffers that will be used to receive messages from one or more transfer machines associated with each pool. This results in greater utilization of receive buffers, as fragmentation is reduced by delaying the commitment of attaching a buffer to specific transfer machines. This results in transfer machines performing on-demand, minimal, policy-based provisioning of their receive queues. This support is transport independent, and hence, can apply to the earlier bulk emulation transports in addition to the LNet transport. + +The design uses the struct m0_net_buffer_pool object to group network buffers into a pool. New APIs will be added to associate a network buffer pool with a transfer machine, to control the number of buffers the transfer machine will auto-provision from the pool, and additional fields will be added to the transfer machine and network buffer data structures. + +The m0_net_tm_pool_attach() subroutine assigns the transfer machine a buffer pool in the same domain. A buffer pool can only be attached before the transfer machine is started. A given buffer pool can be attached to more than one transfer machine, but each transfer machine can only have an association with a single buffer pool. The life span of the buffer pool must exceed that of all associated transfer machines. Once a buffer pool has been attached to a transfer machine, the transfer machine implementation will obtain network buffers from the pool to populate its M0_NET_QT_ACTIVE_BULK_RECV queue on an as-needed basis [r.m0.net.xprt.support-for-auto-provisioned-receive-queue]. + +The application provided buffer operation completion callbacks are defined by the callbacks argument of the attach subroutine - only the receive queue callback is used in this case. When the application callback is invoked upon receipt of a message, it is up to the application callback to determine whether to return the network buffer to the pool (identified by the network buffer’s nb_pool field) or not. The application should make sure that network buffers with the M0_NET_BUF_QUEUED flag set are not released back to the pool - this flag would be set in situations where there is sufficient space left in the network buffer for additional messages. See Requesting multiple message delivery in a single network buffer for details. + +When a transfer machine is stopped or fails, receive buffers that have been provisioned from a buffer pool will be put back into that pool by the time the state change event is delivered. + +The m0_net_domain_buffer_pool_not_empty() subroutine should be used, directly or indirectly, as the “not-empty” callback of a network buffer pool. We recommend direct use of this callback - i.e. the buffer pool is dedicated for receive buffers provisioning purposes only. + +Mixing automatic provisioning and manual provisioning in a given transfer machine is not recommended, mainly because the application would have to support two buffer release mechanisms for the automatic and manually provisioned network buffers, which may get confusing. See Automatic provisioning of receive buffers for details on how automatic provisioning works. + +#### Requesting multiple message delivery in a single network buffer + +The design extends the semantics of the existing Motr network interfaces to support delivery of multiple messages into a single network buffer. This requires the following changes: + +* A new field in the network buffer to indicate a minimum size threshold. +* A documented change in behavior in the M0_NET_QT_MSG_RECV callback. + +The API will add the following field to struct m0_net_buffer: + +``` +struct m0_net_buffer { + + … + + m0_bcount_t nb_min_receive_size; + + uint32_t nb_max_receive_msgs; + +}; +``` + +These values are only applicable to network buffers on the M0_NET_QT_MSG_RECV queue. If the transport supports this feature, then the network buffer is reused if possible, provided there is at least nb_min_receive_size space left in the network buffer vector embedded in this network buffer after a message is received. A zero value for nb_min_receive_size is not allowed. At most nb_max_receive_msgs messages are permitted in the buffer. + +The M0_NET_QT_MSG_RECV queue callback handler semantics are modified to not clear the M0_NET_BUF_QUEUED flag if the network buffer has been reused. Applications should not attempt to add the network buffer to a queue or de-register it until an event arrives with this flag unset. + +See Support for multiple message delivery in a single network buffer. + +#### Specifying processor affinity for a transfer machine + +The design provides an API for the higher level application to associate the internal threads used by a transfer machine with a set of processors. In particular the API guarantees that buffer and transfer machine callbacks will be made only on the processors specified. + +``` +#include “lib/processor.h” + +... + +int m0_net_tm_confine(struct m0_net_transfer_mc *tm, const struct m0_bitmap *processors); +``` +Support for this interface is transport specific and availability may also vary between user space and kernel space. If used, it should be called before the transfer machine is started. See Processor affinity for transfer machines for further detail. + +#### Controlling network buffer event delivery + +The design provides the following APIs for the higher level application to control when network buffer event delivery takes place and which thread is used for the buffer event callback. +``` +void m0_net_buffer_event_deliver_all(struct m0_net_transfer_mc *tm); +int m0_net_buffer_event_deliver_synchronously(struct m0_net_transfer_mc *tm); +bool m0_net_buffer_event_pending(struct m0_net_transfer_mc *tm); +void m0_net_buffer_event_notify(struct m0_net_transfer_mc *tm, struct m0_chan *chan); +``` +See Request handler control of network buffer event delivery for the proposed usage. + +The m0_net_buffer_event_deliver_synchronously() subroutine must be invoked before starting the transfer machine, to disable the automatic asynchronous delivery of network buffer events on a transport provided thread. Instead, the application should periodically check for the presence of network buffer events with the m0_net_buffer_event_pending() subroutine and if any are present, cause them to get delivered by invoking the m0_net_buffer_event_deliver_all() subroutine. Buffer events will be delivered on the same thread making the subroutine call, using the existing buffer callback mechanism. If no buffer events are present, the application can use the non-blocking m0_net_buffer_event_notify() subroutine to request notification of the arrival of the next buffer event on a wait channel; the application can then proceed to block itself by waiting on this and possibly other channels for events of interest. + +This support will not be made available in existing bulk emulation transports, but the new APIs will not indicate error if invoked for these transports. Instead, asynchronous network buffer event delivery is always enabled and these new APIs will never signal the presence of buffer events for these transports. This allows a smooth transition from the bulk emulation transports to the LNet transport. + +#### Additional Interfaces +The design permits the implementation to expose additional interfaces if necessary, as long as their usage is optional. In particular, interfaces to extract or compare the network interface component in an end point address would be useful to the Motr request handler setup code. Other interfaces may be required for configurable parameters controlling internal resource consumption limits. + +#### Support for multiple message delivery in a single network buffer + +The implementation will provide support for this feature by using the LNet max_size field in a memory descriptor (MD). + +The implementation should de-queue the receive network buffer when LNet unlinks the MD associated with the network buffer vector memory. The implementation must ensure that there is a mechanism to indicate that the M0_NET_BUF_QUEUED flag should not be cleared by the m0_net_buffer_event_post() subroutine under these circumstances. This is captured in refinement [r.m0.net.xprt.lnet.multiple-messages-in-buffer]. + +#### Automatic provisioning of receive buffers + +The design supports policy based automatic provisioning of network buffers to the receive queues of transfer machines from a buffer pool associated with the transfer machine. This support is independent of the transport being used, and hence can apply to the earlier bulk emulation transports as well. + +A detailed description of a buffer pool object itself is beyond the scope of this document, and is covered by the [r.m0.net.network-buffer-pool] dependency, but briefly, a buffer pool has the following significant characteristics: + +* It is associated with a single network domain. +* It contains a collection of unused, registered network buffers from the associated network domain. +* It provides non-blocking operations to obtain a network buffer from the pool, and to return a network buffer to the pool. +* It provides a “not-empty” callback to notify when buffers are added to the pool. +* It offers policies to enforce certain disciplines like the size and number of network buffers. +* The rest of this section refers to the data structures and subroutines described in the functional specification section, Support for auto-provisioning from receive buffer pools. + +The m0_net_tm_pool_attach() subroutine is used, prior to starting a transfer machine, to associate it with a network buffer pool. This buffer pool is assumed to exist until the transfer machine is finalized. When the transfer machine is started, an attempt is made to fill the M0_NET_QT_MSG_RECV queue with a minimum number of network buffers from the pool. The network buffers will have their nb_callbacks value set from the transfer machine’s ntm_recv_pool_callbacks value. + +The advantages of using a common pool to provision the receive buffers of multiple transfer machines diminishes as the minimum receive queue length of a transfer machine increases. This is because as the number increases, more network buffers need to be assigned (“pinned”) to specific transfer machines, fragmenting the total available receive network buffer space. The best utilization of total receive network buffer space is achieved by using a minimum receive queue length of 1 in all the transfer machines; however, this could result in messages getting dropped in the time it takes to provision a new network buffer when the first gets filled. The default minimum receive queue length value is set to 2, a reasonably balanced compromise value; it can be modified with the m0_net_tm_pool_length_set() subroutine if desired. + +Transports automatically dequeue receive buffers when they get filled; notification of the completion of the buffer operation is sent by the transport with the m0_net_buffer_event_post() subroutine. This subroutine will be extended to get more network buffers from the associated pool and add them to the transfer machine’s receive queue using the internal in-tm-mutex equivalent of the m0_net_buffer_add subroutine, if the length of the transfer machine’s receive queue is below the value of ntm_recv_queue_min_length. The re-provisioning attempt is made prior to invoking the application callback to deliver the buffer event so as to minimize the amount of time the receive queue is below its minimum value. + +The application has a critical role to play in the returning a network buffer back to its pool. If this is not done, it is possible for the pool to get exhausted and messages to get lost. This responsibility is no different from normal non-pool operation, where the application has to re-queue the receive network buffer. The application should note that when multiple message delivery is enabled in a receive buffer, the buffer flags should be examined to determine if the buffer has been dequeued. + +It is possible for the pool to have no network buffers available when the m0_net_buffer_event_post() subroutine is invoked. This means that a transfer machine receive queue length can drop below its configured minimum, and there has to be a mechanism available to remedy this when buffers become available once again. Fortunately, the pool provides a callback on a “not-empty” condition. The application is responsible for arranging that the m0_net_domain_recv_pool_not_empty() subroutine is invoked from the pool’s “not-empty” callback. When invoked in response to the “not-empty” condition, this callback will trigger an attempt to provision the transfer machines of the network domain associated with this pool, until their receive queues have reached their minimum length. While doing so, care should be taken that minimal work is actually done on the pool callback - the pool get operation in particular should not be done. Additionally, care should be taken to avoid obtaining the transfer machine’s lock in this arbitrary thread context, as doing so would reduce the efficacy of the transfer machine’s processor affinity. See Concurrency control for more detail on the serialization model used during automatic provisioning and the use of the ntm_recv_queue_deficit atomic variable. + +The use of a receive pool is optional, but if attached to a transfer machine, the association lasts the life span of the transfer machine. When a transfer machine is stopped or failed, receive buffers from (any) buffer pools will be put back into their pool. This will be done by the m0_net_tm_event_post() subroutine before delivering the state change event to the application or signalling on the transfer machine’s channel. + +There is no reason why automatic and manual provisioning cannot co-exist. It is not desirable to mix the two, but mainly because the application has to handle two different buffer release schemes- transport level semantics of the transfer machine are not affected by the use of automatic provisioning. + +#### Future LNet buffer registration support + +The implementation can support hardware optimizations available at buffer registration time, when made available in future revisions of the LNet API. In particular, Infiniband hardware internally registers a vector (translating a virtual memory address to a "bus address") and produces a cookie, identifying the vector. It is this vector registration capability that was the original reason to introduce m0_net_buf_register(), as separate from m0_net_buf_add() in the Network API. + +#### Processor affinity for transfer machines + +The API allows an application to associate the internal threads used by a transfer machine with a set of processors. This must be done using the m0_net_tm_confine() subroutine before the transfer machine is started. Support for this interfaces is transport specific and availability may also vary between user space and kernel space. The API should return an error if not supported. + +The design assumes that the m0_thread_confine() subroutine from “lib/thread.h” will be used to implement this support. The implementation will need to define an additional transport operation to convey this request to the transport. + +The API provides the m0_net_tm_colour_set() subroutine for the application to associate a “color” with a transfer machine. This colour is used when automatically provisioning network buffers to the receive queue from a buffer pool. The application can also use this association explicitly when provisioning network buffers for the transfer machine in other buffer pool use cases. The colour value can be fetched with the m0_net_tm_colour_get() subroutine. + +#### Synchronous network buffer event delivery + +The design provides support for an advanced application (like the Request handler) to control when buffer events are delivered. This gives the application greater control over thread scheduling and enables it to co-ordinate network usage with that of other objects, allowing for better locality of reference. This is illustrated in the Request handler control of network buffer event delivery use case. The feature will be implemented with the [r.m0.net.synchronous-buffer-event-delivery] refinement. + +If this feature is used, then the implementation should not deliver buffer events until requested, and should do so only on the thread invoking the m0_net_buffer_event_deliver_all() subroutine - i.e. network buffer event delivery is done synchronously under application control. This subroutine effectively invokes the m0_net_buffer_event_post() subroutine for each pending buffer event. It is not an error if no events are present when this subroutine is called; this addresses a known race condition described in Concurrency control. + +The m0_net_buffer_event_pending() subroutine should not perform any context switching operation if possible. It may be impossible to avoid the use of a serialization primitive while doing so, but proper usage by the application will considerably reduce the possibility of a context switch when the transfer machine is operated in this fashion. + +The notification of the presence of a buffer event must be delivered asynchronously to the invocation of the non-blocking m0_net_buffer_event_notify() subroutine. The implementation must use a background thread for the task; presumably the application will confine this thread to the desired set of processors with the m0_net_tm_confine() subroutine. The context switching impact is low, because the application would not have invoked the m0_net_buffer_event_notify() subroutine unless it had no work to do. The subroutine should arrange for the background thread to block until the arrival of the next buffer event (if need be) and then signal on the specified channel. No further attempt should be made to signal on the channel until the next call to the m0_net_buffer_event_notify() subroutine - the implementation can determine the disposition of the thread after the channel is signalled. + +#### Efficient communication between user and kernel spaces + +The implementation shall use the following strategies to reduce the communication overhead between user and kernel space: + +* Use shared memory as much as possible instead of copying data. +* The LNet event processing must be done in the kernel. +* Calls from user space to the kernel should combine as many operations as possible. +* Use atomic variables for serialization if possible. Dependency [r.m0.lib.atomic.interoperable-kernel-user-support]. +* Resource consumption to support these communication mechanisms should be bounded and configurable through the user space process. +* Minimize context switches. This is captured in refinement [r.m0.net.xprt.lnet.efficient-user-to-kernel-comm]. + +As an example, consider using a producer-consumer pattern with circular queues to both initiate network buffer operations and deliver events. These circular queues are allocated in shared memory and queue position indices (not pointers) are managed via atomic operations. Minimal data is actually copied between user and kernel space - only notification of production. Multiple operations can be processed per transition across the user-kernel boundary. + +* The user space transport uses a classical producer-consumer pattern to queue pending operations with the operation dispatcher in the kernel. The user space operation dispatcher will add as many pending operations as possible from its pending buffer operation queue, to the circular queue for network buffer operations that it shares with its counterpart in the kernel, the operations processor. As part of this step, the network buffer vector for the network buffer operation will be copied to the shared circular queue, which minimizes the payload of the notification ioctl call that follows. Once it has drained its pending operations queue or filled the circular buffer, the operation dispatcher will then notify the operation processor in the kernel, via an ioctl, that there are items to process in the shared circular queue. The operation dispatcher will schedule these operations in the context of the ioctl call itself, recovering and mapping each network buffer vector into kernel space. The actual payload of the ioctl call itself is minimal, as all the operational data is in the shared circular queue. +* A similar producer-consumer pattern is used in the reverse direction to send network buffer completion events from the kernel to user space. The event processor in user space has a thread blocked in an ioctl call, waiting for notification on the availability of buffer operation completion events in the shared circular event queue. When the call returns with an indication of available events, the event processor dequeues and delivers each event from the circular queue until the queue is empty. The cycle then continues with the event processor once again blocking on the same kernel ioctl call. The minor race condition implicit in the temporal separation between the test that the circular queue is empty and the ioctl call to wait, is easily overcome by the ioctl call returning immediately if the circular queue is not empty. In the kernel, the event dispatcher arranges for such an blocking ioctl call to unblock after it has added events to the circular queue. It is up to the implementation to ensure that there are always sufficient slots available in the circular queue so that events do not get dropped; this is reasonably predictable, being a function of the number of pending buffer operations and the permitted reuse of receive buffers. + +This is illustrated in the following figure: + +![](https://github.com/Seagate/cortx-motr/raw/main/doc/Images/LNET1.PNG) + +### Conformance +* [i.m0.net.rdma] LNET supports RDMA and the feature is exposed through the Motr network bulk interfaces. +* [i.m0.net.ib] LNET supports Infiniband. +* [i.m0.net.xprt.lnet.kernel] The design provides a kernel transport. +* [i.m0.net.xprt.lnet.user] The design provides a user space transport. +* [i.m0.net.xprt.lnet.user.multi-process] The design allows multiple concurrent user space processes to use LNet. +* [i.m0.net.xprt.lnet.user.no-gpl] The design avoids using user space GPL interfaces. +* [i.m0.net.xprt.lnet.user.min-syscalls] The [r.m0.net.xprt.lnet.efficient-user-to-kernel-comm] refinement will address this. +* [i.m0.net.xprt.lnet.min-buffer-vm-setup] During buffer registration user memory pages get pinned in the kernel. +* [i.m0.net.xprt.lnet.processor-affinity] LNet currently provides no processor affinity support. The [r.m0.net.xprt.lnet.processor-affinity] refinement will provide higher layers the ability to associate transfer machine threads with processors. +* [r.m0.net.buffer-event-delivery-control] The [r.m0.net.synchronous-buffer-event-delivery] refinement will provide this feature. +* [i.m0.net.xprt.lnet.buffer-registration] The API supports buffer pre-registration before use. Any hardware optimizations possible at this time can be utilized when available through the LNet API. See Future LNet buffer registration support. +* [i.m0.net.xprt.auto-provisioned-receive-buffer-pool] The design provides transport independent support to automatically provision the receive queues of transfer machines on demand, from pools of unused, registered, network buffers. + +### Dependencies +* [r.lnet.preconfigured] The design assumes that LNET modules and associated LNDs are pre-configured on a host. +* [r.m0.lib.atomic.interoperable-kernel-user-support] The design assumes that the Motr library’s support for atomic operations is interoperable across the kernel and user space boundaries when using shared memory. +* [r.m0.net.xprt.lnet.address-assignment] The design assumes that the assignment of LNet transport addresses to Motr components is made elsewhere. Note the constraint that all addresses must use a PID value of 12345, and a Portal Number that does not clash with existing usage (Lustre and Cray). It is recommended that all Motr servers be assigned low (values close to 0) transfer machine identifiers values. In addition, it is recommended that some set of such addresses be reserved for Motr tools that are relatively short lived - they will dynamically get transfer machine identifiers at run time. These two recommendations reduce the chance of a collision between Motr server transfer machine identifiers and dynamic transfer machine identifiers. Another aspect to consider is the possible alignment of FOP state machine localities [6] with transfer machine identifiers. +* [r.m0.net.network-buffer-pool] Support for a pool of network buffers involving no higher level interfaces than the network module itself. There can be multiple pools in a network domain, but a pool cannot span multiple network domains. Non-blocking interfaces are available to get and put network buffers, and a callback to signal the availability of buffers is provided. This design benefits considerably from a “colored” variant of the get operation, one that will preferentially return the most recently used buffer last associated with a specific transfer machine, or if none such are found, a buffer which has no previous transfer machine association, or if none such are found, the least recently used buffer from the pool, if any. + +Supporting this variant efficiently may require a more sophisticated internal organization of the buffer pool than is possible with a simple linked list; however, a simple ordered linked list could suffice if coupled with a little more sophisticated selection mechanism than “head-of-the-list”. Note that buffers have no transfer machine affinity until first used, and that the nb_tm field of the buffer can be used to determine the last transfer machine association when the buffer is put back into the pool. Here are some possible approaches: + +* Add buffers with no affinity to the tail of the list, and push returned buffers to the head of the list. This approach allows for a simple O(n) worst case selection algorithm with possibly less average overhead (n is the average number of buffers in the free list). A linear search from the head of the list will break off when a buffer of the correct affinity is found, or a buffer with no affinity is found, or else the buffer at the tail of the list is selected, meeting the requirements mentioned above. In steady state, assuming an even load over the transfer machines, a default minimum queue length of 2, and a receive buffer processing rate that keeps up with the receive buffer consumption rate, there would only be one network buffer per transfer machine in the free list, and hence the number of list elements to traverse would be proportional to the number of transfer machines in use. In reality, there may be more than one buffer affiliated with a given transfer machine to account for the occasional traffic burst. A periodic sweep of the list to clear the buffer affiliation after some minimum time in the free list (reflecting the fact that that the value of such affinity reduces with time spent in the buffer pool), would remove such extra buffers over time, and serve to maintain the average level of efficiency of the selection algorithm. The nb_add_time field of the buffer could be used for this purpose, and the sweep itself could be piggybacked into any get or put call, based upon some time interval. Because of the sorting order, the sweep can stop when it finds the first un-affiliated buffer or the first buffer within the minimum time bound. +* A further refinement of the above would be to maintain two linked lists, one for un-affiliated buffers and one for affiliated buffers. If the search of the affiliated list is not successful, then the head of the unaffiliated list is chosen. A big part of this variant is that returned buffers get added to the tail of the affiliated list. This will increase the likelihood that a get operation would find an affiliated buffer toward the head of the affiliated list, because automatic re-provisioning by a transfer machine takes place before the network buffer completion callback is made, and hence before the application gets to process and return the network buffer to the pool. The sweep starts from the head of the affiliated list, moving buffers to the unaffiliated list, until it finds a buffer that is within the minimum time bound. +Better than O(n) search (closer to O(1)) can be accomplished with more complex data structures and algorithms. Essentially it will require maintaining a per transfer machine list somewhere. The pool can only learn of the existence of a new transfer machine when the put operation is involved and will have to be told when the transfer machine is stopped. If the per transfer machine list is anchored in the pool, then the set of such anchors must be dynamically extensible. The alternative of anchoring the list in the transfer machine itself has pros and cons; it would work very well for the receive buffer queue, but does not extend to support other buffer pools for arbitrary purposes. In other words, it is possible to create an optimal 2-level pool (a per transfer machine pool in the data structure itself, with a shared backing store buffer pool) dedicated to receive network buffer processing, but not a generalized solution. Such a pool would exhibit excellent locality of reference but would be more complex because high water thresholds would have to be maintained to return buffers back to the global pool. + +### Security Model +No security model is defined; the new transport inherits whatever security model LNet provides today. + +### Refinement +* [r.m0.net.xprt.lnet.transport-variable] + * The implementation shall name the transport variable as specified in this document. +* [r.m0.net.xprt.lnet.end-point-address] + * The implementation should support the mapping of end point address to LNet address as described in Mapping of Endpoint Address to LNet Address, including the reservation of a portion of the match bit space in which to encode the transfer machine identifier. +* [r.m0.net.xprt.support-for-auto-provisioned-receive-queue] The implementation should follow the strategy outlined in Automatic provisioning of receive buffers. It should also follow the serialization model outlined in Concurrency control. +* [r.m0.net.xprt.lnet.multiple-messages-in-buffer] + * Add a nb_min_receive_size field to struct m0_net_buffer. + * Document the behavioral change of the receive message callback. + * Provide a mechanism for the transport to indicate that the M0_NET_BUF_QUEUED flag should not be cleared by the m0_net_buffer_event_post() subroutine. + * Modify all existing usage to set the nb_min_receive_size field to the buffer length. +* [r.m0.net.xprt.lnet.efficient-user-to-kernel-comm] + * The implementation should follow the strategies recommended in Efficient communication between user and kernel spaces, including the creation of a private device driver to facilitate such communication. +* [r.m0.net.xprt.lnet.cleanup-on-process-termination] + * The implementation should release all kernel resources held by a process using the LNet transport when that process terminates. +* [r.m0.net.xprt.lnet.dynamic-address-assignment] + * The implementation may support dynamic assignment of transfer machine identifier using the strategy outlined in Mapping of Endpoint Address to LNet Address. We recommend that the implementation dynamically assign transfer machine identifiers from higher numbers downward to reduce the chance of conflicting with well-known transfer machine identifiers. +* [r.m0.net.xprt.lnet.processor-affinity] + * The implementation must provide support for this feature, as outlined in Processor affinity for transfer machines. The implementation will need to define an additional transport operation to convey this request to the transport. Availability may vary by kernel or user space. +* [r.m0.net.synchronous-buffer-event-delivery] + * The implementation must provide support for this feature as outlined in Controlling network buffer event delivery and Synchronous network buffer event delivery. + +### State +A network buffer used to receive messages may be used to deliver multiple messages if its nb_min_receive_size field is non-zero. Such a network buffer may still be queued when the buffer event signifying a received message is delivered. + +When a transfer machine stops or fails, all network buffers associated with buffer pools should be put back into their pool. The atomic variable, ntm_recv_pool_deficit, used to count the number of network buffers needed should be set to zero. This should be done before notification of the state change is made. + +Transfer machines now either support automatic asynchronous buffer event delivery on a transport thread (the default), or can be configured to synchronously deliver buffer events on an application thread. The two modes of operation are mutually exclusive and must be established before starting the transfer machine. + +#### State Invariants +User space buffers pin memory pages in the kernel when registered. Hence, registered user space buffers must be associated with a set of kernel struct page pointers to the referenced memory. + +The invariants of the transfer machine and network buffer objects should capture the fact that if a pool is associated with these objects, then the pool is in the same network domain. The transfer machine invariant, in particular, should ensure that the value of the atomic variable, ntm_recv_pool_deficit is zero when the transfer machine is in an inoperable state. + +See the refinement [r.m0.net.xprt.support-for-auto-provisioned-receive-queue]. + +#### Concurrency Control +The LNet transport module is sandwiched between the asynchronous Motr network API above, and the asynchronous LNet API below. It must plan on operating within the serialization models of both these components. In addition, significant use is made of the kernel’s memory management interfaces, which have their own serialization model. The use of a device driver to facilitate user space to kernel communication must also be addressed. + +The implementation mechanism chosen will further govern the serialization model in the kernel. The choice of the number of EQs will control how much inherent independent concurrency is possible. For example, sharing of EQs across transfer machines or for different network buffer queues could require greater concurrency control than the use of dedicated EQs per network buffer queue per transfer machine. + +Serialization of the kernel transport is anticipated to be relatively straightforward, with safeguards required for network buffer queues. + +Serialization between user and kernel space should take the form of shared memory circular queues co-ordinated with atomic indices. A producer-consumer model should be used, with opposite roles assigned to the kernel and user space process; appropriate notification of change should be made through the device driver. Separate circular queues should be used for buffer operations (user to kernel) and event delivery (kernel to user). [r.m0.net.xprt.lnet.efficient-user-to-kernel-comm] + +Automatic provisioning can only be enabled before a transfer machine is started. Once enabled, it cannot be disabled. Thus, provisioning operations are implicitly protected by the state of the transfer machine - the “not-empty” callback subroutine will never fail to find its transfer machine, though it should take care to examine the state before performing any provisioning. The life span of a network buffer pool must exceed that of the transfer machines that use the pool. The life span of a network domain must exceed that of associated network buffer pools. + +Automatic provisioning of receive network buffers from the receive buffer pool takes place either through the m0_net_buffer_event_post() subroutine or triggered by the receive buffer pool’s “not-empty” callback with the m0_net_domain_buffer_pool_not_empty subroutine. Two important conditions should be met while provisioning: + +* Minimize processing on the pool callback: The buffer pool maintains its own independent lock domain; it invokes the m0_net_domain_buffer_pool_not_empty subroutine (provided for use as the not-empty callback) while holding its lock. The callback is invoked on the stack of the caller who used the put operation on the pool. It is essential, therefore, that the not-empty callback perform minimal work - it should only trigger an attempt to reprovision transfer machines, not do the provisioning. +* Minimize interference with the processor affinity of the transfer machine: Ideally, the transfer machine is only referenced on a single processor, resulting in a strong likelihood that its data structures are in the cache of that processor. Provisioning transfer machines requires iteration over a list, and if the transfer machine lock has to be obtained for each, it could adversely impact such caching. We provided the atomic variable, ntm_recv_pool_deficit, with a count of the number of network buffers to provision so that this lock is obtained only when the transfer machine really needs to be provisioned, and not for every invocation of the buffer pool callback. The transfer machine invariant will enforce that the value of this atomic will be 0 when the transfer machine is not in an operable state. + + +Actual provisioning should be done on a domain private thread awoken for this purpose. A transfer machine needs provisioning if it is in the started state, it is associated with the pool, and its receive queue length is less than the configured minimum (determined via an atomic variable as outlined above). To provision, the thread will obtain network buffers from the pool with the get() operation, and add them to the receive queue of the transfer machine with the (internal equivalent) of the m0_net_buffer_add_call that assumes that the transfer machine is locked. + +The design requires that receive buffers obtained from buffer pools be put back to their pools when a transfer machine is stopped or fails, prior to notifying the higher level application of the change in state. This action will be done in the m0_net_tm_event_post() subroutine, before invoking the state change callback. The subroutine obtains the transfer machine mutex, and hence has the same degree of serialization as that used in automatic provisioning. + +The synchronous delivery of network buffer events utilizes the transfer machine lock internally, when needed. The lock must not be held in the m0_net_buffer_event_deliver_all() subroutine across calls to the m0_net_buffer_event_post() subroutine. + +In the use case described in Request handler control of network buffer event delivery there is a possibility that the application could wake up for reasons other than the arrival of a network buffer event, and once more test for the presence of network buffer events even while the background thread is making a similar test. It is possible that the application could consume all events and once more make a request for future notification while the semaphore count in its wait channel is non-zero. In this case it would return immediately, find no additional network events and repeat the request; the m0_net_buffer_event_deliver_all() subroutine will not return an error if no events are present. + +### Scenarios +A Motr component, whether it is a kernel file system client, server, or tool, uses the following pattern for multiple-message reception into a single network buffer. + +1. The component creates and starts one or more transfer machines, identifying the actual end points of the transfer machines. +2. The component provisions network buffers to be used for receipt of unsolicited messages. The method differs based on whether a buffer pool is used or not. + i. When a buffer pool is used, these steps are performed. + a. The network buffers are provisioned, with nb_min_receive_size set to allow multiple delivery of messages. The network buffers are added to a buffer pool. + b. The buffer pool is registered with a network domain and associated with one or more transfer machines. Internally, the transfer machines will get buffers from the pool and add them to their M0_NET_QT_MSG_RECV queues. + ii. When a buffer pool is not used, these steps are performed. + a. Network buffers are provisioned with nb_min_receive_size set to allow multiple delivery of messages. + b. The network buffers are registered with the network domain and added to a transfer machine M0_NET_QT_MSG_RECV queue. +3. When a message is received, two sub-cases are possible as part of processing the message. It is the responsibility of the component itself to coordinate between these two sub-cases. + i. When a message is received and the M0_NET_BUF_QUEUED flag is set in the network buffer, then the client does not re-enqueue the network buffer as there is still space remaining in the buffer for additional messages. + ii. When a message is received and the M0_NET_BUF_QUEUED flag is not set in the network buffer, then the component takes one of two paths, depending on whether a buffer pool is in use or not. + a. When a buffer pool is in use, the component puts the buffer back in the buffer pool so it can be re-used. + b. When a buffer pool is not in use, the component may re-enqueue the network buffer after processing is complete, as there is no space remaining in the buffer for additional messages. + +#### Sending non-bulk messages from Motr components + +A Motr component, whether a user-space server, user-space tool or kernel file system client uses the following pattern to use the LNet transport to send messages to another component. Memory for send queues can be allocated once, or the send buffer can be built up dynamically from serialized data and references to existing memory. + +1. The component optionally allocates memory to one or more m0_net_buffer objects and registers those objects with the network layer. These network buffers are a pool of message send buffers. +2. To send a message, the component uses one of two strategies. + i. The component selects one of the buffers previously allocated and serializes the message data into that buffer. + ii. The component builds up a fresh m0_net_buffer object out of memory pages newly allocated and references to other memory (to avoid copies), and registers the resulting object with the network layer. +3. The component enqueues the message for transmission. +4. When a buffer operation completes, it uses one of two strategies, corresponding to the earlier approach. + i. If the component used previously allocated buffers, it returns the buffer to the pool of send buffers. + ii. If the component built up the buffer from partly serialized and partly referenced data, it de-registers the buffer and de-provisions the memory. + +#### Kernel space bulk buffer access from file system clients + +A motr file system client uses the following pattern to use the LNet transport to initiate passive bulk transfers with motr servers. Memory for bulk queues will come from user space memory. The user space memory is not controlled by motr; it is used as a result of system calls, eg read() and write(). + +1. The client populates a network buffer from mapped user pages, registers this buffer with the network layer and enqueues the buffer for transmission. +2. When a buffer operation completes, the client will de-register the network buffer and de-provision the memory assigned. + +#### User space bulk buffer access from Motr servers + +A Motr server uses the following pattern to use the LNet transport to initiate active bulk transfers to other Motr components. + +1. The server establishes a network buffer pool. The server allocates a set of network buffers provisioned with memory and registers them with the network domain. +2. To perform a bulk operation, the server gets a network buffer from the network buffer pool, populates the memory with data to send in the case of active send, and enqueues the network buffer for transmission. +3. When a network buffer operation completes, the network buffer can be returned to the pool of network buffers. + +#### User space bulk buffer access from Motr tools + +A Motr tool uses the following pattern to use the LNet transport to initiate passive bulk tranfers to Motr server components: + +1. The tool should use an end point address that is not assigned to any mero server or file system client. It should use a dynamic address to achieve this. +2. To perform a bulk operation, the tool provisions a network buffer. The tool then registers this buffer and enqueues the buffer for transmission. +3. When a buffer operation completes, the buffer can be de-registered and the memory can be de-provisioned. + +#### Obtaining dynamic addresses for Motr tools + +A Motr tool is a relatively short lived process, typically a command line invocation of a program to communicate with a Motr server. One cannot assign fixed addresses to such tools, as the failure of a human interactive program because of the existence of another executing instance of the same program is generally considered unacceptable behavior, and one that precludes the creation of scriptable tools. + +Instead, all tools could be assigned a shared combination of NID, PID and Portal Number, and at run time, the tool process can dynamically assign unique addresses to itself by creating a transfer machine with a wildcard transfer machine identifier. This is captured in refinement [r.m0.net.xprt.lnet.dynamic-address-assignment] and Mapping of Endpoint Address to LNet Address. Dependency: [r.m0.net.xprt.lnet.address-assignment] + +#### Request handler control of network buffer event delivery + +The user space Motr request handler operates within a locality domain that includes, among other things, a processor, a transfer machine, a set of FOMs in execution, and handlers to create new FOMs for FOPs. The request handler attempts to perform all I/O operations asynchronously, using a single handler thread, to minimize the thread context switching overhead. + +### Failures +One failure situation that must be explicitly addressed is the termination of the user space process that uses the LNet transport. All resources consumed by this process must be released in the kernel. In particular, where shared memory is used, the implementation design should take into account the accessibility of this shared memory at this time. Refinement: [r.m0.net.xprt.lnet.cleanup-on-process-termination] + +### Analysis +The number of LNet based transfer machines that can be created on a host is constrained by the number of LNet portals not assigned to Lustre or other consumers such as Cray. In Lustre 2.0, the number of unassigned portal numbers is 30. + +In terms of performance, the design is no more scalable than LNet itself. The design does not impose a very high overhead in communicating between user space and the kernel and uses considerably more efficient event processing than ULA. + +### Other +We had some concerns and questions regarding the serialization model used by LNet, and whether using multiple portals is more efficient than sharing a single portal. The feedback we received indicates that LNet uses relatively coarse locking internally, with no appreciable difference in performance for these cases. There may be improvements in the future, but that is not guaranteed; the suggestion was to use multiple portals if possible, but that also raises concerns about the restricted available portal space left in LNet (around 30 unused portals) and the fact that all LNet users share the same portals space. [4]. + +### Rationale +One important design choice was the choice to use a custom driver rather than ULA, or a re-implementation of the ULA. The primary reason for not using the ULA directly is that it is covered by the GPL, which would limit the licensing choices for Motr overall. It would have been possible to implement our own ULA-like driver and library. After that, a user-level LNet transport would still be required on top of this ULA-like driver. However, Motr does not require the full set of possible functions and use cases supported by LNet. Implementing a custom driver, tailored to the Motr net bulk transport, means that only the functionality required by Motr must be supported. The driver can also be optimized specifically for the Motr use cases, without concern for other users. For these reasons, a re-implementation of the ULA was not pursued. + +Certain LNet implementation idiosyncrasies also impact the design. We call out the following, in particular: + +The portal number space is huge, but the implementation supports just the values 0-63 [4]. + +* Only messages addressed to PID 12345 get delivered. This is despite the fact that LNet allows us to specify any address in the LNetGet, LNetPut and LNetMEAttach subroutines. +* ME matches are constrained to either all network interfaces or to those matching a single NID, i.e. a set of NIDs cannot be specified. +* No processor affinity support. + +Combined, this translates to LNet only supporting a single PID (12345) with up to 64 portals, out of which about half (34 actually) seem to be in use by Lustre and other clients. Looking at this another way: discounting the NID component of an external LNet address, out of the remaining 64 bits (32 bit PID and 32 bit Portal Number), about 5 bits only are available for Motr use! This forced the design to extend its external end point address to cover a portion of the match bit space, represented by the Transfer Machine Identifier. + +Additional information on current LNet behavior can be found in [4]. + +### Deployment +Motr’s use of LNet must co-exist with simultaneous use of LNet by Lustre on the same host. + +#### Network +LNet must be set up using existing tools and interfaces provided by Lustre. Dependency: [r.lnet.preconfigured]. + +LNet Transfer machine end point addresses are statically assigned to Motr runtime components through the central configuration database. The specification requires that the implementation use a disjoint set of portals from Lustre, primarily because of limitations in the LNet implementation. See Rationale for details. + +#### Core +This specification will benefit if Lustre is distributed with a larger value of MAX_PORTALS than the current value of 64 in Lustre 2.0. + +#### Installation +LNet is capable of running without Lustre, but currently is distributed only through Lustre packages. It is not in the scope of this document to require changes to this situation, but it would be beneficial to pure Motr servers (non-Lustre) to have LNet distributed in packages independent of Lustre. + +### References +* [1] T1 Task Definitions +* [2] Mero Summary Requirements Table +* [3] m0 Glossary +* [4] m0LNet Preliminary Design Questions +* [5] RPC Bulk Transfer Task Plan +* [6] HLD of the FOP state machine