Seagate · rkothiya · Aug 3, 2022 · Jul 29, 2022 · Jul 29, 2022 · Jul 29, 2022
@@ -13,7 +13,7 @@ A version number is stored together with the file system state whose version it
 
 ## Definitions  
 See the Glossary for general M0 definitions and HLD of FOL for the definitions of file system operation, update, and lsn. The following additional definitions are required:
-- For the present design, it is assumed that a file system update acts on units (r.dtx.units). For example, a typical meta-data update acts on one or more "inodes" and a typical data update acts on inodes and data blocks. Inodes, data blocks, directory entries, etc. are all examples of units. It is further assumed that units involved in an update are unambiguously identified (r.dtx.units.identify) and that a complete file system state is a disjoint union of states comprising units. (there are consistent relationships between units, e.g., the inode nlink counter must be consistent with the contents of directories in the name-space).  
+- For the present design, it is assumed that a file system update acts on units (r.dtx.units). For example, a typical meta-data update acts on one or more "inodes" and a typical data update acts on inodes and data blocks. Inodes, data blocks, directory entries, etc. are all examples of units. It is further assumed that units involved in an update are unambiguously identified (r.dtx.units.identify) and that a complete file system state is a disjoint union of states comprising units. (Of course, there are consistent relationships between units, e.g., the inode nlink counter must be consistent with the contents of directories in the name-space).  
 - It is guaranteed that operations (updates and queries) against a given unit are serializable in the face of concurrent requests issued by the file system users. This means that the observable (through query requests) unit state looks as if updates of the unit were executed serially in some order. Note that the ordering of updates is further constrained by the distributed transaction management considerations which are outside the scope of this document.  
 - A unit version number is an additional piece of information attached to the unit. A version number is drawn from some linearly ordered domain. A version number changes on every update of the unit state in such a way that the ordering of unit states in the serial history can be deduced by comparing version numbers associated with the corresponding states.  
 
@@ -27,7 +27,7 @@ See the Glossary for general M0 definitions and HLD of FOL for the definitions o
 
 ## Design Highlights   
 
-In the presence of caching, requirements [r.verno.resource] and [r.verno.fol] are seemingly contradictory: if two caching client nodes assigned (as allowed by [r.verno.resource]) version numbers to two independent units, then after re-integration of units to their common primary server, the version numbers must refer to primary fol, but clients cannot produce such references without extremely inefficient serialization of all accesses to the units on the server.  
+In the presence of caching, requirements [r.verno.resource] and [r.verno.fol] are seemingly contradictory: if two caching client nodes assigned (as allowed by [r.verno.resource]) version numbers to two independent units, then after re-integration of units to their common master server, the version numbers must refer to the master's fol, but clients cannot produce such references without extremely inefficient serialization of all accesses to the units on the server.  
 
 To deal with that, a version number is made compound: it consists of two components:  
 

@@ -0,0 +1,66 @@
+# HLD of Metadata Backend
+This document presents a high level design **(HLD)** of the meta-data back-end for Motr.   
+The main purposes of this document are:  
+ 1. To be inspected by Motr architects and peer designers to ascertain that high level design is aligned with Motr architecture and other designs, and contains no defects.
+ 2.  To be a source of material for Active Reviews of Intermediate Design **(ARID)** and detailed level design **(DLD)** of the same component.
+ 3. To serve as a design reference document. The intended audience of this document consists of Motr customers, architects, designers, and developers.  
+
+
+ ## Introduction    
+ Meta-data back-end (BE) is a module presenting an interface for a transactional local meta-data storage. BE users manipulate and access meta-data structures in memory. BE maps this memory to persistent storage. User groups meta-data updates in transactions. BE guarantees that transactions are atomic in the face of process failures.
+
+BE provides support for a few frequently used data structures: double linked list, B-tree, and exit map.  
+
+
+ ## Dependencies  
+ - a storage object *(stob)* is a container for unstructured data, accessible through the `m0_stob` interface. BE uses stobs to store meta-data on a persistent store. BE accesses persistent store only through the `m0_stob` interface and assumes that every completed stob write survives any node failure. It is up to a stob implementation to guarantee this.  
+ - a segment is a stob mapped to an extent in process address space. Each address in the extent uniquely corresponds to the offset in the stob and vice versa. Stob is divided into blocks of fixed size. Memory extent is divided into pages of fixed size. Page size is a multiple of the block size (it follows that stob size is a multiple of page size). At a given moment in time, some pages are up-to-date (their contents are the same as of the corresponding stob blocks) and some are dirty (their contents were modified relative to the stob blocks). In the initial implementation, all pages are up-to-date, when the segment is opened. In the later versions, pages will be loaded dynamically on demand. The memory extent to which a segment is mapped is called segment memory.
+ - a region is an extent within segment memory. A (meta-data) update is a modification of some region.
+ - a transaction is a collection of updates. The user adds an update to a transaction by capturing the update's region. The user explicitly closes a transaction. BE guarantees that a closed transaction is atomic concerning process crashes that happen after transaction close call returns. That is, after such a crash, either all or none of the transaction updates will be present in the segment memory when the segment is opened next time. If a process crashes before a transaction closes, BE guarantees that none of the transaction updates will be present in the segment memory.
+ - a credit is a measure of a group of updates. A credit is a pair (nr, size), where nr is the number of updates and size is the total size in bytes of modified regions.  
+
+ ## Requirements
+
+* `R.M0.MDSTORE.NUMA`: allocator respects NUMA topology.
+* `R.MO.REQH.10M`: performance goal of 10M transactions per second on a 16-core system with a battery-backed memory.
+* `R.M0.MDSTORE.LOOKUP`: Lookup of a value by key is supported.
+* `R.M0.MDSTORE.ITERATE`: Iteration through records is supported.
+* `R.M0.MDSTORE.CAN-GROW`: The linear size of the address space can grow dynamically.
+* `R.M0.MDSTORE.SPARSE-PROVISIONING`: including pre-allocation.
+* `R.M0.MDSTORE.COMPACT`, `R.M0.MDSTORE.DEFRAGMENT`: used container space can be compacted and de-fragmented.
+* `R.M0.MDSTORE.FSCK`: scavenger is supported
+* `R.M0.MDSTORE.PERSISTENT-MEMORY`: The log and dirty pages are (optionally) in a persistent memory.
+* `R.M0.MDSTORE.SEGMENT-SERVER-REMOTE`: backing containers can be either local or remote
+* `R.M0.MDSTORE.ADDRESS-MAPPING-OFFSETS`: offset structure friendly to container migration and merging
+* `R.M0.MDSTORE.SNAPSHOTS`: snapshots are supported.
+* `R.M0.MDSTORE.SLABS-ON-VOLUMES`: slab-based space allocator.
+* `R.M0.MDSTORE.SEGMENT-LAYOUT` Any object layout for a meta-data segment is supported.
+* `R.M0.MDSTORE.DATA.MDKEY`: Data objects carry a meta-data key for sorting (like the reiser4 key assignment does).
+* `R.M0.MDSTORE.RECOVERY-SIMPLER`: There is a possibility of doing a recovery twice. There is also a possibility to use either object-level mirroring or logical transaction mirroring.
+* `R.M0.MDSTORE.CRYPTOGRAPHY`: optionally meta-data records are encrypted.
+* `R.M0.MDSTORE.PROXY`: proxy meta-data server is supported. A client and a server are almost identical.  
+
+## Design Highlights
+BE transaction engine uses write-ahead redo-only logging. Concurrency control is delegated to BE users.  
+
+## Functional Specification
+BE provides an interface to make in-memory structures transactionally persistent. A user opens a (previously created) segment. An area of virtual address space is allocated to the segment. The user then reads and writes the memory in this area, by using BE-provided interfaces together with normal memory access operations. When the memory address is read for the first time, its contents are loaded from the segment (initial BE implementation loads the entire segment stob in memory when the segment is opened). Modifications to segment memory are grouped in transactions. After a transaction is closed, BE asynchronous writes updated memory to the segment stob.  
+
+When a segment is closed (perhaps implicitly as a result of a failure) and re-opened again, the same virtual address space area is allocated to it. This guarantees that it is safe to store pointers to segment memory in segment memory. Because of this property, a user can place in segment memory in-memory structures, relying on pointers: linked lists, trees, hash tables, strings, etc. Some in-memory structures, notably locks, are meaningless on storage, but for simplicity (to avoid allocation and maintenance of a separate set of volatile-only objects), can nevertheless be placed in the segment. When such a structure is modified (e.g., a lock is taken or released), the modification is not captured in any transaction and, hence, is not written to the segment stob.  
+
+BE-exported objects (domain, segment, region, transaction, linked list, and b-tree) support Motr non-blocking server architecture.  
+
+## Use Cases  
+### Scenarios   
+
+|Scenario | Description |
+|---------|-------------|
+|Scenario	| `[usecase.component.name]` |
+|Relevant quality attributes|	[e.g., fault tolerance, scalability, usability, re-usability]|
+|Stimulus|	[an incoming event that triggers the use case]|
+|Stimulus source |	[system or external world entity that caused the stimulus]|
+|Environment	| [part of the system involved in the scenario]|
+|Artifact |	[change to the system produced by the stimulus]|
+|Response |	[how the component responds to the system change]|
+|Response measure	|[qualitative and (preferably) quantitative measures of response that must be maintained]|
+|Questions and Answers	|