[Feature Request]: Virtual file system #1184

JinHai-CN · 2024-05-07T12:31:16Z

Is there an existing issue for the same feature request?

I have checked the existing issues.

Is your feature request related to a problem?

No response

Describe the feature you'd like

Infinity's internal data is consists of segments and blocks, where each block is made up of a bunch of block columns. The implementation is such that each block column is persisted as a file on disk, no matter how large that file is. This can result in a large number of files in a single table. This feature request aims to solve this problem. We use a virtual filesystem serving infinity, whereas in reality several block column files exist on a single actual file, which avoids the problem of creating a large number of files and alleviates the possibility of a 'too many open' files error.

Describe implementation you've considered

No response

Documentation, adoption, use case

No response

Additional information

No response

JinHai-CN · 2024-05-16T05:49:43Z

The goal of the virtual file system is to have a virtual layer where each generated block column, index file, delete file, etc. can be stored by the VFS. Through this layer, infinity can be connected to the local file system, can also be connected to the file system like s3.

Therefore, virtual file system needs to provide the following interfaces:
Open/Read/Write/Seek/Truncate/Close.

In the concrete implementation, VFS needs a metadata store: provide the mapping relationship between physical files and virtual file blocks, also provide the virtual file data contained in which virtual file blocks. For metadata reading and writing, what we see now is mainly accessed in the form of key value. Therefore, metadata storage can be considered kv store.

The size of each file block should be a fixed size, for example, 64KB. A physical file, isn't a fixed size files. But its size should be fixed in, for example, between 16 and 24MB.

With the constant creation and deletion of files, there must be a large amount of file fragments in the original file that needs to be cleaned up. Considering that s3 will be used as the actual storage, this layer of virtual file system, for the use of physical storage, should be append-only. The fragments merging and cleanup operation logs should be kept by the WAL of the database like create/delete/update/write operations of the VFS.

JinHai-CN · 2024-05-16T05:50:46Z

Considering the complexity of this feature and the main goal of 0.2.0, we decided to move this feature out of the 0.2.0 scope, now.

yuzhichang · 2024-07-11T04:37:15Z

Objective

The main idea of the Feature Request is

Package some small files with coupled lifecycle into a middle-sized(~200MB uncompressed) one.
Keep big files as they are.

It is not

A generic distributed file system, such as Ceph and Cubefs, merge/split files into fix-sized blocks without taking care their lifecycle and have complete and complicated GC mechanism.

Summary

Small files of a segment have the coupled lifecycle:

BlockEntry version file
BlockColumnEntry file, HeapChunk

A coupled file consist of one or more of above files of a single SegmentEntry. Each small file never spread over multiple coupled files.

Other files will not be impacted:

All index files, such as fulltext dictionary file
Catalog files
DeltaOp files

A TableEntry maintains the file id set.
BlockEntry, BlockColumnEntry, ChunkIndexEntry maintain related file id, offset, size, created_ts, deleted_ts.

Details

insert

insert in memory -- no impaction

delete

delete in memory -- no impaction
delete a row of existing coupled file -- read, modify

full checkpoint

A full checkpoint is for the whole system.
Dump metadata json
Dump changed things to coupled files.

delta checkpoint

A delta checkpoint is for the whole system. Delta log file is as it's.

Segment Compaction

Determine segment list
Determine block list
Determine file list
Do compaction to generate new files and create_ts.

Drop table

Touch only metadata.

Drop index

Touch only metadata.

Create index

All index files, such as fulltext dictionary file

Build index

All index files, such as fulltext dictionary file.
Files dumped by MemoryIndexer::Dump() are lazily persisted to S3.

Cleanup

Iterate the file set, delete any one if its delete_ts is no larger than the given one.

yuzhichang · 2024-07-11T10:21:38Z

PersistenceManager is newly introduced class:

PersistenceManager(const String &workspace, SizeT coupled_capacity, SizeT alone_capacity) Construct a PersistenceManager. coupled_capacity applies to the cache of coupled files(each object maps to one or more original file). alone_capacity applies to the cache of alone files(each object maps to exact one original file). Each cache use LRU kick-out mechanism regarding the capacity.
String CreateObj() generate an UUID as object_key of the new object.
int ObjRoom(const String &object_key) returns the room (capacity - sum_of_parts_size) of new object. User should check before each ObjAppend operation. capacity is const, for example 100MB.
void ObjAppend(const String &object_key, char *body, SizeT body_len) append body to the new object. User should have body compressed before each ObjAppend operation.
void ObjFinalize(const String &object_key) finalize a object. Subsequent ObjAppend on this object is forbidden. PersistenceManager shall upload the whole object to S3 in background.
Pair<String, SizeT, SizeT> Persist(const String &file_path) Append the content of given file to some open object, and returns the location.
Pair<UniquePtr<FileHandler>, Status> ObjOpen(const String &object_key) download the whole object from S3 if it's not in cache, open the cached object.

FileWorker gains following capability:

SetSource(const String & object_key, SizeT offset, SizeT len) Set the source object offset and length.
Tuple<String, SizeT, SizeT> GetDest() Get the dest object offset and length.

Every subclass of FileWork ReadFromFileImpl() shall call PersisteceManager::ObjOpen() to open given object, decompress and parse per its need.
Every subclass of FileWork WriteToFileImpl(bool to_spill, bool &prepare_success) shall write to a local file. If to_spill is false, then PersistenceManager::Persist(file_path) to persist to S3 and record the dest location so that later GetDest() can query.

BlockColumnEntrygains following capacity:

An entry may have multiple BufferObject (for varchar, sparse, tensor and tensor array), each BufferObject has a FileWork. So an entry know the persisted locations of the block column. Each Flush() operation changes these locations. We don't store the history of location since the full data visibility is inside the latest one.

Introduced PersistenceManager. Adapted FileWorker for PersistenceManager. Issue link:#1184 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

yuzhichang · 2024-08-19T08:17:00Z

Completed. VFS is disabled by default. Following piece in config file enable vfs:

[persistence]
persistence_dir          = "/var/infinity/persistence"

JinHai-CN added the feature request New feature or request label May 7, 2024

JinHai-CN mentioned this issue May 7, 2024

ROADMAP 2024 #338

Closed

79 tasks

JinHai-CN assigned yuzhichang Jul 1, 2024

yuzhichang mentioned this issue Jul 16, 2024

VFS first phrase #1485

Merged

1 task

yuzhichang added a commit that referenced this issue Jul 17, 2024

VFS first phrase (#1485)

46937c4

Introduced PersistenceManager. Adapted FileWorker for PersistenceManager. Issue link:#1184 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

yuzhichang closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Virtual file system #1184

[Feature Request]: Virtual file system #1184

JinHai-CN commented May 7, 2024

JinHai-CN commented May 16, 2024

JinHai-CN commented May 16, 2024 •

edited

Loading

yuzhichang commented Jul 11, 2024 •

edited

Loading

yuzhichang commented Jul 11, 2024 •

edited

Loading

yuzhichang commented Aug 19, 2024 •

edited

Loading

[Feature Request]: Virtual file system #1184

[Feature Request]: Virtual file system #1184

Comments

JinHai-CN commented May 7, 2024

Is there an existing issue for the same feature request?

Is your feature request related to a problem?

Describe the feature you'd like

Describe implementation you've considered

Documentation, adoption, use case

Additional information

JinHai-CN commented May 16, 2024

JinHai-CN commented May 16, 2024 • edited Loading

yuzhichang commented Jul 11, 2024 • edited Loading

Objective

Summary

Details

insert

delete

full checkpoint

delta checkpoint

Segment Compaction

Drop table

Drop index

Create index

Build index

Cleanup

yuzhichang commented Jul 11, 2024 • edited Loading

yuzhichang commented Aug 19, 2024 • edited Loading

JinHai-CN commented May 16, 2024 •

edited

Loading

yuzhichang commented Jul 11, 2024 •

edited

Loading

yuzhichang commented Jul 11, 2024 •

edited

Loading

yuzhichang commented Aug 19, 2024 •

edited

Loading