Skip to content

Commit

Permalink
RFC-0561: List metadata reuse (#561)
Browse files Browse the repository at this point in the history
* rfc: add metadata to DirEntry

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: rename rfc

1. fix up format of document

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: compilable example code

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: keep example getter style consist with ObjectMeta

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: embedded metadata structure

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: more future possibilities

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: reformat markdown

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: fix markdown-rust misformat

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: make plan A the proposed way

Plan A: simply extend `DirEntry`

Plan B: define a new MetaLite struct and embed it into `DirEntry`
    @ClSlaid prefered.

Plan C: embed `ObjectMetadata` into `DirEntry`
    @Xuanwo prefered

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: avoid using first-person-view

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: add tracking issue for RFC

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

* refactor: remove unimplemented metadata

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>

Signed-off-by: ClSlaid <cailue@bupt.edu.cn>
  • Loading branch information
ClSlaid authored Aug 25, 2022
1 parent 76d904b commit ce5e6b8
Show file tree
Hide file tree
Showing 2 changed files with 211 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,4 @@
- [0443-gateway](rfcs/0443-gateway.md)
- [0501-new-builder](rfcs/0501-new-builder.md)
- [0554-write-refactor](rfcs/0554-write-refactor.md)
- [0561-list-metadata-reused](rfcs/0561-list-metadata-reuse.md)
210 changes: 210 additions & 0 deletions docs/rfcs/0561-list-metadata-reuse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
- Proposal Name: `list_metadata_reuse`
- Start Date: 2022-08-23
- RFC PR: [datafuselabs/opendal#561](https://github.com/datafuselabs/opendal/pull/561)
- Tracking Issue: [datafuselabs/opendal#570](https://github.com/datafuselabs/opendal/pull/570)

# Summary

Reuse metadata returned during listing, by extending `DirEntry` with some metadata fields.

# Motivation

Users may expect to browse metadata of some directories' child files and directories. Using `walk()` of `BatchOperator` seems to be an ideal way to complete this job.

Thus, they start iterating on it, but soon they realized the `DirEntry`, could only offer the name (or path, more precisely) and access mode of the object, and it's not enough.

So they have to call `metadata()` for each name they extracted from the iterator.

The final example looks like:

```rust
let op = Operator::from_env(Scheme::Gcs)?.batch();

// here is a network request
let mut dir_stream = op.walk("/dir/to/walk")?;

while let Some(Ok(file)) = dir_stream.next().await {
let path = file.path();

// here is another network request
let size = file.metadata().await?.content_length();
println!("size of file {} is {}B", path, size);
}
```

But...wait! many storage-services returns object metadata when listing, like HDFS, AWS and GCS. The rust standard library returns metadata when listing local file systems, too.

In the previous versions of OpenDAL those fields were just get ignored. This wastes users' time on requesting on metadata.

# Guide-level explanation

The loop in main will be changed to the following code with this RFC:

```rust
while let Some(Ok(file)) = dir_stream.next().await {
let size = if let Some(len) = file.content_length() {
len
} else {
file.metadata().await?.content_length();
};
let name = file.path();
println!("size of file {} is {}B", path, size);
}
```

# Reference-level explanation

Extend `DirEntry` with metadata fields:

```rust
pub struct DirEntry {
acc: Arc<dyn Accessor>,

mode: ObjectMode,
path: String,

// newly add metadata fields
content_length: Option<u64>, // size of file
content_md5: Option<String>,
last_modified: Option<OffsetDateTime>,
}

impl DirEntry {
pub fn content_length(&self) -> Option<u64> {
self.content_length
}
pub fn last_modified(&self) -> Option<OffsetDateTime> {
self.last_modified
}
pub fn content_md5(&self) -> Option<OffsetDateTime> {
self.content_md5
}
}
```

For all services that supplies metadata during listing, like AWS, GCS and HDFS. Those optional fields will be filled up; Meanwhile for those services doesn't return metadata during listing, like in memory storages, just left them as `None`.

As you can see, for those services returning metadata when listing, the operation of listing metadata will save many unnecessary requests.

# Drawbacks

Add complexity to `DirEntry`. To use the improved features of `DirEntry`, users have to explicitly check the existence of metadata fields.

The size of `DirEntry` increased from 40 bytes to 80 bytes, a 100% percent growth requires more memory.

# Rational and alternatives

The largest drawback of performance usually comes from network or hard disk operations. By letting `DirEntry` storing some metadata, many redundant requests could be avoided.

## Embed a Structure Containing Metadata

Define a `MetaLite` structure containing some metadata fields, and embed it in `DirEntry`

```rust
struct MetaLite {
pub content_length: u64, // size of file
pub content_md5: String,
pub last_modified: OffsetDateTime,
}

pub struct DirEntry {
acc: Arc<dyn Accessor>,

mode: ObjectMode,
path: String,

// newly add metadata struct
metadata: Option<MetaLite>,
}

impl DirEntry {
// get size of file
pub fn content_length(&self) -> Option<u64> {
self.metadata.as_ref().map(|m| m.content_length)
}
// get the last modified time
pub fn last_modified(&self) -> Option<OffsetDateTime> {
self.metadata.as_ref().map(|m| m.last_modified)
}
// get md5 message digest
pub fn content_md5(&self) -> Option<String> {
self.metadata.as_ref().map(|m| m.content_md5)
}
}
```

The existence of those newly added metadata fields is highly correlated. If one field does not exist, the others neither.

By wrapping them together in an embedded structure, 8 bytes of space for each `DirEntry` object could be saved. In the future, more metadata fields may be added to `DirEntry`, then a lot more space could be saved.

This approach could be slower because some intermediate functions are involved. But it's worth sacrificing rarely used features' performance to save memory.

## Embed a `ObjectMetadata` into `DirEntry`

- Embed a `ObjectMetadata` struct into `DirEntry`
- Remove the `ObjectMode` field in `DirEntry`
- Change `ObjectMetadata`'s `content_length` field's type to `Option<u64>`.

```rust
pub struct DirEntry {
acc: Arc<dyn Accessor>,

// - mode: ObjectMode, removed
path: String,

// newly add metadata struct
metadata: ObjectMetadata,
}

impl DirEntry {
pub fn mode(&self) -> ObjectMode {
self.metadata.mode()
}
pub fn content_length(&self) -> Option<u64> {
self.metadata.content_length()
}
pub fn content_md5(&self) -> Option<&str> {
self.metadata.content_md5()
}
// other metadata getters...
}
```

In the degree of memory layout, it's the same as proposed way in this RFC. This approach offers more metadata fields and fewer changes to code.

# Prior art

None.

# Unresolved questions

None.

# Future possibilities

## Switch to Alternative Implement Approaches

As the growing of metadata fields, someday the alternatives could be better. And other RFCs will be raised then.

## More Fields

Add more metadata fields to DirEntry, like:

- accessed: the last access timestamp of object

## Simplified Get

Users have to explicitly check if those metadata fields actual present in the DirEntry. This may be done inside the getter itself.

```rust
let path = file.path();

// if content_length is not exist
// this getter will automatically fetch from the storage service.
let size = file.content_length().await?;

// the previous getter can cache metadata fetched from service
// so this function could return instantly.
let md5 = file.content_md5().await?;
println!("size of file {} is {}B, md5 outcome of file is {}", path, size, md5);
```

1 comment on commit ce5e6b8

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploy preview for opendal ready!

✅ Preview
https://opendal-h8e879qpu-databend.vercel.app

Built with commit ce5e6b8.
This pull request is being automatically deployed with vercel-action

Please sign in to comment.