Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Immutable tiered storage #962

Merged
merged 16 commits into from
Dec 6, 2020
Merged

Immutable tiered storage #962

merged 16 commits into from
Dec 6, 2020

Conversation

itaiad200
Copy link
Contributor

Initial draft of the tiered storage.

@codecov-io
Copy link

codecov-io commented Nov 25, 2020

Codecov Report

Merging #962 (891e9d4) into master (5e34175) will increase coverage by 0.49%.
The diff coverage is 61.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #962      +/-   ##
==========================================
+ Coverage   44.24%   44.74%   +0.49%     
==========================================
  Files         140      147       +7     
  Lines       11570    11850     +280     
==========================================
+ Hits         5119     5302     +183     
- Misses       5800     5867      +67     
- Partials      651      681      +30     
Impacted Files Coverage Δ
config/config.go 33.54% <42.85%> (+0.44%) ⬆️
pyramid/tierFS.go 59.72% <59.72%> (ø)
pyramid/ro_file.go 71.42% <71.42%> (ø)
pyramid/file.go 76.47% <76.47%> (ø)
pyramid/eviction.go 77.77% <77.77%> (ø)
graveler/staging.go 100.00% <0.00%> (ø)
graveler/graveler.go 0.00% <0.00%> (ø)
graveler/staging_iterator.go 81.08% <0.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e34175...891e9d4. Read the comment docs.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Will use this to store commits.

pyramid/file.go Show resolved Hide resolved
pyramid/pyramid.go Outdated Show resolved Hide resolved
pyramid/pyramid.go Outdated Show resolved Hide resolved
pyramid/pyramid.go Outdated Show resolved Hide resolved
pyramid/tierFS.go Outdated Show resolved Hide resolved
pyramid/tierFS.go Outdated Show resolved Hide resolved
pyramid/file.go Outdated Show resolved Hide resolved
pyramid/tierFS.go Outdated Show resolved Hide resolved
pyramid/tierFS.go Show resolved Hide resolved
pyramid/tierFS.go Show resolved Hide resolved
@itaiad200 itaiad200 force-pushed the feature/tiered-storage branch from 5b8f6b0 to b353aae Compare November 26, 2020 15:52
@itaiad200 itaiad200 force-pushed the feature/tiered-storage branch from b353aae to 856e506 Compare November 29, 2020 16:16
@itaiad200 itaiad200 force-pushed the feature/tiered-storage branch from 99cb63e to b9ac0b4 Compare November 30, 2020 16:54
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Probably need to unit-test everything here; in particular I suspect file creation might not do what we want (comments marked).

config/config.go Outdated Show resolved Hide resolved
pyramid/eviction.go Outdated Show resolved Hide resolved
pyramid/eviction.go Outdated Show resolved Hide resolved
pyramid/file.go Show resolved Hide resolved
pyramid/tierFS.go Outdated Show resolved Hide resolved
pyramid/file.go Outdated Show resolved Hide resolved
pyramid/tierFS.go Outdated Show resolved Hide resolved
pyramid/tierFS.go Show resolved Hide resolved
@itaiad200 itaiad200 force-pushed the feature/tiered-storage branch from f26f5b2 to 77ed7cb Compare December 1, 2020 14:58
@itaiad200 itaiad200 changed the title First version of the immutable tiered storage Immutable tiered storage Dec 2, 2020
@itaiad200 itaiad200 requested a review from arielshaqed December 2, 2020 12:45
@itaiad200 itaiad200 marked this pull request as ready for review December 2, 2020 12:45
@itaiad200 itaiad200 force-pushed the feature/tiered-storage branch from add7902 to 21c6826 Compare December 2, 2020 12:49
@itaiad200 itaiad200 requested a review from johnnyaug December 2, 2020 12:50
@itaiad200 itaiad200 force-pushed the feature/tiered-storage branch from 834af95 to fe134d8 Compare December 2, 2020 15:00
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Sorry it's taking so long. As discussed the largest change is probably to write files outside the scope of the same files as reading them. Specifically we need to be able to close a to-be-written-to-S3 file without copying it out to S3, e.g. to abort it when Pebble has a write error. And that can even happen when Pebble is closing the file, at which point it commits to close its received file.

config/config.go Outdated Show resolved Hide resolved
pyramid/file.go Outdated Show resolved Hide resolved
pyramid/pyramid.go Outdated Show resolved Hide resolved
pyramid/pyramid.go Outdated Show resolved Hide resolved
pyramid/file.go Outdated Show resolved Hide resolved
pyramid/tierFS.go Show resolved Hide resolved
pyramid/file.go Show resolved Hide resolved
pyramid/tierFS.go Outdated Show resolved Hide resolved
pyramid/tierFS.go Outdated Show resolved Hide resolved
pyramid/tierFS.go Outdated Show resolved Hide resolved
@itaiad200 itaiad200 requested a review from arielshaqed December 3, 2020 14:53
@itaiad200 itaiad200 force-pushed the feature/tiered-storage branch from 0f4617f to 9acd925 Compare December 3, 2020 15:04
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice, and a lot simpler!

Some issues:

  1. I don't think size-based LRU will make good decisions (I am not familiar with any literature about it either way, though). I think that it always makes cache thrashing worse. This is a criticism of our fork of the LRU cache, I just think it is particularly relevant here in our use-case: most user workloads will have a mix of very large and very small files (say, commits vs. trees), and frequent access to a small file should prefer to displace small files to large ones. We can partly overcome this with multiple pyramids, but then we need to configure each one separately.
  2. Need to delete empty directories from the cache. Otherwise a long-lived system can end up with high fan-out, which is even less efficient on POSIX than on S3.
  3. Can we handle paths with zero-length subdirectories? In S3 the paths s3://foo/a////b and s3://foo/a///b are distinct, whereas in POSIX /a////b, /a///b and /a/b all point to the same place.
    If we never use pyramid to cache general user files on disk, then we can just fail pathnames with an empty directory component. Otherwise we will need metadata.

}

func (f *ROFile) Stat() (os.FileInfo, error) {
return f.fh.Stat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to touch here. But... as long as we collect useful metrics from pyramid (not necessarily in this PR!), that's OK either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a touch - metrics in next PR.

return nil
}

func (tfs *TierFS) removeFromLocal(rPath relativePath) {
removeFromLocal(tfs.fsLocalBaseDir, rPath)
// This will be called by the cache eviction mechanism during entry insert.
// We don't want to wait while the file is being removed from the local disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! But it does mean we will need to be very careful how we handle errors there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing much we can do, right?
For now logging it, will add metrics later on.

pyramid/tierFS.go Outdated Show resolved Hide resolved
@@ -320,6 +304,10 @@ func (tfs *TierFS) newLocalFileRef(namespace, filename string) localFileRef {
}

func (tfs *TierFS) objPointer(namespace, filename string) block.ObjectPointer {
if runtime.GOOS == "windows" {
filename = strings.ReplaceAll(filename, `\\'`, "/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is fundamentally unsafe, e.g. if/when MacOS decides it has a new way of writing file paths. I think that in all files that are not local, we should avoid filepath.Join in favour of URL-based joiners or our very own, and write our own /s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow - how can we prepare for a scenario when MacOS decides to use '^' as the separator? Filenames can contain the new separator.

)

func TestPyramidWriteFile(t *testing.T) {
filename := uuid.Must(uuid.NewRandom()).String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of UUIDs or of random data in tests, but if we're going to use them then this form is documented as exactly the above:

Suggested change
filename := uuid.Must(uuid.NewRandom()).String()
filename := uuid.New().String()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed - need to some randomness so that subsequent runs after a fail test won't look at the same data (if failed to be deleted)

pyramid/file_test.go Outdated Show resolved Hide resolved
@itaiad200
Copy link
Contributor Author

This is really nice, and a lot simpler!

Some issues:

  1. I don't think size-based LRU will make good decisions (I am not familiar with any literature about it either way, though). I think that it always makes cache thrashing worse. This is a criticism of our fork of the LRU cache, I just think it is particularly relevant here in our use-case: most user workloads will have a mix of very large and very small files (say, commits vs. trees), and frequent access to a small file should prefer to displace small files to large ones. We can partly overcome this with multiple pyramids, but then we need to configure each one separately.
  2. Need to delete empty directories from the cache. Otherwise a long-lived system can end up with high fan-out, which is even less efficient on POSIX than on S3.
  3. Can we handle paths with zero-length subdirectories? In S3 the paths s3://foo/a////b and s3://foo/a///b are distinct, whereas in POSIX /a////b, /a///b and /a/b all point to the same place.
    If we never use pyramid to cache general user files on disk, then we can just fail pathnames with an empty directory component. Otherwise we will need metadata.
  1. FS is already per usage (e.g. separate FS for trees and sstables). There are several reasons for this choice and you mentioned some. Other reasons are the need for separation of the disk usage, dynamic FS allocations and more. I also don't think LRU cache is optimised for variable cost entries - we'll have metrics to understand exactly by how much.
    I'm making 2 assumptions here, that I'll think will hold for the first version:
  • Disk capacity per FS is much bigger than a single file (>1000)
  • Files are roughly the same size.
  1. Done.
  2. Good point. Currently there's no use-case for user based files and we control all paths. Added verification to file paths.

@itaiad200 itaiad200 requested a review from arielshaqed December 6, 2020 14:08
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

2 small requests for follow on PRs:

  1. Rename tierFS.go to tier_fs.go.
  2. Monitoring. It's a cache, and we've already disagreed about lru.

@itaiad200 itaiad200 merged commit e88f3c7 into master Dec 6, 2020
return f.fh.Read(p)
}

func (f *File) ReadAt(p []byte, off int64) (n int, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off confused me. please change to offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather keep the same os.File terminology :)

return f.fh.Sync()
}

func (f *File) Close() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be simpler to use Store instead of Close. Store will close the file.
If someone wants to close without storing - he/she should use of.File

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to unify them. But Close may be called by a different layer, e.g. while closing sstable writer.

func TestPyramidWriteFile(t *testing.T) {
filename := uuid.Must(uuid.NewRandom()).String()
filepath := path.Join("/tmp", filename)
defer os.Remove(filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be after create

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated comment.


// ROFile is pyramid wrapper for os.file that implements io.ReadCloser
// with hooks for updating evictions.
type ROFile struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File should be WOFile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? it's also readable.

)

// File is pyramid wrapper for os.file that triggers pyramid hooks for file actions.
type File struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the file name kept?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp file name, which is only relevant during the write itself, is held by the os.File handle.
The final filename is passed during Store.

type relativePath string

// localFileRef consists of all possible local file references
type localFileRef struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that the two last properties will be methods on localFileRef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. But then I'll have to keep a pointer to tfs, which is a bit messy.
I don't feel strongly about it, willing to change if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants