Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrently load the in-memory cache index from disk #123

Closed
wants to merge 1 commit into from

Conversation

bdittmer
Copy link
Contributor

This change loads the on-disk cache into memory concurrently.

err := os.MkdirAll(filepath.Join(dir, cache.CAS.String(), subDir), os.FileMode(0744))

casSubDir := filepath.Join(dir, cache.CAS.String(), subDir)
err := os.MkdirAll(casSubDir, os.FileMode(0777))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This FileMode change looks unrelated- I'm not sure if @buchgr has a reason to use 0744.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not the expert but I figured it's good practice to be as restrictive as possible. @bdittmer what's the reason for changing this to 777?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case is so that a group of users can admin the cache without root access. The permissions on disk are not 0777 but (0777 - umask) for dirs and (0666 - umask) for files.

I think this permission change is OK, but it should land in a separate commit. Added an alternative permissions change in #128 (which will cause a merge conflict here- sorry).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, thanks!


filesChan := make(chan []NameAndInfo, len(dirs))
workChan := make(chan string)
workers := runtime.NumCPU()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many CPU cores have you benchmarked this with? I wonder if this is more constrained by disk access than by CPU time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find the benchmarks (this landed in our branch several months ago) but iirc we saw ~40-50% startup time improvement on a 6-core MBP w/ SSD. If I get some time I'll try to run some additional benchmarks

}

log.Println("Sorting cache files by atime.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add back the log statement.

// Sort in increasing order of atime
sort.Slice(files, func(i int, j int) bool {
return atime.Get(files[i].info).Before(atime.Get(files[j].info))
})

log.Println("Building LRU index.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add back the log statement.

@@ -153,7 +197,6 @@ func (c *diskCache) loadExistingFiles() error {
})
}

log.Println("Finished loading disk cache files.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add back the log statement.

for dir := range workChan {
dirFiles := make([]NameAndInfo, 0)

_ = filepath.Walk(dir, func(name string, info os.FileInfo, err error) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the return value (and the other instances of this)?

for dir := range workChan {
dirFiles := make([]NameAndInfo, 0)

_ = filepath.Walk(dir, func(name string, info os.FileInfo, err error) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using fastwalk would be faster?
https://godoc.org/golang.org/x/tools/internal/fastwalk

@@ -40,21 +41,36 @@ type diskCache struct {
func New(dir string, maxSizeBytes int64) cache.Cache {
// Create the directory structure.
hexLetters := []byte("0123456789abcdef")
dirs := make([]string, len(hexLetters)*len(hexLetters)*3)
idx := 0

for _, c1 := range hexLetters {
for _, c2 := range hexLetters {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth adding another for loop over []string{"ac", "cas", "raw"} and avoid some mostly copy+pasted code?

@@ -119,32 +135,60 @@ func migrateDirectory(dir string) error {
// loadExistingFiles lists all files in the cache directory, and adds them to the
// LRU index so that they can be served. Files are sorted by access time first,
// so that the eviction behavior is preserved across server restarts.
func (c *diskCache) loadExistingFiles() error {
log.Printf("Loading existing files in %s.\n", c.dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add back this log message.


i := 0
for f := range filesChan {
if len(f) > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check is required when using range?

cache/disk/disk.go Show resolved Hide resolved
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 10, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 10, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 10, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 10, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 11, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 14, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 14, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 18, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 26, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Feb 26, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Mar 4, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
mostynb added a commit to mostynb/bazel-remote that referenced this pull request Mar 8, 2020
The header is made up of three fields:
1) Little-endian int32 (4 bytes) representing the REAPIv2
   DigestFunction.
2) Little-endian int64 (8 bytes) representing the number
   of bytes in the blob.
3) The hash bytes from the digest, length determined by
   the particular DigestFunction.
   (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the
entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new
directories for the affected blobs: ac.v2/ instead of ac/ and
similarly for raw/.

We do not use this header to actually verify data yet, and we
still os.File.Sync() after file writes (buchgr#67).

This also includes a slightly refactored version of PR buchgr#123
(load the items from disk concurrently) by @bdittmer.
@mostynb
Copy link
Collaborator

mostynb commented Sep 25, 2022

I added similar functionality in #581 just now.

@mostynb mostynb closed this Sep 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants