Concurrently load the in-memory cache index from disk #123

bdittmer · 2019-11-14T19:41:47Z

This change loads the on-disk cache into memory concurrently.

mostynb · 2019-11-15T16:17:19Z

cache/disk/disk.go

-			err := os.MkdirAll(filepath.Join(dir, cache.CAS.String(), subDir), os.FileMode(0744))
+
+			casSubDir := filepath.Join(dir, cache.CAS.String(), subDir)
+			err := os.MkdirAll(casSubDir, os.FileMode(0777))


This FileMode change looks unrelated- I'm not sure if @buchgr has a reason to use 0744.

I am not the expert but I figured it's good practice to be as restrictive as possible. @bdittmer what's the reason for changing this to 777?

The use case is so that a group of users can admin the cache without root access. The permissions on disk are not 0777 but (0777 - umask) for dirs and (0666 - umask) for files.

I think this permission change is OK, but it should land in a separate commit. Added an alternative permissions change in #128 (which will cause a merge conflict here- sorry).

No problem, thanks!

mostynb · 2019-11-15T16:21:42Z

cache/disk/disk.go

+
+	filesChan := make(chan []NameAndInfo, len(dirs))
+	workChan := make(chan string)
+	workers := runtime.NumCPU()


How many CPU cores have you benchmarked this with? I wonder if this is more constrained by disk access than by CPU time?

I cannot find the benchmarks (this landed in our branch several months ago) but iirc we saw ~40-50% startup time improvement on a 6-core MBP w/ SSD. If I get some time I'll try to run some additional benchmarks

mostynb · 2019-11-15T16:33:53Z

cache/disk/disk.go

 	}

-	log.Println("Sorting cache files by atime.")


Please add back the log statement.

mostynb · 2019-11-15T16:33:57Z

cache/disk/disk.go

 	// Sort in increasing order of atime
 	sort.Slice(files, func(i int, j int) bool {
 		return atime.Get(files[i].info).Before(atime.Get(files[j].info))
 	})

-	log.Println("Building LRU index.")


Please add back the log statement.

mostynb · 2019-11-15T16:34:05Z

cache/disk/disk.go

@@ -153,7 +197,6 @@ func (c *diskCache) loadExistingFiles() error {
 		})
 	}

-	log.Println("Finished loading disk cache files.")


Please add back the log statement.

mostynb · 2019-11-15T16:35:14Z

cache/disk/disk.go

+			for dir := range workChan {
+				dirFiles := make([]NameAndInfo, 0)
+
+				_ = filepath.Walk(dir, func(name string, info os.FileInfo, err error) error {


Should we check the return value (and the other instances of this)?

mostynb · 2019-11-15T16:36:05Z

cache/disk/disk.go

+			for dir := range workChan {
+				dirFiles := make([]NameAndInfo, 0)
+
+				_ = filepath.Walk(dir, func(name string, info os.FileInfo, err error) error {


I wonder if using fastwalk would be faster?
https://godoc.org/golang.org/x/tools/internal/fastwalk

mostynb · 2019-11-15T17:56:42Z

cache/disk/disk.go

@@ -40,21 +41,36 @@ type diskCache struct {
 func New(dir string, maxSizeBytes int64) cache.Cache {
 	// Create the directory structure.
 	hexLetters := []byte("0123456789abcdef")
+	dirs := make([]string, len(hexLetters)*len(hexLetters)*3)
+	idx := 0
+
 	for _, c1 := range hexLetters {
 		for _, c2 := range hexLetters {


Maybe it's worth adding another for loop over []string{"ac", "cas", "raw"} and avoid some mostly copy+pasted code?

mostynb · 2019-11-15T17:59:35Z

cache/disk/disk.go

@@ -119,32 +135,60 @@ func migrateDirectory(dir string) error {
 // loadExistingFiles lists all files in the cache directory, and adds them to the
 // LRU index so that they can be served. Files are sorted by access time first,
 // so that the eviction behavior is preserved across server restarts.
-func (c *diskCache) loadExistingFiles() error {
-	log.Printf("Loading existing files in %s.\n", c.dir)


Please add back this log message.

mostynb · 2019-11-15T18:26:12Z

cache/disk/disk.go

+
+	i := 0
+	for f := range filesChan {
+		if len(f) > 0 {


Is this check is required when using range?

cache/disk/disk.go

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

mostynb · 2022-09-25T19:26:13Z

I added similar functionality in #581 just now.

Concurrently load the in-memory cache index from disk

bef144b

mostynb reviewed Nov 15, 2019

View reviewed changes

mostynb mentioned this pull request Feb 10, 2020

disk cache: store a data integrity header for non-CAS blobs #186

Open

mostynb closed this Sep 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrently load the in-memory cache index from disk #123

Concurrently load the in-memory cache index from disk #123

bdittmer commented Nov 14, 2019

mostynb Nov 15, 2019

buchgr Nov 20, 2019

mostynb Nov 25, 2019

bdittmer Nov 26, 2019

mostynb Nov 15, 2019

bdittmer Nov 26, 2019

mostynb Nov 15, 2019

mostynb Nov 15, 2019

mostynb Nov 15, 2019

mostynb Nov 15, 2019

mostynb Nov 15, 2019

mostynb Nov 15, 2019

mostynb Nov 15, 2019

mostynb Nov 15, 2019

mostynb commented Sep 25, 2022 •

edited

Loading

Concurrently load the in-memory cache index from disk #123

Concurrently load the in-memory cache index from disk #123

Conversation

bdittmer commented Nov 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mostynb commented Sep 25, 2022 • edited Loading

mostynb commented Sep 25, 2022 •

edited

Loading