Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Improve usage of blob store cache during searchable snapshots shard recovery #69283

Closed
wants to merge 2 commits into from

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Feb 19, 2021

The blob store cache was introduced in #60522 to speed up searchable snapshots shard recovery by caching (in a system index) the first 4096 bytes, sometimes 8192, of every Lucene files that compose a shard.

Recent experiments using large snapshots suggest that we could maybe adjust the current caching strategy by caching less data (ie 1024 bytes) by default for most of the files and cache more data (up to 64KB) for Lucene metadata files.

This draft pull request addresses this point by introducing a BlobStoreCacheService#computeHeaderByteRange() that computes the range of bytes to put in blob store cache depending of the Lucene file type.

We also noticed that compound files could represent a non negligeable amount of the total size of a shard (~30% in our tests) and that it may be worth to avoid random seeks and reads by also caching the files that compose .cfs files.

This pull request addresses this point by caching headers and footers of .cfs inner files in the blob store cache. The size of the data to cache for the header is computed using computeHeaderByteRange(). The footer is 16 bytes long.

Finally, we found that concurrent prewarming and directory opening could prevent some file parts to be effectively cached in the blob store cache the first time an index is mounted, forcing some bytes to be redownloaded again the next times that index will be mounted.

This pull request addresses this point by detecting when using the blob store cache index should be preferred rather than using the disk based cache. Blob store cache is always preferred when the recovery is not finalized yet, and completely bypassed when the recovery is done.

I'm opening this PR as a draft to show the complexity introduced by this change. It's possible that we decide to move forward with only a subset of the changes.

@tlrx tlrx added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 labels Feb 19, 2021

private CompoundReaderUtils() {}

public static Map<String, Map<String, Tuple<Long, Long>>> extractCompoundFiles(Directory directory) throws IOException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry but I did not find any better way to extract the list of files that composed the CFS. Reading only the .cfe is possible but that won't give the inner offsets (only the lengths) and I think it is better to check the right boundaries.

@@ -205,33 +250,36 @@ public void testBlobStoreCache() throws Exception {

assertAcked(client().admin().indices().prepareDelete(restoredIndex));

logger.info("--> mount snapshot [{}] as an index for the second time", snapshot);
final String restoredAgainIndex = mountSnapshot(
cacheEnabled = randomBoolean();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second time the index is mounted can now be fully randomized between full cache/partial cache/no cache.

|| mayReadMoreThanHeader == false) {
assertThat(Strings.toString(indexInputStats), indexInputStats.getBlobStoreBytesRequested().getCount(), equalTo(0L));
}
assertThat(Strings.toString(indexInputStats), indexInputStats.getBlobStoreBytesRequested().getCount(), equalTo(0L));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now blindly assume in this test that no bytes where requested when mounting the second time.

@ywelsch ywelsch self-requested a review February 19, 2021 16:08
tlrx added a commit that referenced this pull request Mar 1, 2021
Today searchable snapshots IndexInput implementations use the 
blob store cache to cache the first 4096 bytes of every Lucene files. 
After some experiments we think that we could adjust the length of 
the cached data depending of the Lucene file that is read, caching 
up to 64KB for Lucene metadata files (ie files that are fully read 
when a Directory is opened) and only 1KB for other files.

The files that are cached up to 64KB are the following extensions:

        "cfe", // compound file's entry table
        "dvm", // doc values metadata file
        "fdm", // stored fields metadata file
        "fnm", // field names metadata file
        "kdm", // Lucene 8.6 point format metadata file
        "nvm", // norms metadata file
        "tmd", // Lucene 8.6 terms metadata file
        "tvm", // terms vectors metadata file
        "vem"  // Lucene 9.0 indexed vectors metadata

The 64KB limit can be configured on a per index basis through a new 
index setting. This change is extracted from #69283 and does not 
address the caching of CFS files.
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Mar 1, 2021
…ic#69431)

Today searchable snapshots IndexInput implementations use the
blob store cache to cache the first 4096 bytes of every Lucene files.
After some experiments we think that we could adjust the length of
the cached data depending of the Lucene file that is read, caching
up to 64KB for Lucene metadata files (ie files that are fully read
when a Directory is opened) and only 1KB for other files.

The files that are cached up to 64KB are the following extensions:

        "cfe", // compound file's entry table
        "dvm", // doc values metadata file
        "fdm", // stored fields metadata file
        "fnm", // field names metadata file
        "kdm", // Lucene 8.6 point format metadata file
        "nvm", // norms metadata file
        "tmd", // Lucene 8.6 terms metadata file
        "tvm", // terms vectors metadata file
        "vem"  // Lucene 9.0 indexed vectors metadata

The 64KB limit can be configured on a per index basis through a new
index setting. This change is extracted from elastic#69283 and does not
address the caching of CFS files.

Backport of elastic#69431
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Mar 2, 2021
…ic#69431)

Today searchable snapshots IndexInput implementations use the
blob store cache to cache the first 4096 bytes of every Lucene files.
After some experiments we think that we could adjust the length of
the cached data depending of the Lucene file that is read, caching
up to 64KB for Lucene metadata files (ie files that are fully read
when a Directory is opened) and only 1KB for other files.

The files that are cached up to 64KB are the following extensions:

        "cfe", // compound file's entry table
        "dvm", // doc values metadata file
        "fdm", // stored fields metadata file
        "fnm", // field names metadata file
        "kdm", // Lucene 8.6 point format metadata file
        "nvm", // norms metadata file
        "tmd", // Lucene 8.6 terms metadata file
        "tvm", // terms vectors metadata file
        "vem"  // Lucene 9.0 indexed vectors metadata

The 64KB limit can be configured on a per index basis through a new
index setting. This change is extracted from elastic#69283 and does not
address the caching of CFS files.

Backport of elastic#69431
tlrx added a commit that referenced this pull request Mar 3, 2021
Today searchable snapshots IndexInput implementations use the
blob store cache to cache the first 4096 bytes of every Lucene files.
After some experiments we think that we could adjust the length of
the cached data depending of the Lucene file that is read, caching
up to 64KB for Lucene metadata files (ie files that are fully read
when a Directory is opened) and only 1KB for other files.

The files that are cached up to 64KB are the following extensions:

    "cfe", // compound file's entry table
    "dvm", // doc values metadata file
    "fdm", // stored fields metadata file
    "fnm", // field names metadata file
    "kdm", // Lucene 8.6 point format metadata file
    "nvm", // norms metadata file
    "tmd", // Lucene 8.6 terms metadata file
    "tvm", // terms vectors metadata file
    "vem"  // Lucene 9.0 indexed vectors metadata

The 64KB limit can be configured on a per index basis through a new
index setting. This change is extracted from #69283 and does not
address the caching of CFS files.

Backport of #69431
tlrx added a commit that referenced this pull request Mar 3, 2021
Today searchable snapshots IndexInput implementations use the
blob store cache to cache the first 4096 bytes of every Lucene files.
After some experiments we think that we could adjust the length of
the cached data depending of the Lucene file that is read, caching
up to 64KB for Lucene metadata files (ie files that are fully read
when a Directory is opened) and only 1KB for other files.

The files that are cached up to 64KB are the following extensions:

    "cfe", // compound file's entry table
    "dvm", // doc values metadata file
    "fdm", // stored fields metadata file
    "fnm", // field names metadata file
    "kdm", // Lucene 8.6 point format metadata file
    "nvm", // norms metadata file
    "tmd", // Lucene 8.6 terms metadata file
    "tvm", // terms vectors metadata file
    "vem"  // Lucene 9.0 indexed vectors metadata

The 64KB limit can be configured on a per index basis through a new
index setting. This change is extracted from #69283 and does not
address the caching of CFS files.

Backport of #69431
@tlrx
Copy link
Member Author

tlrx commented Mar 23, 2021

Part of this draft pull request have been implemented and merged (#69861, #69415, #68902, #69431).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants