Skip to content

Add a disk-usage analysis API #68508

@jpountz

Description

@jpountz

Disk usage analysis is a popular ask, as it maps directly to cost efficiency for users who are managing large volumes of data. We have internal tools that help with this, but they are really impractical. This would be very useful for benchmarking purposes as well, e.g. analyzing the effect of mapping changes on disk usage, not only understanding how it affects the overall disk usage but also the per-field disk usage or ever per-field per-data-structure disk usage.

I've been thinking about exposing an API that analyzes disk usage and would work the following way:

  • Create a directory wrapper that tracks all bytes that are read via IndexInput#readBytes including clones and slices.
  • Wrap the directory of the index that we wish to analyze with this wrapper.
  • For every (field, data structure) pair in the index, reset the counter of the wrapper and then read all the content of the data structure of the considered field. This should give an approximation of the contribution of this data structure for this field to overall disk usage.

It won't be 100% accurate. For instance, metadata that is loaded into memory up-front wouldn't be tracked. Another example is skip lists, since it's tricky to find a way to find sequences of calls to PostingsEnum#advance that are guaranteed to read all skip data. But I believe that these bytes that we would miss would be rare enough that they would rarely account for more than 1% of the overall size of an index.

Such an API would be costly at it would need to scan the entire content of the index and could trash useful content from the filesystem cache.

Metadata

Metadata

Assignees

Labels

:Distributed Indexing/EngineAnything around managing Lucene and the Translog in an open shard.>featureTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions