Skip to content

Conversation

@camilojd
Copy link
Contributor

I'd like to propose this API to query index file sizes, as part of SegmentsStats.

Copypasting a comment previously posted on #16131:

Some comments/questions follow.

I'm consolidating file sizes using the same criteria of SegmentsStats to expose disk usage of Lucene index files: Terms, TermVectors, StoredFields, Norms, DocValues (class IndexResources in SegmentsStats).

SegmentsStats currently exposes memory consumption as bytes of each of the aforementioned so it's possible to add these as part of a class (IndexResources) and expose it as two fields of SegmentsStats, one for memory resources (existing stats) and other for disk resources. This is what I implemented in my branch, but I'm not sure if it would be better to expose this as another "sister" class of SegmentsStats, member of CommonStats, and leave SegmentsStats as it is. I'm however aware that my toXContent serialization could break clients expecting certain fields that now are placed elsewhere.

Some other couple questions I've got:

  1. Is it necessary to consolidate sizes by the same criteria of SegmentsStats or may be better doing it using the file extensions? This last option has the advantage that would include additional information about postings, i.e., separate size info for positions, payloads, etc.
  2. The meat of the matter: Is the way I open the Directory from the SegmentInfo and the compound reader a sensible approach? I'm not sure if this could introduce issues with the Store.

Closes #16131

@clintongormley
Copy link
Contributor

@mikemccand would you mind having a look at this PR please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should we have an "other" category in a final else clause here?

@mikemccand
Copy link
Contributor

Is the way I open the Directory from the SegmentInfo and the compound reader a sensible approach?

Yes, that's correct!

Is it necessary to consolidate sizes by the same criteria of SegmentsStats or may be better doing it using the file extensions? This last option has the advantage that would include additional information about postings, i.e., separate size info for positions, payloads, etc.

I think I'd rather see by file extension? E.g. knowing whether your terms dict is huge, or your docs or positions or payloads are huge, is important information.

We could do extensions plus an "interpretation" e.g. ".tim (terms)", ".tip (terms index)", etc.? Doing it this way also has the advantage that an unknown extension (which can easily happen when lucene changes its file format, e.g. the upcoming dimensional points is a change in 6.0) can just be added into the stats.

@camilojd
Copy link
Contributor Author

We could do extensions plus an "interpretation" e.g. ".tim (terms)", ".tip (terms index)", etc.? Doing it this way also has the advantage that an unknown extension (which can easily happen when lucene changes its file format, e.g. the upcoming dimensional points is a change in 6.0) can just be added into the stats.

I like this compromise. I'll simplify some things and proceed to fix the mentioned issues. 👍

@camilojd camilojd force-pushed the feature/disk-segments-stats branch from 1c116d8 to 8e30cc6 Compare February 19, 2016 22:32
@camilojd
Copy link
Contributor Author

@mikemccand Hi Mike,

I fixed all of the above, and some more. Details:

  • Simplified field: just a ImmutableOpenMap<String, Long> to store per extension sizes.
  • Changed exception handling to limit try/catch blocks to the minimum scope, logging with warn when necessary
  • Transport serialization doesn't serialize the map field until ES Version 3.0
  • XContent serialization takes care of handling known/unknown extension descriptions by querying a map.
  • Querying segment disk usage is optional: I added a boolean flag to CommonStatsFlags, triggered by setting a detailed_segments parameter in the REST endpoint to query segments disk stats. I don't know if there is a nicer way to do this.

So far it works and passes all the tests.

It seems to me that a better place for this would be under StoreStats, because Store already queries the effective size of all physical files and caches the result... but had no idea/luck on how to grab a SegmentInfo for uncommitted segments (only does show some disk usage after forcing a _flush). See camilojd@df0370f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an else clause here to set detailedSegmentsStats to false so it's explicit?

@mikemccand
Copy link
Contributor

Thanks @camilojd, this is looking very nice. I just left some minor comments.

Querying segment disk usage is optional: I added a boolean flag to CommonStatsFlags, triggered by setting a detailed_segments parameter in the REST endpoint to query segments disk stats. I don't know if there is a nicer way to do this.

I think CommonStatsFlags is fine for now: I don't know of another place to put it (maybe someone else has a suggestion?).

Do we also need to punch the new boolean option through to e.g. the Java client API (and eventually the other language clients)?

It seems to me that a better place for this would be under StoreStats, because Store already queries the effective size of all physical files and caches the result... but had no idea/luck on how to grab a SegmentInfo for uncommitted segments (only does show some disk usage after forcing a _flush).

I do agree it would be more natural to have the stats there, e.g. we could just fix store stats to break out the file sizes by extension, and not cause any additional IO load to the filesystem.

Maybe it would be an OK limitation that it'd only list flushed segments? We could explain that in the docs?

@camilojd camilojd force-pushed the feature/disk-segments-stats branch from 8e30cc6 to e51111d Compare February 23, 2016 03:48
@camilojd
Copy link
Contributor Author

@mikemccand just pushed changes that address all your comments. I really like how the code became more self-contained. :-)

Do we also need to punch the new boolean option through to e.g. the Java client API (and eventually the other language clients)?

The option is in the Client API, accessible through IndicesStatsRequest.detailedSegmentsStats(boolean), so also through NodesStatsRequest.indices.

Currently it can be requested through HTTP in RestIndicesStatsAction and RestNodesStatsAction, but it's kinda ugly, just checks if the key detailed_segments is in the query string. Maybe leave the option only in the Client API by now?

Maybe it would be an OK limitation that it'd only list flushed segments? We could explain that in the docs?

It's possible to do that, but it's kind of weird. Currently StoreStats already informs the real disk usage of the directory, but numbers in my implementation will almost always be very far from the total, because of unflushed segments. SegmentsStats is more accurate (not exactly 100% accurate because of cfs, etc).

@mikemccand
Copy link
Contributor

Thanks @camilojd, I'll review your latest changes.

The option is in the Client API, accessible through IndicesStatsRequest.detailedSegmentsStats(boolean), so also through NodesStatsRequest.indices.

Oh good, sorry I missed this.

Currently it can be requested through HTTP in RestIndicesStatsAction and RestNodesStatsAction, but it's kinda ugly, just checks if the key detailed_segments is in the query string. Maybe leave the option only in the Client API by now?

I think it's good to expose the option in the REST API too.

SegmentsStats is more accurate (not exactly 100% accurate because of cfs, etc).

OK let's leave it where it is now, but maybe add a TODO about whether this could/should move to StoreStats instead, explaining the tradeoffs we discussed here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can we rename detailedSegmentStats? Maybe includeSegmentFileSizes?

@mikemccand
Copy link
Contributor

This change looks great!

I just left a minor (naming, the hardest part!) comment, and let's add a TODO about maybe moving this to StoreStats in the future?

Can you add some tests here to confirm the feature is working and catch us if anyone breaks it in the future, and also update the docs explaining this new cool parameter? Thanks @camilojd!

@camilojd camilojd force-pushed the feature/disk-segments-stats branch from e51111d to 3b371e8 Compare February 26, 2016 20:27
@camilojd
Copy link
Contributor Author

@mikemccand all comments were addressed.

Do you think there is anything left to improve?
Thanks for the amazing help!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update docs/reference/indices/stats.asciidoc with this new parameter?

@mikemccand
Copy link
Contributor

I think we just need to document the new parameter and then we are done! Thanks @camilojd.

@camilojd camilojd force-pushed the feature/disk-segments-stats branch from 3b371e8 to 27647d4 Compare February 29, 2016 22:44
@camilojd
Copy link
Contributor Author

@mikemccand now it's done! 👍 Thank you sir!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I just realized we are doing too much work here? We should only be visiting the files belonging to this segment, but we are instead visiting all files in the directory (at least, in the non-compound-file case)?

I think instead of directory.listAll(), you could use segmentReader.getSegmentInfo().files()? This should return all file names that this segment uses ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of directory.listAll(), you could use segmentReader.getSegmentInfo().files()? This should return all file names that this segment uses ...

Hi @mikemccand,
Makes sense. Perhaps we can do:

        if (useCompoundFile) {
            files = directory.listAll();
        } else {
            files = segmentReader.getSegmentInfo().files().toArray(new String[]{});
        }

Otherwise getSegmentInfo.files() returns .cfs, .cfe files and the Compound Directory can't find them when querying fileLength

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes @camilojd I agree that makes sense!

@mikemccand
Copy link
Contributor

Thanks @camilojd I left one more comment!

@camilojd camilojd force-pushed the feature/disk-segments-stats branch from 27647d4 to f7ab290 Compare March 2, 2016 01:19
@camilojd
Copy link
Contributor Author

camilojd commented Mar 2, 2016

@mikemccand I updated the code according to the comments :-) thanks!

Edit: fixed the exception handling to log the correct messages and limit the try {} scope to one line at most.

…extension.

Use 'includeSegmentFileSizes' as the flag name to report disk usage.
Added test that verifies reported segment disk usage is growing accordingly after adding a document.
Documentation: Reference the new parameter as part of indices stats.
@camilojd camilojd force-pushed the feature/disk-segments-stats branch from f7ab290 to 3563648 Compare March 2, 2016 02:17
@mikemccand
Copy link
Contributor

Thanks @camilojd, this looks great, I'll push to 5.0 now!

mikemccand pushed a commit that referenced this pull request Mar 2, 2016
Segments stats in indices stats API now optionally includes aggregated file sizes by file extension / index component
@mikemccand mikemccand merged commit 4df83dc into elastic:master Mar 2, 2016
@camilojd
Copy link
Contributor Author

camilojd commented Mar 3, 2016

Awesome! thank you very much @mikemccand!!

@camilojd camilojd deleted the feature/disk-segments-stats branch March 15, 2016 04:06
tlrx added a commit that referenced this pull request Apr 15, 2021
… APIs (#71643)

Since #16661 it is possible to know the total sizes for some Lucene segment files 
by using the Node Stats or Indices Stats API with the include_segment_file_sizes 
parameter, and the list of file extensions has been extended in #71416.

This commit adds a bit more information about file sizes like the number of files 
(count), the min, max and average file sizes in bytes that share the same extension.

Here is a sample:
"cfs" : {
  "description" : "Compound Files",
  "size_in_bytes" : 2260,
  "min_size_in_bytes" : 2260,
  "max_size_in_bytes" : 2260,
  "average_size_in_bytes" : 2260,
  "count" : 1
}

This commit also simplifies how compound file sizes were computed: before 
compound segment files were extracted and sizes aggregated with regular 
non-compound files sizes (which can be confusing and out of the scope of 
the original issue #6728), now CFS/CFE files appears as distinct files.

These new information are provided to give a better view of the segment 
files and are useful in many cases, specially with frozen searchable snapshots 
whose segment stats can now be introspected thanks to the 
include_unloaded_segments parameter.
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Apr 15, 2021
… APIs (elastic#71643)

Since elastic#16661 it is possible to know the total sizes for some Lucene segment files 
by using the Node Stats or Indices Stats API with the include_segment_file_sizes 
parameter, and the list of file extensions has been extended in elastic#71416.

This commit adds a bit more information about file sizes like the number of files 
(count), the min, max and average file sizes in bytes that share the same extension.

Here is a sample:
"cfs" : {
  "description" : "Compound Files",
  "size_in_bytes" : 2260,
  "min_size_in_bytes" : 2260,
  "max_size_in_bytes" : 2260,
  "average_size_in_bytes" : 2260,
  "count" : 1
}

This commit also simplifies how compound file sizes were computed: before 
compound segment files were extracted and sizes aggregated with regular 
non-compound files sizes (which can be confusing and out of the scope of 
the original issue elastic#6728), now CFS/CFE files appears as distinct files.

These new information are provided to give a better view of the segment 
files and are useful in many cases, specially with frozen searchable snapshots 
whose segment stats can now be introspected thanks to the 
include_unloaded_segments parameter.
tlrx added a commit that referenced this pull request Apr 15, 2021
… Stats APIs (#71725)

Since #16661 it is possible to know the total sizes for some Lucene segment files 
by using the Node Stats or Indices Stats API with the include_segment_file_sizes 
parameter, and the list of file extensions has been extended in #71416.

This commit adds a bit more information about file sizes like the number of files 
(count), the min, max and average file sizes in bytes that share the same extension.

Here is a sample:
"cfs" : {
  "description" : "Compound Files",
  "size_in_bytes" : 2260,
  "min_size_in_bytes" : 2260,
  "max_size_in_bytes" : 2260,
  "average_size_in_bytes" : 2260,
  "count" : 1
}

This commit also simplifies how compound file sizes were computed: before 
compound segment files were extracted and sizes aggregated with regular 
non-compound files sizes (which can be confusing and out of the scope of 
the original issue #6728), now CFS/CFE files appears as distinct files.

These new information are provided to give a better view of the segment 
files and are useful in many cases, specially with frozen searchable snapshots 
whose segment stats can now be introspected thanks to the 
include_unloaded_segments parameter.

Backport of #71643
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Stats Statistics tracking and retrieval APIs >feature v5.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants