-
Notifications
You must be signed in to change notification settings - Fork 25.6k
API for listing index file sizes #16661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API for listing index file sizes #16661
Conversation
|
@mikemccand would you mind having a look at this PR please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should we have an "other" category in a final else clause here?
Yes, that's correct!
I think I'd rather see by file extension? E.g. knowing whether your terms dict is huge, or your docs or positions or payloads are huge, is important information. We could do extensions plus an "interpretation" e.g. ".tim (terms)", ".tip (terms index)", etc.? Doing it this way also has the advantage that an unknown extension (which can easily happen when lucene changes its file format, e.g. the upcoming dimensional points is a change in 6.0) can just be added into the stats. |
I like this compromise. I'll simplify some things and proceed to fix the mentioned issues. 👍 |
1c116d8 to
8e30cc6
Compare
|
@mikemccand Hi Mike, I fixed all of the above, and some more. Details:
So far it works and passes all the tests. It seems to me that a better place for this would be under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add an else clause here to set detailedSegmentsStats to false so it's explicit?
|
Thanks @camilojd, this is looking very nice. I just left some minor comments.
I think Do we also need to punch the new boolean option through to e.g. the Java client API (and eventually the other language clients)?
I do agree it would be more natural to have the stats there, e.g. we could just fix store stats to break out the file sizes by extension, and not cause any additional IO load to the filesystem. Maybe it would be an OK limitation that it'd only list flushed segments? We could explain that in the docs? |
8e30cc6 to
e51111d
Compare
|
@mikemccand just pushed changes that address all your comments. I really like how the code became more self-contained. :-)
The option is in the Client API, accessible through Currently it can be requested through HTTP in
It's possible to do that, but it's kind of weird. Currently |
|
Thanks @camilojd, I'll review your latest changes.
Oh good, sorry I missed this.
I think it's good to expose the option in the REST API too.
OK let's leave it where it is now, but maybe add a TODO about whether this could/should move to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can we rename detailedSegmentStats? Maybe includeSegmentFileSizes?
|
This change looks great! I just left a minor (naming, the hardest part!) comment, and let's add a TODO about maybe moving this to Can you add some tests here to confirm the feature is working and catch us if anyone breaks it in the future, and also update the docs explaining this new cool parameter? Thanks @camilojd! |
e51111d to
3b371e8
Compare
|
@mikemccand all comments were addressed. Do you think there is anything left to improve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update docs/reference/indices/stats.asciidoc with this new parameter?
|
I think we just need to document the new parameter and then we are done! Thanks @camilojd. |
3b371e8 to
27647d4
Compare
|
@mikemccand now it's done! 👍 Thank you sir! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I just realized we are doing too much work here? We should only be visiting the files belonging to this segment, but we are instead visiting all files in the directory (at least, in the non-compound-file case)?
I think instead of directory.listAll(), you could use segmentReader.getSegmentInfo().files()? This should return all file names that this segment uses ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of directory.listAll(), you could use segmentReader.getSegmentInfo().files()? This should return all file names that this segment uses ...
Hi @mikemccand,
Makes sense. Perhaps we can do:
if (useCompoundFile) {
files = directory.listAll();
} else {
files = segmentReader.getSegmentInfo().files().toArray(new String[]{});
}
Otherwise getSegmentInfo.files() returns .cfs, .cfe files and the Compound Directory can't find them when querying fileLength
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes @camilojd I agree that makes sense!
|
Thanks @camilojd I left one more comment! |
27647d4 to
f7ab290
Compare
|
@mikemccand I updated the code according to the comments :-) thanks! Edit: fixed the exception handling to log the correct messages and limit the |
…extension. Use 'includeSegmentFileSizes' as the flag name to report disk usage. Added test that verifies reported segment disk usage is growing accordingly after adding a document. Documentation: Reference the new parameter as part of indices stats.
f7ab290 to
3563648
Compare
|
Thanks @camilojd, this looks great, I'll push to 5.0 now! |
Segments stats in indices stats API now optionally includes aggregated file sizes by file extension / index component
|
Awesome! thank you very much @mikemccand!! |
… APIs (#71643) Since #16661 it is possible to know the total sizes for some Lucene segment files by using the Node Stats or Indices Stats API with the include_segment_file_sizes parameter, and the list of file extensions has been extended in #71416. This commit adds a bit more information about file sizes like the number of files (count), the min, max and average file sizes in bytes that share the same extension. Here is a sample: "cfs" : { "description" : "Compound Files", "size_in_bytes" : 2260, "min_size_in_bytes" : 2260, "max_size_in_bytes" : 2260, "average_size_in_bytes" : 2260, "count" : 1 } This commit also simplifies how compound file sizes were computed: before compound segment files were extracted and sizes aggregated with regular non-compound files sizes (which can be confusing and out of the scope of the original issue #6728), now CFS/CFE files appears as distinct files. These new information are provided to give a better view of the segment files and are useful in many cases, specially with frozen searchable snapshots whose segment stats can now be introspected thanks to the include_unloaded_segments parameter.
… APIs (elastic#71643) Since elastic#16661 it is possible to know the total sizes for some Lucene segment files by using the Node Stats or Indices Stats API with the include_segment_file_sizes parameter, and the list of file extensions has been extended in elastic#71416. This commit adds a bit more information about file sizes like the number of files (count), the min, max and average file sizes in bytes that share the same extension. Here is a sample: "cfs" : { "description" : "Compound Files", "size_in_bytes" : 2260, "min_size_in_bytes" : 2260, "max_size_in_bytes" : 2260, "average_size_in_bytes" : 2260, "count" : 1 } This commit also simplifies how compound file sizes were computed: before compound segment files were extracted and sizes aggregated with regular non-compound files sizes (which can be confusing and out of the scope of the original issue elastic#6728), now CFS/CFE files appears as distinct files. These new information are provided to give a better view of the segment files and are useful in many cases, specially with frozen searchable snapshots whose segment stats can now be introspected thanks to the include_unloaded_segments parameter.
… Stats APIs (#71725) Since #16661 it is possible to know the total sizes for some Lucene segment files by using the Node Stats or Indices Stats API with the include_segment_file_sizes parameter, and the list of file extensions has been extended in #71416. This commit adds a bit more information about file sizes like the number of files (count), the min, max and average file sizes in bytes that share the same extension. Here is a sample: "cfs" : { "description" : "Compound Files", "size_in_bytes" : 2260, "min_size_in_bytes" : 2260, "max_size_in_bytes" : 2260, "average_size_in_bytes" : 2260, "count" : 1 } This commit also simplifies how compound file sizes were computed: before compound segment files were extracted and sizes aggregated with regular non-compound files sizes (which can be confusing and out of the scope of the original issue #6728), now CFS/CFE files appears as distinct files. These new information are provided to give a better view of the segment files and are useful in many cases, specially with frozen searchable snapshots whose segment stats can now be introspected thanks to the include_unloaded_segments parameter. Backport of #71643
I'd like to propose this API to query index file sizes, as part of SegmentsStats.
Copypasting a comment previously posted on #16131:
Some comments/questions follow.
I'm consolidating file sizes using the same criteria of
SegmentsStatsto expose disk usage of Lucene index files:Terms,TermVectors,StoredFields,Norms,DocValues(classIndexResourcesinSegmentsStats).SegmentsStatscurrently exposes memory consumption as bytes of each of the aforementioned so it's possible to add these as part of a class (IndexResources) and expose it as two fields ofSegmentsStats, one for memory resources (existing stats) and other for disk resources. This is what I implemented in my branch, but I'm not sure if it would be better to expose this as another "sister" class ofSegmentsStats, member ofCommonStats, and leaveSegmentsStatsas it is. I'm however aware that mytoXContentserialization could break clients expecting certain fields that now are placed elsewhere.Some other couple questions I've got:
SegmentsStatsor may be better doing it using the file extensions? This last option has the advantage that would include additional information about postings, i.e., separate size info for positions, payloads, etc.SegmentInfoand the compound reader a sensible approach? I'm not sure if this could introduce issues with theStore.Closes #16131