Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve response format of data frame stats endpoint #44350

Merged
merged 19 commits into from
Jul 23, 2019

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Jul 15, 2019

This change adjusts the data frame transforms stats
endpoint to return a structure that is easier to
understand.

This is a breaking change for clients of the data frame
transforms stats endpoint, but the feature is in beta so
stability is not guaranteed.

Closes #43767

This change adjusts the data frame stats endpoint
to return the format discussed in elastic#43767.

Relates elastic#43767
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

This still doesn't work as the checkpointing info is
always empty in the state-and-stats docs.  This
commit makes that clearer by not actually persisting
empty checkpointing info in the state-and-stats docs,
but instead hardcoding this at the point of use.  The
next step is then to get non-empty checkpointing info
when required...
DataFrameTransformStateAndStatsInfo -> DataFrameTransformStats (user facing)
DataFrameTransformStateAndStats -> DataFrameTransformStoredDoc (internal)
Now that indexer_state has moved into checkpointing.next
it is unreliable to assert on the value of indexer_state
in a YAML test because the tests are so quick that the
next checkpoint can easily move to last before any
assertion is made.
Also renamed some more internal variables to match the
user facing stats output
@droberts195 droberts195 marked this pull request as ready for review July 19, 2019 13:01
@droberts195 droberts195 changed the title [ML] Change response format of data frame stats endpoint [ML] Improve response format of data frame stats endpoint Jul 19, 2019
randomFrom(DataFrameTransformTaskState.values()),
randomBoolean() ? null : randomAlphaOfLength(100),
randomBoolean() ? null : randomNodeAttributes(),
// On the server side the stats has transform ID "_all" when embedded in another doc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this internal ID field be skipped while serializing the XContent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is skipped when serializing. The problem is that this test creates a server-side object, round-trips that through the HLRC back to another server-side object, then asserts equality. After the round-tripping via serialization that discards the ID it ends up as _all. Hence the original object has to be created with an ID of _all so that it is considered equal.

You're also correct that the cleanest solution with the way the rest of the code is today is just not to store a transform ID in the indexer stats. At present when it's persisted to an index it's always inside a state-and-stats wrapper. There is an idea to split the storage so that state and stats are persisted separately. Obviously if that were done then we'd need an ID in the stats during persistence and this problem would come back. But maybe it is best to leave the state and stats wrapped in a single document when they're persisted. Grouping them may avoid consistency problems where the stats endpoint response includes old state and new stats (or vice-versa) for a particular transform due to the two docs becoming searchable at different times. So in fact keeping state and stats grouped during persistence can solve two problems:

  1. Guarantees consistency of state and indexer stats incorporated into any single stats endpoint response
  2. Allows the complex/confusing code related to the sometimes-present ID in indexer stats to be deleted

Any thoughts @hendrikmuhs and @davidkyle?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Our discussions convinced me that we should stay with storing state and stats together. As explained above this avoids consistency problems and in addition we do not have to break backwards compatibility. The deletion of superfluous fields should not break BWC.

@@ -133,7 +132,6 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
throw new IllegalArgumentException("when storing transform statistics, a valid transform id must be provided");
}
builder.field(DataFrameField.ID.getPreferredName(), transformId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need to serialize this ID anylonger? Or could we have some sort of parameter that is passed by the containing stats object?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point, the purpose of the ID was to delete corresponding documents after the deletion of a transform (via dbq). As stats is not stored as such the ID is as superfluous as the doc type, so the whole internal storage handling is obsolete.

(We can also clean up in a later PR if preferred, this is already getting complex)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this in a followup PR. It will also mean the assertion that IDs are identical on merge will have to go, but I guess that was only a debugging aid - it didn't do anything in production.


private void populateSingleStoppedTransformStat(DataFrameTransformStoredDoc transform,
ActionListener<DataFrameTransformCheckpointingInfo> listener) {
transformsCheckpointService.getCheckpointStats(transform.getId(), transform.getTransformState().getCheckpoint(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are going to find that this will hurt performance in the future. I know refactoring getCheckpointStats would a monstrous undertaking, and probably will have to be done in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true, I should probably write a new method that gets all the required checkpoint docs for all the stopped transforms in one step.

Also getting the "operations behind" for many transforms that share the same source indices is inefficient because the indices stats action sends messages to the nodes holding the shards for each index that's involved and we'd do that multiple times. So a more efficient approach would be to get the indices stats for the union of all indices referenced in the source of all transforms we're going to return stats for, then pick the relevant bits out of that for each transform.

The approach in this PR at the moment is actually what's suggested in #42978 but further optimisation would certainly be possible. Since this PR is targeted at 7.4 we have time to do this in a followup. Otherwise like you say the title of this PR will have to be changed to "refactor everything".

AtomicInteger numberRemaining = new AtomicInteger(statsForTransformsWithoutTasks.size());
AtomicBoolean isExceptionReported = new AtomicBoolean(false);

statsForTransformsWithoutTasks.forEach(stat -> populateSingleStoppedTransformStat(stat,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the cluster behave with a larger amount of transforms getting stats at the same time? This could be pretty bad if the UI is loading many transforms at the same time to display their stats and most of them are stopped.

* @param nextCheckpoint the next checkpoint
* @param nextCheckpointIndexerState indexer state for the next checkpoint
* @param nextCheckpointPosition position for the next checkpoint
* @param nextCheckpointProgress progress for the next checkpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yo dog, I heard you like nextCheckpoint :D

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:feelsgood:

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Huge Thanks for this!

I added some comments, e.g. there might be some follow ups but should not stop us here

randomBoolean() ? null : randomNodeAttributes(),
// On the server side the stats has transform ID "_all" when embedded in another doc
// TODO: change this so that the outer document sets the correct transform ID during parsing
// It's very confusing and could cause subtle errors that the inner object has a surprising ID

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for the record: the id in stats is a refactoring leftover, it was forgotten to be removed, stats was stored as individual document before state and stats have been combined, see discussion above. So the todo should be "remove the id"

@@ -133,7 +132,6 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
throw new IllegalArgumentException("when storing transform statistics, a valid transform id must be provided");
}
builder.field(DataFrameField.ID.getPreferredName(), transformId);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point, the purpose of the ID was to delete corresponding documents after the deletion of a transform (via dbq). As stats is not stored as such the ID is as superfluous as the doc type, so the whole internal storage handling is obsolete.

(We can also clean up in a later PR if preferred, this is already getting complex)

/**
* Used as a wrapper for the objects returned from the stats endpoint.
* Objects of this class are expected to be ephemeral.
* Do not persist objects of this class to cluster state or an index.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -661,9 +660,8 @@ protected void doSaveState(IndexerState indexerState, DataFrameIndexerPosition p

// Persisting stats when we call `doSaveState` should be ok as we only call it on a state transition and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment looks outdated, quick suggestion:

"Persist the current state and stats in the internal index. The interval of this method being called is controlled by AsyncTwoPhaseIndexer#onBulkResponse which calls doSaveState every-so-often when doing bulk indexing calls or at the end of one indexing run"

@droberts195 droberts195 deleted the change_stats_format branch July 23, 2019 09:48
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jul 23, 2019
Mutes data frame BWC tests prior to backporting elastic#44350
droberts195 added a commit that referenced this pull request Jul 23, 2019
Mutes data frame BWC tests prior to backporting #44350
droberts195 added a commit that referenced this pull request Jul 23, 2019
This change adjusts the data frame transforms stats
endpoint to return a structure that is easier to
understand.

This is a breaking change for clients of the data frame
transforms stats endpoint, but the feature is in beta so
stability is not guaranteed.

Backport of #44350
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jul 23, 2019
This is a followup to elastic#44350. The indexer stats used to
be persisted standalone, but now are only persisted as
part of a state-and-stats document. During the review
of elastic#44350 it was decided that we'll stick with this
design, so there will never be a need for an indexer
stats object to store its transform ID as it is stored
on the enclosing document. This PR removes the indexer
stats document ID.
droberts195 added a commit that referenced this pull request Jul 23, 2019
This change adjusts the changes of #44350 to account
for the backport to the 7.x branch in #44743.
droberts195 added a commit that referenced this pull request Jul 25, 2019
This is a followup to #44350. The indexer stats used to
be persisted standalone, but now are only persisted as
part of a state-and-stats document. During the review
of #44350 it was decided that we'll stick with this
design, so there will never be a need for an indexer
stats object to store its transform ID as it is stored
on the enclosing document. This PR removes the indexer
stats document ID.
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jul 25, 2019
Mutes data frame BWC tests prior to backporting elastic#44350
droberts195 added a commit that referenced this pull request Jul 25, 2019
This is a followup to #44350. The indexer stats used to
be persisted standalone, but now are only persisted as
part of a state-and-stats document. During the review
of #44350 it was decided that we'll stick with this
design, so there will never be a need for an indexer
stats object to store its transform ID as it is stored
on the enclosing document. This PR removes the indexer
stats document ID.

Backport of #44768
droberts195 added a commit that referenced this pull request Sep 24, 2019
…cs (#46821)

The PRs that made these changes are:

- #44350
- #45276
- #45856

Co-Authored-By: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 24, 2019
The PRs that made these changes are:

- elastic#44350
- elastic#45276
- elastic#45856

Co-Authored-By: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>

Backport of elastic#46821
droberts195 added a commit that referenced this pull request Sep 25, 2019
…cs (#47034)

The PRs that made these changes are:

- #44350
- #45276
- #45856

Co-Authored-By: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>

Backport of #46821
2lambda123 pushed a commit to 2lambda123/elastic-elasticsearch that referenced this pull request May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Data frame GET _stats response is confusing
5 participants