Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract append-only optimization from Engine #84771

Closed
wants to merge 12 commits into from

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 8, 2022

This extracts the logic for the "append only" optimization from Engine
into a pluggable behavior class so that we can override it in TSDB.

@nik9000 nik9000 force-pushed the may_have_been_indexed_before branch from 20750c0 to 838964f Compare March 8, 2022 18:27
nik9000 added 4 commits March 8, 2022 13:38
This extracts the logic for the "append only" optimization from `Engine`
into a pluggable behavior class so that we can override it in TSDB.
@jpountz
Copy link
Contributor

jpountz commented Mar 21, 2022

Out of curiosity, what would the implementation look like for TSDB? Would it track the max timestamp for each timeseries or something like that?

@nik9000
Copy link
Member Author

nik9000 commented Mar 21, 2022

Out of curiosity, what would the implementation look like for TSDB? Would it track the max timestamp for each timeseries or something like that?

I was thinking of keeping some number of max timestamps, yeah. Like grabbing the low nibble from the hashed tsid and storing 64 timestamps. Or something like that. Then I got distracted by that _id query optimization. Then I actually merged the id generation and had a bunch of follow up to run through.

@jpountz
Copy link
Contributor

jpountz commented Mar 22, 2022

I just had a quick chat with @henningandersen about this PR and I believe that we have two main options at a high level. The first one is the one that you are suggesting, that consists of extracting the timestamp from the ID to optimize writes in case this timestamp is greater than all other timestamps that have been seen before. And the second one consists of generating IDs so that Elasticsearch and Lucene could do it mostly automatically by having an ID that concatenates the timestamp in BE order and then the TSID.

The trade-off I'm seeing is that your approach require more maintenance and is specific to timeseries data streams, but it is likely more space-efficient thanks to better sharing of prefixes of IDs, and more CPU efficient in case data comes in timestamp order on a per-timeseries basis but not globally (which is not unlikely?).

I'm curious if you have any data about how much larger the index of the _id field would be if we generated the ID by putting the timestamp first instead of last?

@nik9000
Copy link
Member Author

nik9000 commented Mar 22, 2022 via email

@jpountz
Copy link
Contributor

jpountz commented Mar 22, 2022

At the moment my gut feeling is that we'd need both too. I'm sort of hoping that data about disk usage would confirm that it's not ok so that I feel more confident about moving forward in the direction you suggest.

@nik9000
Copy link
Member Author

nik9000 commented Mar 22, 2022

At the moment my gut feeling is that we'd need both too.

👍 I'm glad our stomachs agree. I'll try and get data. Though I might be getting distracted in the short term.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments from an initial read.

I was thinking of keeping some number of max timestamps, yeah. Like grabbing the low nibble from the hashed tsid and storing 64 timestamps.

IIRC, we could have millions of tsids. The way we make this safe todahy in failover cases is to ensure the replica knows the max unsafe timestamp. For tsdb, it would need to know the max timestamp or we would have to bootstrap that from scratch on failover. Is the latter your thought here?

I wonder if that necessarily goes well, since then on failover, the first request for every tsid will have an extended duration. It might need the normal "does this id exist" check and/or also have to search the tsid to find the largest timestamp.

If we imagine a cluster max'ed out (more or less) indexing with the optimization, it might fall over when a node dies? Or at least it might buffer up loads of data, reject some and require retries from clients. This could even be the case for some relocations maybe, in particular if they have just one data stream with many tsids.

EngineConfig engineConfig,
int maxDocs,
BiFunction<Long, Long, LocalCheckpointTracker> localCheckpointTrackerSupplier,
MayHaveBeenIndexedBefore mayHaveBeenIndexedBefore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this new strategy object to the EngineConfig instead? I think we would need it there anyway to avoid having to extend EngineFactory? It also feels like it belongs there.

@@ -2879,21 +2823,12 @@ void updateRefreshedCheckpoint(long checkpoint) {

@Override
public final long getMaxSeenAutoIdTimestamp() {
return maxSeenAutoIdTimestamp.get();
return mayHaveBeenIndexedBefore.getMaxSeenAutoIdTimestamp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used for safety after recovery. I wonder if you would not need something similar with a tsdb specific optimized append?

*/
void bootstrap(Iterable<Map.Entry<String, String>> liveCommitData);

long getMaxSeenAutoIdTimestamp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should a specific interface for this method (and the corresponding update method) Otherwise I think you would have to return dummy data here in the tsdb version? I'd prefer the casting in InternalEngine I think. Ideally we would turn this into something generic (like a recovery state) and pass a long. I wonder if we can do that without changing serialization, might be possible?


void writerSegmentStats(SegmentsStats stats);

void commitData(Map<String, String> commitData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this updateCommitData or similar to signal that it is expected to update the commitData and not just be reading from it?


void handleNonPrimary(Index index);

void writerSegmentStats(SegmentsStats stats);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this updateSegmentsStats?

Comment on lines +31 to +48
/**
* {@code true} if it's valid to call {@link #mayHaveBeenIndexedBefore}
* on the provided {@link Index}, false otherwise. This should be fast
* an only rely on state from the {@link Index} and not rely on any
* internal state.
*/
boolean canOptimizeAddDocument(Index index);

/**
* Returns {@code true} if the indexing operation may have already be
* processed by the engine. Note that it is OK to rarely return true even
* if this is not the case. However a {@code false} return value must
* always be correct.
* <p>
* This relies on state internal to the implementation and may modify
* that state.
*/
boolean mayHaveBeenIndexedBefore(Index index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we collapse those into one method? I see a conflict to the assertion in indexIntoLucene, but I think I would prefer to make that an explicit assertXYZ method instead then.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:08
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@kingherc kingherc removed the v8.6.0 label Nov 16, 2022
@martijnvg
Copy link
Member

The index operation results for the tsdb track when running with this change:

|                                                Min Throughput |                   index | 35402.8         | 36538.3         | 1135.55    | docs/s |   +3.21% |
|                                               Mean Throughput |                   index | 37938.5         | 40071.2         | 2132.69    | docs/s |   +5.62% |
|                                             Median Throughput |                   index | 37776           | 40121.3         | 2345.31    | docs/s |   +6.21% |
|                                                Max Throughput |                   index | 42319           | 44990.3         | 2671.25    | docs/s |   +6.31% |
|                                       50th percentile latency |                   index |   968.057       |   908.885       |  -59.1719  |     ms |   -6.11% |
|                                       90th percentile latency |                   index |  1382.28        |  1319.74        |  -62.5386  |     ms |   -4.52% |
|                                       99th percentile latency |                   index |  3577.88        |  3486.66        |  -91.2231  |     ms |   -2.55% |
|                                     99.9th percentile latency |                   index |  6542.95        |  6145.17        | -397.778   |     ms |   -6.08% |
|                                    99.99th percentile latency |                   index |  8484.75        |  7568.61        | -916.14    |     ms |  -10.80% |
|                                      100th percentile latency |                   index |  9014.89        |  8420.02        | -594.871   |     ms |   -6.60% |
|                                  50th percentile service time |                   index |   968.057       |   908.885       |  -59.1719  |     ms |   -6.11% |
|                                  90th percentile service time |                   index |  1382.28        |  1319.74        |  -62.5386  |     ms |   -4.52% |
|                                  99th percentile service time |                   index |  3577.88        |  3486.66        |  -91.2231  |     ms |   -2.55% |
|                                99.9th percentile service time |                   index |  6542.95        |  6145.17        | -397.778   |     ms |   -6.08% |
|                               99.99th percentile service time |                   index |  8484.75        |  7568.61        | -916.14    |     ms |  -10.80% |
|                                 100th percentile service time |                   index |  9014.89        |  8420.02        | -594.871   |     ms |   -6.60% |

Just attaching this here for keeping record of this. I ran with the benchmark defaults (indexing into a tsdb index, 8 clients concurrently indexing).

@nik9000
Copy link
Member Author

nik9000 commented Aug 14, 2024

This one's so stale it won't go in. We might be able to reuse parts of it one day, but no need to keep it open.

@nik9000 nik9000 closed this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.