-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract append-only optimization from Engine #84771
Conversation
20750c0
to
838964f
Compare
Out of curiosity, what would the implementation look like for TSDB? Would it track the max timestamp for each timeseries or something like that? |
I was thinking of keeping some number of max timestamps, yeah. Like grabbing the low nibble from the hashed tsid and storing 64 timestamps. Or something like that. Then I got distracted by that |
I just had a quick chat with @henningandersen about this PR and I believe that we have two main options at a high level. The first one is the one that you are suggesting, that consists of extracting the timestamp from the ID to optimize writes in case this timestamp is greater than all other timestamps that have been seen before. And the second one consists of generating IDs so that Elasticsearch and Lucene could do it mostly automatically by having an ID that concatenates the timestamp in BE order and then the TSID. The trade-off I'm seeing is that your approach require more maintenance and is specific to timeseries data streams, but it is likely more space-efficient thanks to better sharing of prefixes of IDs, and more CPU efficient in case data comes in timestamp order on a per-timeseries basis but not globally (which is not unlikely?). I'm curious if you have any data about how much larger the index of the _id field would be if we generated the ID by putting the timestamp first instead of last? |
Not yet. My plan is to see what changes we get from putting the timestamp
at the front of the id. My semi-educated guess is that we won't be ok with
the space expansion in the inverted index. And that it won't be fast
enough. But I'm not going to do anything until I know. Are there any other
thing like this you think I should try?
If the query is fast enough but too big we can have a look at just plumbing
that funny query in my other PR. If not fast enough we'd need something
like the timestamp array as well.
I think we'll need both. But I'd be pleasantly surprised to need neither.
…On Tue, Mar 22, 2022, 5:38 AM Adrien Grand ***@***.***> wrote:
I just had a quick chat with @henningandersen
<https://github.com/henningandersen> about this PR and I believe that we
have two main options at a high level. The first one is the one that you
are suggesting, that consists of extracting the timestamp from the ID to
optimize writes in case this timestamp is greater than all other timestamps
that have been seen before. And the second one consists of generating IDs
so that Elasticsearch and Lucene could do it mostly automatically by having
an ID that concatenates the timestamp in BE order and then the TSID.
The trade-off I'm seeing is that your approach require more maintenance
and is specific to timeseries data streams, but it is likely more
space-efficient thanks to better sharing of prefixes of IDs, and more CPU
efficient in case data comes in timestamp order on a per-timeseries basis
but not globally (which is not unlikely?).
I'm curious if you have any data about how much larger the index of the
_id field would be if we generated the ID by putting the timestamp first
instead of last?
—
Reply to this email directly, view it on GitHub
<#84771 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUXIUZE3ZDGZHCENK2Z7LVBGIIVANCNFSM5QG5OJEA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
At the moment my gut feeling is that we'd need both too. I'm sort of hoping that data about disk usage would confirm that it's not ok so that I feel more confident about moving forward in the direction you suggest. |
👍 I'm glad our stomachs agree. I'll try and get data. Though I might be getting distracted in the short term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments from an initial read.
I was thinking of keeping some number of max timestamps, yeah. Like grabbing the low nibble from the hashed tsid and storing 64 timestamps.
IIRC, we could have millions of tsids. The way we make this safe todahy in failover cases is to ensure the replica knows the max unsafe timestamp. For tsdb, it would need to know the max timestamp or we would have to bootstrap that from scratch on failover. Is the latter your thought here?
I wonder if that necessarily goes well, since then on failover, the first request for every tsid will have an extended duration. It might need the normal "does this id exist" check and/or also have to search the tsid to find the largest timestamp.
If we imagine a cluster max'ed out (more or less) indexing with the optimization, it might fall over when a node dies? Or at least it might buffer up loads of data, reject some and require retries from clients. This could even be the case for some relocations maybe, in particular if they have just one data stream with many tsids.
EngineConfig engineConfig, | ||
int maxDocs, | ||
BiFunction<Long, Long, LocalCheckpointTracker> localCheckpointTrackerSupplier, | ||
MayHaveBeenIndexedBefore mayHaveBeenIndexedBefore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this new strategy object to the EngineConfig
instead? I think we would need it there anyway to avoid having to extend EngineFactory
? It also feels like it belongs there.
@@ -2879,21 +2823,12 @@ void updateRefreshedCheckpoint(long checkpoint) { | |||
|
|||
@Override | |||
public final long getMaxSeenAutoIdTimestamp() { | |||
return maxSeenAutoIdTimestamp.get(); | |||
return mayHaveBeenIndexedBefore.getMaxSeenAutoIdTimestamp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used for safety after recovery. I wonder if you would not need something similar with a tsdb specific optimized append?
*/ | ||
void bootstrap(Iterable<Map.Entry<String, String>> liveCommitData); | ||
|
||
long getMaxSeenAutoIdTimestamp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should a specific interface for this method (and the corresponding update
method) Otherwise I think you would have to return dummy data here in the tsdb version? I'd prefer the casting in InternalEngine
I think. Ideally we would turn this into something generic (like a recovery state) and pass a long. I wonder if we can do that without changing serialization, might be possible?
|
||
void writerSegmentStats(SegmentsStats stats); | ||
|
||
void commitData(Map<String, String> commitData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this updateCommitData
or similar to signal that it is expected to update the commitData
and not just be reading from it?
|
||
void handleNonPrimary(Index index); | ||
|
||
void writerSegmentStats(SegmentsStats stats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this updateSegmentsStats
?
/** | ||
* {@code true} if it's valid to call {@link #mayHaveBeenIndexedBefore} | ||
* on the provided {@link Index}, false otherwise. This should be fast | ||
* an only rely on state from the {@link Index} and not rely on any | ||
* internal state. | ||
*/ | ||
boolean canOptimizeAddDocument(Index index); | ||
|
||
/** | ||
* Returns {@code true} if the indexing operation may have already be | ||
* processed by the engine. Note that it is OK to rarely return true even | ||
* if this is not the case. However a {@code false} return value must | ||
* always be correct. | ||
* <p> | ||
* This relies on state internal to the implementation and may modify | ||
* that state. | ||
*/ | ||
boolean mayHaveBeenIndexedBefore(Index index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we collapse those into one method? I see a conflict to the assertion in indexIntoLucene
, but I think I would prefer to make that an explicit assertXYZ
method instead then.
The index operation results for the tsdb track when running with this change:
Just attaching this here for keeping record of this. I ran with the benchmark defaults (indexing into a tsdb index, 8 clients concurrently indexing). |
This one's so stale it won't go in. We might be able to reuse parts of it one day, but no need to keep it open. |
This extracts the logic for the "append only" optimization from
Engine
into a pluggable behavior class so that we can override it in TSDB.