-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise seeking by timestamp #22129
Comments
The current implementation for seeking by timestamp is here: Lines 727 to 774 in 30697bd
called from here: pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java Lines 1890 to 1936 in c99a51d
I guess the missed optimization is to use the ledger metadata as a first level filtering. pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpFindNewest.java Lines 83 to 137 in 82237d3
|
LedgerInfo contains the timestamp when the ledger was sealed (it got closed or was rolled over): pulsar/managed-ledger/src/main/proto/MLDataFormats.proto Lines 55 to 61 in 23f46a0
there could be an initial binary search which uses this information available in the ManagedLedgerImpl via pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Lines 3844 to 3846 in e6cd005
I guess there is a gotcha since the ledger's timestamp is the broker's clock, but the seek uses the message publish time which is using the client's (publisher's) clock. There might be corner cases because of this. |
There's also a related issue #10488. |
Search before asking
Motivation
Right now it seems that seeking a reader or a consumer to a specific timestamp is an unoptimised process that can take many seconds / over a minute for larger topics (single GB data size, tens of messages per second). From a slack comment @lhotari it appears that seeking via a timestamp is not optimised, and I'm here to propose optimising it as a valuable feature.
Solution
Seeking currently works by message ID or by timestamp. I assume (though I could be wrong) that seeking by messageID is optimised. Without going into the implementation details properly and just spitballing ideas, something like binary searching on the time, or creating a treemap from timestamp to message ID (at any level of sparsity) might allow seeking to become far faster
Alternatives
No response
Anything else?
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: