-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Records in the queues should be more than just a payload #2834
Comments
I can see two driving forces that can influence how this change should be implemented. First, since the data that we handle here can be huge, it is performance sensitive. That part is pretty clear. The second aspect (that I don't quite understand the motivation behind) is having
Assuming that we want to keep |
@guilload, did you mean using
The interface remains backward compatible with the current implementation, so it is kind of a combination of 1) and 2). It should also simplify migration to |
We have plenty of contradictory needs and nice-to-have things we are considering:
|
One way to approach it by separating the concerns. Compression sounds like something The validation is something quickwit should worry about. There is a couple of approaches there but at the end of the day if you want it to be stored on disk we still need to transfer it into a series of bytes, so I think concern can stay in the quickwit land. And zero-copy is the concern of this interface. By introducing the Buf trait we can delay the actual act of serialization as far as possible, assuming that serialization always happen. If we think that some sort of pre-parsed representation of records will be needed we can add a provision to have dual representation for the records. For example, records can be stored in memory in their pre-parsed object format or serialized and deserialized into bytes for compression or on-disk storage at mrecordlog's discretion when and if needed. |
I personally vote for forgetting chasing after zero-copy and stick to json for the moment. @guilload What do you think? |
We're all aligned on what we want the perfect ingest pipeline (early commit, early schema validation, zero-copy, compression, etc.) to look like by Quickwit 0.6.0 or 0.7.0. So how do we get there? I'd like to land the user-friendly features before the performance features because we have some headroom there. What we do today is not particularly clever, yet our indexing throughput is still good because the throughput of So here's my proposal: Iteration 1: early commit
Iteration 2: early schema validationAfter step Iteration 3:
|
Ok about iteration 1 and iteration 2 with the plan with probably 4 before 3, since the validation is Index specific. I don't understand Iteration 3, but I think we don't need to discuss it right now. My high level point is that I think compression and zero-copy are mutually exclusive. Let's got with iteration 1 anyway. |
Picks up changes in quickwit-oss/mrecordlog#31 Relates to quickwit-oss#2834
Picks up changes in quickwit-oss/mrecordlog#31 Relates to #2834
Since we are changing binary format of the data that can persist server restarts, what's your current view on the upgrade experience? I cannot find any place the format version is stored and current record is basically just a list of bytes, so theoretically it can be anything. Is handling this gracefully a valid goal? If it is I can think of a couple of solution, but they will abuse the fact that the current record (if it is valid) will most likely start with |
Unfortunately, in Quickwit 0.5.0 and potentially 0.6.0, the file format (tantivy) will change and won't be backward compatible. Consequently, we don't need to worry about the upgrade experience for this specific feature because I suspect our users will upgrade with downtime and will reingest their data from scratch. Starting with Quickwit 0.6.0, we want to maintain backward comparability with future versions. |
I see, so you have to start with empty qwdata and therefore the situation that I have described will never happen? I think it is a good opportunity to build in some versioning mechanism then to make life of whoever will work on this next easier. Ideally, it would be also nice to build some mechanism into qwdata to prevent a downgraded server from starting on a queue that were touched by a server with a later serialization version so we only have to worry about reading older records, and consider anything we cannot recognize as a record log corruption and safely ignore without fear of losing good new records. |
Makes it possible to inject different commands such as force commit alongside with documents into the ingest stream. See quickwit-oss#2834
Makes it possible to inject different commands such as force commit alongside with documents into the ingest stream. See #2834
Should we close this issue and move Iterations 1, 2, 4 and 5 into a separate issues or a meta issue with an appropriate title? |
Yeah, that makes sense. You can open new issues. |
Closing this as done. We had discussed some interesting ideas on this issue related to the following other issues:
|
The PushAPI relies on
mrecordlog
to store the documents that were ingested as is.This is fine for the moment, but it will prevent us from further evolution...
In particular #2699 will most likely require us to append some commit Command to the queue.
We need the record in the queue to not be directly documents, but rather something closer to
with an adhoc simple, versioned and extensible format.
The text was updated successfully, but these errors were encountered: