-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistence Query for Cassandra #77
Comments
I have started working on this issue. Will update with progress and create a PR when ready. |
awesome, i had planned to start on it but go for it. This PR brings the dependency up to 2.4-RC1 which should get merged soon so worth basing your work off this #69 |
Very cool, thanks for hopping on to it @zapletal-martin :-) |
+1 Thanks @zapletal-martin for bringing this here. This is a real MUST for us. We're thrilled to improve the READ layer in our CQRS stuff. We'll have a look as soon as you share your work. Of course @krasserm should coordinate this effort IMHO :) |
I'd love to hear more use cases, my initial thoughts:
There's also a possibility we don't need to go via the query interface tho this is a lot of work. Spark streaming has similar requirements, checkout https://issues.apache.org/jira/browse/CASSANDRA-8844 but that is a long way off :) I definitely think we should discuss this in depth before akka 2.4 / plugin 0.4 is released in case we wan to make any data model changes. |
Thanks @chbatey . As I see it there are 3 options:
If we wanted to use 3) the ideal case would be to have defined ordering (persistenceIds, events by persistenceId and events by tag) and be able to query using offset. AllPersistenceIDs - it is relatively simple to achieve at the moment without any data model changes, but as you mention it may not be feasible considering a large amount of events and inability to easily detect a change without a notion of order (set difference). More specialised data structure allowing the above mentioned properties would work much better. EventsByPersistenceId - paging could work well, however it still does not support offset queries. We may need to run a query in fromSequenceNr toSequenceNr range. That requirement is directly in the API and would be required considering pull only approach. |
i'd maybe incorrectly assumed that this had to work in a different JVM to the writer, WDYT? |
The read and write journal running in the same actor system was probably my incorrect assumption based on reading the LevelDb implementation. I assume you should be able to execute the query from anywhere and therefore that option is not relevant unless we want to communicate remotely. |
Just in case this can help. In our application, we have lots of shards of actors where each shard represents an aggregate in terms of DDD. So, every instance of each aggregate type has its own persistenceId and lifecycle of course. So, from my point of view, all we need is to project every event raised from the aggregate and update the views accordingly. I can imagine that EventsByPersistenceId wouldn't be the way to go because in that case we'd have a live stream instance per persistenceId, one per order for example, unless we have some kind of passivation mechanism as we already do with aggregates. Maybe the EventsByTag query could work if we tag our shards with a label, for example "order" in the classical e-commerce analogy. In that case, it'd be possible to mantain a stream per shard approach. I'm not sure if I'm understanding the proposed query API correctly but IMHO our case could be viewed as a canonical example of DDD/CQRS and Event Sourcing approach. Hope this helps. |
Here are my initial thoughts on the general architecture, constraints, implementation options and open issues. Architecture
Implementation options
Before we start discussing implementation details, I'd find it helpful that we agree on the general architecture, constraints and high-level implementation options. Thoughts? |
Thanks @krasserm for a detailed analysis. I conceptually agree with what you have outlined. There are still unknowns ragarding the actual implementation, but we will surely discuss those later. I have done some work mostly on the Stream Production component. It may not be relevant as we need to discuss how exactly it will work, but I envision it to be relatively straightforward. One thing we must take into consideration during design is backwards compatibility. It may be dependent on the actual implementation, but I assume we want to be able to correctly apply persitence query to events stored before the change as well. I have a few clarification questions. I prefer to ask to make sure we all have shared clear understanding. Re 1: "Consequently, we cannot easily determine what have been the last n events, but this is very important for efficiently updating streams with live events?" Shouldn't the typical query rather be events after nth (i.e. offset) to determine what is new in the stream? "Since Akka Persistence requires read-your-write consistency for PersistentActors, eventually consistent index reads must be completed with reads of remaining events from the master event table." I probably don't understand the wording and proposed responsibilities of index and master data clearly. What do you mean by index and master here? Would current master change? Why wouldn't Akka Persistence read directly from master table similarly to current functionality? Re 2: Good point. Let's find this out. Re 3: "and repeated queries would result in the same event ordering (although not required by akka-persistence-query but relevant if we plan to support insertion order event delivery, for example)". I believe same event ordering of repeated queries is a requirement of akka persistence query. The documentation for EventsByPersistenceId reads "The same prefix of stream elements (in same order) are returned for multiple executions of the query, except for when events have been deleted." and for EventsByTag "The returned event stream is ordered by the offset (tag sequence number), which corresponds to the same order as the write journal stored the events. The same stream elements (in same order) are returned for multiple executions of the query." |
We can also achieve that with a migration tool or script. Breaking the existing schema shouldn't be an issue then.
Yes exactly, my description was rather imprecise.
With master I mean the raw event log(s) as written by journal actors. With index, I mean tables derived from that log to serve queries efficiently, such a query by tag, for example. Instead of custom tables, this can also be a Cassandra secondary index but I'm not sure at the moment if this has an impact on write throughput, causal consistency or has other issues. Another example are Cassandra materialized views, as already mentioned in a previous comment. Anyway, I hope this clarifies what I mean with master and index.
If we want to produce streams with journal insertion-order then we should consider a change of the master schema. Instead of ordering events by (
With the above changes in the master, a query by
Ok, didn't know that. |
Breaking an existing schema is not an issue, but I initially thought (re)ordering information might be. See below. Currently events are ordered by persistenceId and sequenceNr. It maintains partial ordering per CassandraJournal per persistenceId. Unlike Eventuate, persistenceIds can not be shared by multiple instances in Akka Persistence so it was fine.
Sorry, the ordering requirements I mentioned in previous comment were not accurate and were only true for LevelDB journal. The documentation clearly reads "Akka persistence query is purposely designed to be a very loosely specified API. A very important thing to keep in mind when using queries spanning multiple persistenceIds, such as EventsByTag is that the order of events at which the events appear in the stream rarely is guaranteed (or stable between materializations). Journals may choose to opt for strict ordering of the events, and should then document explicitly what kind of ordering guarantee they provide - for example "ordered by timestamp ascending, independently of persistenceId" is easy to achieve on relational databases, yet may be hard to implement efficiently on plain key-value datastores.". You however mentioned we would want to achieve insertion order to be able to efficiently update a stream, to be able to achieve resumable projections and stability between materializations which I agree with (although this is not a requirement). You however mentioned that would require
Is that correct? Wouldn't it rather require order across CassandraJournal instances and persistenceIds/tags (for AllPersistenceIds and EventsByTag). I am avoiding the term total order, because of lighter requirements if documented properly, see below. Since that requirement did not exist previously I thought it could potentially be challenging to infer the knowledge about order. Tags are yet to be introduced so backwards compatibility is not an issue. We just need to design how to achieve the order going forward. The ordering requirements are not too strict across instances as we don't necessarily need to know the real total order (we just need it persisted across materializations) and could therefore avoid coordination to obtain it. For AllPersistenceIds and EventsByPersistenceId we should be able to infer the knowledge during migration if required. We may need to implement tagging for Akka Persistence Cassandra which I believe does not exist yet. I will create a separate issue for this, but it should be aligned with the outcome of our discussion here. I will work on designing of queries and data representation and report back when ready. |
You can create a total order without coordinating among journal actors. The simplest case is just merging the streams written by each journal actor. If the merged stream shall be repeatable it must be persisted before delivering it. This is fine as long as streams written by different journal actors are independent of each other. This is usually not the case: if a persistent actor is migrated from one journal actor to another (for example, during a failover in a cluster), there is a dependency because events with the same Simply merging these streams (i.e. without further constraints) might deliver these events in wrong order. What is therefore needed is a stream merge logic that enforces causal delivery per |
Regarding re-ordering of events written by older journal versions, I wouldn't even try to do that but rather deliver the events for each persistent actor independently, either sequentially of interleaved. Applications shouldn't care anyway because they must assume that events with different In other words, events written by older journal versions can be used but the new ordering guarantees only apply to events written by the new version. |
To clarify, the aforementioned total order is not necessarily wall clock time order (although it may come close). |
Thanks @krasserm . It is a significant change in current functionality. It will be crucial to implement the stream merging / master to index transformation. It should then be possible to achieve the desired guarantees as described before. One thing where I am not still completely clear architecture wise is what is the indisputable advantage of separating master and index. The option I am considering as an alternative would be to have just the index part (e.g. events by persistence id - same as currently, events by tag and some efficient query structure for persistenceIds). That should efficiently support all the required queries, including current needs for replay etc. CassandraJournal would update all 3 tables directly when writing rather then updating master and having separate process for the denormalized replication. I did not yet think about the table representations, but it could potentially simplify the implementation (master to index replication and replay which are not trivial problems) maybe at the expense of some write speed (if any) and generality (e.g. supporting currently not required queries). Or I am missing something? I have implemented the Stream production component supporting live streams, refresh interval, max buffer size and causal consistency given the aforementioned index structure that can be queried by offset as an ActorPublisher that can be used to create an Akka Streams Source. The next steps will be to start implementing the master and index structures and the merging component. I do not yet have a clear vision of the implementation so please feel free to suggest. If Materialized Views support the required guarantees that could simplify the implementation. But that would mean dependency on Cassandra 3 which I am not sure is an option. Since the whole issue is a relatively large amount of work should I create multiple smaller tickets a create smaller PRs into a separate branch or how would you like to see the work managed? |
Regardless of the master-index transformation solution we need to have a discussion about data representation. EventsByPersistenceIdQueries
The existing table representation PRIMARY KEY ((persistence_id, partition_nr), sequence_nr)) should efficient support the above query and therefore both Persistence Query and replaying. It is also convenient that thanks to per persistence_id order paritioning can be easily managed even across multiple CassandraJournal instances. EventsByTagQueries
This looks similar to previous query, but is actually quite different, because tags span persistence_ids and CassandraJournal instances. There are options how to implement this:
AllPersistenceIdsQueries
This query is more difficult. We would ideally store just the distinct values with idempotency attribute so that inserts of the same persistence_id would have no effect (insertion time ordered set). The options could be
MasterQueries
If we had master data we would have to implement completion of reads during replay to ensure RYOW consistence as mentioned before. We would therefore have to be able to look up events for given persistence_id with sequence_nr higher than the highest sequence_nr found in index table. If we didn't use master we could update all the tables in a batch with the potential impact of such approach. As for the transforming of data from master to index it may not be trivial, because 1) any CassandraJournal may be removed at any time and therefore the transformation must not be tied to it. 2) During replay if we wanted to complete the read from master we may need to check in multiple CassandraJournal ordered partitions, again because an actor can change CassandraJournal. Feedback and Cassandra data modelling advice appreciated. |
It's the same advantage how all databases internally use a transaction log (= master) and derive structured tables from it (= index). With event sourcing and CQRS this separation is visible to the application and entries in the transaction log (= event log) are usually not deleted. Writes to the master should be atomic and isolated (see also #48). This is in general independent how the master is structured. You could write a single log (per journal actor) or n logical logs, one per
How do you handle failures in this case, especially with respect to atomicity and isolation? Assuming that different index writes go to different partitions, you could use a logged C* batch to ensure atomicity (not isolation) but this will break the requirement of Akka Persistence that reads by Alternatively, the journal actor could make an isolated and atomic write to the master and, if this write succeeds, make further index writes. This would meet the read-your-write consistency requirement but introduces sequential writes, one batch write to the master and then one or more (concurrent) writes to the index, which will reduce write throughput. Furthermore, if index writes fail, you need to retry them and this is best done with a separate process/worker ... and we again end up with the architecture that separates a master from an index. Hope this makes my motivations for the proposed architecture clearer. There's nothing special with it, it only mirrors how most databases internally work. Anyway, for progressing with the implementation, I'm fine to have an experimental version where the journal actor makes all the writes as long as we document the consequences and later migrate to a clear master - index separation.
Great. Looking forward to review the pull request(s).
+1 |
Here are some thoughts regarding data processing and representation of which I think they could solve the issues you mentioned in your previous comment. I make the following assumptions:
Using these assumptions:
To achieve read-your write consistency from the index (needed during replay, for example) the same mechanism as for resuming index updates can be used: first read the index, then complete the index read by reading from a merged stream from the master. This approach doesn't only make us more independent from Cassandra implementation details such secondary indices, materialized views or whatever, it could also allow us to do master data management with a different storage backend, such as Kafka, for example (using a Kafka partition per journal actor). Following this approach would make it rather easy to reason about data consistency and deal with failures IMO. It is furthermore very similar to how many stream processing and data analytics pipelines are structured (in contrast to updating several indices simultaneously from a single writer). The only critical thing we need to implement is a stateful, causality-preserving stream merging from master logs (for which solutions definitely exist). |
@krasserm @zapletal-martin Now, when Cassandra 3.0 is released with the new Materialized Views, do you still see any reason for not using them? The EventsByTag query is important for the project that I'm currently working on so I would like to help out implementing it. I played around with materilaized views and here is a cqlsh session for EventByTag: https://gist.github.com/patriknw/4bcec28b8d3e5c5e56cc |
hey @patriknw is this going to production any time soon? MVs are a very experimental feature I would wait for a major release before using. |
so you are saying that MVs in Cassandra 3.0 is not production ready? |
Definitely not, normally 6 months for a major release to become stable, especially any headline feature e,g Light Weight Transactions for 2.0. |
Thanks for the information. |
For EventsByTag I can't see any issues feature wise. However we'd need to be careful about partition size, we wouldn't want it to get more than a few million events with the same tag. To get around the problem for persistenceId in the raw data table we add out own synthetic partitioning which we could do for the eventsbytag table if it was hand crafted rather than using the built in MV feature. Or if we added year, month, day columns to the original table we could use them in the MV partition key. This would make queries slower for due to multi partition scanning for cases where there aren't many events for a tag (unless you always queries for the latest events less than a day old) but would keep partitions to a max size of events for a tag in one day. |
That is a good point. Another thing I'm uncertain about is what we can use as offset column. I was thinking of a |
I would generally be in favour of using materialized views over custom solution, because it would simplify the solution and some of the difficult problems would be solved for us. But there are a few things I can see as potential issues.
|
I share your concerns @zapletal-martin, especially the causal consistency concern. Since our raw data table is partitioned and there's only eventual consistency between a table and its corresponding MV, we might end up in the following situation: Given that eventi is written to partitionj and then eventi+1 is written to partitionj+1, the MV update for partitionj+1 may occur before that for partitionj (as we only have eventual consistency guarantees). A read between these two MV updates does violate causal consistency if we read eventi+1 from the MV but not eventi. The situation gets even more problematic if the MV update order is not even defined for writes to the same partition. With more problematic I mean, chances for inconsistent reads would significantly increase. @chbatey are there any MV update ordering guarantees given by Cassandra 3.0 to avoid these potential problems? |
No. I had a similar example typed up on my phone. The writes to the MV are asynchronous so a newer MV write could potentially overtake an older one.
|
Interesting questions, indeed. They discuss consistency in length in CASSANDRA-6477, but I'm not sure what the conclusion or implementation is. I assume that MV is updated in the same order as the base table for one partition, i.e. we have causal consistency per persistenceId (with some caveat when we change partitionId for a persistenceId). I assume that we don't have causal consistency across different persistenceId, and even though it would be convenient to have that I don't think it's critical. We can achieve a best effort experience by ordering on a timeuuid on the read side and delay reading the tip with a few seconds. That will not work in case of network partitions, where MV updates are delayed a long time, and that is also why I don't think we should guarantee exact same results for a query that is run multiple times. A later query may return more events than an earlier query, because the MV was updated with old events that got stuck during the network partition. By delivering the unique timeuuid for each event to the application, it can retry the queries and filter out duplicates. We can also provide a count query for quick checking if there are any more events. |
I am not sure we can assume that. Writes for persistenceid seq nr 1 could go to one node, seq nr 2 could go to another node and the async replication of the MV for write two could be come available before 1. Joel Knigton (@joelknighton) did some testing with Jepson for MVs, will ask him to take a look. |
Yes that is a solution that would work. It however explicitly chooses certain tradeoffs (e.g. not guaranteeing exact same results, causality and exploiting randomness instead of guaranteed correctness). Given how MVs seem to work we would have to choose some tradeoffs in any case. It seems the decision is not as much about implementation details, but about conceptual correctness, implementation complexity, use cases and requirements of persistence query. |
I attempted an implementation of the solution that uses Cassandra's materialized views. It involves storing Tag, changes to replay, event by tag query actor. The implementation seems to be pretty simple for the eventsByTag query. We could subpartition the data based on a timestamp (either configurable by user based on their needs or year/month/day etc.). Alternatively we could use a different storage format where persistenceId would not be the primary to give us more freedom, but that would complicate recovery. There seem to be some caveats, for example we would have to have multiple rows per persistenceId-sequenceNr combination(event can have multiple tags and C* collection is not an option in this case) which complicates replay and stream emitting slightly. The QueryActorPublisher we have should work for EventsByTag query without major changes. Overall it seems that the implementation of EventsByTag is pretty quick and straightforward. AllPersistenceIds or other complicated queries not as much though. To summarise the advantages and disadvantages to help us make the best decision Materialized views
Disadvantages
Custom solution
Disadvantages
Hybrid approach
Is there anything I missed? I think both solutions are relevant. As mentioned previously we need to consider conceptual correctness, but also use cases, tradeoffs and requirements of persistence query (which are not as strict in terms of guarantees or even which queries must be supported). It would be great to discuss and drive the decision making so we can proceed with the implementation. (I will have quite some time over next week/s to contribute :)) |
@zapletal-martin thanks for your spot-on analysis. At least for the projects I work(ed) on
would be a blocker. Furthermore, I also think that
is an important aspect for the long-term evolution of akka-persistence-xxx storage plugins. For example, using a Kafka/Cassandra hybrid where raw logs (written by journal actors) are stored in Kafka topic partitions and indices in Cassandra might be an interesting solution. I also like the hybrid approach you proposed, using MVs where appropriate (e.g. AllPersistenceIds) but a custom solution where clear semantics (causality) and repeatability of query results are important. Here's my +1 for a focus on a custom solution, using a hybrid where appropriate. |
I have also put together a prototype of EventsByTag query using materialized view. I see no blockers for the needs I have in my current project.
I think I can achive right replay order for each persistenceId without too much trouble. As I understand it you want more causaility guarantees across different persistenceIds. With the custom indexer you have outlined you can achive causality per writing journal instance. Too be honest, I think it can be dangerous to rely on something like that because its not location transparent and would not fit well with Cluster Sharding. Have I misunderstood something? I don't think That said, I agree that a custom solution gives more flexibility and power. I suggest that @zapletal-martin continues working on that and I can put together a first I'm also happy to help out with reviewing, if piecemeal pull requests are created. |
@patriknw @zapletal-martin sound good to me! Regarding
I let @zapletal-martin answer that :-) |
Replay order is not an issue if you are storing events partitioned by persistenceId as currently. I believe the goal is to achieve causality control for view precomputation, e.g. per persistenceId or tag. So you avoid issues such as first query returns events 1,2,4 and the same query run again returns 1,2,3,4.
I am not sure I fully understand your comment. I assume by location transparency you mean an actor can live in a cluster and is referenced transparently whereas journal is 'physical' and may therefore be added/removed and actors rebalanced. That is true, but something we accounted for. The journal must have unique id which should however be possible to be reused. From external viewer's perspective I think it is location transparent or the location transparency is achieved in the stream merging mechanism where the independent streams are reconciled. I do not see any issue regarding cluster sharding (actually I use sharding in all my tests). I may be missing something though.
@krasserm, @patriknw sounds good. There are some conflicts that may need changes and migration (e.g. different table definitions, replay etc.), but that is expected. Feel free to use any of the existing code from wip-persistence-query branch. The read journal, configuration classes, query actors etc. are there and are working (eventsByPersistenceId is there and eventsByTag should not be too different using the existing base class), tested and could easily be reused). Also @patriknw I would be more than happy to help so please let me know if I can contribute (e.g. the read side which is partially already done in wip-persistence-query or other parts that require new work such as replay etc?)! What I am thinking at the moment is I could help Patrik build eventsByPersistenceId and eventsByTag using MVs with the aforementioned tradeoffs which should be relatively straightforward and we can then take our time to try to build the PoC and asses its correctness, performance and scalability. |
Let me clarify with an example. We have persistent actors A and B.
All these events are tagged with same tag. You want to guarantee causual consistency so that EventsByTag returns a stream with events in order a1, a2, b1, b2 if A and B were using same journal instance, but if A and B were not using same journal instance (i.e. located on different cluster nodes) you can't guarantee that b1 comes after a2 in the query stream. Therefore I challenge the usefulness of the causual consistency per journal that you seem to find so imporant. The order a1 -> a2 and b1 -> b2 must of course be maintained. |
This can not be guaranteed when making a query by tag from the MV. This may arise if
This is what we mean when talking about causal consistency. We do not aim to achieve causal consistency as Eventuate does, for example. |
Then I'm sorry that I completely misunderstood what you were refering to with casaul consistency. I thought you meant across different persistentIds. I'm aware about the async nature of MV and my plan in to solve that on the read side using the sequence number per persistentId. |
No problem. Causal consistency is usually discussed in context of what you described. We've just used the more general definition here: an effect must never be observed before its (potential) cause. |
For persistenceId we have the sequenceNr so we can achieve causality. There is no ordering defined across the journal instance "streams" so your example is right for tag query. I wonder if there is a way to infer a single total order for tags (any repeatable same order with causality by persistenceId).
That sounds like a good idea to me. You may be facing some of the same difficulties that we are (causal merging, potentially unbounded number of persistenceIds etc.). This case however is simplified, because you shouldn't have to worry about failures, resuming the stream etc. @patriknw I also sent you an email seeking cooperation. I think I could help you with the effort and potentially avoid doing quite some of the work that we already did in our branch. Thinking about it the stream merging logic we have cuould also be reused for the events by tag query causality. @krasserm would it be worth creating a separate branch for this work again? |
Sounds good to me. |
I'll create the branch tomorrow morning and open the first pull request. |
@krasserm @patriknw I have created wip-materialized-views-persistence-query branch. I have also created #124 review which contains the bits that I think are relevant to the implementation using materialized views. Some of them are duplicated in Patrik's branch which is fine. But in my opinion some code is quite valuable and would be great if it was reused (see my comments in the review). Let's have a discussion tomorrow and define implementation plan and work distribution. |
I have implemented |
@zapletal-martin I have not migrated this and (sub-issues) to akka/akka-persistence-cassandra. |
Akka persistence query complements Persistence by providing a universal asynchronous stream based query interface that various journal plugins can implement in order to expose their query capabilities. The API is documented http://doc.akka.io/docs/akka/snapshot/scala/persistence-query.html#persistence-query-scala. An example implementation using LevelDB is described here http://doc.akka.io/docs/akka/snapshot/scala/persistence-query-leveldb.html#persistence-query-leveldb-scala.
Akka-persistence-cassandra should support this new query side API. The API is available in Akka 2.4 so this work will need to be done against akka-persistence-cassandra supporting 2.4. It will also require introduction of dependency on Akka streams.
The work is tracked in the following tickets
The text was updated successfully, but these errors were encountered: