Skip to content
Randall Hauch edited this page Oct 30, 2015 · 12 revisions

What is Debezium?

Debezium is a prototype for a fast, scalable, distributed, durable, and sequentially consistent data service for mobile apps, and can be easily incorporated into Mobile Backend As A Service (MBaaS) systems. It is easy to use, and can reliably handle large volumes of data and large numbers of concurrent changes. It stores all data in durable, replicated, and partitioned transaction logs, and its functionality is implemented as a series of microservices that consume these logs and produce other logs. Debezium can provide mobile data storage for a Mobile Backend As A Service (MBaaS).

This is a prototype

This project is currently just a prototype for a more complete system. It was developed to help reach some specific goals:

  1. Identify directions that satisfy the requirements of data access, data sync, subscriptions, and push notifications within the context of a data service for the LiveOak and AeroGear projects.
  2. Identify the techniques and patterns client apps can use to access and update data even when:
  3. multiple versions of the client app use the same database (or application in LiveOak terms);
  4. many clients are concurrently updating the same data
  5. Identify the pros/cons of the pre-defined consistency policies, how they might be implemented, and how they might be affected by the notion of public and private database. Are transactions required? Is it possible for a data service to be sequentially or causally consistent without the use of transactions?
  6. Identify the pros/cons of schemas for client app development.
  7. Identify the feasibility of schema learning? What is easily accomplished, and what is more difficult to infer?
  8. Identify how dynamically provisioning databases affects the design of the data service.
  9. Identify options for storing entities, artifacts, etc., especially in a distributed system. JBoss already has experience with database systems (e.g., Cassandra, MongoDB, PostgreSQL, etc.). Are there alternatives?

The prototype effort was also a great place to investigate some new or different technologies:

  1. Transaction logs - very simple but powerful mechanism to handle large volumes of data very quickly. Apache Kafka is one of the best implementations, but how well does it work, how easy is it to use, and how does its approach to replication and partitioning change the design of the data service?
  2. Stream processing - This is often used in real-time analytics (e.g., process the Twitter firehose, analyze and monitor website requests and services, processing data in Hadoop, etc.), but is there potential for data storage and state management? What benefits and drawbacks does a stream-based approach have? How would requests that expect a response be handled? How should the streams be set up to maximize durability and fault-tolerance? Examples of stream processing frameworks include Apache Samza and Apache Storm. Many stream processing architectures also can produce tremendous volumes of metrics about the system itself. What benefit might this have for real-time monitoring of a distributed data service?
  3. Microservices - This is a weakly-defined term that means different things to different people. But is it feasible to implement a mobile data service using small services that consume one stream and output to another? Apache Samza will help with this, though it is more heavily oriented around Apache YARN rather than Kubernetes.
  4. CQRS - This style of architecture separates operations that read data from operations that update data. How does this affect performance, scalability, durability, and extensibility? Is it advantageous to use separate stores/views for separate types of queries/reads? Does the complexity permeate into each individual part (e.g., microservice), or can individual functions be kept simple?

Status

The codebase includes an example application that demonstrates the basics and how data flows through the system. It includes a script that will start all of the processes that form Debezium (including Zookeeper, Kafka, and YARN for the multiple Samza "services"), plus an example app that uses Debezium.

Current features

Debezium has the following user-level features:

  • Easily provision databases
  • Store JSON-based entities in collections, and optionally organize entities within a collection into separate zones
  • Use patches to create, modify, and delete entities
  • Submit multiple patches (on different entities) with a single batch
  • Automatically learns the schema of stored entities
  • Explicitly change the schema
  • Subscribe to a zone for notification of created, updated, and deleted entities in that zone
  • Asynchronous driver API

Debezium's architecture has the following characteristics:

  • All information is persisted on the file system and can be replicated
  • All data is partitioned, making it easier to scale while sharing nothing
  • Fault tolerant: even if the entire system crashes, when restarted Debezium will pick up with whatever requests were accepted without losing or skipping any data.
  • Add new microservices that consume existing data channels

Future features

Several features are likely to be needed beyond the prototype, but these were either lower-risk or lower-priority:

  • Integrate with AeroGear's Unified Push Server (easy)
  • Get all entities in a particular zone (easy)
  • Query database to find entities matching criteria (moderate, would likely use external storage)
  • Integrate into LiveOak (easy)
  • Easily deploy on OpenShift

Using Debezium

Typically the developer of a mobile app would provision a new database for that app, which is then immediately ready for the app to start uploading and adding data. Data is organized into aggregate entities that are represented with JSON, and entities with similar structure are stored in collections named for the type of entity. Entities within a collection can be grouped together in zones. For example, an address book application might have a "Contacts" collection, where each entity in that collection represents the contact information about a particular person, and all entities owned by a user might be stored in a zoned named for the user.

Debezium does use a schema for each database, and the schema describes the structure of all entities within each collection. But you don't have to define this structure up-front, because Debezium can learn the schema as data is put into the database. This learned can be used as-is, or it can be fine-tuned manually. Automatic schema learning makes it very easy to develop an app, since it can help ensure that the app is producing and storing entities with the expected structure. As the app matures and moves closer to production, however, this feature will likely not be needed, so schema learning can be turned off for the database and the schema frozen.

One very unique feature of Debezium is its API. Most CRUD-based data services require clients to read, locally change, then then submit new states of the entities. This style of interaction can be problematic when multiple clients try to update the same data, and can require complex logic (often in the client) to handle race conditions and detection/resolution of conflicts. Because mobile devices do not always remain online, synchronizing data stored locally on the many devices is often quite difficult and requires additional logic. Storing data locally on the device can be a detriment with CRUD-style data services.

Debezium's approach is different. It actually benefits from mobile apps maintaining local copies of some of the data, even when that data is likely to be out of sync. It also leverages the existing push notification services available on mobile devices. In effect, Debezium looks like a state machine: clients submit change requests, which Debezium stores and uses to compute transitions from one state to another. Debezium does cache the intermediate states for performance reasons, but fundamentally the persisted sequence of change requests serve as the primary master data from which all other states and data can be derived.

When a mobile app wants to change data it has cached/stored locally (even if that data may be slightly stale), it simply records the changes it makes to that data. These changes, called a patch, are then submitted to Debezium, which applies the patch to the most recent version it has and returns the updated representation. When the client receives the updated representation, it simply has to update its local store. Mobile apps can also request they be sent push notifications when entities are created, changed, or removed.

Mobile apps therefore have fairly simple and straightforward responsibilities:

  • Read entities as necessary, locally storing some or all of these entities.
  • Subscribe to any zones in which entity changes will be sent to the app via push notification.
  • Record changes to local entities as patches.
  • If connected, submit the patches to Debezium and wait for the response stating whether the patch was successful. If there is no connection, the patches can be stored for later submission.
  • Update local storage with the latest representations, which may arrive via push notifications or be returned from patch submission requests.

There are several features of Debezium that makes all of this possible:

  • All patches that apply to a single entity are processed in the order they are received. Even though Debezium is a distributed system, it still very efficiently maintains the total ordered of all changes to each entity. This is done by partitioning entities and does not need transactions, and this is one reason why Debezium can scale so well.
  • The changes within a patch are either all applied completely or none of them are. Clients can use this to ensure fields within an entity are always updated consistently and atomically. Consider our address book app with a contact for "Sally Johnson". If the app on one device changes the first name to "Sarah" while the app on another device simultaneously changes the first name to "Samantha" and the last name to "Johnston", then it is possible that the contact end up with the name "Sarah Johnston". Instead, if our mobile app always sets the first name and last name together, then the contact's name would either "Sarah Johnson" or "Samantha Johnston". This is dependent upon the order in which the patches are received, but at every point in time the contact exactly reflects what one of the users expected. Of course, client apps are in complete control over the patches they build and submit.
  • Patches can contain the changes to be made and assertions that must be true when the patch is applied before any changes are made. This lets the client place conditions on their changes, in case the current entity's state differs from what the client was expecting. In these cases, the patch failure (along with the current entity state) would be returned to the client, which can then try to rebuild and submit a new patch. This dramatically simplifies conflict resolution.
  • Multiple patches can be submitted as a single batch. This saves network traffic and makes it easier for the mobile application.

Next?

To learn more, start with the results of the prototype and then look at the more detailed documentation of the API, services, streams, schema learning, subscriptions, some example apps, and some Q&A. And, you can always look at and run the code.

Clone this wiki locally