Skip to content
This repository has been archived by the owner on Dec 20, 2024. It is now read-only.

The Doradus Story

Randy Guck edited this page Jun 2, 2014 · 9 revisions

Motivation

Several years ago at Quest Software, we observed a common problem among our auditing products. These products collect data from server logs, network packets, and other feeds and store it in relational databases for reporting. Their common problem was scalability costs: performance degrades as the database grows. Because relational DBs scale vertically, the hardware costs often couldn’t be justified given the role of the product. Customers just wouldn’t pay for massive database servers for this class of products.

Our product teams were creative in how they compensated for this. Some merged fine-grained data into coarse-grain data and deleted the former. Some sealed-off the database every N months and started a whole new one. One product ditched the database entirely in favor of hierarchical zip files, front-ended with a Lucene search engine. But these compromises resulted in data loss, reduced query features, and other negative side effects. In an ideal world, we could save years of data, query all of it, and scale for only modest hardware costs.

In 2010, we started looking at NoSQL databases because of their low-cost, horizontal scalability features. Open source NoSQL databases were also attractive from the software licensing cost perspective. We surveyed the field for a NoSQL database with the following criteria:

  • Horizontally scalability: elasticity, replication, failover, etc.
  • Multi-platform: Windows and Linux
  • Rich indexing: multi-field searching, full text queries
  • API: can be called from any programming language
  • Data model: support for typed scalar fields and inter-record relationships
  • Longevity: active user community, evolving with new features

We surveyed a dozen or so NoSQL databases and experimented with maybe half of them. None of the databases met all of our criteria due to limited support for indexing, queries, or something else. We came away with two conclusions from this study:

  1. We liked Cassandra the most because it is cross-platform (pure Java), supports a flexible tabular data mode, and has an active user community.
  2. We needed something to complement Cassandra that added missing features. We initially looked at some Cassandra add-ons, such as Lucandra (Lucene + Cassandra), but we found these too slow or not scalable.

For 2, we decided to build a common service that many products could use that would sit between Cassandra and applications, adding features we needed. The project was launched in late 2011, code-named Doradus.

Why Doradus?

Choosing names is hard. Some of us wait until the last moment, waiting for someone else to propose a good name. Originally, our product used vague references such as “the DB service”. But eventually, someone had to start writing code, which required them to give their workspace a name. A quick Internet search for words like big, cluster, and new turned-up a Wikipedia article on the Tarantula Nebular, also known as 30 Doradus. It’s described as an active region where many new stars are born. That sounded good, so Doradus was chosen, and the name stuck.

Doradus Goals

As a supplement to Cassandra, Doradus was designed to provide the following features:

  • Data model: support complex objects and inter-object relationships
  • Searching: allow equality/range searching on any field including full text queries
  • REST API: support for XML and JSON to allow a broad range of applications
  • Multi-tenancy: support multiple application databases in a single cluster
  • Scalability: scale horizontally just like Cassandra

For portability, we chose to implement Doradus in Java. For scalability, we chose a stateless peer model so that more instances could be added as needed.

The data model (fully described elsewhere) started with basic scalar types. To support strong inter-object relations we implemented links, which are directed pointers. A pair of links are defined for each relationship, allowing navigation from either end. Other features such as group fields and aliases were added over time, but the core data model is pretty basic: tables of objects whose values are stored in scalar and link fields. These are stored in Cassandra rows in different ways depending on the storage service (see below).

The Doradus query language (DQL, described fully elsewhere), started with Lucene-compatible syntax: terms, phrases, wildcards, range clauses, and more. To navigate link fields, we added the idea of link paths, which support quantifiers, filters, transitive searches, and other features. We also decided to employ DQL in two distinct roles: object queries, which return selected objects, and aggregate queries, which perform statistical calculations over selected objects.

v1.0: The Birth of Spider

When Doradus started, our first client was a team developing a product that collected messaging event data. Initially, we thought their most important use cases were object queries with full text clauses. Consequently, we created a storage service that used techniques similar to other Lucene-based search engines: fully inverted indexes, trie trees, term vectors, etc. Combining full text queries with link paths yielded rich DQL queries features such as this:

    GET /Msgs/Message/_query
        ?q=SendDate=[2010-07-01 TO 2010-12-31] AND
            Sender.Person.WHERE(FirstName=J*) AND
            ALL(Recipients.Person.Department):support

This query finds all messages sent between 2011-07-01 and 2010-12-31, whose sender’s first name starts with “J”, and all of whose recipients belong to a department containing the term support. A DQL aggregate query that selects the same messages and counts them, grouped by message tag, is shown below:

    GET /Msgs/Message/_aggregate
        ?m=COUNT(*)
        &q=SendDate=[2010-07-01 TO 2010-12-31] AND
            Sender.Person.WHERE(FirstName=J*) AND
            ALL(Recipients.Person.Department):support
        &f=Tags

DQL proved to be sufficiently expressive, and the Lucene-like storage approach was effective, but it turned out our client team’s focus was shifting, and this approach was not meeting their needs. So we kept this storage service, calling it the Spider Service, but we started a new one with a radically different approach.

v2.0: OLAP - A Radically Different Approach

It turned out that aggregate queries were far more important to our client team than object queries. They wanted to scan millions of objects and perform counts, sums, averages, etc. and group the results by any number of criteria. Initially, we thought we could perform large statistical queries in the background and provide fast request-time performance of the metrics. (This resulted in the statistics feature in the Spider service.) But the searching and grouping criteria turned out to be too vast: we concluded that we would need to perform a million background statistical queries, each of which could take minutes to hours. This approach wouldn’t scale.

So in v2.0, we created a second storage service called the OLAP service. It uses the same basic data model as the Spider service, but it combines techniques from online analytical processing with columnar storage and other features to store data in a radically different way. With Spider, each object is stored in one Cassandra row. The OLAP service takes a different approach:

  • The database is divided into logical partitions called shards. A shard is typically data belonging to a specific time period such as a specific day.
  • Objects are loaded into tables belonging to a specific shard.
  • Each object’s scalar and link field values are extracted into field-specific arrays called segments, which are compressed and stored as segmented blobs.
  • Each object and aggregate query targets a specific range of shards.
  • The shard/field segments needed for a given query are loaded into memory and cached on an LRU basis.
  • The memory-based arrays are scanned linearly at a typical rate of millions of values per second.

The OLAP service can scan millions of objects/second without indexing. The columnar storage approach yields very high compression rates, so databases are surprisingly small. The full Doradus data model and query language are available, though there are a few feature differences between OLAP and Spider. The OLAP service also imposes some restrictions that which are spelled out elsewhere.

Doradus offers both the OLAP and Spider services: each application chooses the one that best fits its needs. A Doradus server can be configured to start either or both services, so that only the functionality required by applications is provided.

Open Sourcing Doradus

Dell acquired Quest Software in 2012 and has remained committed to the Doradus project. The messaging insights product that uses Doradus has shipped several versions, and other teams are in various POC stages. But Doradus has never been sold or offered as a standalone product.

As a supporter of the open source community, Dell has released certain projects as open source, such as Blockade. In April, 2014, we decided to make Doradus open source under the Apache License version 2.0. We are extremely excited to give back to the open source community, and we hope others can benefit from Doradus and its unique features.

Clone this wiki locally