Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Core concepts

Andy Schlaikjer edited this page Oct 15, 2013 · 12 revisions

This page introduces the core concepts of Summingbird. Summingbird jobs produce two main types of data: streams and snapshots. A stream is a full history of data, whereas a snapshot captures system state at a specific time. Below it will help to ask: is this data a stream or a snapshot?

Producer

The Producer is Summingbird’s data stream abstraction. Every job begins by creating an initial Producer[P, T] node out of a P#Source[T] by calling the Producer.source method:

def source[P <: Platform[P], T: Manifest](s: P#Source[T]): Producer[P, T]

Once you’ve created a Producer, all operations in the Producer API are fair game. After building up your desired MapReduce workflow, you’ll need to hand your Producer to an instance of Platform to compile the MapReduce workflow down to that particular Platform’s notion of a “plan”. More on the “plan” later.

Platform

A Summingbird Platform instance can be implemented for any streaming MapReduce library that knows how to make sense of the operations one can perform on a Producer. The Summingbird repository contains Platform implementations for Storm, Scalding and in-memory. Here’s the Platform trait:

trait Platform[P <: Platform[P]] {
  type Source[_]
  type Store[_, _]
  type Sink[_]
  type Service[_, _]
  type Plan[_]

  def plan[T](completed: Producer[P, T]): Plan[T]
}

Each of the type variables locks down one of Summingbird’s core concepts for a particular execution platform. As you build up a graph of operations on your Producer, you’ll pull in various instances of these types. Let’s discuss each type variable in turn.

Source

A Source represents a stream of data. Each execution platform has its own notion of a data source. For example, the Memory platform defines a Source[T] to be any TraversableOnce[T]:

type Source[T] = TraversableOnce[T]

As a result, any Scala sequence is fair game:

import com.twitter.summingbird._
import com.twitter.summingbird.memory._

val producer: Producer[Memory, Int] = Producer.source(Seq(1,2,3))

Storm’s sources are often backed by realtime distributed queues, so the Storm platform’s Source type is a bit different:

type Source[+T] = com.twitter.tormenta.spout.Spout[(Long, T)]

This source is a Tormenta Spout tuned to produce instances of T along with an associated timestamp. While different, the Memory platform’s TraversableOnce[T] and the Storm platform’s Spout[(Long, T)] are, logically, generators of data with sensical implementations of the methods in the Producer API.

Store

The Store is where the “reduce” of Summingbird’s streaming MapReduce comes into play. A Store[K, V] contains a snapshot of the aggregated value for each of its keys. When a Producer instance contains (K, V) pairs, calling producer.sumByKey(store) will group all V (by K), sum all V’s using an instance of Monoid[V] and push a snapshot representation of all resulting (key, summed-value) pairs together into the supplied Store[K, V].

Because Summingbird uses a Monoid[V], updates into a Store are always associative:

newStore = assoc(oldStore, value)

where assoc(assoc(a, b), c) == assoc(a, assoc(b, c)). The associative property is exploited by the Storm and Scalding execution platforms for parallelism and fault tolerance. Associativity is also helpful when merging results from separate batch and realtime computations; see batch and realtime for more information.

Sink

Unlike the Store, the Sink allows you to materialize an un-aggregated “stream” representation of the Producer’s values. A Sink is a stream, not a snapshot. In Storm and Memory, a Sink is just a function call. In Storm that that function call might populate a log stream, or another realtime queue that a further Summingbird topology could pull in as a Source.

In the word count example displayed on Summingbird’s README, inserting a call to write and sinking before the call to sumByKey would send a stream of (word, 1L) pairs into the sink:

source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }
  .write(sink)
  .sumByKey(store)

Calling producer.write(sink) after a sumByKey will produce a stream of monotonically increasing counts for each word, effectively a scanLeft output of the updates to every key in the stream.

Service

A Service allows the user to perform a “lookup join”, or leftJoin, against the current values within a Producer’s stream.

The joined values can come from another Store’s snapshot, another Sink’s stream, or even some other asynchronous function call that requires non-negligible time to compute.

A Service can also provide access to data from a Store that is being materialized earlier in the job, or in a dependent job. Here at Twitter, one might imagine joining a Service[TCOShortenedURL, ExpandedURL] against a Producer[(TCOShortenedURL, V)] to create a stream of (TCOShortenedURL, (V, Option[ExpandedURL])).

On the Memory or Storm platforms, a Service is a key-value store against which the stream performs lookups. Before a service join, a producer must have type Producer[P, (K, V)]. A join against a P#Service[K, RightV] will return a Producer[P, (K, (V, Option[RightV]))].

The Scalding platform allows you to join against the stream output by another Summingbird job’s sink. The join is implemented in a way that prevents any key on the left side of the join from seeing the right side’s value at a future point in time. Put another way, the Scalding platform’s service join is non-clairvoyant.

Plan

The plan is the final representation of the MapReduce flow produced by a Platform after a call to platform.plan(producer). For Storm, the plan is a StormTopology instance that the user can execute using Storm’s supplied methods. For the Memory platform, the plan is an in-memory Stream[T] containing the output of the supplied producer. Once you have a plan, you can jump back out to the APIs backing the specific Platform for job customization and submission.