Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.
Austin Lee edited this page Jul 28, 2021 · 59 revisions

Welcome to the armor wiki!

What is Armor

The Armor library defines a columnar storage file format, similar to Parquet and ORC in terms of columnar but different in that it is entity/domain driven. It is meant as an alternative for building columnar files to be read by Presto over S3. It also provides a low cost way to deploy time-series style tables over S3. Our main goal here was eliminate the requirement for deploying additional infrastructure whether it be a database, cluster, etc in order to implement the two previous features. Within the library you'll find ways to read and write Armor files easily.

Why Armor? What problems are we trying to solve.

The reasons for Armor stems from our initial use of our Emr/Spark/Parquet/Presto ingestion pipeline. Although it works we have had some major pain points that motivated us to want to improve our current process. Our goal here was to find a way simplify the way we build our tables such that a technology like Presto can read our tables easily. To give you a better idea of what problems we faced here are some of the issues.

Complex Deployment from top to bottom

Due to our dependence on Spark we are forced to rely on a Hadoop based cluster framework such Amazon EMR. Now it is possible to run Spark standalone but it isn't the convention for most Spark applications. Doing the unconventional things in itself brings about further unknown complexities. On top of that most Spark applications are job based meaning run once vs run perpetually. To compensate for this we monitor the amount of work to be done and the state of each cluster to constantly see if we need to submit more jobs, kill jobs, kill clusters, create clusters etc.

Another complexity is the need for a Hive metastore. The hive metastore is required for Presto to read parquet files. Without it Presto doesn't know what the shape of the parquet tables or where they exist at. Thus in order to use Presto and Parquet you'll need Hive. However, requiring a metastore is cumbersome. First the hive metastore database (postgres or mysql backed) is abstracted by the Hive library. This abstraction can present problems since: a) Upgrading between versions is risky b) The client-side libraries that are available to interact with hive metastore aren't easy to use. In fact we've found simply re-using a hadoop cluster to connect to the metastore and run HQL statements is much easier than via a simple java application. Finally for multi-tenant deployments where a schema equates to one tenant, executing standard HQL statements is taxing. Things like adding or dropping a column requires spanning through all schemas. Also adding and deleting new tenants requires more HQL scripts to be run.

Last building the right type of spark clusters is a process. Often you'll want the right mix of resources allocated between workers and masters to achieve an expected level of performance. This really varies per pipeline and thus adds another layer of complexity that has to be understood in order to achieve optimal performance. Below is an AWS tuning guide for EMR Spark clusters that demonstrates the complexity.

best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr

Below is an overview of the deployment model we were trying to simplify. Notice the various technologies needed to be deployed, so from a soft cost perspective the cost is very high.

Complex deployment

High Learning curve

Our stack in this domain is comprised of Presto/EMR/Hadoop/Hive/Spark/S3/Parquet. Though these technologies are popular and well-known. They are not as ubiquitous in the java world, such as Spring. What we have found is that most Java developers have very little experience in these technologies and finding someone to fill these roles isn't easy. On top of that I've found some of these technologies does require a lot of background and experience to really be adept in it. Thus I would say the learning curve for becoming adept in these sets of technologies is high.

Lack of time series features

One thing that we found to be lacking in our rollout for building parquet files is the slow nature of building time based tables. For us, we snapshot each and every table for each tenant on a weekly basis. The snapshot approach though is slow in nature and if you want to snapshot at smaller intervals such as hour or day then the process of snapshotting over so many tables across all tenants via hive may go over the time interval you desire to establish. Another drawback is that each and every new snapshot must be notified to Hive of its existence. This is yet another downside of a centralized metastore.

Another feature that we'd like to enable is recording events over time, which is basically the opposite of snapshoting. However, trying to build this type of table using Spark across thousands of tenants is complex. Perhaps it is beyond the design of what Spark is meant for. But the inability to build these type of tables precludes us from providing richer features to our users.

Requires diffing

This often applies to a good number of data ingestion pipelines where some part of the pipeline has to execute a diff against an entity often identified by a unique key. Basically you'll load what is stored into memory compare that against what you are getting, and then execute deletes, updates, inserts.

A more efficient approach is to push this logic down into the data layer. Essentially attempting to avoid the cost of the diffing by executing a delete/insert. For example a simple way to do this in a relational database is this.

start_xact()
DELETE FROM table where id = 1
INSERT INTO TABLE (id, col1, col2, col3)
VALUES(1, 100, "fred", "texas")
VALUES(1, 120, "fred", "california")
VALUES(1, -1, "tom", "vermont")
VALUES(1, 300, "tom", "california")
commit()

ORM frameworks like Hibernate make this extremely simple for the developer to do. The above example could be done simply like this if your POJO is fully initialized.

MyObj pojo = getPojo()
tx = session.beginTransaction();
session.update(pojo)
tx.commit()

Performance wise this is likely to be slow but it is much more domain driven which is a plus. Spark doesn't provide this level of abstraction so developers generally must use the former pattern to accomplish this. The inefficiency found in either approach is the fact there could be a lot of network shuffling or table row shifting by either Spark or the file backed database.

High Cost, hard to optimize

Due to the clustered nature of running Hadoop, you often pay a premium since you must run at least 1 master and 1 worker to launch 1 cluster dedicated for ingesting. However, most often you'll need to run at least 2 if not 3 or more workers and many clusters to be able to concurrently ingest more events and tables. Not only is there additional cost in running these clusters but there is additional cost in maintaining a metastore as well as maintaining and orchestrating the clusters.

How does Armor address the problems?

Simple architecture leads to simple deployment

All you need to build Armor tables is to know Java and use the library. There is no dependence on frameworks, technologies or other libraries. Building a pipeline to start and build armor tables is super simple. You can have a single thread or process build and update a portion of an armor table in safe consistent fashion. This gives you the ability to run many java based microservices on host machines, VMs or containers that write to many different tables or a single table.

On top of that, there is no metastore required to pre-define what you'll be building. Rather the table definitions are based off of what you build. This includes time based tables, columns, and schemas that you had to define before you started writing.

Finally Armor tables can reside on simple to use file-systems like S3 and standard file systems. All you have to do is pick the file-system and a root destination. Armor will then start to build tables for you in a multi-tenant fashion.

Parquet on the other hand is commonly built and read using frameworks like Spark or Hive. Building a Java application that directly writes or reads Parquet files is possible. Below you'll find the various ways you can achieve this. However, notice all implementations below do not provide an ORM api. You may also find that many approaches require importing more than one library. Finally the APIs for this libraries doesn't lend itself to succinct and concise code, which is part of the reason why many turn to frameworks like Spark for that level of abstraction.

Below is a potential deployment model given using either standard VMs or containers. In this scenario you only need Java to run on a host. Also notice the hive/mysql requirement is removed.

Simple deployment model with Armor

Low learning curve

Armor is just a library where our goal is to provide simple APIs to build tables. There is no need to learn a framework or memorize an abundant set of APIs and settings to get started. Our goal here is to have even the most junior java developer start to write Armor tables as easily as possible. If you have experience building java applications then it shouldn't take more than an hour to get started.

Supports time series

When we designed and built armor, we also tackled the lack of time-series capabilities in other solutions. This means you can take advantage of the simple deployment model of Armor to integrate with Presto in a time-series based manner. This might be a good choice if you want to store your data like Druid or Pinot but without the hassle and difficulty of deploying and integrating such technologies with Presto. The Armor API provides simple ways to dynamically create time based tables at various time intervals even down to the minute level. Because it's all calendar based it can handle leap years, daylight savings, etc with no problems. Also the Presto-Armor connector is fully aware of the time-series layout of Armor and applies this information when deciding which armor files to download and load.

Time series deployment

No diffing, more natural way to update entities

We modelled our API so it has the simplicity of ORM libraries like Hibernate. It is very domain driven and easy for the user to understand. Underneath the covers, we designed the file format to take advantage of the entity/domain driven process we were running in. You can check it out here. But to give a short summary read below.

The Armor file format maintains an entity index that tracks what portion of the file belongs to which entity. Since all of an entity's data is stored in a contiguous region, each new version of the entity can simply be appended to the end file. The previous version can be marked for deletion. This avoid shuffling and makes it easy and fast to compact and remove the soft deleted regions in one pass.

Low Cost and easy to quantify

No metastore and no clusters allows us to tune and optimize for cost more efficiently. Instead of having to run 100 instances of say 20 clusters we could perhaps for the same level of ingestion for 10 instances or 20 containers. It is much easier to extract more power in this way than the master/worker cluster model.

How is Armor broken down

There are three main parts to this repo which are armor-base (core support classes), armor-write (for writing armor files), and armor-read (for reading armor files).

The library does provide a set of patterns and classes to read and write Armor files. On the read side we primarily deal with 2 ways to read armor files which is locally on disk via a set of APIs or a specialized reader that reads the file as quickly and efficiently as possible in order to run on PrestoDB.

Writing Armor files can be customized to suit whatever the use-case is, but out of the box we provide ways to write Armor files either to local disk or Amazon S3. Either approach supports transactions.

Armor Weak-points

  • The format doesn't lend itself to perfectly balanced shards.
  • Since Armor is entity based where one entity has one to many rows. There is no way to update just a portion of an entity. Users are responsible that each update contains the full state of the entity.