With REST we had an architectural style with a set of constraints that lets you interfact with Resources. The advent of GraphQL emphasized the fuzziness of those constraints by providing a full blown specification that defines the conceptual model as a graph. Despite graphs' simplicity, they have a rich structure making them very efficient at representing hierarchical data, with and without cicles (in case of connected acyclic graphs we get simple trees).
- Eliminates overfetching and underfetching problems as the request should be fulfilled in a single round trip to the backend and contain only the necessary data;
- Has built in type safety;
- Reduces the amount of client side joins, error handling and retry logic.
Taking bits from Eric Evans' Domain-Driven Design book, in domain modeling we shouldn't separate concepts from their implementation. Thus we need to remove the sequentiality of the process "design, then build". During the developement phase, domain experts and technologists interact with each other (using an Ubiquitous Language) trying to find the best domain representation while addressing business goals. As time advances, technologists will become more familiar with the scenario and will eventually modify the domain model many times.
When developing new APIs in GraphQL it's possible to reuse the same endpoints and just focus on connecting elements to the graph by expanding the schema. Clients won't need to make new integrations, as they can simply ask for more/different fields and data will still be served in a single request.
Evans is an Extreme Programming supporter, but at this point it's pretty obvious that GraphQL is a great fit for multiple iterations across the graph's schema.
In this article, we're going to use AppSync, a fully managed GraphQL service, and DynamoDB the most famous NoSQL serverless database.
When performing an operation, in Appsync it's possible to make use of different resolvers for fetching nested fields for a given return type. These resolvers may get the data from the same datasource of the parent type or from another one (or a combination of both, depending on how you modeled your domain).
Now we need to make a difficult choice:
- Leveraging Single-Table Design and store everything in a single DynamoDB table;
- Use DynamoDB pretty much like a SQL database and create a new table for each data model.
The purpose of this article is to compare the effectiveness of each approach by executing three queries with different nesting levels.
schema {
query: Query
}
type Query {
getUser(userID: String!): User
}
type User {
PK: String!
SK: String!
__typename: Typename!
email: String!
name: String!
bio: String
followersCount: Int
followingCount: Int
reviewsCount: Int
listsCount: Int
likesCount: Int
updatedAt: AWSDateTime!
# nested
reviews: [Review!]
lists: [List!]
followers: [User!] # we can use pipelines here!
following: [User!] # here too!
}
type Review {
PK: String!
SK: String!
__typename: Typename!
rating: Int!
description: String
likesCount: Int!
updatedAt: AWSDateTime!
# nested
movie: Movie!
}
type Movie {
PK: String!
SK: String!
__typename: Typename!
title: String!
releaseDate: AWSDateTime!
director: String!
listsCount: Int!
reviewsCount: Int!
updatedAt: AWSDateTime!
# nested
reviews: [Review!]
}
type List {
PK: String!
SK: String!
__typename: Typename!
name: String!
updatedAt: AWSDateTime!
# nested
movies: [Movie!]
}
enum Typename {
USER
MOVIE
REVIEW
LIST
ASSOCIATION
FOLLOW
}
Here's how I structured the Single Table that contains all the information about users, movies, reviews and lists. (Data might be inconsistent due to auto generated values).
While, for the Multi Table Design every table contains only elements with the same data model.
The loaded data is the same for both approaches.
query MyQuery($userID: String!) {
getUser(userID: $userID) {
PK
following {
PK
SK
}
}
}
Results:
query MyQuery($userID: String!) {
getUser(userID: $userID) {
PK
following {
PK
SK
reviews {
PK
SK
}
}
}
}
Results:
query MyQuery($userID: String!) {
getUser(userID: $userID) {
PK
following {
PK
SK
reviews {
PK
SK
movie {
PK
SK
}
}
}
}
}
Results:
Hot take: Appsync doesn't implement the DataLoader pattern, or at least not in the way you might think of. On the Appsync Architecture page we get to know that resolvers are executed concurrently, as you would do in a set of promises in JS via Promise.all(). This implies that the total number of database accesses is not optimized when using normal DynamoDB VTL resolvers.
Here's the detailed X-Ray report of the Query no. 3:
Request and Response mappings, as well as Database accesses for the same nested Type
are executed almost at the same time.
When we look at the database access, that is even clearer:
A series of GetItem-based resolvers of the same Type
, on the same datasource are launched in parallel, as configured.
There's no trace of intelligent batching here.
In order to really have the DataLoader functionality we have to deeply know the Data models and do everything by ourselves with Lambdas resolvers by interpreting the query or making use of external libraries.
While good Domain modeling is necessary to create useful APIs, Data modeling is not required when you have a GraphQL API in place.
Knowing our access patterns is mandatory to design our data model and run efficient, predictable queries.
Pre-joining data via Single Table Design allows us to leverage the internal DynamoDB structure based on partitions and B-trees to fetch all the necessary data (often) in a single round trip. But comes at the big cost of Denormalization.
As the application grows and domain models evolve, we might need to add new access patterns or modify the ones already in place. This can be difficult and eventually lead to a complete redesign of the Data model.
Denormalizing a big SQL database and migrate it to DynamoDB (or, in reality, to any noSQL database) is a huge burden and needs to be done very carefully in order to keep the 'relational value' of the data (wink to Rick Houlihan for me if you see him around).
The denormalization step is very "sequential", and here we see the first discrepancy with DDD. Denormalization can happen only after you've defined the access patterns (as adding more would be a pain) and and migrated the existing data, spread across multiple tables, into a single table (which translates to operational load and, in the worst case, downtime).
In my comparison it's pretty clear that Single-Table Design is more efficient in terms of bare latency, but is it worth it?
The efficiency gains can be easily surpassed by the ease of use (no complex custom VTL logic).
I don't want to start a philosophical debate but the serverless proposition is about saying goodbye to everything that doesn't bring direct value to the project.
Let's imagine the following scenario: You've just launched a big OLTP service that relies solely on Serverless products from AWS. The first big development phase has just finished, and you're trying to prepare for the next sprints when a colleague asks: "Can you please provide me the number of apple orders executed within the dates X and Y from customers that live in Taiwan that don't have pears in their fridge?"
DynamoDB is not meant for analytical applications (it doesn't even have the simplest aggregation functions like SUM or MAX). And you haven't prepared any infrastructure that can handle such a complex query.
If the application data is spread across multiple tables and for each table, the data model is consistent, we can easily export that data to build aggregations and run queries via Athena, Redshift, or Glue (Crawlers are cool!).
If we favored a Single Table Design, we're most likely going to write a custom script/application that will make a lot of requests and logically join the data. Then we'd probably start scratching our heads trying to figure out how to include that access pattern (maybe using metadata objects).
What happens if, the subsequent day, our curious colleague asks for a completely different type of information? I believe I’ve made my point.
If you're bootstrapping a new project it's better to go for a multi-table approach, because:
- Queries' latencies won't matter much during development phase;
- You will know the domain and business needs better, so you will change access patterns very often
If you're migrating a long-running product that's relying on a SQL database to DynamoDB:
- Going for a Single Table Design might be the right choice because access patterns are well-defined.
In general, if you want to improve your solution's overall latencies Single Table Design might do the trick. If speed is your concern, have a look at the different caching mechanisms in AWS, they might serve you better.
Additional readings on the topic:
- AWS Blog on DynamoDB modeling
- AWS Blog on Appsync and Single Table Design
- Rick Houlihan's office hours
- Alex De Brie's post
- Alex De Brie's video
- Alex DeBrie's DynamoDB analysis, or if you feel brave enough the full DynamoDB paper
- Awesome DynamoDB Repo
- Benoit Boure's post
- Rafal Wilinski's lessons
- Paul Swail's comparison