High Fidelity Schema Based Simulation #118

cowboyd · 2021-08-10T16:37:28Z

cowboyd
Aug 10, 2021
Maintainer

Note: These are thoughts on how to best proceed with a method to generate a realistic simulated graph of persistent objects based on work done by @jgnieuwhof both within #83 and also prior implementations on which he has worked.

The payoff for having a high-fidelity simulation is huge, but how do you realize the benefits of simulating a mature API without implementing it twice?

If we’re making a simulation of a complex API you can end up in a bad situation where you have two completely separate and divergent implementations: your simulated implementation and your “real” implemenation. What good is it if your simulation implementation is a constant maintenance headache?

Ideally, we would be able to automatically generate as much of the implementation as possible, that way, the cost of maintaining your simulation does not add significantly to the overhead of maintaining the actual service. The way to go about this is to take what we know about a service, and use that to generate as much as we can. For our base cases, let’s assume that we have some form of Schema for the service. Popular API schema definitions are OpenAPI, RAML, and GraphQL. Hopefully there will be a lot of overlap in what we can generate from each of these schemas, and we’ll consider them all in turn, but let’s start with GraphQL since it has a very concise syntax for expressing type information and also because we use it a lot.

Suppose we have the following schema:

type User {
  name: String!
  email: String!
}

Value Generation

Let’s build this from the ground up, and from the simplest properties to the biggest. For starters, we want to generate, with zero configuration, good names and emails. We can see that the type is String!, so we can use an all purpose string generator. However, we also know that the words name and email correspond to specific types of strings and so we want to generate using that specific type without having to tell it. In the event that we guess wrong, only then should you need to narrow the type of data generated. One way to do this would be with a GraphQL directive:

type User {
  name: String! @gen(“japanese”)
  email: String! @gen(“gmail”)
}

The directive names aren’t important, but the idea is that you can declare how it ought to be generated.

Whenever you define a new type in your schema, you would need to specificy (also via directive?) a strategy to generate values of that type

product types such as User can be generated just by generating all of their fields! However, you may want to control default parameters on how a user is generated like for instance the statistical variance of using gmail vs yahoo vs some custom domain for the email. Could we use a directive there as well?

type User @gen(email: { gmail: 0.3, yahoo:0.3, custom: 0.3 }) {
  name: String!
  email: String!
}

Persistent Records

Now that we’ve talked about simple values, it’s time to jump up a level of complexity and talk about simulating the persistence represented by the user itself. Unlike the simple values, the user (usually) represents a record that is persistent in the system. For example, when you search for a user by email, you should get the same record over and over and over again. When you traverse a relationship (more on that later) you should get the same related record no matter how many times you traverse it. I don’t think that it’s enough to have just the schema information at this point. You have to have some more information about how records are retrieved. If we have that information, then we can generate the persistence as well.

For example, in GraphQL a very common persistence protocol is called Relay. It provides interfaces and types for retrieving and relating data. If we know that a particular schema is an implementation of Relay, then we can use that information not only to write resolvers to retrieve persistent records, but we can also use it automatically generate scenarios to seed those records as well.

https://relay.dev

I think that if we start with a relay implementation first, then it will become more apparent what are the lower level building blocks of that implemenation that can be shared to help people

We can use Relay to determine which types reperesent persistent roots, and which that are merely derived values that hang off of those persistent roots. In this case, anything that implements the Node interface is a persistent root. So we’d have to re-declare the User type as

type User implements Node {
  id: ID!;
  name: String!
  email: String!
}

Armed with this knowledge, we can create programatically generate a scenario createUser() that takes parametrs name and email which:

generates the value of the User using the rules for value generation for the user type described in the previous section
creates an id and then persist it in the store.

We can also generate the node(id: ID!): Node resolver that can lookup User records for you.

It’s worth re-iterating that no scenario is generated for creating non-node types because they are not persistent. There is no scenario for creating a string, or a number. Likewise, there is no scenario created for a product type like an Address as it is just a value like a string or a number, and while we can generate values for it inside our app, we can’t do anything with it. Because a user is both a value AND a persistent, root, then it does get a scenario. So, for example, if the user has a type:

type Address {
  Address1: String!;
  Address2: String!;
  postalCode: String!
}

type User implements Node {
  Id: ID!;
  name: String!;
  address: Address!;
}

There is no scenario createAddress() because those are just values, not nodes, and an address value will be generated as part of generating a user value.

Relationships

“many to one"

Now that we have a way to automate the generation of persistent root values like User, how do we handle generating user values that also have fields which are themselves persistent roots. Let’s now add a BlogPost type to our schema that is a persistent root:

type BlogPost implements Node {
  Id: ID!
  title: String!
  body: String!
  author: User!
}

type User implements Node {
  id: ID!;  
  name: String!;
  email: String!;
}

The question is, what do we use for the value of the author of a post? It seems like our only two options here are to either create a brand new one, or to re-use one that has already been created for us. Which one is appropriate? It’s a tough call, and I’m not sure what the right answer is, all though my intuition is that you want to create new ones up to a certain point, and then start recycling them after that.

One possibility is to declare a set of constraints that the generated author must satisfy. For example: “we want our blog posts to be evenly distributed between 4 to 10 authors” That means that for the first four, you generate a new one every time, but after that, you occasionally generate a new one, and then after you have 10 authors, you never generate any more and only distribute between them.

To do this, we would have to have some sort of control function that guided the creation of the relationship. This function would take as its inputs the current stats: P blog posts, N authors and return both the newly created author as well as how the stats have been augmented by this.

Note: This is only one rudimentary strategy. I need to talk more with Adam B. about how this might look from the perspective of world generation. There might also be some prior art in terms of loot dropping

https://www.gamasutra.com/blogs/DanielCook/20141215/232300/Loot_drop_best_practices.php

In other words, if you view the author as loot that appears with a declared probability, when you create a blog post, then we can search the literature already out there for potential solutions.

In addition to these

one to many

The flip side of the coin is the reverse end of that relationship because most of the time there will be two sides of the relationship: If A BlogPost has an author, then the author will have many blogposts. This not only means that when we generate an author, we have to make sure that there are enough blog posts for it. We could do this for example, with declaring the outcome that we want. E.g. “authors average around 10 blog posts” and our generating function can shoot for that by trying to generate a random outcome and then checking if it hit the target. Sometimes it will generate 6, and others, it will generate 24, and it will iterate until the generation hits the

This produces a counter-intuitive result because let’s say that we want the average author to have about 15 blog posts. That means that in order to create a bog post, we need to create both the author, and the other blog posts of that other author… enough that it would satisfy the constraints of the simulation.

many to many

The necessity of this approach is evident when sets of persistent roots are related to each other in more complex ways. Suppose that blog posts can have more than one author. How does this affect the creation of a single blog post? Well, let’s say that we say that blog posts have an average of 1.2 authors. It might be that a particular generation produces a blog post with three authors. Then, we need to produce blog posts for all of those authors that match the generation profile, each of which may or may not have 1 or more authors. This could cause an infinite explosion of entities if we don’t actually re-use some of the authors that have been created for previous blog posts. In other words, we need to make sure that our generation is converging on a finite set rather than exploding infinitely.

Note: Not exactly sure how this would work, but we could perhaps use an iterative approach where we continually validate if something is converging or diverging.

First Class Edges

In summary, at the lowest level we are generating values, and at the persistent level, we are generating a values and the relationships between them. In this way, we can think of creating a User record as generating a relationship between the root (or Query) in graphql terms and a User value.

Coordinating Scenarios and Things in the Store

All of the proceeding describes how arranging state would work within a single domain, in this case a GraphQL API. But what it doesn’t address is how to arrange a simulation that could span multiple domains. Let’s say for example that we are simulating both Auth0 and a GraphQL gateway. In this case, we want to have a single scenario to create a person and then that person is imported into both the Auth0 domain and the GraphQL domain with the result that you can log in and complete a flow that spans both systems only having invoked a single scenario to create a person.

This could get nasty here because of what we talked about in the prior discussion. If we create a person in the world, and then that person is imported into the graphql gateway as an author, which results in more authors getting created, which results in more people getting created, which are then imported into auth0 which could potentially create more users, which are then mapped back to the world, you could end up with another explosion.

One option is by mapping different entities in the various domains, we could have some sort of orchestrated agreement on a relationship topology before hand, and then once that agreement is reached, we can do the generation of values lazily. This could be done by mapping node equivalencies, and then making sure that allocating nodes at any level, no matter which domain, allocated on that equivalency.

Open Questions

How do you simulate searching, since that requires lots of indexing

dagda1 · 2021-08-11T09:32:59Z

dagda1
Aug 11, 2021

great write up, really enjoyed reading it.

I like the idea of lazy generation or on demand, that makes a whole lot of sense for many reasons.

One thing I was thinking of as a way of stopping regeneration of the same data in multiple domains is to generate a hashCode of the generated object that could be used to use in a Set<T> or something.

const hashCode = o.hashCode;

if (map.has(hashCode)) {
  // do nothing
}

1 reply

cowboyd Aug 11, 2021
Maintainer Author

Yes, we will need a way to identify equivalency between domains. We could also use a WeakMap if we don't want to have a hashCode property.

cowboyd · 2021-08-12T16:09:49Z

cowboyd
Aug 12, 2021
Maintainer Author

Here is some relevant prior art related to this problem of graph generation

https://github.com/facebookarchive/linkbench Old facebook realistic social graph generator used to load test their mysql instances.
http://www.cs.cmu.edu/~christos/PUBLICATIONS/siam04.pdf R-MAT graph generation
https://mathinsight.org/generating_networks_desired_degree_distribution

0 replies

jbolda · 2021-08-13T15:47:17Z

jbolda
Aug 13, 2021
Maintainer

Thinking about this some more, I am getting some real vibes of Monte Carlo simulation here. The implementation of this that I always think of is guesstimate.

other links with implementations:
https://observablehq.com/@jeregrine/estimate
https://observablehq.com/@jeregrine/estimating-with-uncertainty

0 replies

cowboyd · 2021-08-15T00:19:39Z

cowboyd
Aug 15, 2021
Maintainer Author

@jbolda This is a really cool find. Basically, the idea would be that your only responsibility would be to specify a distribution, and then the engine would chose at random along that distribution.

1 reply

jbolda Aug 16, 2021
Maintainer

And we could possibly even have a default distribution for things too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Fidelity Schema Based Simulation #118

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

High Fidelity Schema Based Simulation #118

cowboyd Aug 10, 2021 Maintainer

Value Generation

Persistent Records

Relationships

one to many

many to many

First Class Edges

Coordinating Scenarios and Things in the Store

Open Questions

Replies: 4 comments · 2 replies

dagda1 Aug 11, 2021

cowboyd Aug 11, 2021 Maintainer Author

cowboyd Aug 12, 2021 Maintainer Author

jbolda Aug 13, 2021 Maintainer

cowboyd Aug 15, 2021 Maintainer Author

jbolda Aug 16, 2021 Maintainer

cowboyd
Aug 10, 2021
Maintainer

Replies: 4 comments 2 replies

dagda1
Aug 11, 2021

cowboyd Aug 11, 2021
Maintainer Author

cowboyd
Aug 12, 2021
Maintainer Author

jbolda
Aug 13, 2021
Maintainer

cowboyd
Aug 15, 2021
Maintainer Author

jbolda Aug 16, 2021
Maintainer