Replies: 4 comments 2 replies
-
great write up, really enjoyed reading it. I like the idea of lazy generation or on demand, that makes a whole lot of sense for many reasons. One thing I was thinking of as a way of stopping regeneration of the same data in multiple domains is to generate a const hashCode = o.hashCode;
if (map.has(hashCode)) {
// do nothing
} |
Beta Was this translation helpful? Give feedback.
-
Here is some relevant prior art related to this problem of graph generation
|
Beta Was this translation helpful? Give feedback.
-
Thinking about this some more, I am getting some real vibes of Monte Carlo simulation here. The implementation of this that I always think of is guesstimate. other links with implementations: |
Beta Was this translation helpful? Give feedback.
-
@jbolda This is a really cool find. Basically, the idea would be that your only responsibility would be to specify a distribution, and then the engine would chose at random along that distribution. |
Beta Was this translation helpful? Give feedback.
-
The payoff for having a high-fidelity simulation is huge, but how do you realize the benefits of simulating a mature API without implementing it twice?
If we’re making a simulation of a complex API you can end up in a bad situation where you have two completely separate and divergent implementations: your simulated implementation and your “real” implemenation. What good is it if your simulation implementation is a constant maintenance headache?
Ideally, we would be able to automatically generate as much of the implementation as possible, that way, the cost of maintaining your simulation does not add significantly to the overhead of maintaining the actual service. The way to go about this is to take what we know about a service, and use that to generate as much as we can. For our base cases, let’s assume that we have some form of Schema for the service. Popular API schema definitions are OpenAPI, RAML, and GraphQL. Hopefully there will be a lot of overlap in what we can generate from each of these schemas, and we’ll consider them all in turn, but let’s start with GraphQL since it has a very concise syntax for expressing type information and also because we use it a lot.
Suppose we have the following schema:
Value Generation
Let’s build this from the ground up, and from the simplest properties to the biggest. For starters, we want to generate, with zero configuration, good names and emails. We can see that the type is
String!
, so we can use an all purpose string generator. However, we also know that the wordsname
andemail
correspond to specific types of strings and so we want to generate using that specific type without having to tell it. In the event that we guess wrong, only then should you need to narrow the type of data generated. One way to do this would be with a GraphQL directive:The directive names aren’t important, but the idea is that you can declare how it ought to be generated.
Whenever you define a new type in your schema, you would need to specificy (also via directive?) a strategy to generate values of that type
product types such as
User
can be generated just by generating all of their fields! However, you may want to control default parameters on how a user is generated like for instance the statistical variance of usinggmail
vsyahoo
vs some custom domain for the email. Could we use a directive there as well?Persistent Records
Now that we’ve talked about simple values, it’s time to jump up a level of complexity and talk about simulating the persistence represented by the user itself. Unlike the simple values, the user (usually) represents a record that is persistent in the system. For example, when you search for a user by email, you should get the same record over and over and over again. When you traverse a relationship (more on that later) you should get the same related record no matter how many times you traverse it. I don’t think that it’s enough to have just the schema information at this point. You have to have some more information about how records are retrieved. If we have that information, then we can generate the persistence as well.
For example, in GraphQL a very common persistence protocol is called
Relay
. It provides interfaces and types for retrieving and relating data. If we know that a particular schema is an implementation of Relay, then we can use that information not only to write resolvers to retrieve persistent records, but we can also use it automatically generate scenarios to seed those records as well.https://relay.dev
We can use Relay to determine which types reperesent persistent roots, and which that are merely derived values that hang off of those persistent roots. In this case, anything that implements the
Node
interface is a persistent root. So we’d have to re-declare theUser
type asArmed with this knowledge, we can create programatically generate a scenario
createUser()
that takes parametrsname
andemail
which:generates the value of the
User
using the rules for value generation for the user type described in the previous sectioncreates an id and then persist it in the store.
We can also generate the
node(id: ID!): Node
resolver that can lookupUser
records for you.It’s worth re-iterating that no scenario is generated for creating non-node types because they are not persistent. There is no scenario for creating a string, or a number. Likewise, there is no scenario created for a product type like an
Address
as it is just a value like a string or a number, and while we can generate values for it inside our app, we can’t do anything with it. Because a user is both a value AND a persistent, root, then it does get a scenario. So, for example, if the user has a type:There is no scenario
createAddress()
because those are just values, not nodes, and an address value will be generated as part of generating a user value.Relationships
“many to one"
Now that we have a way to automate the generation of persistent root values like User, how do we handle generating user values that also have fields which are themselves persistent roots. Let’s now add a
BlogPost
type to our schema that is a persistent root:The question is, what do we use for the value of the author of a post? It seems like our only two options here are to either create a brand new one, or to re-use one that has already been created for us. Which one is appropriate? It’s a tough call, and I’m not sure what the right answer is, all though my intuition is that you want to create new ones up to a certain point, and then start recycling them after that.
One possibility is to declare a set of constraints that the generated author must satisfy. For example: “we want our blog posts to be evenly distributed between 4 to 10 authors” That means that for the first four, you generate a new one every time, but after that, you occasionally generate a new one, and then after you have 10 authors, you never generate any more and only distribute between them.
To do this, we would have to have some sort of control function that guided the creation of the relationship. This function would take as its inputs the current stats: P blog posts, N authors and return both the newly created author as well as how the stats have been augmented by this.
https://www.gamasutra.com/blogs/DanielCook/20141215/232300/Loot_drop_best_practices.php
In other words, if you view the author as loot that appears with a declared probability, when you create a blog post, then we can search the literature already out there for potential solutions.
In addition to these
one to many
The flip side of the coin is the reverse end of that relationship because most of the time there will be two sides of the relationship: If A BlogPost has an author, then the author will have many blogposts. This not only means that when we generate an author, we have to make sure that there are enough blog posts for it. We could do this for example, with declaring the outcome that we want. E.g. “authors average around 10 blog posts” and our generating function can shoot for that by trying to generate a random outcome and then checking if it hit the target. Sometimes it will generate 6, and others, it will generate 24, and it will iterate until the generation hits the
This produces a counter-intuitive result because let’s say that we want the average author to have about 15 blog posts. That means that in order to create a bog post, we need to create both the author, and the other blog posts of that other author… enough that it would satisfy the constraints of the simulation.
many to many
The necessity of this approach is evident when sets of persistent roots are related to each other in more complex ways. Suppose that blog posts can have more than one author. How does this affect the creation of a single blog post? Well, let’s say that we say that blog posts have an average of 1.2 authors. It might be that a particular generation produces a blog post with three authors. Then, we need to produce blog posts for all of those authors that match the generation profile, each of which may or may not have 1 or more authors. This could cause an infinite explosion of entities if we don’t actually re-use some of the authors that have been created for previous blog posts. In other words, we need to make sure that our generation is converging on a finite set rather than exploding infinitely.
First Class Edges
In summary, at the lowest level we are generating values, and at the persistent level, we are generating a values and the relationships between them. In this way, we can think of creating a
User
record as generating a relationship between the root (or Query) in graphql terms and a User value.Coordinating Scenarios and Things in the Store
All of the proceeding describes how arranging state would work within a single domain, in this case a GraphQL API. But what it doesn’t address is how to arrange a simulation that could span multiple domains. Let’s say for example that we are simulating both Auth0 and a GraphQL gateway. In this case, we want to have a single scenario to create a person and then that person is imported into both the Auth0 domain and the GraphQL domain with the result that you can log in and complete a flow that spans both systems only having invoked a single scenario to create a person.
This could get nasty here because of what we talked about in the prior discussion. If we create a person in the world, and then that person is imported into the graphql gateway as an author, which results in more authors getting created, which results in more people getting created, which are then imported into auth0 which could potentially create more users, which are then mapped back to the world, you could end up with another explosion.
One option is by mapping different entities in the various domains, we could have some sort of orchestrated agreement on a relationship topology before hand, and then once that agreement is reached, we can do the generation of values lazily. This could be done by mapping node equivalencies, and then making sure that allocating nodes at any level, no matter which domain, allocated on that equivalency.
Open Questions
Beta Was this translation helpful? Give feedback.
All reactions