Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for seeding the database with test data #259

Closed
soggycactus opened this issue Jul 30, 2021 · 23 comments · Fixed by #291
Closed

Add support for seeding the database with test data #259

soggycactus opened this issue Jul 30, 2021 · 23 comments · Fixed by #291

Comments

@soggycactus
Copy link

I have used goose at pretty much every job to manage my relational databases. It works like a charm, but I always find myself creating some sort of wrapper that allows me to use goose to seed my local and development environments with test data. It would be really nice if this were supported already.

I propose creating a new command goose seed that will allow users to continue managing their schema migrations with goose and also seed their non-production environments with test data for integration & end-to-end testing. Please note that I am not suggesting actual seed data be managed this way - if your application requires some static, pre-existing data, that can just be inserted along with the regular schema migrations.

What are your thoughts on this? If we agree that this is useful and decide on an interface, I'd like to take on this work and open a PR to implement it.

@mfridman
Copy link
Collaborator

Is your proposal similar to #235 (comment), specifically step number 4?

Where by "seed" you mean import an already dumped database?

I certainly think there is room for improvement, as I've also (and in internal projects) have extended goose. So why not add those "nice" things to the core package.

@soggycactus maybe if you can flush out what you're proposing we can have a more technical understanding of what a command such as "seed" would do

@soggycactus
Copy link
Author

It's similar to #235 but not exactly the same. Typically in our projects what we've done is create another migrations directory that essentially consists of a lot of INSERT INTO migrations to fill out the database tables with some static data that can be used for integration tests; for example, our tests check that our /accounts endpoint is able to correctly read, update, and delete data, so when these tests run we run them with the expectation that some data already exists to test this functionality.

We do some work outside of goose to make sure these additional migrations are only run locally and inside of CI and dev environments, since the static testing data helps keep our end-to-end test results predictable. When we deploy and run migrations for prod, we don't run the additional migrations that insert test data into the database.

I'm curious what your thoughts are on what a potential seed command would look like. My initial thought is that we add a type attribute to the migrations, where each migration is either a schema type or a seed type, and we tell goose to either run all migrations or only schema migrations. This probably would be very hard to make backwards compatible, so another thought is to perhaps have a separate seed command that references a different migrations directory and also uses a different database table to track the seed migrations - this way we don't run into the versioning problem

@soggycactus
Copy link
Author

bumping @mfridman any thoughts on the above?

@mfridman
Copy link
Collaborator

mfridman commented Aug 3, 2021

My initial thought was why is it a goose concern, e.g., you can already tell goose via -dir to run from dirA (seed data) and then dirB (actual migrations). You just have to call goose twice with different args.

So I'm struggling a bit to understand how a seed command would help in this case. You'd still have to run goose seed and then goose up, which is effectively goose up -dir seedDir/ and goose up -dir migrations/

and we tell goose to either run all migrations or only schema migrations.

Is the intention to keep seed data alongside the actual migration data? Or to be able to identify it in the database table?

Sorry maybe I'm missing something.


this way we don't run into the versioning problem

Any chance you can elaborate on this. I suspect there is a way to address your issues (and many others) but I'd like to understand a bit better.

@soggycactus
Copy link
Author

So I'm struggling a bit to understand how a seed command would help in this case. You'd still have to run goose seed and then goose up, which is effectively goose up -dir seedDir/ and goose up -dir migrations/

@mfridman you're exactly right about this - we manage our current project doing exactly this. It's a little annoying to have to write a few scripts and make sure that we're only running the second set of migrations in the lower environments.

The versioning problem arises because the seed migrations always have to come after the schema migrations. So in a regular workflow, it's really easy to create a schema migration with a later timestamp than the existing seed migrations, apply the schema migrations, and then try to apply the seed migrations only to have goose say there's nothing to run. We get around this by making every seed migration happen in the year 3000, but again it feels hacky and is kind of annoying.

I agree that a seed command in this context might not be as useful as it sounds - what if goose seed simply introspected the existing structure of the database, used something like https://github.com/bxcodec/faker to generate the fake data and write some dummy records to all tables.

I just feel like there has to be a better way to seed the database with dummy data than what we're currently doing.

@mfridman
Copy link
Collaborator

mfridman commented Aug 3, 2021

There's a bit of history w.r.t to the current versioning done by goose, I'll save this for a blog post.

However, say goose supported an opt-in flag such as --strict=off .. and the behaviour of this would be to apply all migrations regardless of timestamp and/or sequence, would that solve or mitigate your current issues? Such as described in this comment, #172 (comment)

Curious, for your INSERT seed migrations, are you adding ON CONFLICT DO NOTHING or similar depending on your database?

@soggycactus
Copy link
Author

However, say goose supported an opt-in flag such as --strict=off .. and the behaviour of this would be to apply all migrations regardless of timestamp and/or sequence, would that solve or mitigate your current issues? Such as described in this comment, #172 (comment)

It might mitigate my current issues IF the seed migrations run after the schema migrations; don't want to be inserting data into tables that don't exist.

We don't need to use the ON CONFLICT statement because the seed data is managed by goose, so if we run the seed migrations again and again, goose recognizes that it's already applied them.

@mfridman
Copy link
Collaborator

mfridman commented Aug 3, 2021

It might mitigate my current issues IF the seed migrations run after the schema migrations; don't want to be inserting data into tables that don't exist.

Yes, this is the intention, regardless when the migrations were created they'd be applied regardless. This pushes the complexity to the user to manage their migrations, but the outcome is what most users would want anyways.

As long as the migration steps, migrations step followed by seed step, are coordinated and applied serially then it accomplishes what you want.

I think this is another data point, that we should introduce --strict=off to allow for the above behaviour. That way you can treat your directories seedDir and migrationsDir for what they are instead of worrying about coordinating versions/timestamps between them.

@soggycactus
Copy link
Author

@mfridman That sounds good, should I leave this issue open in the meantime so you can reference it if you implement --strict=off?

@mfridman
Copy link
Collaborator

mfridman commented Aug 3, 2021

@soggycactus Yes please, I'm going to carve a bit of time to think this through and make sure we get the correct implementation based on all the issues we received w.r.t versioning.

I'd first like to land module support #261 and have a clear non-breaking (/v3) vs breaking (/v4) path forward for the project. We should be able to land strict mode off in /v3.

@soggycactus soggycactus changed the title [Discussion] - Add support for seeding the database with test data Add support for seeding the database with test data Aug 3, 2021
@ukiahsmith
Copy link

I am looking for a similar feature. I need a way to seed and teardown test data. It is easy enough to run goose using two different directories, one for sql structure (migrations/) then a second on sql for test data (seedDir/), but it is not possible to rollback the test data.

When running "goose down" on the test data directory I received the error no migration 1. With "1" being the name of the only migration in the sql structure directory.

It appears that goose fails to execute the "down" sql unless the file for the previous migrations are also present.

My use case is that I need to have a HEAD of sql statements that seed test data, and roll them back at test cleanup. The seed sql only needs to be valid for the most recent version of the db schema, so it can change over time.

@mfridman
Copy link
Collaborator

Do you need to keep track of "seeded" migrations in the database?

Also, if I understood correctly, the order of operations will be:

  1. Apply migrations
  2. Seed database
  3. Delete seeded data
  4. Down migrations ?

@ukiahsmith
Copy link

Are you asking for the order of operations to get the error I described? Or order of operations for my desired workflow?

For my desired workflow:

I don't think I need to keep track of the seeded migrations in the database in the same that that migrations are tracked with a serial/datetime id. I do not expect to have multiple layers of seed/test data.

I do not down the migrations as part of my testing process.

  1. Apply migrations
  2. Seed database
  3. Run tests, make changes, fix things
  4. Delete seed data
  5. Commit updates

I would also image that step 3 above could include deleting seed data, updating seed data in response to test changes, re-seed database, repeat as needed. This would be "as needed" as I've yet to encounter it.

My workaround was to create two sql files, a down and an up, and run each one when needed using Exec() manually. The up is run after migrations, and the down is run defered and run at the end of TestMain().

Using os.ReadAll() and Exec() is not that much work, but I was hoping to have up/down of sql all be done one way with one tool for consistency.

@mfridman
Copy link
Collaborator

Apologies, I should have clarified.. I was looking for order of operations for desired workflow. I believe I understand the current limitation, so this is useful. Thank you!

fwiw I'd like to add this functionality because I also found myself wrapping around goose doing a bunch of INSERT commands.

What are your thoughts, as @soggycactus suggested, to have a distinct command (or flag) that applies ad-hoc SQL ? Something like:

goose -dir ./seedDir -seed up
goose -dir ./seedDir -seed down

And the -seed flag would instruct goose to functionally behave the same, but not track applied "changes". So in effect you get an all or nothing for both up and down?

So /seedDir/00001_add_organizations.sql could look something like:

-- +goose Up
INSERT INTO ...

-- +goose Down
TRUNCATE TABLE ..

@mfridman
Copy link
Collaborator

mfridman commented Nov 25, 2021

I've been playing up around with an implementation and quite like it. The only thing I'm not sure about is how to expose the functionality to the end-user:

  1. a flag -seed (as described above): goose -dir .. -seed {up|down}
  2. a subcommand: goose -dir .. seed {up|down}
  3. a command: goose -dir .. {seed-up|seed-down}

The -dir .. here is intended to be a separate "seed" directory that contains one or more migration files. The up and down functionality is intended to be all or none, likely within a transaction.

Suggestions or alternatives welcome.

One item for discussion, from this comment, is whether seed operations should be tracked in the database or not.

@ukiahsmith
Copy link

I see two different use cases, "insert seed data after migration in test/prod" and "insert test data on demand in local/dev".

In thinking this through I see that migrations and seed data are two different uses. I would then separate out migrations to only alter the structure and shape of the database, and seed data to only insert into tables and update rows.

Insert seed data after migration in test/prod

I can see there being a seed data file for every migration file. The migration updates the structure, or adds new structure, and the seed data inserts test/dummy data needed to use the new feature.

In this case I would want the option to rollback seed data independently of the structure, or to roll them back in lock-step.

Seed data would also need to only be applied after it's corresponding migration was applied. So, migration-01, seed-01, migration-02, seed-02.

Insert test data on demand in local/dev

I would want looser options around apply test data. I want to add test data per-test, and then delete it after the test is run.

Here is where I think that the test data wouldn't need to update the migration version in the goose table. Goose would just run the corresponding up or down sql, in file order, from the directory it's pointed to. I would see seed data files being separate from migration files.


Really for my needs I think an "ignore version" or "don't update version" flag would allow me to point goose to another directory with sql and just up it or down it.

I pointed goose to a different directory with sql and said "just run these" but goose complained that the current version in the db table was not available in that directory. If I had a flag to ignore that check it would have just up'd my test data inserts.

@mfridman
Copy link
Collaborator

mfridman commented Nov 30, 2021

That's a great summary of the problem. Thanks for putting the time to write that up.

For case 1 I think they are one and the same and can be accomplished today (for the most part, caveat below). Whenever we needed some data to always be inserted into the database, we captured it within the same migration file (or a subsequent file).

Caveat, there is no way (today) to run multiple migrations within a single transaction. There is already an open issue for this #222.

So to keep goose simple, I can't see an immediate use for special casing the first case with new semantics for "seed data after migrations".

For case 2, however, I think we can introduce a flag --no-versioning or whatever we call it, to indicate that goose commands should be executed ad-hoc without versioning in the database I think you phrased it nicely:

would just run the corresponding up or down sql, in file order, from the directory it's pointed to.

@ukiahsmith
Copy link

I agree for case 1, nothing needs to be changed. This would be more of a pattern to be documented.

I also agree for case 2. What is the working that would be most direct? "Ignore versioning", "No versioning", "Ad Hoc", "Omit versioning" Don't mind me, this bike sheds a decent enough color already.

Would --no-versioning be able to have a corresponding short flag too?

@mfridman
Copy link
Collaborator

mfridman commented Dec 1, 2021

I'll carve out some time to add this feature, update the docs and write a short blog post explaining the use case.

Don't mind bikeshedding a bit. With this feature I'd err on the side of caution to make the flag explicit and self explanatory:

--no-versioning, --omit-versioning, --ad-hoc

If you have suggestion for a short flag, I'm open to it. Although something tells me this shouldn't have a short flag.

Would --no-versioning be able to have a corresponding short flag too?

The short answer is, potentially, but not yet.

The long answer is, goose is 9 year old project and there are a few components of the project I'd like to refactor to make more idiomatic (this will be a /v4). One of these is improving the CLI experience, so before committing to additional flags I'd like to think it through properly. Whatever the outcome is, the flags will be backported to the current /v3.

@ukiahsmith
Copy link

I like either --ignore-versioning or --no-version or --no-versioning. I'm sure they are all about the same.

I understand about it being an old codebase with it's own ghosts. I appreciate you helping to maintain it.

@mfridman
Copy link
Collaborator

@ukiahsmith This functionality has been added in c8aa123, merged via #291

You'll likely find the following commands do what you described above:

# apply all migrations 
goose -dir ./seed -no-versioning up

# apply all down migrations (same as down-to 0)
goose -dir ./seed -no-versioning reset

All commands should support the -no-versioning flag and will treat the -dir as the source of truth. Note there are a few limitations, such as running down multiple times will always run the same file, and up-by-one .. but that is to be expected since we don't know where to begin (i.e., the version state is not tracked when running -no-versioning).

Thanks for the input, glad we got this functionality into goose

@ukiahsmith
Copy link

@mfridman Thank you for taking the time to talk through this and to implement it.

@mfridman
Copy link
Collaborator

Thanks for the feedback, wrote a quick post to summarize how this works for the next person.

Ad-hoc migrations with no versioning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants