Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query optimizations #478

Merged
merged 16 commits into from
May 11, 2020
Merged

Conversation

ghost
Copy link

@ghost ghost commented Apr 18, 2020

This PR optimizes performer and scene queries by doing a few things:

  • Selecting only columns that are actually used by graphql. The primary purpose of this is excluding performer images and scene covers from the queries since they are never used.
  • Fetching all entity data at once instead of first querying for ids, and then querying each individual id for the data. Despite the sqlite manual's claims, doing a single query is significantly faster, at least in my tests.
  • Removing pointless joins. Many places entity tables are joined to do counts and existence checks, despite the join table alone being sufficient. E.g: Joining performers with scenes through performer_scenes and counting scenes.id, rather than just counting performers_scenes.scene_id.

The performance gains for my test cases have been:
120 Performers: ~560ms -> 30ms.
60 Scenes with performers and custom covers: 220ms -> 70ms.

Note that this is with the migration to move the performer image to the last column. Otherwise the difference for performers is likely less significant.

The scene query is harder to optimize due to a larger number of joins for each entity, but it's still a nice gain.

The diff is a bit hard to follow, and I didn't test all the filters extensively, so it's possible I missed something, but generally speaking there should be no changes other than the performance improvements.

@WithoutPants
Copy link
Collaborator

Looks like its failing the unit tests. Can you take a look at the results of make it?

bnkai
bnkai previously approved these changes Apr 20, 2020
@bnkai bnkai dismissed their stale review April 20, 2020 22:03

misssing field

@WithoutPants
Copy link
Collaborator

I really don't like the idea of changing the queries to get a static list of column names, but it'll have to do for now until we get rid of the blobs in the schema. I don't like having an extra place to remember to maintain columns for objects.

@WithoutPants
Copy link
Collaborator

WithoutPants commented Apr 22, 2020

Just as an addendum to last comment, I think having an arbitrary set of columns in the Find functions are a bad idea. It is non-obvious to the calling code just what is being returned.

My feeling is that the column changes should be dropped for this PR, so that we can get the performance gain of getting rid of the unnecessary joins, and perform the column stuff in a separate PR. The design of the query builder PR needs to be iterated on further imo.

If we absolutely must include it in this PR, my thinking would be that the existing find interfaces should be left as they originally were (getting all columns) and adding a new interface that accepts a list of columns. That way, the calling code has control over what columns it queries for, and makes it more explicit that the graphql code will be querying for specific columns.

I think though that this should probably be done separately as part of a larger refactoring of our query interfaces.

@ghost
Copy link
Author

ghost commented Apr 22, 2020

I agree that it's not a great solution. Ideally the resolvers would figure out the required columns based on the query and only fetch those, but that would probably make for a really hairy resolver.

Unfortunately once I revert the column changes, the performance gains largely evaporate. I'll try to come up with a cleaner solution, but as you say the query builders should probably be refactored altogether, and I don't think I'm the right person to do that.

@WithoutPants
Copy link
Collaborator

Assuming we remove the blobs from the tables, is there going to be an appreciative performance difference between doing select * vs select <specific columns>?

If there is, then it makes sense to refactor the query functions to accept columns. Otherwise, I would suggest changing the unnecessary joins and leaving the column selection as is with the view that the blob stuff should be done as priority for the next release.

I'm going to submit a PR to your branch with an alternative approach, but I am hesitant to merge it into 0.2 presuming we release reasonably soon, since there's a decent risk of regression.

@WithoutPants
Copy link
Collaborator

I've had a quick go at it. You can take a look at the branch here: https://github.com/WithoutPants/stash/tree/select-columns

I still don't really like the idea in general. When we make a call that gets a Scene or Performer object, we should really be fully populating the object, because otherwise other functions have no idea how much an object has been populated. I can see this being a massive maintenance headache.

@ghost
Copy link
Author

ghost commented Apr 29, 2020

Column changes have been reverted, as have the bulk fetching since it proved to be slower than fetching everything one by one without the column change.

Also removed some groupBys in performer findByScene since it's really slow and doesn't make much sense anyway. If someone for some reason attaches the same performer multiple times to a scene then we should just return that.

Finally, refactored scene query to use the querybuilder because why not.

@WithoutPants
Copy link
Collaborator

Changes look ok and tested ok during my quick testing. #513 is now merged, so if you can merge in the develop branch then we can ensure that at least the new unit tests pass.

@ghost ghost force-pushed the query-optimizations branch from 352787a to 4508340 Compare May 4, 2020 15:21
Copy link
Collaborator

@WithoutPants WithoutPants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the following unit tests to querybuilder_scene_test.go and remove the applicable TODO comments:

func TestSceneCountByTagID(t *testing.T) {
	sqb := models.NewSceneQueryBuilder()

	sceneCount, err := sqb.CountByTagID(tagIDs[tagIdxWithScene])

	if err != nil {
		t.Fatalf("error calling CountByTagID: %s", err.Error())
	}

	assert.Equal(t, 1, sceneCount)

	sceneCount, err = sqb.CountByTagID(0)

	if err != nil {
		t.Fatalf("error calling CountByTagID: %s", err.Error())
	}

	assert.Equal(t, 0, sceneCount)
}

func TestSceneCountByMovieID(t *testing.T) {
	sqb := models.NewSceneQueryBuilder()

	sceneCount, err := sqb.CountByMovieID(movieIDs[movieIdxWithScene])

	if err != nil {
		t.Fatalf("error calling CountByMovieID: %s", err.Error())
	}

	assert.Equal(t, 1, sceneCount)

	sceneCount, err = sqb.CountByMovieID(0)

	if err != nil {
		t.Fatalf("error calling CountByMovieID: %s", err.Error())
	}

	assert.Equal(t, 0, sceneCount)
}

func TestSceneCountByStudioID(t *testing.T) {
	sqb := models.NewSceneQueryBuilder()

	sceneCount, err := sqb.CountByStudioID(studioIDs[studioIdxWithScene])

	if err != nil {
		t.Fatalf("error calling CountByStudioID: %s", err.Error())
	}

	assert.Equal(t, 1, sceneCount)

	sceneCount, err = sqb.CountByStudioID(0)

	if err != nil {
		t.Fatalf("error calling CountByStudioID: %s", err.Error())
	}

	assert.Equal(t, 0, sceneCount)
}

func TestFindByMovieID(t *testing.T) {
	sqb := models.NewSceneQueryBuilder()

	scenes, err := sqb.FindByMovieID(movieIDs[movieIdxWithScene])

	if err != nil {
		t.Fatalf("error calling FindByMovieID: %s", err.Error())
	}

	assert.Len(t, scenes, 1)
	assert.Equal(t, sceneIDs[sceneIdxWithMovie], scenes[0].ID)

	scenes, err = sqb.FindByMovieID(0)

	if err != nil {
		t.Fatalf("error calling FindByMovieID: %s", err.Error())
	}

	assert.Len(t, scenes, 0)
}

func TestFindByPerformerID(t *testing.T) {
	sqb := models.NewSceneQueryBuilder()

	scenes, err := sqb.FindByPerformerID(performerIDs[performerIdxWithScene])

	if err != nil {
		t.Fatalf("error calling FindByPerformerID: %s", err.Error())
	}

	assert.Len(t, scenes, 1)
	assert.Equal(t, sceneIDs[sceneIdxWithPerformer], scenes[0].ID)

	scenes, err = sqb.FindByPerformerID(0)

	if err != nil {
		t.Fatalf("error calling FindByPerformerID: %s", err.Error())
	}

	assert.Len(t, scenes, 0)
}

func TestFindByStudioID(t *testing.T) {
	sqb := models.NewSceneQueryBuilder()

	scenes, err := sqb.FindByStudioID(performerIDs[studioIdxWithScene])

	if err != nil {
		t.Fatalf("error calling FindByStudioID: %s", err.Error())
	}

	assert.Len(t, scenes, 1)
	assert.Equal(t, sceneIDs[sceneIdxWithStudio], scenes[0].ID)

	scenes, err = sqb.FindByStudioID(0)

	if err != nil {
		t.Fatalf("error calling FindByStudioID: %s", err.Error())
	}

	assert.Len(t, scenes, 0)
}

Three of these tests are failing (TestSceneCountByTagID, TestSceneCountByMovieID and TestFindByMovieID). I believe it is due to the selectAll("all") calls in lines 30 and 36 of querybuilder_scene.go. This is also causing the movies and tags pages to error out.

@@ -10,7 +10,9 @@ import (
"github.com/stashapp/stash/pkg/database"
)

var scenesForPerformerQuery = selectAll("scenes") + `
const TABLE_NAME = "scenes"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golang constants are visible module-wide, so TABLE_NAME is not a good constant name for a scene-specific value. Suggest changing this to sceneTable or similar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh, I keep forgetting that go is basically C, scope and all. Fixed.

@ghost ghost force-pushed the query-optimizations branch from 2dd38ee to 7681b2a Compare May 9, 2020 06:50
@WithoutPants WithoutPants merged commit 8ba7678 into stashapp:develop May 11, 2020
Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021
* Remove slow and largely pointless groupbys
* Change scene.query to use querybuilder
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants