Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Fix TestResults SQL query to not take as long #880

Merged
merged 11 commits into from
Oct 18, 2024

Conversation

joseph-sentry
Copy link
Contributor

this commit replaces the previous django orm code with an invocation
to do a raw sql query that performs much better for repos with a large
amount of data.

one worry with this approach is the risk for sql injection

in this specific case, we're passing most of the dynamic user provided
values used in the sql through the parameters which the django docs say
are sanitized and should be used to protect against sql injection

for the user provided values that are not passed through the sql parameters
they're checked to be in a strict set of values, so we shouldn't be substituting
any unexpected strings into the query

this commit replaces the previous django orm code with an invocation
to do a raw sql query that performs much better for repos with a large
amount of data.

one worry with this approach is the risk for sql injection

in this specific case, we're passing most of the dynamic user provided
values used in the sql through the parameters which the django docs say
are sanitized and should be used to protect against sql injection

for the user provided values that are not passed through the sql parameters
they're checked to be in a strict set of values, so we shouldn't be substituting
any unexpected strings into the query
@joseph-sentry joseph-sentry requested a review from a team as a code owner October 10, 2024 20:53
Copy link

codecov bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 95.90164% with 5 lines in your changes missing coverage. Please review.

Project coverage is 96.28%. Comparing base (8945e37) to head (9c16f21).

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
utils/test_results.py 94.89% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #880      +/-   ##
==========================================
- Coverage   96.31%   96.28%   -0.03%     
==========================================
  Files         823      823              
  Lines       19079    19130      +51     
==========================================
+ Hits        18376    18420      +44     
- Misses        703      710       +7     
Flag Coverage Δ
unit 92.64% <95.90%> (-0.02%) ⬇️
unit-latest-uploader 92.64% <95.90%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 94.57831% with 9 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
utils/test_results.py 93.57% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@matt-codecov matt-codecov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chiming in on the SQL injection risk: currently looks good. {order_by} and {order} are computed based on validated data, and every other string substitution controls the presence/omission of a hardcoded clause, nothing dynamic. the hardcoded clauses use query parameters and any dynamic values are provided separately

https://docs.djangoproject.com/en/5.1/topics/db/sql/#passing-parameters-into-raw it looks like you can name your query parameters like %(repoid)s and then make your params list a dict instead where the key is the parameter name. that would probably make this query a lot easier to work with

i imagine Django chose to use Python's % string interpolation syntax for their parameterized query syntax to make switching to safe queries easier, but man i wish the safe code didn't look so much like the dangerous code

do you have a sample of the SQL that was generated with the Django ORM query? i am not a django-optimizing expert but dynamically generating a query like this is fragile enough that it's worth a second look to see if we can't improve the performance in place

@joseph-sentry
Copy link
Contributor Author

https://docs.djangoproject.com/en/5.1/topics/db/sql/#passing-parameters-into-raw it looks like you can name your query parameters like %(repoid)s and then make your params list a dict instead where the key is the parameter name. that would probably make this query a lot easier to work with

this is a really good idea i will implement this

do you have a sample of the SQL that was generated with the Django ORM query? i am not a django-optimizing expert but dynamically generating a query like this is fragile enough that it's worth a second look to see if we can't improve the performance in place

the issue i ran into with Django was that doing arbitrary joins and using derived tables seems to be impossible, and i had been messing around with the SQL for a while trying to avoid using self joins and derived tables but I could not find a performant way to query this information without doing so. If we can come up with SQL that doesn't use derived tables and is performant then I'd be willing to try using Django again but until then I think it's impossible.

params: dict[str, int | str | tuple[str, ...]]


def convert_tuple_or_none(value: set[str] | list[str] | None) -> tuple[str, ...] | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_to_tuple_else_none maybe?

return tuple(value) if value else None


def encode_after_or_before(value: str | None) -> str | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is actually just encode_to_b64 isnt it? Nothing in particular to after or before


term_filter = f"%{term}%" if term else None

if interval_num_days not in {1, 7, 30}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Im kinda going back and forth with moving all this validation logic into a separate "validate_params" fn or similar to kinda clean up this function a bit

select test_id, commits_where_fail as cwf
from base_cte
where array_length(commits_where_fail,1) > 0
) as foo, unnest(cwf) as unnested_cwf group by test_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foo


filtered_test_ids = set([test["test_id"] for test in totals])

test_ids = test_ids & filtered_test_ids if test_ids else filtered_test_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this intersection is quicker than doing a filter on term prior to flags?

fail_count_sum=Sum("fail_count"),
pass_count_sum=Sum("pass_count"),
)
.filter(skip_count_sum__gt=0, fail_count_sum=0, pass_count_sum=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are your thoughts on not having the fail_count_sum check here --- signaling tests which were constantly failing that were then skipped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it makes sense to have tests that have never passed here, but there's no way of knowing if it's like interleaved in the way that its failing and skipped, i'm cool with trying it though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more relatable personally

.annotate(v=Distinct(Unnest(F("commits_where_fail"))))
.values("v")
if not first and not last:
first = 25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 20 is our default typically

we change the query to only do ordering and get the entire set of tests
then we do the filtering in the application using a binary search,
because the result of the query is ordered by the ordering parameter
then the name and the binary search takes that into account.

The reason we have to do the filtering in the application is so we can
get the correct value for the totalCount.

Otherwise we would have one query complete the logic for getting
totalCount and another query actually doing the cursor filtering.
@codecov-qa
Copy link

codecov-qa bot commented Oct 17, 2024

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
2701 1 2700 6
View the top 1 failed tests by shortest run time
utils.tests.unit.test_cursor test_cursor
Stack Traces | 0.001s run time
def test_cursor():
&gt;       row = TestResultsRow(
            test_id="test",
            name="test",
            computed_name="test",
            updated_at=datetime.fromisoformat("2024-01-01T00:00:00Z"),
            commits_where_fail=1,
            failure_rate=0.5,
            avg_duration=100,
            last_duration=100,
            flake_rate=0.1,
            total_fail_count=1,
            total_flaky_fail_count=1,
            total_skip_count=1,
            total_pass_count=1,
        )
E       TypeError: TestResultsRow.__init__() got an unexpected keyword argument 'computed_name'

.../tests/unit/test_cursor.py:7: TypeError

To view individual test run time comparison to the main branch, go to the Test Analytics Dashboard

Copy link

Test Failures Detected: Due to failing tests, we cannot provide coverage reports at this time.

❌ Failed Test Results:

Completed 2707 tests with 1 failed, 2700 passed and 6 skipped.

View the full list of failed tests

pytest

  • Class name: utils.tests.unit.test_cursor
    Test name: test_cursor

    def test_cursor():
    > row = TestResultsRow(
    test_id="test",
    name="test",
    computed_name="test",
    updated_at=datetime.fromisoformat("2024-01-01T00:00:00Z"),
    commits_where_fail=1,
    failure_rate=0.5,
    avg_duration=100,
    last_duration=100,
    flake_rate=0.1,
    total_fail_count=1,
    total_flaky_fail_count=1,
    total_skip_count=1,
    total_pass_count=1,
    )
    E TypeError: TestResultsRow.__init__() got an unexpected keyword argument 'computed_name'

    .../tests/unit/test_cursor.py:7: TypeError

@@ -234,7 +234,7 @@ def test_fetch_test_result_last_duration(self) -> None:
result["owner"]["repository"]["testAnalytics"]["testResults"]["edges"][0][
"node"
]["lastDuration"]
== 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why this guy changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we stopped automatically setting the lastDuration value to 0 from when we stopped computing it because we thought it was singlehandedly ruining the perf of the query

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh okay yeah i remember

term_filter = f"%{term}%" if term else None

if should_reverse:
ordering_direction = "DESC" if ordering_direction == "ASC" else "ASC"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, we want to reverse the ordering direction when we get last passed to the query

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, yeah that makes sense

return (row_value_str > cursor_value_str) - (row_value_str < cursor_value_str)

left, right = 0, len(rows) - 1
while left <= right:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holy smokes is this binary search in the wild?


left, right = 0, len(rows) - 1
while left <= right:
print(left, right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print

return TestResultsQuery(query=base_query, params=filtered_params)


def search_base_query(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing all this? A function description would go a long way for this guy

@@ -76,131 +298,152 @@ def generate_test_results(
:param branch: optional name of the branch we want to filter on, if this is provided the aggregates calculated will only take into account
test instances generated on that branch. By default branches will not be filtered and test instances on all branches wil be taken into
account.
:param history: timedelta for filtering test instances used to calculated the aggregates by time, the test instances used will be
those with a created at larger than now - history.
:param interval: timedelta for filtering test instances used to calculated the aggregates by time, the test instances used will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used to calculate*

:param history: timedelta for filtering test instances used to calculated the aggregates by time, the test instances used will be
those with a created at larger than now - history.
:param interval: timedelta for filtering test instances used to calculated the aggregates by time, the test instances used will be
those with a created at larger than now - interval.
:param testsuites: optional list of testsuite names to filter by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test suites filtering is also union right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed i will add that as a comment


rows = [TestResultsRow(*row) for row in aggregation_of_test_results]

page_size: int = first or last or 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point first or last will always have a value thanks to your check on 399 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the typing is broken unless i do this, but you are correct

@joseph-sentry joseph-sentry added this pull request to the merge queue Oct 18, 2024
Merged via the queue into main with commit 7857053 Oct 18, 2024
15 of 19 checks passed
@joseph-sentry joseph-sentry deleted the joseph/fix-sql-query branch October 18, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants