Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIRE has a lot of MySQL QPS #1351

Closed
mcpherrinm opened this issue Jan 22, 2020 · 8 comments
Closed

SPIRE has a lot of MySQL QPS #1351

mcpherrinm opened this issue Jan 22, 2020 · 8 comments

Comments

@mcpherrinm
Copy link
Contributor

mcpherrinm commented Jan 22, 2020

We've noticed in our MySQL metrics that SPIRE is doing a lot of QPS.

I'm attaching our data to this ticket for reference. The source of this cluster has approximately 3500 SPIRE Agents and 31000 workloads

Here are the top queries (everything over 1% exec):

            Execution Duration    
    QPS Load Count Exec % min avg max
  Total 47048 7.8 4062095 100.00% 0.004 ms 0.17 ms 2.6 s
1 select e.id as e_id...... 6829 3.5 589593 44.66% 0.24 ms 0.51 ms 2.6 s
2 prepare 14861 2.7 1283065 34.23% 0.04 ms 0.18 ms 68 ms
3 SELECT node_resolver_map_entries 6832 1 589836 13.35% 0.08 ms 0.15 ms 64 ms
4 close stmt 14867 0.3 1283650 3.90% 0.005 ms 0.020 ms 50 ms
5 SELECT bundles 614 0.09 53001 1.19% 0.08 ms 0.15 ms 48 ms
6 SELECT attested_node_entries 595 0.09 51336 1.17% 0.08 ms 0.15 ms 31 ms

This is the top query:
select e.id as e_id, e.entry_id as entry_id, e.spiffe_id, e.parent_id, e.ttl as reg_ttl, e.admin, e.downstream, e.expiry, s.id as selector_id, s.type as selector_type, s.value as selector_value, b.trust_domain, d.id as dns_name_id, d.value as dns_name from registered_entries e left join (select ? as joinitem union select ? union select ?) as joinitems on true left join selectors s on joinitem=? and e.id=s.registered_entry_id left join dns_names d on joinitem=? and e.id=d.registered_entry_id left join (federated_registration_entries f inner join bundles b on f.bundle_id=b.id) on joinitem=? and e.id=f.registered_entry_id where e.id in(?+) order by e_id, selector_id, dns_name_id

MySQL says about that:

  Bytes Sent Lock Time Query Time Rows Affected Rows Examined Rows Sent Temporary Tables Temporary Table Sizes
min 1030 B 0.05 ms 0.24 ms 0 0 0 3 0
avg 1893 B 0.11 ms 0.51 ms 0 30 4 3 62659
max 1212963 B 24 ms 2.6 s 0 41319 6260 6 34220880
sum 1115819362 B 63 s 300 s 0 17468138 2606360 1768788 36943502592

The stats on the query itself look OK, so I think it's mostly the number of times its being run.

@azdagron
Copy link
Member

azdagron commented Jan 22, 2020

Are all agents authorized for all workloads?

@azdagron
Copy link
Member

azdagron commented Jan 22, 2020

Unfortunately, [1] is used a bunch (list registration entries) when figuring out what workloads are authorized for an agent. However, it looks like we can shave a bunch of load by using prepared statements (i.e. [2] and [4]).

I'm working on PRs now that should reduce the counts for [3] and [5].

@mbyczkowski
Copy link
Contributor

Are all agents authorized for all workloads?

No, workloads are assigned to particular agents and if they are moved around then new registration entries are created (and old ones are removed).

Also a small nit/clarification to your 2nd comment @azdagron when you say #1, #2 etc., surely you mean the number of the query from the table that @mcpherrinm posted, right?

@azdagron
Copy link
Member

Yes, fixed them to not be links to issues :)

@mbyczkowski
Copy link
Contributor

I should have clarified that we have a number of daemon sets that run on all hosts, so a big portion of the 31k workloads comes from them (NUM_DAEMON_SETS * NUM_AGENTS). I'm not sure if that changes anything. Those workloads still have parent_id set to the spiffe_id of a particular agent.

@azdagron
Copy link
Member

I'm pretty confident we can halve the load shown above. I'll have PRs out shortly.

@mcpherrinm
Copy link
Contributor Author

#1355 indeed cut our MySQL QPS in half:

Screen Shot 2020-02-13 at 5 05 17 PM

The CPU usage drop is even more dramatic:

Screen Shot 2020-02-13 at 5 05 59 PM

@azdagron
Copy link
Member

QPS were greatly reducde by #1350 and more so #1355. We also considered taking #1357 but decided not to at this time due to back-compat concerns. I'm closing this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants