Speed up filtered queries (but how?) #716

SIMULATAN · 2024-12-07T17:53:28Z

Today, I tried looking up my coding time for a specific label. Unfortunately, the requests time out with a HTTP 502 before completing.

Sadly, my host systems are pretty slow (old server CPU & RAM) and thus take ages to compute summaries. Checking routes/summaries, it appears as if all heartbeats are fetched no matter what filters are applied, which leads me to believe that filtering is done either in the template or somewhere in the frontend - at least not on a database level. My user has about 350 000 heartbeats, filtering them already took long enough before but it has reached the breaking point today.

Since these tables can utilize indexes for queryable fields on a database level and are overall faster than application level filters, I'd appreciate if we could brainstorm ideas on how to improve this current situation.

SIMULATAN · 2024-12-07T17:54:05Z

Please let me know in case I misunderstood something, my experience with the codebase is rather limited

muety · 2024-12-08T21:14:31Z

Thanks for bringing this up. You're right, as soon as filters are involved, summaries need to be recomputed, instead of being fetched from "cache" (aka. the summaries table). The relevant part in the code is this method, which, in turn, is called from the SummaryService.

At first sight, your suggestion to simply just fetch those heartbeats, that are relevant for the filtering (e.g. ... WHERE LABEL = 'foo') seems sensible. Unfortunately, it's not as easy, though.

If you take a look at the aggregation logic, you'll find that, in order to get total coding time for label foo correctly, you might still need other heartbeats that fall "in between". Suppose this situation:

Timestamp	Heartbeat ID	Label
t=0s	1	foo
t=60s	2	bar
t=150s	3	foo

With heartbeatsTimeout = 120s.

When only fetching heartbeats for label foo, you'll end up with a total coding time of 120s + 120s = 240s. When considering the fact that, in between of working on label foo, you were additionally working on a different label (e.g. different project) (thus Wakapi gets more fine-grained heartbeats information), total coding time will be 60s + 90s + 120s = 270s. Or, similarly, if there were only heartbeats 1 and 2, then you'd get a total of 120s when only considering foo-heartbeats, but 60s + 120s = 180s when take heartbeat 2 into account as well.

So instead of "give me all heartbeats for 'foo'", your query would have to be something like "give me all heartbeats for 'foo' and everything in between each two of those that are farther than apart" or something - which, of course, would be hard to implement as an indexed query either.

So while I definitely see the problem here, I can't think of a straightforward solution right now.

SIMULATAN · 2024-12-09T08:19:22Z

Ah, yeah that explains why this wasn't implemented on a DB level in the first place. A classic case of the user thinking an improvement is easy, but failing to understand the complexity behind it :^)

In theory, I guess we could move the current calculation into an SQL query. Especially Postgres should have support for these complex computations. I do imagine that a vendor-agnostic implementation would be rather challenging though. The performance would probably increase, yes, but I don't think the difference is major enough to warrant the complicated change.

My next course of action, should I get around to it, would be to try profiling Wakapi on my local machine, connected to the remote database, to find potential bottlenecks. It'll probably just be my slow I/O, but there may be some potential left.

muety · 2024-12-10T10:52:00Z

I attempted to implement (a previous version of) the calculation logic entirely in SQL before (see here). However, as you correctly pointed out, it would require individual queries per supported database system, as the query can't be done in standard SQL.

We could implement a MySQL- and Postgres-specific query and fall back to "application-side" aggregation for all other databases. But I'm a bit reluctant, because it would increase complexity by a lot and every change to it would then have to be done three times. When I find a spare moment, I might give it a try, let's see...

muety changed the title ~~Database-level queries~~ Speed up filtered queries (but how?) Dec 8, 2024

muety added technical_task prio b effort:4 labels Dec 8, 2024

muety added prio a and removed prio b labels Dec 10, 2024

muety mentioned this issue Dec 10, 2024

Timeout when filtering summary with many heartbeats #717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up filtered queries (but how?) #716

Speed up filtered queries (but how?) #716

SIMULATAN commented Dec 7, 2024

SIMULATAN commented Dec 7, 2024

muety commented Dec 8, 2024 •

edited

Loading

SIMULATAN commented Dec 9, 2024

muety commented Dec 10, 2024

Speed up filtered queries (but how?) #716

Speed up filtered queries (but how?) #716

Comments

SIMULATAN commented Dec 7, 2024

SIMULATAN commented Dec 7, 2024

muety commented Dec 8, 2024 • edited Loading

SIMULATAN commented Dec 9, 2024

muety commented Dec 10, 2024

muety commented Dec 8, 2024 •

edited

Loading