-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[usage] List workspaces for each workspace instance in usage period #10495
Conversation
be1a7ce
to
9fa1c65
Compare
08a44d6
to
acdb029
Compare
38183b3
to
bb77bfb
Compare
ec61948
to
036774b
Compare
036774b
to
971cd31
Compare
1040954
to
379bc08
Compare
type workspaceWithInstances struct { | ||
Workspace db.Workspace | ||
Instances []db.WorkspaceInstance | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: I wonder if we actually want to load into memory the "fully-hydrated" shapes of both Workspace
and WorkspaceInstance
.
In the past, we've had issues with memory consumption, because we were keeping full shapes in memory, while only a few fields were actually relevant. This is especially true for workspace instances, because there are so many of them.
A usage record will probably look something like this:
export type WorkspaceUsageRecord = {
instanceId: string;
workspaceId: string;
userId: string;
projectId?: string;
teamId?: string;
fromTime: string;
toTime: string;
};
So, in my mind, the most efficient flow to get these usage records is:
- Query only
instanceId
,workspaceId
,creationTime
,stoppedTime
fromd_b_workspace_instance
(for the current period) -- all the other fields are unnecessary - Trim the time bounds between
beginningOfMonth
andmin(getCurrentTime(), endOfMonth)
(also set thetoTime
to the maximum bound when there is nostoppedTime
yet) - For each unique
workspaceId
query only the owneruserId
and possibleprojectId
fromd_b_workspace
- For each
userId
, query only the possibleteamId
(randomly pick one team for now)
If we query and then load the full Workspace
and WorkspaceInstance
shapes into memory, and start passing them around a lot, I worry that we'll have a bad time (because many workspace instances are created every month, and that number is rapidly increasing over time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this makes sense. But I don't think it makes sense to do this now.
For me, this is an optimisation once we have something working. For now, I'd prefer to waste memory to get to a working version and then measure what is in fact the cost of these queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, thanks. (I don't really see why it's better to ship a less optimized version first when we know how to easily write a more optimized one, but I also agree about shipping this sooner than later, and maybe my suggestion could already be considered premature optimization). So, fine to defer this to later. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely because I think the current handling will go away anyway once we do add attribution to the WSI. If we optimized here, it would then be wasted work because it would also be replaced by the attribution work from WSI.
I don't really see why it's better to ship a less optimized version first when we know how to easily write a more optimized one
We don't have any hard data to know it is in fact an optimisation. Suboptimal, but with data is more valuable than optimal but without baseline metrics.
379bc08
to
63e98ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks @easyCZ!
With the unit tests, this seems to be working as intended. 🎉 I couldn't resist to add a few more thoughts on (premature?) optimization, mostly because I generally find it easier to design an efficient algorithm from scratch (minimal data footprint, limited iterations) than to optimize existing complex code, but you're right -- the current code might already go away when we attribute instances on start. Someone clever once said, "one of the most expensive mistakes engineers make is to optimize something that shouldn't exist in the first place".
Anyway, so many words to say, thanks for the progress here! Looking forward to the next iterations. 🚀
workspaceIDs = append(workspaceIDs, instance.WorkspaceID) | ||
} | ||
|
||
workspaces, err := db.ListWorkspacesByID(ctx, u.conn, toSet(workspaceIDs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the IDs unique! I would have preferred to make the collection a unique "set" before all of the appends, in order to iterate just once over all instances (not once, then again in toSet
to create the map, then another time over the unique IDs to convert them back to a list) -- but okay for quick iterations without worrying too much about complexity this early. 🛹
if err != nil { | ||
return nil, fmt.Errorf("failed to load workspaces for workspace instances in time range: %w", err) | ||
} | ||
status.Workspaces = len(workspaces) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, at this point we have:
- a collection of all "fully hydrated" instances
- a collection of all "fully hydrated" workspaces, each with an attached sub-collection of all its corresponding "fully hydrated" instances (although these will just be references to a single instance in memory, right, not a full copy of the object?)
I'm still slightly uneasy with loading so many large objects into memory, and iterating over them multiple times. So, I'm looking forward to seeing exactly how we'll use these collections, and see if we can later reduce the amount of used memory and the number of iterations. 👀
Description
We first compute all workspace instances in a given period, for each WorkspaceID, we fetch the underlying Workspace from DB to have the necessary information to proceed to determining the owner (cost center)
Related Issue(s)
Fixes #
How to test
Unit tests
Release Notes
Documentation
NONE