Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

codeintel: Consider dependency graph for LSIF data retention #22930

Merged
merged 2 commits into from
Jul 16, 2021

Conversation

efritz
Copy link
Contributor

@efritz efritz commented Jul 16, 2021

This updates the query that periodically soft-deletes stale LSIF data to better support dependency indexing. Previously, we would mark all records older than 720h (configurable) that were not used to resolve code intelligence data from a non-stale tag or the tip of a non-stale branch. This ignores dependencies, meaning that go to definition could stop working for a repository we care about.

Now, the query deletes all irrelevant LSIF data by:

  • initializing a working set by adding all uploads younger than 720h as well as all uploads used to resolve code intelligence for a non-stale branch or tag
  • expanding the working set by adding dependencies of the working set to itself
  • deleting all uploads outside of the working set

This ensures that while an upload is protected, all of its transitive dependencies are as well. Fixes #19120.

This query may be susceptible in the future to performance issues as we grow the dependency graph. I've tried to make adjustments early on to help with scaling concerns, but was so far unsuccessful:

  • We thought of storing the set of branch and tags from which each upload is visible. This turns out to be a huge mapping, as some uploads in our production database today are visible from 8k distinct interesting commits. We don't want to materialize what is basically a cross-join of our dependency graph and our upload records in our database.
  • We thought of selecting a small set of old candidates, then tracing each one up the dependency graph. This should be more efficient when the number of uploads to delete is rather small (as it would amount to only a small number of traversals of the dependency graph). I'm not sure how to do this in SQL without issuing n+1 queries from the application (there's not an obvious way to share traversal work, and I'm not sure how to invoke a recursive CTE on each row of a candidate set).

We should check on this query periodically as we grow. It current runs in 1m10s on a clone of the production database and is nowhere near a hot path.

@efritz efritz added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) dependency-indexing labels Jul 16, 2021
@efritz efritz self-assigned this Jul 16, 2021
@efritz efritz force-pushed the ef/improved-eviction-strategy branch from 387b624 to 50ec75d Compare July 16, 2021 16:23
Copy link
Contributor

@Strum355 Strum355 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@efritz efritz enabled auto-merge (squash) July 16, 2021 19:25
@efritz efritz merged commit b74c3b1 into main Jul 16, 2021
@efritz efritz deleted the ef/improved-eviction-strategy branch July 16, 2021 19:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
dependency-indexing team/graph Graph Team (previously Code Intel/Language Tools/Language Platform)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code intel: Improve eviction strategy
2 participants