Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid inserting duplicate records in typerefs (fixes #61) #67

Merged
merged 5 commits into from
Feb 10, 2024

Conversation

jhrcek
Copy link
Collaborator

@jhrcek jhrcek commented Feb 7, 2024

First naive attempt at solving #61.

Here's a quick comparison of this PR with master:

  1. I built current master of haskell-language-server (to get bunch of .hie files to index):
    cabal clean && cabal build all --ghc-options=-fwrite-ide-info

  2. I indexed that directory with hiedb binary:

# build hiedb with optimizations:
cabal build --ghc-options=-O2
# find hiedb exe in cabal build dir and use it to index build artifacts in hls repo:
dist-newstyle/build/x86_64-linux/ghc-9.4.8/hiedb-0.5.0.1/x/hiedb/build/hiedb/hiedb -- -D tmp.db index ~/Devel/github.com/haskell/haskell-language-server/
  • master:
    • indexing stats: 303 indexed, 0 skipped in 1m13s + 0.07s gc
    • maximum residency: 68 908 kb
    • resulting sqlite size: 218MB
$ sqlite3 tmp.db "select count(*) from typerefs; select count(*) from (select distinct * from typerefs)"
342879 <<< ~3x redundancy
118819
  • this PR:
    • indexing stats: 303 indexed, 0 skipped in 56.61s + 0.07s gc <<< 1.3x faster
    • maximum residency: 69 924 kb
    • resulting sqlite size: 117MB <<< ~50% size reduction
    • no duplicates in sqlite:
$ sqlite3 tmp.db "select count(*) from typerefs; select count(*) from (select distinct * from typerefs)"
118819 <<< no redundancy
118819

You can see that difference with hls codebase is not that significant.
But the difference is much more significant with our work codebase which has much more deriving of stuff:

  • master:
    • indexing stats: 208 indexed, 0 skipped in 1m39s + 0.04s gc
    • maximum residency: 68 900 kb
    • sqlite size: 771 MB
$ sqlite3 tmp.db "select count(*) from typerefs; select count(*) from (select distinct * from typerefs)"
2398458  <<< ~13x redundancy
189136
  • this PR:
    • indexing stats: 208 indexed, 0 skipped in 41.16s + 0.05s gc <<< 2.4x faster!
    • maximum residency: 55 232 kb <<< wut? Even residency is smaller? Probably fluke
    • sqlite size: 102 MB <<< size reduced by 87%
$ sqlite3 tmp.db "select count(*) from typerefs; select count(*) from (select distinct * from typerefs)"
189136 <<< no redundancy
189136

@jhrcek jhrcek marked this pull request as draft February 7, 2024 20:49
src/HieDb/Utils.hs Outdated Show resolved Hide resolved
@wz1000
Copy link
Owner

wz1000 commented Feb 8, 2024

It would also be good to have a comparison with using UNIQUE constraints in the db and using INSERT OR IGNORE when inserting.

@jhrcek
Copy link
Collaborator Author

jhrcek commented Feb 8, 2024

I tried with the approach with UNIQUE constraint and it didn't find noticeable difference in perf. between it and master branch.
I won't open PR with that change and just paste the diff so you can check it's the change you had in mind:

git diff --patch master..HEAD 
diff --git a/src/HieDb/Create.hs b/src/HieDb/Create.hs
index a29838d..3f28842 100644
--- a/src/HieDb/Create.hs
+++ b/src/HieDb/Create.hs
@@ -161,6 +161,7 @@ initConn (getConn -> conn) = do
                 \, ec      INTEGER NOT NULL \
                 \, FOREIGN KEY(id) REFERENCES typenames(id) DEFERRABLE INITIALLY DEFERRED \
                 \, FOREIGN KEY(hieFile) REFERENCES mods(hieFile) ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED \
+                \, CONSTRAINT uniqtyperef UNIQUE (id, hieFile, depth, sl, sc, el, ec) ON CONFLICT IGNORE \
                 \)"
   execute_ conn "CREATE INDEX IF NOT EXISTS typeref_id ON typerefs(id)"
   execute_ conn "CREATE INDEX IF NOT EXISTS typerefs_mod ON typerefs(hieFile)"
diff --git a/src/HieDb/Utils.hs b/src/HieDb/Utils.hs
index 0b5c073..596e4bf 100644
--- a/src/HieDb/Utils.hs
+++ b/src/HieDb/Utils.hs
@@ -56,7 +56,7 @@ addTypeRef (getConn -> conn) hf arr ixs sp = go 0
         Nothing -> pure ()
         Just occ -> do
           let ref = TypeRef occ hf d sl sc el ec
-          execute conn "INSERT INTO typerefs VALUES (?,?,?,?,?,?,?)" ref
+          execute conn "INSERT OR IGNORE INTO typerefs VALUES (?,?,?,?,?,?,?)" ref
       let next = go (d+1)
       case arr A.! i of
         HTyVarTy _ -> pure ()

@wz1000
Copy link
Owner

wz1000 commented Feb 8, 2024

I think we should add the UNIQUE constraint anyway so we can detect violations of this property.

@jhrcek jhrcek force-pushed the jhrcek/avoid-duplicate-typeref-indexing branch from c4562a1 to 225c08d Compare February 8, 2024 13:17
@jhrcek
Copy link
Collaborator Author

jhrcek commented Feb 8, 2024

I think we should add the UNIQUE constraint anyway so we can detect violations of this property.

Would this require schema version bump?

@jhrcek
Copy link
Collaborator Author

jhrcek commented Feb 8, 2024

Tried this and the issue is that with the UNIQUE constraint the indexing time doubles. It's not as bad as on master, but still much worse than without it. Not sure if it's worth it..

@wz1000
Copy link
Owner

wz1000 commented Feb 8, 2024

Would this require schema version bump?

not strictly, though I guess it would be good if all the existing databases used by HLS are rebuilt so they don't violate the property.

@wz1000
Copy link
Owner

wz1000 commented Feb 8, 2024

Tried this and the issue is that with the UNIQUE constraint the indexing time doubles. It's not as bad as on master, but still much worse than without it. Not sure if it's worth it..

ok, I guess checking in haskell should be sufficient.

src/HieDb/Utils.hs Outdated Show resolved Hide resolved
@jhrcek jhrcek marked this pull request as ready for review February 8, 2024 14:52
indexed <- get
when (Set.notMember (occ, d) indexed) $ do
let isTypeIndexed = ISet.member (fromIntegral occ) (IMap.findWithDefault ISet.empty depth indexed)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use i to ensure this still works on 32 bit machines.

Copy link
Collaborator Author

@jhrcek jhrcek Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow.
I think we have to use the occ (which corresponds to type's id looked up from typenames table).
This is unique across multiple indexed hie files (autoincremented id in sqlite).

Whereas i corresponds to index of a type within single hie file.

When I tried it, duplicated rows started to be created again, not sure why.

Copy link
Collaborator Author

@jhrcek jhrcek Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we COULD maybe switch Int64 -> Int when looking up type indices here and then this would work?

HieDb/src/HieDb/Create.hs

Lines 172 to 173 in 404b4fa

addArr :: HieDb -> A.Array TypeIndex HieTypeFlat -> IO (A.Array TypeIndex (Maybe Int64))
addArr (getConn -> conn) arr = do

Is there a need for Int64 to represent type IDs? If yes, that I guess that would also break on 32 bit machines..

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct about i not working. But I think sqlite indices are Int64, not much we can do about it.

@jhrcek
Copy link
Collaborator Author

jhrcek commented Feb 8, 2024

Turns out when switching to int map/set I forgot to flip the boolean condition:
when (Set.notMember ...) -> unless (ISet.member ..., that's why it was apparently faster.
Fixed in the last commit and the indexing speed is back to what it was before.

@wz1000
Copy link
Owner

wz1000 commented Feb 8, 2024

Bump the schema version and this should be good to go I think

@jhrcek
Copy link
Collaborator Author

jhrcek commented Feb 8, 2024

Bumped. Is that just to force users to regenerate new hiedb?

@jhrcek jhrcek requested a review from wz1000 February 9, 2024 17:10
@jhrcek
Copy link
Collaborator Author

jhrcek commented Feb 10, 2024

Hey @wz1000
this should be ready to merge.
It would be also great to get this on hackage/use that in hls.
I can prepare a version bump PR if you want (and then push the update to hackage as I believe I'm still among the mainainers of this package).

@wz1000 wz1000 merged commit 109d8e9 into wz1000:master Feb 10, 2024
5 checks passed
@wz1000
Copy link
Owner

wz1000 commented Feb 10, 2024

@jhrcek feel free to prepare a release after updating the changelogs. Thanks!

@jhrcek jhrcek deleted the jhrcek/avoid-duplicate-typeref-indexing branch February 22, 2024 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants