Avoid inserting duplicate records in typerefs (fixes #61) #67

jhrcek · 2024-02-07T20:49:28Z

First naive attempt at solving #61.

Here's a quick comparison of this PR with master:

I built current master of haskell-language-server (to get bunch of .hie files to index):
cabal clean && cabal build all --ghc-options=-fwrite-ide-info
I indexed that directory with hiedb binary:

# build hiedb with optimizations:
cabal build --ghc-options=-O2
# find hiedb exe in cabal build dir and use it to index build artifacts in hls repo:
dist-newstyle/build/x86_64-linux/ghc-9.4.8/hiedb-0.5.0.1/x/hiedb/build/hiedb/hiedb -- -D tmp.db index ~/Devel/github.com/haskell/haskell-language-server/

master:
- indexing stats: 303 indexed, 0 skipped in 1m13s + 0.07s gc
- maximum residency: 68 908 kb
- resulting sqlite size: 218MB

$ sqlite3 tmp.db "select count(*) from typerefs; select count(*) from (select distinct * from typerefs)"
342879 <<< ~3x redundancy
118819

this PR:
- indexing stats: 303 indexed, 0 skipped in 56.61s + 0.07s gc <<< 1.3x faster
- maximum residency: 69 924 kb
- resulting sqlite size: 117MB <<< ~50% size reduction
- no duplicates in sqlite:

$ sqlite3 tmp.db "select count(*) from typerefs; select count(*) from (select distinct * from typerefs)"
118819 <<< no redundancy
118819

You can see that difference with hls codebase is not that significant.
But the difference is much more significant with our work codebase which has much more deriving of stuff:

master:
- indexing stats: 208 indexed, 0 skipped in 1m39s + 0.04s gc
- maximum residency: 68 900 kb
- sqlite size: 771 MB

$ sqlite3 tmp.db "select count(*) from typerefs; select count(*) from (select distinct * from typerefs)"
2398458  <<< ~13x redundancy
189136

this PR:
- indexing stats: 208 indexed, 0 skipped in 41.16s + 0.05s gc <<< 2.4x faster!
- maximum residency: 55 232 kb <<< wut? Even residency is smaller? Probably fluke
- sqlite size: 102 MB <<< size reduced by 87%

$ sqlite3 tmp.db "select count(*) from typerefs; select count(*) from (select distinct * from typerefs)"
189136 <<< no redundancy
189136

src/HieDb/Utils.hs

wz1000 · 2024-02-08T10:01:43Z

It would also be good to have a comparison with using UNIQUE constraints in the db and using INSERT OR IGNORE when inserting.

jhrcek · 2024-02-08T13:11:46Z

I tried with the approach with UNIQUE constraint and it didn't find noticeable difference in perf. between it and master branch.
I won't open PR with that change and just paste the diff so you can check it's the change you had in mind:

git diff --patch master..HEAD 
diff --git a/src/HieDb/Create.hs b/src/HieDb/Create.hs
index a29838d..3f28842 100644
--- a/src/HieDb/Create.hs
+++ b/src/HieDb/Create.hs
@@ -161,6 +161,7 @@ initConn (getConn -> conn) = do
                 \, ec      INTEGER NOT NULL \
                 \, FOREIGN KEY(id) REFERENCES typenames(id) DEFERRABLE INITIALLY DEFERRED \
                 \, FOREIGN KEY(hieFile) REFERENCES mods(hieFile) ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED \
+                \, CONSTRAINT uniqtyperef UNIQUE (id, hieFile, depth, sl, sc, el, ec) ON CONFLICT IGNORE \
                 \)"
   execute_ conn "CREATE INDEX IF NOT EXISTS typeref_id ON typerefs(id)"
   execute_ conn "CREATE INDEX IF NOT EXISTS typerefs_mod ON typerefs(hieFile)"
diff --git a/src/HieDb/Utils.hs b/src/HieDb/Utils.hs
index 0b5c073..596e4bf 100644
--- a/src/HieDb/Utils.hs
+++ b/src/HieDb/Utils.hs
@@ -56,7 +56,7 @@ addTypeRef (getConn -> conn) hf arr ixs sp = go 0
         Nothing -> pure ()
         Just occ -> do
           let ref = TypeRef occ hf d sl sc el ec
-          execute conn "INSERT INTO typerefs VALUES (?,?,?,?,?,?,?)" ref
+          execute conn "INSERT OR IGNORE INTO typerefs VALUES (?,?,?,?,?,?,?)" ref
       let next = go (d+1)
       case arr A.! i of
         HTyVarTy _ -> pure ()

wz1000 · 2024-02-08T13:17:31Z

I think we should add the UNIQUE constraint anyway so we can detect violations of this property.

jhrcek · 2024-02-08T13:19:49Z

I think we should add the UNIQUE constraint anyway so we can detect violations of this property.

Would this require schema version bump?

jhrcek · 2024-02-08T13:24:07Z

Tried this and the issue is that with the UNIQUE constraint the indexing time doubles. It's not as bad as on master, but still much worse than without it. Not sure if it's worth it..

wz1000 · 2024-02-08T13:25:27Z

Would this require schema version bump?

not strictly, though I guess it would be good if all the existing databases used by HLS are rebuilt so they don't violate the property.

wz1000 · 2024-02-08T13:27:20Z

Tried this and the issue is that with the UNIQUE constraint the indexing time doubles. It's not as bad as on master, but still much worse than without it. Not sure if it's worth it..

ok, I guess checking in haskell should be sufficient.

src/HieDb/Utils.hs

wz1000 · 2024-02-08T14:55:10Z

src/HieDb/Utils.hs

          indexed <- get
-          when (Set.notMember (occ, d) indexed) $ do
+          let isTypeIndexed = ISet.member (fromIntegral occ) (IMap.findWithDefault ISet.empty depth indexed)


I think you can use i to ensure this still works on 32 bit machines.

Sorry, I don't follow.
I think we have to use the occ (which corresponds to type's id looked up from typenames table).
This is unique across multiple indexed hie files (autoincremented id in sqlite).

Whereas i corresponds to index of a type within single hie file.

When I tried it, duplicated rows started to be created again, not sure why.

But we COULD maybe switch Int64 -> Int when looking up type indices here and then this would work?

HieDb/src/HieDb/Create.hs

Lines 172 to 173 in 404b4fa

addArr :: HieDb -> A.Array TypeIndex HieTypeFlat -> IO (A.Array TypeIndex (Maybe Int64))

addArr (getConn -> conn) arr = do

Is there a need for Int64 to represent type IDs? If yes, that I guess that would also break on 32 bit machines..

You are correct about i not working. But I think sqlite indices are Int64, not much we can do about it.

jhrcek · 2024-02-08T15:06:54Z

Turns out when switching to int map/set I forgot to flip the boolean condition:
when (Set.notMember ...) -> unless (ISet.member ..., that's why it was apparently faster.
Fixed in the last commit and the indexing speed is back to what it was before.

wz1000 · 2024-02-08T17:40:00Z

Bump the schema version and this should be good to go I think

jhrcek · 2024-02-08T17:57:34Z

Bumped. Is that just to force users to regenerate new hiedb?

jhrcek · 2024-02-10T09:59:07Z

Hey @wz1000
this should be ready to merge.
It would be also great to get this on hackage/use that in hls.
I can prepare a version bump PR if you want (and then push the update to hackage as I believe I'm still among the mainainers of this package).

wz1000 · 2024-02-10T15:48:36Z

@jhrcek feel free to prepare a release after updating the changelogs. Thanks!

Avoid inserting duplicate records in typerefs

d97134f

jhrcek marked this pull request as draft February 7, 2024 20:49

jhrcek commented Feb 8, 2024

View reviewed changes

src/HieDb/Utils.hs Outdated Show resolved Hide resolved

jhrcek commented Feb 8, 2024

View reviewed changes

src/HieDb/Utils.hs Show resolved Hide resolved

evalStateT at the level of Span

225c08d

jhrcek force-pushed the jhrcek/avoid-duplicate-typeref-indexing branch from c4562a1 to 225c08d Compare February 8, 2024 13:17

wz1000 reviewed Feb 8, 2024

View reviewed changes

src/HieDb/Utils.hs Outdated Show resolved Hide resolved

Switch to IntMap IntSet state

6cafa26

jhrcek marked this pull request as ready for review February 8, 2024 14:52

wz1000 reviewed Feb 8, 2024

View reviewed changes

Use unless 🤦

4649a1f

Bump schema version

c81fe34

jhrcek requested a review from wz1000 February 9, 2024 17:10

wz1000 merged commit 109d8e9 into wz1000:master Feb 10, 2024
5 checks passed

jhrcek deleted the jhrcek/avoid-duplicate-typeref-indexing branch February 22, 2024 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid inserting duplicate records in typerefs (fixes #61) #67

Avoid inserting duplicate records in typerefs (fixes #61) #67

jhrcek commented Feb 7, 2024 •

edited

Loading

wz1000 commented Feb 8, 2024

jhrcek commented Feb 8, 2024

wz1000 commented Feb 8, 2024

jhrcek commented Feb 8, 2024

jhrcek commented Feb 8, 2024

wz1000 commented Feb 8, 2024

wz1000 commented Feb 8, 2024

wz1000 Feb 8, 2024

jhrcek Feb 8, 2024 •

edited

Loading

jhrcek Feb 8, 2024 •

edited

Loading

wz1000 Feb 8, 2024

jhrcek commented Feb 8, 2024 •

edited

Loading

wz1000 commented Feb 8, 2024

jhrcek commented Feb 8, 2024

jhrcek commented Feb 10, 2024

wz1000 commented Feb 10, 2024

	addArr :: HieDb -> A.Array TypeIndex HieTypeFlat -> IO (A.Array TypeIndex (Maybe Int64))
	addArr (getConn -> conn) arr = do

Avoid inserting duplicate records in typerefs (fixes #61) #67

Avoid inserting duplicate records in typerefs (fixes #61) #67

Conversation

jhrcek commented Feb 7, 2024 • edited Loading

wz1000 commented Feb 8, 2024

jhrcek commented Feb 8, 2024

wz1000 commented Feb 8, 2024

jhrcek commented Feb 8, 2024

jhrcek commented Feb 8, 2024

wz1000 commented Feb 8, 2024

wz1000 commented Feb 8, 2024

wz1000 Feb 8, 2024

Choose a reason for hiding this comment

jhrcek Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

jhrcek Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

wz1000 Feb 8, 2024

Choose a reason for hiding this comment

jhrcek commented Feb 8, 2024 • edited Loading

wz1000 commented Feb 8, 2024

jhrcek commented Feb 8, 2024

jhrcek commented Feb 10, 2024

wz1000 commented Feb 10, 2024

jhrcek commented Feb 7, 2024 •

edited

Loading

jhrcek Feb 8, 2024 •

edited

Loading

jhrcek Feb 8, 2024 •

edited

Loading

jhrcek commented Feb 8, 2024 •

edited

Loading