Use hash code instead of full source text for language service caches #5944

cartermp · 2018-11-20T21:16:01Z

Not sure if this is the correct overall approach - but hopefully this can help encourage a better fix 🙂

The following issues track large source files ultimately living on the large object heap, forcing horrible GC delays over time:

#5937
#5936
#5935
#4881

But as mentioned in #4881 (comment) the language service uses the full source text as a key for caches, making it more difficult to simply not pass around the full source text. This seeks to rectify that. It may also help the linked issues

dsyme · 2018-11-21T00:12:08Z

src/fsharp/service/service.fs

-    let AreSameForParsing((fileName1: string, source1: string, options1), (fileName2, source2, options2)) =
-        fileName1 = fileName2 && options1 = options2 && source1 = source2
+    let AreSameForParsing((fileName1: string, source1Hash: int, options1), (fileName2, source2Hash, options2)) =
+        fileName1 = fileName2 && options1 = options2 && source1Hash = source2Hash


It depends how accurate the string hash codes are and how much we care about very odd behaviour on collisions.

Checking only hash codes means that when there is collision we will re-use results from some other random parse or typecheck.

So if collisions happen very rarely this may just be ok.

If hash codes are ever based on the partial hash of the string (e.g. a prefix) then this shouldn't be done.

We should definitely have a think about this. At least initially, I feel that if the file names are the same and the hashes are the same, it's highly unlikely we'd get some arbitrary cached value. But let's definitely make sure we feel confident on that :)

0x53A · 2018-11-21T00:25:22Z

I don't really trust string.GetHashCode, I doubt it was designed with strong collision resistance in mind.

If a true cryptographic hash like sha is too expensive, can you maybe at least use a specialised one like murmurHash?

(Sorry, can't sleep)

cartermp · 2018-11-21T00:38:12Z

I think what's worth thinking about is if this is a scenario where strong collision resistance is important. For example, in each cache here, the key varies:

AreSimilarForParsing uses the file name as the key
AreSameForParsing uses the file name, FSharpParsingOptions (also contains source files, so this needs to be thought about), and (now) the hash of the source file as the key

And so on. All current keys are an aggregate of some information. Given that, does it make sense to use a cryptographically strong hash function? Is the Birthday Problem applicable for IDE tooling? Would doing so be meaningful perf-wise? Not entirely sure how to answer each question, but we definitely do want to move away from storing the whole source text if we don't actually need it.

0x53A · 2018-11-21T01:01:42Z

I think I often sound too negative. Hashes are awesome and this pr is great.

String.GetHashCode is 32 bytes, so 4 giga possibilities.

Really large source files can be many kB to maybe even MB if they are auto-generated?

I was never that good in statistics, and am just talking out of my behind because I'm bored. But that's a few magnitudes too few for me to be comfortable ;D

But I would reverse that question: is there any reason NOT to use a strong hash?

Typechecking is probably so expensive that running a sha256 over the source wouldn't be noticeable.

And if it is, there is still murmur3, which is a fast, specialised, 128 bit hash.
Heck, you could also use md5 (also 128 bit)

cartermp · 2018-11-21T06:23:56Z

So I'm pretty sure my stats needs some work (it's been a while!), but here's a quick calculation in the awesome Fable repl. So for 10k files, we could see a ~1.1% chance of a collision with string's GetHashCode, assuming we only use that as a key for a cache. Given that we aggregate additional information I think it's fine, but I wouldn't take that as gospel 🙂

That said, I certainly don't see anything wrong with a stronger hash function if the perf isn't a problem. We'd have to measure that, but if it's negligible then I think something like MD5 is fine. I'd much rather use something in the BCL than having to hand-code some F# implementation of something in here.

cartermp · 2018-11-21T22:01:54Z

So, second commit is uhhh...a lot more. But this addresses all sources of source files being used as keys in the caches. Let me make sure this PR is marked as WIP so we don't ever accidentally pull it before really vetting the approach.

vasily-kirichenko · 2018-11-22T12:16:15Z

Typechecking is probably so expensive that running a sha256 over the source wouldn't be noticeable.
And if it is, there is still murmur3, which is a fast, specialised, 128 bit hash. Heck, you could also use md5 (also 128 bit)

.NET MD5 and SHA-1 implementations have approximately same performance, SHA-256/384/512 are way slower, so I'd use SHA-1 or murmur3, if it's fast.

ncave · 2018-11-24T17:13:26Z

@cartermp Is the second commit necessary, what are we saving by hashing source file names?

cartermp · 2018-11-24T17:37:03Z

Ah, yeah they're just the path+names (even though they're not named that way). I'll be rid of the second commit.

TIHan · 2018-12-15T02:41:44Z

Closing this as the PR, #6001, makes source text equality abstract for caching. If you are using the string implementation of ISourceText , it will do the direct string compare. But for VS, we have an implementation of ISourceText's ContentEquals to be Roslyn's SourceText's ContentEquals which does the right thing already: http://source.roslyn.io/#Microsoft.CodeAnalysis/Text/SourceText.cs,dc290adccf6f9ba7
Other IDEs will be recommended to implement their own ISourceText so they are in charge of the efficiency there; but I will always recommend them to just take a dependency Microsoft.CodeAnalysis and implement ISourceText using Roslyn's SourceText.

TIHan · 2018-12-15T03:12:53Z

A path forward for caching will be interesting. Ideally we might want to use an int based on time to determine file changes.

dsyme reviewed Nov 21, 2018

View reviewed changes

ncave mentioned this pull request Nov 21, 2018

(fable-compiler, REPL) Incremental building for projects with multiple files fable-compiler/Fable#1648

Closed

cartermp changed the title ~~Use hash code instead of full source text for some language service caches~~ [WIP] Use hash code instead of full source text for some language service caches Nov 21, 2018

cartermp changed the title ~~[WIP] Use hash code instead of full source text for some language service caches~~ [WIP] Use hash code instead of full source text for language service caches Nov 21, 2018

cartermp and others added 3 commits November 24, 2018 14:18

Use hash code instead of full source text for language service caches

e40fd3f

Clean diff

67a25a9

you gotta be kidding me

969ed66

cartermp changed the title ~~[WIP] Use hash code instead of full source text for language service caches~~ Use hash code instead of full source text for language service caches Nov 28, 2018

TIHan closed this Dec 15, 2018

This was referenced Dec 18, 2018

Source files are ending up on the large object heap #4881

Closed

Use smaller keys in language service caches #6028

Closed

cartermp mentioned this pull request Jan 2, 2019

Use hashes for language service cache keys #6058

Merged

ncave mentioned this pull request Feb 1, 2019

Added ParseAndCheckFileInProject fable-compiler/Fable#1714

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use hash code instead of full source text for language service caches #5944

Use hash code instead of full source text for language service caches #5944

cartermp commented Nov 20, 2018 •

edited

Loading

dsyme Nov 21, 2018

cartermp Nov 21, 2018

0x53A commented Nov 21, 2018 •

edited

Loading

cartermp commented Nov 21, 2018

0x53A commented Nov 21, 2018 •

edited

Loading

cartermp commented Nov 21, 2018 •

edited

Loading

cartermp commented Nov 21, 2018 •

edited

Loading

vasily-kirichenko commented Nov 22, 2018

ncave commented Nov 24, 2018

cartermp commented Nov 24, 2018

TIHan commented Dec 15, 2018

TIHan commented Dec 15, 2018

Use hash code instead of full source text for language service caches #5944

Use hash code instead of full source text for language service caches #5944

Conversation

cartermp commented Nov 20, 2018 • edited Loading

dsyme Nov 21, 2018

Choose a reason for hiding this comment

cartermp Nov 21, 2018

Choose a reason for hiding this comment

0x53A commented Nov 21, 2018 • edited Loading

cartermp commented Nov 21, 2018

0x53A commented Nov 21, 2018 • edited Loading

cartermp commented Nov 21, 2018 • edited Loading

cartermp commented Nov 21, 2018 • edited Loading

vasily-kirichenko commented Nov 22, 2018

ncave commented Nov 24, 2018

cartermp commented Nov 24, 2018

TIHan commented Dec 15, 2018

TIHan commented Dec 15, 2018

cartermp commented Nov 20, 2018 •

edited

Loading

0x53A commented Nov 21, 2018 •

edited

Loading

0x53A commented Nov 21, 2018 •

edited

Loading

cartermp commented Nov 21, 2018 •

edited

Loading

cartermp commented Nov 21, 2018 •

edited

Loading