Intern strings #811

layus · 2021-01-07T23:59:07Z

Some speedup is gained:
avg (5 runs): 21.2s vs 23.4s on evaluating nixpkgs.hello

Nix also caches the strings content. I guess we should just benchmark it. The only downside is that we encounter bigger texts, and compute a hash unnecessarily on some of them.

ryantrinkle · 2021-01-09T03:40:00Z

@layus Although a 10% speedup is nice, I have a few concerns:

Is everything still deterministic?
Do interned strings get garbage collected properly?
Would this interfere with parallel evaluation?

The answers to these questions weren't obvious to me from a quick look at the intern package.

layus · 2021-01-09T09:53:51Z

@ryantrinkle Thanks for your review :-)

@layus Although a 10% speedup is nice, I have a few concerns:
1. Is everything still deterministic?

It should. There is a possibility of introducing incorrect behavior by sorting on the internedText rather than on the Text themselves. I had to reorder operations to sort after uninterning. That mostly because nix has sorted attrsets, which are not sorted internally.

2. Do interned strings get garbage collected properly?

No, by design. We could imagine something like ref-counting, but It does not really matter. Most of the Texts are held in memory for the whole lifetime of the program.

An you get a better space savings by interning strings than by garbage collection. Try to guess how many times the text "out" appears. Or "src" :-)

3. Would this interfere with parallel evaluation?

Not at all, if you trust unsafeDupablePerformIO . atomicModifyIORef used in the intern package.

The answers to these questions weren't obvious to me from a quick look at the intern package.

By the way, that what's nix does. I also trust them for properly benchmarking it and balancing the pros and cons of prevented garbage collection.

Anton-Latukha · 2021-01-08T17:32:30Z

src/Nix/Utils.hs

@@ -172,3 +181,39 @@ alterF
 alterF f k m = f (M.lookup k m) <&> \case
  Nothing -> M.delete k m
  Just v  -> M.insert k v m
+
+
+type VarName = InternedText


Great idea to hash cons. I even forgot, only remotely knew, such a thing.

Kudos for type synonym.

Please explain (maybe in the code): why hash consing particularly VarName.

As VarName seems like hash consing what looks as minor stuff when there are elephants in the heap.

I read that the main thing is that the hash consing is computationally heavy and works in some situations.

There should be things that are definitely great to hash cons in the project, so far my mind is not landed on any particular, nor I have practice with doing it.

You probably have a better idea and what to say, and people would want to ask this anyway.

Looking at the library - it is indeed is mostly used just for the plain text types.

Although a 10% speedup is nice

It is 10% speedup of the currently slow code. So the improvement is drastic. So I am also interested in how is VarName brings this degree of drastic improvement.

Great if there is a way to dedupe (import ... ).
It may indeed be a drastic increase if the use of pkgs and so on gets deduped. with ... are the scopes that probably creave to have a dedup. with {pkgs, haskellPackages, lib, etc...} can be considered almost unique, big & static scopes, their match hash tables should be small, I wonder are they deduped as is currently.

Anton-Latukha · 2021-01-09T18:49:56Z

src/Nix/Builtins.hs

@@ -130,12 +132,12 @@ builtins = do
   where
    go b@(Builtin TopLevel _) = b
    go (Builtin Normal (name, builtin)) =
-      Builtin TopLevel ("__" <> name, builtin)
+      Builtin TopLevel (intern $ "__" <> unintern name, builtin)


Maybe add coerce type synonym instance for Interned <-> Text. In that way the code would read easy. And now there is HLS to show the type signature for what the equivalent transformation is between.
The idea is to abstract this theme into a well-known coerce abstraction. So the interned types get represented and read by people simply as Text just in a different type. So we free other people during their work from the need to go read the specific lib functions and read about hash consing to understand what intern <-> unintern does, if their work is not related to the hash consing.

layus · 2021-01-09T23:10:11Z

@Anton-Latukha The implementation is indeed rough. I started by changing the type synonym for VarName, and started to use the same name for a variety of things that do not really fit this varname name.

And I had to implement many instances that were missing for internedText. Then I had to move everything in an easily includable module, so I stuffed everything in Nix.Utils.

There is definitely some interface improvements to be made.

Decisions still to be made / benchmarked

Use hashtables for faster lookups. 5we may have to do that too for hashsets)
Also intern NStrings. This is a bit more annoying as we do not really want to intern NixStrings which are subject to more manipulations, and less opportunities to dedup.
Upstream all the extra instances
Design a nice API, proper names, and module separation. Including trying to reduce packing / unpacking and interning / uninterning

layus · 2021-01-09T23:12:14Z

The heap usage shows a lot of tuples, most probably originating from the HashMap implementation. That makes me think it would be worth to conside hashtables
hnix.pdf

Again, speedscope is very interesting to understand what is going on.

ryantrinkle · 2021-01-11T00:32:57Z

@layus I put together an alternative way of doing interning here: obsidiansystems@0adb998#diff-3d76949b3ec463adc3cb7c86318020458995f4f80602365dc8e413c34f278caeR189-R229

The basic idea is that we do not intern globally; instead, when two values are compared, if they are equal, we opportunistically "intern" one of them by making it point at the other. This has a number of advantages:

There's no impact on garbage collection, except for the benefit you already mentioned, that interning reduces memory usage.
There's no need to synchronize anything between threads. Even atomicModifyIORef is surprisingly expensive if you're using lots of cores! Obviously, we don't do parallel evaluation at the moment, but I think this library has a lot of potential to use parallelism for very large evaluations. (Even though we are messing with IORefs in a potentially-multithreaded context, we don't need to use atomicModifyIORef because we only write to the IORef when we know the values are equal; thus, regardless of whether readers see out-of-order modifications, their behavior should not be affected.)
if a string is never compared with any other, it doesn't incur the cost of interning. It does incur the cost of hashing, though, so this is not a very large benefit.
Everything operates in a very cache-friendly way: we don't read or write data that we aren't already "in the neighborhood" of.

Now to the performance:

I tested by running this, throwing away the first two samples, and taking the "real" time of the rest:

cabal build exe:hnix -foptimize
for x in {1..10} ; do
  time -p ./dist-newstyle/build/x86_64-linux/ghc-8.10.1/hnix-0.12.0.1/x/hnix/build/hnix/hnix -I nix=$PWD/data/nix/corepkgs --eval -E '(import (builtins.fetchTarball { url = "https://github.com/NixOS/nixpkgs/archive/51894963cbdc41f0cd8f571b7bcf79437d940355.tar.gz"; sha256 = "13nfghpnhnr5hbbibsrq172g1rdibd8lycis7ncvf9yxd4rdlf7b"; }) {}).hello.outPath'
done

That gave me 8 results for each.

For master (7e6cd97), i got: 15.57,15.57,15.57,15.56,15.58,15.65,15.63,15.59 (average: 15.5900)
For the global interning branch (241752d), i got: 15.07, 15.09, 15.11, 15.04, 15.04, 15.11, 15.15, 15.17 (average: 15.0975)
For the local interning branch (0adb998), i got: 15.19, 15.14, 15.07, 15.06, 15.14, 15.03, 15.15, 15.15 (average: 15.11625)

From this test, it appears that global inlining improves performance by 3.16% and local inlining improves it by 3.04%. I'm not entirely sure how much of the difference is due to measurement error (maybe someone with stats knowledge can chime in!) but the two approaches seem pretty close!

Further Info

I decided I wanted to get some cleaner data, so I pinned the program to a single CPU and gave it nice -n -20. Unfortunately, this had the effect of completely flipping the results! Under that condition, string interning (of both kinds) make things slower! I'm guessing this is because the cache is both large and warm. I've collected these results up here.

The code to run this was (run as root):

for x in {1..102} ; do time -p nice -n -20 taskset 0x10 ./dist-newstyle/build/x86_64-linux/ghc-8.10.1/hnix-0.12.0.1/x/hnix/build/hnix/hnix -I nix=$PWD/data/nix/corepkgs --eval -E '(import (builtins.fetchTarball { url = "https://github.com/NixOS/nixpkgs/archive/51894963cbdc41f0cd8f571b7bcf79437d940355.tar.gz"; sha256 = "13nfghpnhnr5hbbibsrq172g1rdibd8lycis7ncvf9yxd4rdlf7b"; }) {}).hello.outPath' ; done |& tee $(git rev-parse HEAD).out

The machine I ran this on has an AMD EPYC 7702P.

Final Thoughts

Given that global interning seems to have a large impact on @layus's machine, I would be very interested to know whether my "local" interning also does.

layus · 2021-01-15T10:43:18Z

Here is a proper benchmarking of the three solutions:

hyperfine --warmup 3 './run nix-instantiate' './run ./hnix-'{intern,intern-local,nocache}                                (nix)
Benchmark #1: ./run nix-instantiate
  Time (mean ± σ):     249.9 ms ±   3.3 ms    [User: 216.4 ms, System: 33.5 ms]
  Range (min … max):   246.0 ms … 255.7 ms    11 runs
 
Benchmark #2: ./run ./hnix-intern
  Time (mean ± σ):     18.649 s ±  1.127 s    [User: 16.696 s, System: 0.568 s]
  Range (min … max):   17.737 s … 20.644 s    10 runs
 
Benchmark #3: ./run ./hnix-intern-local
  Time (mean ± σ):     18.044 s ±  0.130 s    [User: 16.160 s, System: 0.633 s]
  Range (min … max):   17.896 s … 18.326 s    10 runs
 
Benchmark #4: ./run ./hnix-nocache
  Time (mean ± σ):     18.110 s ±  0.064 s    [User: 16.329 s, System: 0.533 s]
  Range (min … max):   17.990 s … 18.207 s    10 runs
 
Summary
  './run nix-instantiate' ran
   72.21 ± 1.08 times faster than './run ./hnix-intern-local'
   72.48 ± 0.98 times faster than './run ./hnix-nocache'
   74.64 ± 4.62 times faster than './run ./hnix-intern'

layus · 2021-01-15T10:59:18Z

In this case global interning adds a variable slowdown. And local interning has no effect whatsoever.

I was looking for a solution to make attrset lookups and matching faster. In many places we extract, convert, compare and lookup attribute sets. The flamechart at https://www.speedscope.app/#profileURL=https://gist.githubusercontent.com/layus/2dfb3f18b12c1daec65a69d34ee29639/raw/10d33407cbd78a4a2ea3ffe89ac5db3b91c31de7/hnix.eventlog.json spends a lot of time manipulating these attr sets. from lists to hashmaps and conversely.

Attrsets related functions take about 11s out of 30.

So local interning does not help, because it does not speedup these attset queries. But global interning does not help either. I guess the $O(log_16(n))$ is not as constant as they pretend
.

Next attempt: use hashtables ?

Anton-Latukha · 2021-01-15T14:41:19Z

Hunches

Beautiful.
It means in a complex situation something happens "more right" 🙃.

From info "meticulously" noted that all saved time is in time of User execution metrics. And what we do not see - is costly IPC calls between the Users execs and to the System. Afair IPCs metrics never go not in User nor System metric. That means afaik IPC calls volume stays the same and can be ruled out, so what is left is that the time is saved directly in the program code computation.
Then it may mean that situation probable resolution seems to be that the simple hello hits some particular operation that is costly.
Well, looking at hello derivation, maybe with stdenv.lib; somehow be a place.
Also maybe it is a doCheck = true; in hello and doCheck = false; in firefox, maybe set doCheck = false; for both of them to rule it out.

layus · 2021-01-15T14:45:17Z

Tests are very unstable. I guess they are bound to some resource (ram, cache ?) because running with anything else in parallel (despite having mutiple cores) changes the results.

On a second run with firefox, I get the the same times for all the implementations. I am clueless ATM.

layus · 2021-01-15T15:02:14Z

Okay, a final benchmark, on a very still machine (hear coffee pause :-D) yields more stable results (see variance).

Benchmark #1: ./run nix-instantiate
  Time (mean ± σ):     311.4 ms ±  10.0 ms    [User: 278.0 ms, System: 32.9 ms]
  Range (min … max):   300.2 ms … 335.1 ms    10 runs
 
Benchmark #2: ./run ./hnix-intern
  Time (mean ± σ):     12.600 s ±  0.191 s    [User: 8.848 s, System: 0.516 s]
  Range (min … max):   12.333 s … 12.928 s    10 runs
 
Benchmark #3: ./run ./hnix-intern-local
  Time (mean ± σ):     12.555 s ±  0.082 s    [User: 8.838 s, System: 0.534 s]
  Range (min … max):   12.388 s … 12.703 s    10 runs
 
Benchmark #4: ./run ./hnix-nocache
  Time (mean ± σ):     12.556 s ±  0.478 s    [User: 8.874 s, System: 0.510 s]
  Range (min … max):   11.584 s … 12.959 s    10 runs
 
Summary
  './run nix-instantiate' ran
   40.32 ± 1.32 times faster than './run ./hnix-intern-local'
   40.32 ± 2.01 times faster than './run ./hnix-nocache'
   40.47 ± 1.44 times faster than './run ./hnix-intern'

Interning does not look like a killer feature.

It drives me crazy that we are 40* slower than nix. I guess I could live with a factor 10, but a factor 2 would make me happy. Sadly Christmas is already over 🎅

Anton-Latukha · 2021-01-15T15:33:56Z

Tangental offtopic

Ye, one probably should reload into the Linux console to run more or less stable benchmarks on the desktop. The empty networking single-board setup is probably the good-enough testing platform, it would punish the hungry processes by metrics and so distinguish between computations harder, but in most cases, it is ARM, and ARM chips have different implementation and sets of instructions that have different optimizations and GHC probably lands not perfectly on it, but overall it should give some results.

I personally do not bother that HNix is slower. It can be 100 times slower and I would not care. Because the first concern is bug2bug language coverage, second is a good language debugging story, people would use HNix because it is in Haskell, or because it can actually do what Nix can not - really check & debug the code.

I just plan to do a lot of clean-up and refactoring and we would find what is the causes.
afaik the HNix uses hashing that is 40-200 times slowed then counterparts, and:
Nix/Builtins.hs:

import           Crypto.Hash
import qualified Crypto.Hash.MD5               as MD5
import qualified Crypto.Hash.SHA1              as SHA1
import qualified Crypto.Hash.SHA256            as SHA256
import qualified Crypto.Hash.SHA512            as SHA512

--- that is hashing use in the core of the language:

After new Store release comes-out (because sorki did a great deps clean-up, and my little chips in, and we need to fix Stackage/Nixpkgs by easy release). Right after Store release, I want to go refactor Store and then HNix to cryptonite asap. And that would take some time from me. And I kind of eager to do it myself and take reviews, because I did a lot of clean-up in last months and that is a useful core refactoring that I can do playfully and also learn something new by doing. That should be done in a couple of months.

I basically do not care that HNix is slow, especially before builtins and all clean-up & refactors. The base needs to fully work before the speed or memory would become a concern. People would start using HNix not because of the speed, but because it does something that Nix can not do. When Haskell.Nix infra breaks they would only care to find the place where and why, and Nix can not deliver that, HNix can, idk situation in rnix.

But at the end of all, we should receive the maximally lean and flexible implementation by the code & design, since Haskell implements the language with the same paradigms. Because of the same math paradigms in languages in type system, in the most theoretical efficient implementation, HNix language core is a functor from Nix category into Hask category. So I think in the end game after our work is done the HNix should be performant enough for use.

Hashing the structures definitely should be effective in some places in the implementation. Maybe narrow to the most logically cost-effective case and lets roll.

ryantrinkle · 2021-01-18T22:56:47Z

@layus In the course of testing this stuff, I did find a perf improvement that seemed to make a real difference: a94748a

layus · 2021-01-19T08:52:55Z

Good catch ! That quite a good example of a counter-performance :-).

Is it hanging on any branch we can merge ?

Anton-Latukha · 2021-02-21T10:34:49Z

Notifying on Hashable instances changes:

Hashable1 NonEmpty was upstreamed: instance Hashable1 NonEmpty haskell-unordered-containers/hashable#186. hashable 1.3.1 shipped it. HNix aligns to it in the Expr.Types: add hashable instances back; organize the module #857.
Simon also provided a bunch of useful Hashable instances for semigroups: Add Hashable1 instances ekmett/semigroups#99, currently in master.

layus · 2021-12-16T23:05:33Z

Dropping for now. Adds complexity for no real gain.

layus added 2 commits January 8, 2021 00:54

Intern strings

22c28b5

fix sorting by uninterning text first

241752d

Anton-Latukha reviewed Jan 9, 2021

View reviewed changes

Anton-Latukha mentioned this pull request Jan 21, 2021

Increase the performance of parsing and evaluation speed #200

Open

layus closed this Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intern strings #811

Intern strings #811

layus commented Jan 7, 2021

ryantrinkle commented Jan 9, 2021

layus commented Jan 9, 2021 •

edited

Loading

Anton-Latukha Jan 8, 2021 •

edited

Loading

Anton-Latukha Jan 9, 2021

Anton-Latukha Jan 9, 2021

Anton-Latukha Jan 9, 2021 •

edited

Loading

Anton-Latukha Jan 9, 2021 •

edited

Loading

layus commented Jan 9, 2021

layus commented Jan 9, 2021

ryantrinkle commented Jan 11, 2021

layus commented Jan 15, 2021

layus commented Jan 15, 2021

Anton-Latukha commented Jan 15, 2021 •

edited

Loading

layus commented Jan 15, 2021

layus commented Jan 15, 2021

Anton-Latukha commented Jan 15, 2021 •

edited

Loading

ryantrinkle commented Jan 18, 2021

layus commented Jan 19, 2021

Anton-Latukha commented Feb 21, 2021 •

edited

Loading

layus commented Dec 16, 2021

Intern strings #811

Intern strings #811

Conversation

layus commented Jan 7, 2021

ryantrinkle commented Jan 9, 2021

layus commented Jan 9, 2021 • edited Loading

Anton-Latukha Jan 8, 2021 • edited Loading

Choose a reason for hiding this comment

Anton-Latukha Jan 9, 2021

Choose a reason for hiding this comment

Anton-Latukha Jan 9, 2021

Choose a reason for hiding this comment

Anton-Latukha Jan 9, 2021 • edited Loading

Choose a reason for hiding this comment

Anton-Latukha Jan 9, 2021 • edited Loading

Choose a reason for hiding this comment

layus commented Jan 9, 2021

layus commented Jan 9, 2021

ryantrinkle commented Jan 11, 2021

layus commented Jan 15, 2021

layus commented Jan 15, 2021

Anton-Latukha commented Jan 15, 2021 • edited Loading

layus commented Jan 15, 2021

layus commented Jan 15, 2021

Anton-Latukha commented Jan 15, 2021 • edited Loading

ryantrinkle commented Jan 18, 2021

layus commented Jan 19, 2021

Anton-Latukha commented Feb 21, 2021 • edited Loading

layus commented Dec 16, 2021

layus commented Jan 9, 2021 •

edited

Loading

Anton-Latukha Jan 8, 2021 •

edited

Loading

Anton-Latukha Jan 9, 2021 •

edited

Loading

Anton-Latukha Jan 9, 2021 •

edited

Loading

Anton-Latukha commented Jan 15, 2021 •

edited

Loading

Anton-Latukha commented Jan 15, 2021 •

edited

Loading

Anton-Latukha commented Feb 21, 2021 •

edited

Loading