Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency #402

Closed
antoine-levitt opened this issue Jan 5, 2021 · 25 comments · Fixed by #809
Closed

Latency #402

antoine-levitt opened this issue Jan 5, 2021 · 25 comments · Fixed by #809
Labels
good first issue Good for newcomers performance Performance regression or performance-related

Comments

@antoine-levitt
Copy link
Member

Could be good to run https://discourse.julialang.org/t/new-tools-for-reducing-compiler-latency/52882, although I would imagine that our latency cost is because of our deps, not because of us.

@mfherbst
Copy link
Member

mfherbst commented Jan 5, 2021

Actually I think SnoopCompile will help quite a bit. Could also be super helpful for my new design in asedftk. I'm definitely going to take a look at that at some point.

@antoine-levitt
Copy link
Member Author

At least it'll tell us which packages slow us down and then we can bug them.

@antoine-levitt
Copy link
Member Author

I just tried the new 1.6 beta, a lot of things in general feel faster but time julia silicon.jl is actually slower, so there might be some scope for latency optimization in DFTK.

@mfherbst mfherbst added the performance Performance regression or performance-related label Jan 19, 2021
@antoine-levitt
Copy link
Member Author

Something is seriously wrong somewhere. Just taking out lobpcg_hyper_impl.jl with a simple test script

A = randn(10, 10); A = A+A'
X = randn(10, 2)
ret = LOBPCG(A, randn(10, 2))
println(ret.n_matvec)

has a second @time include of 4s, due to inference. This is on julia 1.6.1

@antoine-levitt antoine-levitt added the good first issue Good for newcomers label Jul 6, 2021
@antoine-levitt
Copy link
Member Author

Playing with JuliaLang/julia#41612:

julia> using DFTK
    125.7 ms  FillArrays
    965.6 ms  ArrayLayouts
    853.7 ms  StaticArrays
    859.2 ms  DiffResults
      4.0 ms  Compat
      6.0 ms  Ratios
      1.0 ms  Requires
      0.9 ms  Adapt
      7.9 ms  WoodburyMatrices
     11.6 ms  AxisAlgorithms
     57.2 ms  ChainRulesCore
     55.1 ms  OffsetArrays
    198.2 ms  Interpolations
     86.4 ms  DataStructures
      0.7 ms  NaNMath
      0.3 ms  UnPack
      1.4 ms  DataAPI
      0.4 ms  StatsAPI
      0.9 ms  SortingAlgorithms
     11.3 ms  Missings
     46.0 ms  StatsBase
      1.8 ms  PositiveFactorizations
      1.8 ms  Parameters
      4.3 ms  DocStringExtensions
      0.6 ms  CompilerSupportLibraries_jll
      3.1 ms  LogExpFunctions
      0.6 ms  Preferences
      2.8 ms  JLLWrappers
     12.6 ms  OpenSpecFun_jll
     33.3 ms  SpecialFunctions
     54.1 ms  MacroTools
     57.5 ms  CommonSubexpressions
      1.1 ms  DiffRules
    193.2 ms  ForwardDiff
      0.3 ms  IfElse
     48.9 ms  Static
    198.5 ms  ArrayInterface
     14.6 ms  FiniteDiff
      7.8 ms  NLSolversBase
    478.1 ms  LineSearches
    564.8 ms  Optim
      0.5 ms  ExprTools
     26.1 ms  RecipesBase
      2.2 ms  Mocking
     67.1 ms  TimeZones
    102.5 ms  Intervals
    311.7 ms  MutableArithmetics
   1202.4 ms  Polynomials
    461.6 ms  Libxc_jll
    466.3 ms  Libxc
      6.7 ms  Distances
      0.3 ms  VersionParsing
     13.2 ms  Conda
   1476.3 ms  PyCall
     10.2 ms  AbstractFFTs
      4.3 ms  MPICH_jll
     52.1 ms  MPI
      0.6 ms  Reexport
    297.0 ms  LinearMaps
      0.5 ms  CommonSolve
      1.8 ms  ConstructionBase
     47.6 ms  TimerOutputs
    195.7 ms  Unitful
   1121.0 ms  PeriodicTable
     36.0 ms  IterTools
      3.1 ms  ProgressMeter
     20.5 ms  IterativeSolvers
      4.9 ms  NLsolve
    333.1 ms  UnitfulAtomic
      1.3 ms  FFTW_jll
    274.1 ms  FFTW
    193.0 ms  BlockArrays
      1.4 ms  spglib_jll
      9.5 ms  Roots
   8915.1 ms  DFTK

@mfherbst
Copy link
Member

Hmm once #483 is in we can probably make PyCall optional. That already is quite a large chunk. As for PeriodicTable that is also mostly for convenience, but I wonder why it takes so long to load. It is a pretty small package.

@mfherbst
Copy link
Member

But 9 seconds is really not great. We should definitely work on that.

@antoine-levitt
Copy link
Member Author

The load time is not even that much of an issue, the time to first SCF feels longer in my experience. The LOBPCG stuff is apparently responsible for some of it (though I have no idea why)

@antoine-levitt
Copy link
Member Author

https://discourse.julialang.org/t/ann-new-package-snoopprecompile/84778/4 sounds like exactly what we need

@mfherbst
Copy link
Member

mfherbst commented Jul 30, 2022

No it does not help. I tried it yesterday. We have too many invalidations. We need to fix those first.

@antoine-levitt
Copy link
Member Author

Pretty sure that's not our issue. They always talk about that and it's one source of latency, but I don't think that's the one we suffer from. At least I couldn't find anything problematic when I took at look last time

@antoine-levitt
Copy link
Member Author

List of things to try:

@antoine-levitt
Copy link
Member Author

The julia beta released today has the latency improvements, which look pretty spectacular. Not at a computer for a while, but very curious to see what it looks like for us

@rashidrafeek
Copy link

rashidrafeek commented Dec 30, 2022

I just tested this expecting some improvements. But it seems there is a regression. I tested the first example in the README and measured the time to first SCF:

using DFTK, Unitful, UnitfulAtomic

# 1. Define lattice and atomic positions
a = 5.431u"angstrom"          # Silicon lattice constant
lattice = a / 2 * [[0 1 1.];  # Silicon lattice vectors
                   [1 0 1.];  # specified column by column
                   [1 1 0.]];

# Load HGH pseudopotential for Silicon
Si = ElementPsp(:Si, psp=load_psp("hgh/lda/Si-q4"))

# Specify type and positions of atoms
atoms     = [Si, Si]
positions = [ones(3)/8, -ones(3)/8]

# 2. Select model and basis
model = model_LDA(lattice, atoms, positions)
kgrid = [4, 4, 4]     # k-point grid (Regular Monkhorst-Pack grid)
Ecut = 7              # kinetic energy cutoff
# Ecut = 190.5u"eV"  # Could also use eV or other energy-compatible units
basis = PlaneWaveBasis(model; Ecut, kgrid)
# Note the implicit passing of keyword arguments here:
# this is equivalent to PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid)

# 3. Run the SCF procedure to obtain the ground state
@time scfres = self_consistent_field(basis, tol=1e-5);

Following are the results for two runs each in 1.8.4 and 1.9.0-beta2.

1.8.4:

julia> Pkg.status()
Status `~/.julia/environments/dftk/Project.toml`
  [acf6eb54] DFTK v0.6.0 `~/GitProjects/DFTK`

julia> versioninfo()
Julia Version 1.8.4
Commit 00177ebc4fc (2022-12-23 21:32 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, nehalem)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_PKG_DEVDIR = /home/rashid/GitProjects
  JULIA_NUM_PRECOMPILE_TASKS = 6
  JULIA_CONDAPKG_VERBOSITY = 0
  JULIA_CONDAPKG_EXE = conda

Run 1: 70.294600 seconds (128.09 M allocations: 6.876 GiB, 5.40% gc time, 98.92% compilation time)
Run 2: 72.600813 seconds (128.07 M allocations: 6.875 GiB, 4.58% gc time, 98.76% compilation time)

1.9.0-beta2:

julia> Pkg.status()
Status `~/.julia/environments/dftk/Project.toml`
  [acf6eb54] DFTK v0.6.0 `~/GitProjects/DFTK`

julia> versioninfo()
Julia Version 1.9.0-beta2
Commit 7daffeecb8c (2022-12-29 07:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, nehalem)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_PKG_DEVDIR = /home/rashid/GitProjects
  JULIA_NUM_PRECOMPILE_TASKS = 6
  JULIA_CONDAPKG_VERBOSITY = 0
  JULIA_CONDAPKG_EXE = conda

Run 1: 81.235901 seconds (80.53 M allocations: 5.735 GiB, 7.46% gc time, 168.66% compilation time)
Run 2: 81.807179 seconds (80.53 M allocations: 5.735 GiB, 6.91% gc time, 167.89% compilation time)

@antoine-levitt
Copy link
Member Author

Oh :( thanks for testing though. Next step is to put this snippet in DFTK.jl so it gets precompiled... Otherwise something must be blocking precompolation, maybe the timeroutputs macro...

@antoine-levitt
Copy link
Member Author

From what I read the new changes introduce a load time regression, the trade-off being a more efficient precompilation, so these numbers are not that crazy, they just mean that precompilation is not working for us

@antoine-levitt
Copy link
Member Author

The following diff adds precompilation to DFTK:

diff --git a/Project.toml b/Project.toml
index 284eacdc..94b15148 100644
--- a/Project.toml
+++ b/Project.toml
@@ -35,6 +35,7 @@ PseudoPotentialIO = "cb339c56-07fa-4cb2-923a-142469552264"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
 Requires = "ae029012-a4dd-5104-9daa-d747884805df"
 Roots = "f2b01f46-fcfa-551c-844a-d8ac1e96c665"
+SnoopPrecompile = "66db9d55-30c0-4569-8b51-7e840670fc0c"
 SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
 SpecialFunctions = "276daf66-3868-5448-9aa4-cd146d93841b"
 Spglib = "f761d5c5-86db-4880-b97f-9680a7cccfb5"
diff --git a/src/DFTK.jl b/src/DFTK.jl
index 09416da9..7985d9a1 100644
--- a/src/DFTK.jl
+++ b/src/DFTK.jl
@@ -232,4 +232,19 @@ function __init__()
     end
 end
 
+using SnoopPrecompile
+
+begin
+a = 10.26  # Silicon lattice constant in Bohr
+lattice = a / 2 * [[0 1 1.];
+                   [1 0 1.];
+                   [1 1 0.]]
+Si = ElementPsp(:Si, psp=load_psp("hgh/lda/Si-q4"))
+atoms     = [Si, Si]
+positions = [ones(3)/8, -ones(3)/8]
+
+model = model_LDA(lattice, atoms, positions)
+basis = PlaneWaveBasis(model; Ecut=15, kgrid=[4, 4, 4])
+scfres = self_consistent_field(basis, tol=1e-8)
+end
 end # module DFTK

Some numbers on 1.9beta

Without precompilation code
1 dependency successfully precompiled in 5 seconds. 111 already precompiled.
42.652828 seconds (122.06 M allocations: 8.922 GiB, 7.15% gc time, 141.72% compilation time: 1% of which was recompilation)

With precompilation code
1 dependency successfully precompiled in 59 seconds. 103 already precompiled.
8.373334 seconds (18.88 M allocations: 1.944 GiB, 4.29% gc time, 57.40% compilation time: 59% of which was recompilation)

So all in all a very nice win for users. Not so much for developers though. I asked on the julia slack, and people recommend turning off precompilation of deved packages. Should we just merge this diff, at least when 1.9 is released? @rashidrafeek can you post your numbers with this diff, since they're much slower than mine?

@rashidrafeek
Copy link

Its a lot better after applying this patch! Number from my system with 1.9.0-beta2:

Without precompilation code:

  1 dependency successfully precompiled in 13 seconds. 103 already precompiled.
 78.601886 seconds (80.51 M allocations: 5.735 GiB, 6.91% gc time, 169.27% compilation time)

With precompilation code:

  1 dependency successfully precompiled in 167 seconds. 103 already precompiled.
 10.451499 seconds (10.32 M allocations: 875.533 MiB, 3.96% gc time, 149.20% compilation time)

@antoine-levitt
Copy link
Member Author

Can you try the the same patch but with @precompile_all_calls before the begin?

@mfherbst
Copy link
Member

mfherbst commented Jan 4, 2023

Hmm that's still pretty impressive overall. Nice step in the right direction!

@rashidrafeek
Copy link

Can you try the the same patch but with @precompile_all_calls before the begin?

  1 dependency successfully precompiled in 170 seconds. 103 already precompiled.
  1.412220 seconds (467.34 k allocations: 216.655 MiB, 4.57% gc time, 41.23% compilation time)

😲

@antoine-levitt
Copy link
Member Author

Huh I had more than this. Did you do both steps in the same session?

@rashidrafeek
Copy link

Meaning, both precompilation and running code? I tried in the same session as well as after restarting julia, as I was not sure of the result. Both gave similar results. A result after restarting julia:

  1.186028 seconds (466.23 k allocations: 216.232 MiB, 3.85% gc time, 38.92% compilation time)

@antoine-levitt
Copy link
Member Author

Oh wow. OK fine, let's just do it and eat the precompilation cost. I'll add a flag so devs can disable it.

@antoine-levitt
Copy link
Member Author

Ah I see where the discrepancy between your numbers and mine come from, you're measuring the last @time in the above, I was measuring the whole thing (using, model and basis construction, self_consistent_field). So we get similar relative timings but your machine is slower than mine. (Thanks @giordano on slack!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers performance Performance regression or performance-related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants