Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient CPU utilization when compiling dependencies #14200

Open
liamwhite opened this issue Jan 19, 2025 · 6 comments
Open

Inefficient CPU utilization when compiling dependencies #14200

liamwhite opened this issue Jan 19, 2025 · 6 comments

Comments

@liamwhite
Copy link
Contributor

Elixir and Erlang/OTP versions

Erlang/OTP 27 [erts-15.2] [source] [64-bit] [smp:32:32] [ds:32:32:10] [async-threads:1] [jit:ns]

Elixir 1.18.1 (compiled with Erlang/OTP 27)

Operating system

Linux

Current behavior

I have an application which has a lot of its own files and a fair number of dependencies. When running mix deps.compile for the first time (so no _build directory exists at this point), mix spends a lot of time compiling applications, one after the other, in a serial fashion.

The problem with this approach is that I have a 16-core, 32-logical-core processor, and System.schedulers_online() returns 32, so mix should really always be trying to schedule 32 compilation units between applications with satisfied dependencies, which it does not appear to be doing.

Since most dependency applications are just a handful of files, this results in just a few cores being used most of the time. The logging and utilization suggests that all of the files within each application get compiled in parallel; then the result of all compilations is awaited before generating the next application.

An example of a deps list in a new mix project that doesn't make the best use of the CPU during compilation follows:

  # Run "mix help deps" to learn about dependencies.
  defp deps do
    [
      {:phoenix, "~> 1.7"},
      {:phoenix_pubsub, "~> 2.1"},
      {:phoenix_ecto, "~> 4.4"},
      {:ecto_sql, "~> 3.9"},
      {:postgrex, ">= 0.0.0"},
      {:phoenix_html, "~> 3.3"},
      {:phoenix_view, "~> 2.0"},
      {:phoenix_live_reload, "~> 1.4", only: :dev},
      {:gettext, "~> 0.22"},
      {:jason, "~> 1.4"},
      {:bandit, "~> 1.2"},
      {:phoenix_pubsub_redis, "~> 3.0"},
      {:ecto_network, "~> 1.3"},
      {:bcrypt_elixir, "~> 3.0"},
      {:pot, "~> 1.0"},
      {:secure_compare, "~> 0.1"},
      {:nimble_parsec, "~> 1.2"},
      {:qrcode, "~> 0.1"},
      {:redix, "~> 1.2"},
      {:remote_ip, "~> 1.1"},
      {:briefly, "~> 0.4"},
      {:req, "~> 0.5"},
      {:exq, "~> 0.17"},
      {:ex_aws, "~> 2.0"},
      {:ex_aws_s3, "~> 2.0"},
      {:sweet_xml, "~> 0.7"},
      {:inet_cidr, "~> 1.0"},
      {:swoosh, "~> 1.17"},
      {:mua, "~> 0.2.0"},
      {:mail, "~> 0.3.0"},
      {:ex_doc, "~> 0.30"},
      {:sobelow, "~> 0.11"},
      {:mix_audit, "~> 2.1"},
      {:dialyxir, "~> 1.2"}
    ]
  end

Example

$ time mix deps.compile
[...]

real	0m30.369s
user	1m56.178s
sys	0m30.771s

-> 3.83x speedup over serial compilation

Expected behavior

My application contains 640+ elixir compilation units. After cleaning and running mix compile, mix will max out the CPU immediately and consistently schedule work for all 32 logical cores until there are no new compilation tasks to complete.

Example:

$ time mix compile
[...]
Generated philomena app

real	0m13.939s
user	1m40.286s
sys	0m22.636s

-> 7.19x speedup over serial compilation

This would also be the desired behavior when compiling application dependencies.

@josevalim
Copy link
Member

This is tricky because compiling Elixir code means running Elixir code, and that code may rely on global values, the most obvious one being the current working directory.

Therefore, the only way we could optimize this is by starting multiple mix instances, but then we need to figure out a way of splitting the work that doesn't require you to load the same dependencies multiple times. For example, splitting the work on dependencies that require plug may be tricky, because it means you either compile plug twice or you have to wait until plug is compiled and then load it across multiple the multiple instances.

What we could do is a graph analysis and see if the graph has reasonably distinct branches, and emit a separate mix deps.compile command. Rebar and make dependencies would also be trivial to parallelize, as they already require a separate VM, but in your case you only have 4 of them, so it is unlikely to make a large difference.

@TylerWitt
Copy link
Contributor

Could there be a global cache somehow (like a mix supervisor or something) that could cache per application to avoid over compiling?

For example, splitting the work on dependencies that require plug may be tricky, because it means you either compile plug twice or you have to wait until plug is compiled and then load it across multiple the multiple instances.

Using this as an example, we could check the cache for a compiled plug, and return its path if it already exists, otherwise, compile plug?

@josevalim
Copy link
Member

josevalim commented Jan 20, 2025

I'd say that's a separate problem. One is caching and the other is optimizing the (uncached) build itself.

When it comes to caching, keep in mind that compile time configuration and optional dependencies may lead to different builds for a single dep, and the issue is that a bug in caching can lead to very subtle, hard to debug, issues. For caching itself, I'd prefer to explore first caching within the same build but different results, as in #12520, because I think our compilation graph is robust enough to deal with that and recompile stale parts.

@liamwhite
Copy link
Contributor Author

liamwhite commented Jan 20, 2025

The scope of this issue is for the uncached build :)

and that code may rely on global values, the most obvious one being the current working directory

It is really unfortunate that such a highly concurrent language has a concept of a global working directory, though I guess given that erlang has it as file:get_cwd, it would have been a leaky concept at compile time anyway.

I have no issue with spawning new mix processes to accomplish parallel compilation. I do wonder how expensive it is to redundantly reload compiled beam files across multiple VMs? I have to imagine that for the small, handful-of-files types of projects that seem to bottleneck my compilation today, it would be clearly worth it to do this.

EDIT: also, maybe independent mix processes could avoid the overhead of code reloading by making compilation use a distributed erlang setup, and having the compiled code reside on the node that compiled it?

@josevalim
Copy link
Member

It is really unfortunate that such a highly concurrent language has a concept of a global working directory, though I guess given that erlang has it as file:get_cwd, it would have been a leaky concept at compile time anyway.

Most things which are system related, OS variables, current working directory, or even the file system itself, are global shared resources. Of course, we could provide some reasonable "shielding" but at some point the rubber has to hit the road.

I do wonder how expensive it is to redundantly reload compiled beam files across multiple VMs?

The issue may not be reload per se but introducing synchronization points. If we can split the dep tree into subtrees without dependencies, then everyone's life will be easier. :) When I have some time, I will provide a script that computes these subtrees, so you and other people could experiment with this.

@liamwhite
Copy link
Contributor Author

I made an unoptimized proof of concept demo that compiles dependencies quite a lot faster using Elixir v1.17 (note that a lock was added recently which would probably prevent this from working as intended on v1.18, but I did not test that)

$ rm -rf _build
$ time mix deps.compile
[...]
real	0m28.457s
# iex -S mix
# then in a different terminal: rm -rf _build
iex(1)> Code.require_file("dep_solver.ex")
# ...
iex(2)> :timer.tc(fn -> ConcurrentDependencyProcessor.process_all_concurrently() end)
# ...
{16039126,
 [:ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok,
  :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok,
  :ok, :ok]}

so a 1.77x speedup, and the CPU is fully utilized until the very end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants