Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time to complete an epoch depends on the number of epochs #430

Closed
nickgnd opened this issue Dec 18, 2022 · 10 comments
Closed

Time to complete an epoch depends on the number of epochs #430

nickgnd opened this issue Dec 18, 2022 · 10 comments

Comments

@nickgnd
Copy link
Contributor

nickgnd commented Dec 18, 2022

Hi 👋

this is the 2nd issue I'm opening in Axon in a short period of time. I hope I'm not misusing the medium, I'm learning ML and therefore there is high chances that I'm doing something funky. Please bear with me.

Again, thanks for all the work you are doing, that's great and I'm enjoy learning ML using the Nx* libraries so far.


The issue

While playing with Axon I noticed that the training slows down when increasing the number of epochs. I'm not referring to the overall time, which is of course expected, but the time to complete a single epoch.

Here a quick benchmark that I assembled.

Mix.install(
  [
    {:exla, "~> 0.4"},
    {:nx, "~> 0.4"},
    {:axon, "~> 0.3.1"},
    {:benchee, "~> 1.1.0"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

# Generate the data

# 2 inputs
inputs =
  Nx.iota({9000, 2}, type: :f32)
  |> Nx.divide(9000)
  |> Nx.subtract(0.5)
  |> Nx.shuffle()

# one-hot encode the labels
labels =
  Enum.map(0..8999, fn _ -> Enum.random([0, 1]) end)
  |> Nx.tensor()
  |> Nx.reshape({:auto, 1})
  |> Nx.equal(Nx.tensor([0, 1]))

# split the dataset in batches
batch_size = 250
inputs_batches = Nx.to_batched(inputs, batch_size)
labels_batches = Nx.to_batched(labels, batch_size)
train_batches = Stream.zip(inputs_batches, labels_batches)

defmodule CustomEventHandler do
  def maybe_exit(%Axon.Loop.State{epoch: epoch} = state) do
    if epoch > 100 do
      IO.puts("Early exit!")
      {:halt_loop, state}
    else
      {:continue, state}
    end
  end
end

# Create a loop that early exits after the 100th epoch
loop =
  Axon.input("data")
  |> Axon.dense(100, activation: :sigmoid)
  |> Axon.dense(2, activation: :softmax)
  |> Axon.Loop.trainer(:categorical_cross_entropy, :adam)
  |> Axon.Loop.handle(:epoch_started, &CustomEventHandler.maybe_exit/1)

Benchee.run(
  %{
    "100" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 100, compiler: EXLA) end,
    "1000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 1000, compiler: EXLA) end,
    "10_000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 10_000, compiler: EXLA) end
  },
  time: 4,
  memory_time: 2
)

And here the results:

Name             ips        average  deviation         median         99th %
100             0.39         2.58 s     ±0.32%         2.58 s         2.59 s
1000            0.25         4.04 s     ±0.00%         4.04 s         4.04 s
10_000        0.0559        17.90 s     ±0.00%        17.90 s        17.90 s

Comparison:
100             0.39
1000            0.25 - 1.56x slower +1.46 s
10_000        0.0559 - 6.92x slower +15.31 s

Memory usage statistics:

Name      Memory usage
100          246.69 MB
1000         250.49 MB - 1.02x memory usage +3.80 MB
10_000       263.74 MB - 1.07x memory usage +17.05 MB

**All measurements for memory usage were the same**

Given that the training is always exiting after the 100th epoch, I'd expect comparable results, instead there is a remarkable bump-up when the max epoch is set to 10000. Is it something expected?

Thanks in advance, and please let me know if there is anything else I can do for you 🙇‍♂️

Best,
Nicolò

@seanmor5
Copy link
Contributor

Hi @nickgnd thanks for the report! This is not expected, any chance you can run a profile on the training loop at different epoch #s with eprof?

@nickgnd
Copy link
Contributor Author

nickgnd commented Dec 19, 2022

@seanmor5 yup, sure, happy to help! I'll look into it and get back to you :)

@nickgnd
Copy link
Contributor Author

nickgnd commented Dec 19, 2022

Hey @seanmor5 👋
I used the profiler option in Benchee (which used eprof by default).

Benchee.run(
  %{
    "100" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 100, compiler: EXLA) end,
    "1000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 1000, compiler: EXLA) end,
    "10_000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 10_000, compiler: EXLA) end
  },
  time: 1,
  memory_time: 2,
  profile_after: true
)

So I profiled the training loop for each different epochs option (100, 1000, 10_000). I hope that's what you meant, let me know. :)

You can find the results in this google sheets

And you can find the raw data returned from eprof in this folder

From what I can see the time spent on garbage collecting is way higher with 10_000 epochs compared to the other cases, while the number of calls is more or less the same 🤔

| # epochs | function                  | CALLS | % time | time     |   |
|----------|---------------------------|-------|--------|----------|---|
| 100      | :erlang.garbage_collect/0 | 3600  | 0.04   | 1436     |   |
| 1000     | :erlang.garbage_collect/0 | 3636  | 0.07   | 2912     |   |
| 10_000   | :erlang.garbage_collect/0 | 3636  | 69.50  | 11226690 |   |

Hope this can help, let me know if there is anything else i can do 😉


Side note:

To convert the eprof results to csv I used this small script:

results
|> String.split("\n", trim: true)
|> Enum.map(fn
  "anonymous fn" <> _rest = row ->
    [anonymous_fn, rest] =
      String.split(row, ~r{anonymous fn/\d+ in .*/\d+}, trim: true, include_captures: true)
    [anonymous_fn] ++ String.split(rest, ~r/\s/, trim: true)
    
  row ->
    String.split(row, ~r/\s/, trim: true)
end)
|> Enum.map(& Enum.join(&1, ", "))
|> Enum.join("\n")
|> then(& File.write!("results.csv", &1))

@nickgnd
Copy link
Contributor Author

nickgnd commented Dec 19, 2022

Hey @seanmor5 👋

I took a look at Axon.Loop.run searching for parts of the implementation that depends on the number of epochs.

I noticed that the final_metrics_map is fully initialzied at the beginning of the training and its size (number of keys) is the same of the # epochs.

axon/lib/axon/loop.ex

Lines 1643 to 1648 in f67d45e

final_metrics_map =
for i <- epoch_start..epoch_end do
{i, Map.new(metric_fns, fn {k, _} -> {k, Nx.tensor(0)} end)}
end
|> Map.new()
|> Map.merge(loop_state.metrics)

I did a quick spike and update the code to don't initialize the final_metrics_map at the beginning, but to update after each epoch just to see if there were any improvements. If I re-run the same benchmark now the results are comparable:

Name             ips        average  deviation         median         99th %
10_000          0.42         2.37 s     ±0.00%         2.37 s         2.37 s
100             0.42         2.37 s     ±0.00%         2.37 s         2.37 s
1000            0.42         2.37 s     ±0.00%         2.37 s         2.37 s

Comparison:
10_000          0.42
100             0.42 - 1.00x slower +0.00040 s
1000            0.42 - 1.00x slower +0.00360 s

Memory usage statistics:

Name      Memory usage
10_000       175.26 MB
100          173.53 MB - 0.99x memory usage -1.73569 MB
1000         175.26 MB - 1.00x memory usage +0 MB

This is just to highlight the possible issue, the linked PR doesn't propose any solution since the system would degrade anyway when going up with the epoch number :/

A couple of thoughts:

  • Then, the final_metrics_map is a map where the key is the epoch number, and the value is the metrics map. It would be interesting to try it out with a different data structure, maybe a list, prepending the latest metrics and reverse it at the very end. But that's a just a naive idea that should be validated :)

  • Nevertherless, do you think would it make sense to not initialize the final_metrics_map at the beginning, but to populate it step-by-step and then eventually populate the missing epochs at the very end? This would mitigate the problem a bit I think.

Happy to help :)

Cheers ✌️

@josevalim
Copy link
Contributor

Awesome find @nickgnd! I assume this PR also addresses the issue?

#428

If it does not, we definitely need to go with the second suggestion. :)

@nickgnd
Copy link
Contributor Author

nickgnd commented Dec 19, 2022

@josevalim nice! 🚀

I tried the same benchmark I posted above in my first message on top of your PR #428 and indeed it is way better

Name             ips        average  deviation         median         99th %
100             0.43         2.30 s     ±0.00%         2.30 s         2.30 s
1000            0.41         2.41 s     ±0.00%         2.41 s         2.41 s
10_000          0.28         3.51 s     ±0.00%         3.51 s         3.51 s

Comparison:
100             0.43
1000            0.41 - 1.05x slower +0.114 s
10_000          0.28 - 1.53x slower +1.21 s

Memory usage statistics:

Name      Memory usage
100          174.68 MB
1000         176.51 MB - 1.01x memory usage +1.83 MB
10_000       177.26 MB - 1.01x memory usage +2.59 MB

In that benchmark the training exits always after the 100th epochs.
Therefore, I wanted to see how it behaves in the full run.

Here a recap:

| Max epochs | # Epoch | Time (us) | x times slower than 1st epoch run |
|------------|---------|-----------|-----------------------------------|
| 100        | 1       | 20161     | -                                 |
| 100        | 99      | 20722     | 1.02                              |
| 1000       | 1       | 22891     | -                                 |
| 1000       | 99      | 25058     | 1.09                              |
| 1000       | 999     | 25874     | 1.13                              |
| 10000      | 1       | 29597     | -                                 |
| 10000      | 99      | 32655     | 1.10                              |
| 10000      | 999     | 38673     | 1.30                              |
| 10000      | 9999    | 103420    | 3.49                              |

I didn't consider the 1st epoch run (index 0) because it takes always longer than the 2nd one consistently in all the scenarios (probably due to some initialization).

Here the link to the google sheet with all the data.

I assume this PR also addresses the issue?

Your changes definitely improve the performances and mitigates the issue. Then, given the same loop/model:

  1. the time spent to complete an epoch, even for the firsts, depends on the number of epochs
  2. and when the training goes over many epochs, the performances degrades and the time to complete an epoch grows up while progressing

Probably, not initializing the whole final_metrics_map at the beginning would partially address the 1st point, but then I don't think would address the 2nd point.

But I also don't have enough experience to judge if this ("many" epochs) is a common case and I don't know how other ML libraries behaves in similar scenarios (just to have a comparison), maybe is even expected 🙈 So... your call ☎️

But I would love to hear your opinion about :)


For completeness, I share below how I changed the snippet of code that I initially shared in my first message. Now the custom handler writes to a CSV file the time spent on completing the last epoch.

# Change the filename based on the scenario
{:ok, file} = File.open("./100_epochs_times.csv", [:append])

defmodule CustomEventHandler do
  def write(%Axon.Loop.State{epoch: epoch, times: times} = state, file) do
    epoch_n = Nx.to_number(epoch)
    IO.binwrite(file, "#{epoch_n}, #{Map.get(times, epoch_n)}\n")

    {:continue, state}
  end
end

loop =
  Axon.input("data")
  |> Axon.dense(100, activation: :sigmoid)
  |> Axon.dense(2, activation: :softmax)
  |> Axon.Loop.trainer(:categorical_cross_entropy, :adam, log: -1)
  |> Axon.Loop.handle(:epoch_completed, &CustomEventHandler.write(&1, file))

# Change the epochs based on the scenario
Axon.Loop.run(loop, train_batches, %{},
  epochs: 100,
  compiler: EXLA,
  event: :iteration_completed,
  filter: :always
)

:ok = File.close(file)

My current setup:

  • macOS Monterey 12.6
  • erlang 25.1.2
  • elixir 1.14.2-otp-25
  • processor: 2.3 GHz Quad-Core Intel Core i7
  • memory: 32 GB
  • graphic card: Intel Iris Plus Graphics 1536 MB

@seanmor5
Copy link
Contributor

@nickgnd I think this was resolved in #427 :)

@josevalim
Copy link
Contributor

@nickgnd, can you please try this branch? #440

It should show no variation between epochs now.

@nickgnd
Copy link
Contributor Author

nickgnd commented Jan 5, 2023

Hey @josevalim
i quickly tried the same benchmark, you can find all the different results in this google sheet.

| Max epochs | # Epoch | Time to complete epoch (us) |
|------------|---------|-----------|
| 100        | 1       | 23023     | 
| 1000       | 1       | 21908     |
| 10000      | 1       | 23961     |
|------------|---------|-----------|
| 100        | 99      | 19839     | 
| 1000       | 99      | 19269     |
| 10000      | 99      | 19609     |
|------------|---------|-----------|
| 1000       | 999     | 25703     |
| 10000      | 999     | 25432     |
|------------|---------|-----------|
| 10000      | 9999    | 102406    |
|------------|---------|-----------|

It should show no variation between epochs now.

Yup, the time to complete an epoch does not depend anymore on the max epochs, yay! 🎉

Thanks for looking into that 🙇
Have a great day!

@nickgnd
Copy link
Contributor Author

nickgnd commented Mar 3, 2023

Hey @josevalim and @seanmor5 👋

Out of curiosity, I re-run the same benchmark with Nx v0.5, Axon v0.5, Exla v0.5 and I can tell you that there were a huge performance boost 🚀 .
In particular, the time to complete an epoch is now more "stable" in the sense that the time needed to complete 1st epoch is comparable to the one needed to complete the 10000th epoch. Impressive job! 👏

Here the updated recap:

| Max epochs | # Epoch | Time to complete epoch (us) |
|------------|---------|-----------|
| 100        | 1       | 15264     | 
| 1000       | 1       | 13700     |
| 10000      | 1       | 14823     |
|------------|---------|-----------|
| 100        | 99      | 16949     | 
| 1000       | 99      | 13437     |
| 10000      | 99      | 13548     |
|------------|---------|-----------|
| 1000       | 999     | 11482     |
| 10000      | 999     | 11608     |
|------------|---------|-----------|
| 10000      | 9999    | 14482     |
|------------|---------|-----------|

Cheers 👋


Expand to see the snipped I used for the benchmark
Mix.install(
  [
    {:exla, "~> 0.5"},
    {:nx, "~> 0.5"},
    {:axon, "~> 0.5"},
    {:benchee, "~> 1.1.0"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

# Generate the data (labels one-hot encoded)
inputs =
  Nx.iota({9000, 2}, type: :f32)
  |> Nx.divide(9000)
  |> Nx.subtract(0.5)
  |> Nx.shuffle()

labels =
  Enum.map(0..8999, fn _ -> Enum.random([0, 1]) end)
  |> Nx.tensor()
  |> Nx.reshape({:auto, 1})
  |> Nx.equal(Nx.tensor([0, 1]))

batch_size = 250
inputs_batches = Nx.to_batched(inputs, batch_size)
labels_batches = Nx.to_batched(labels, batch_size)
train_batches = Stream.zip(inputs_batches, labels_batches)

# Change the filename based on the scenario
{:ok, file} = File.open("./10000_epochs_axon_0_5.csv", [:append])

defmodule CustomEventHandler do
  def write(%Axon.Loop.State{epoch: epoch, times: times} = state, file) do
    epoch_n = Nx.to_number(epoch)
    IO.binwrite(file, "#{epoch_n}, #{Map.get(times, epoch_n)}\n")

    {:continue, state}
  end
end

loop =
  Axon.input("data")
  |> Axon.dense(100, activation: :sigmoid)
  |> Axon.dense(2, activation: :softmax)
  |> Axon.Loop.trainer(:categorical_cross_entropy, :adam, log: -1)
  |> Axon.Loop.handle(:epoch_completed, &CustomEventHandler.write(&1, file))

# Change the epochs based on the scenario
Axon.Loop.run(loop, train_batches, %{}, epochs: 10000, compiler: EXLA)

:ok = File.close(file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants