Time to complete an epoch depends on the number of epochs #430

nickgnd · 2022-12-18T22:15:25Z

Hi 👋

this is the 2nd issue I'm opening in Axon in a short period of time. I hope I'm not misusing the medium, I'm learning ML and therefore there is high chances that I'm doing something funky. Please bear with me.

Again, thanks for all the work you are doing, that's great and I'm enjoy learning ML using the Nx* libraries so far.

The issue

While playing with Axon I noticed that the training slows down when increasing the number of epochs. I'm not referring to the overall time, which is of course expected, but the time to complete a single epoch.

Here a quick benchmark that I assembled.

Mix.install(
  [
    {:exla, "~> 0.4"},
    {:nx, "~> 0.4"},
    {:axon, "~> 0.3.1"},
    {:benchee, "~> 1.1.0"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

# Generate the data

# 2 inputs
inputs =
  Nx.iota({9000, 2}, type: :f32)
  |> Nx.divide(9000)
  |> Nx.subtract(0.5)
  |> Nx.shuffle()

# one-hot encode the labels
labels =
  Enum.map(0..8999, fn _ -> Enum.random([0, 1]) end)
  |> Nx.tensor()
  |> Nx.reshape({:auto, 1})
  |> Nx.equal(Nx.tensor([0, 1]))

# split the dataset in batches
batch_size = 250
inputs_batches = Nx.to_batched(inputs, batch_size)
labels_batches = Nx.to_batched(labels, batch_size)
train_batches = Stream.zip(inputs_batches, labels_batches)

defmodule CustomEventHandler do
  def maybe_exit(%Axon.Loop.State{epoch: epoch} = state) do
    if epoch > 100 do
      IO.puts("Early exit!")
      {:halt_loop, state}
    else
      {:continue, state}
    end
  end
end

# Create a loop that early exits after the 100th epoch
loop =
  Axon.input("data")
  |> Axon.dense(100, activation: :sigmoid)
  |> Axon.dense(2, activation: :softmax)
  |> Axon.Loop.trainer(:categorical_cross_entropy, :adam)
  |> Axon.Loop.handle(:epoch_started, &CustomEventHandler.maybe_exit/1)

Benchee.run(
  %{
    "100" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 100, compiler: EXLA) end,
    "1000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 1000, compiler: EXLA) end,
    "10_000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 10_000, compiler: EXLA) end
  },
  time: 4,
  memory_time: 2
)

And here the results:

Name             ips        average  deviation         median         99th %
100             0.39         2.58 s     ±0.32%         2.58 s         2.59 s
1000            0.25         4.04 s     ±0.00%         4.04 s         4.04 s
10_000        0.0559        17.90 s     ±0.00%        17.90 s        17.90 s

Comparison:
100             0.39
1000            0.25 - 1.56x slower +1.46 s
10_000        0.0559 - 6.92x slower +15.31 s

Memory usage statistics:

Name      Memory usage
100          246.69 MB
1000         250.49 MB - 1.02x memory usage +3.80 MB
10_000       263.74 MB - 1.07x memory usage +17.05 MB

**All measurements for memory usage were the same**

Given that the training is always exiting after the 100th epoch, I'd expect comparable results, instead there is a remarkable bump-up when the max epoch is set to 10000. Is it something expected?

Thanks in advance, and please let me know if there is anything else I can do for you 🙇‍♂️

Best,
Nicolò

The text was updated successfully, but these errors were encountered:

seanmor5 · 2022-12-18T22:31:10Z

Hi @nickgnd thanks for the report! This is not expected, any chance you can run a profile on the training loop at different epoch #s with eprof?

nickgnd · 2022-12-19T08:42:33Z

@seanmor5 yup, sure, happy to help! I'll look into it and get back to you :)

nickgnd · 2022-12-19T12:45:30Z

Hey @seanmor5 👋
I used the profiler option in Benchee (which used eprof by default).

Benchee.run(
  %{
    "100" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 100, compiler: EXLA) end,
    "1000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 1000, compiler: EXLA) end,
    "10_000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 10_000, compiler: EXLA) end
  },
  time: 1,
  memory_time: 2,
  profile_after: true
)

So I profiled the training loop for each different epochs option (100, 1000, 10_000). I hope that's what you meant, let me know. :)

You can find the results in this google sheets

And you can find the raw data returned from eprof in this folder

From what I can see the time spent on garbage collecting is way higher with 10_000 epochs compared to the other cases, while the number of calls is more or less the same 🤔

| # epochs | function                  | CALLS | % time | time     |   |
|----------|---------------------------|-------|--------|----------|---|
| 100      | :erlang.garbage_collect/0 | 3600  | 0.04   | 1436     |   |
| 1000     | :erlang.garbage_collect/0 | 3636  | 0.07   | 2912     |   |
| 10_000   | :erlang.garbage_collect/0 | 3636  | 69.50  | 11226690 |   |

Hope this can help, let me know if there is anything else i can do 😉

Side note:

To convert the eprof results to csv I used this small script:

results
|> String.split("\n", trim: true)
|> Enum.map(fn
  "anonymous fn" <> _rest = row ->
    [anonymous_fn, rest] =
      String.split(row, ~r{anonymous fn/\d+ in .*/\d+}, trim: true, include_captures: true)
    [anonymous_fn] ++ String.split(rest, ~r/\s/, trim: true)
    
  row ->
    String.split(row, ~r/\s/, trim: true)
end)
|> Enum.map(& Enum.join(&1, ", "))
|> Enum.join("\n")
|> then(& File.write!("results.csv", &1))

nickgnd · 2022-12-19T20:43:53Z

Hey @seanmor5 👋

I took a look at Axon.Loop.run searching for parts of the implementation that depends on the number of epochs.

I noticed that the final_metrics_map is fully initialzied at the beginning of the training and its size (number of keys) is the same of the # epochs.

axon/lib/axon/loop.ex

Lines 1643 to 1648 in f67d45e

    
           final_metrics_map = 
        
             for i <- epoch_start..epoch_end do 
        
               {i, Map.new(metric_fns, fn {k, _} -> {k, Nx.tensor(0)} end)} 
        
             end 
        
             |> Map.new() 
        
             |> Map.merge(loop_state.metrics)

I did a quick spike and update the code to don't initialize the final_metrics_map at the beginning, but to update after each epoch just to see if there were any improvements. If I re-run the same benchmark now the results are comparable:

Name             ips        average  deviation         median         99th %
10_000          0.42         2.37 s     ±0.00%         2.37 s         2.37 s
100             0.42         2.37 s     ±0.00%         2.37 s         2.37 s
1000            0.42         2.37 s     ±0.00%         2.37 s         2.37 s

Comparison:
10_000          0.42
100             0.42 - 1.00x slower +0.00040 s
1000            0.42 - 1.00x slower +0.00360 s

Memory usage statistics:

Name      Memory usage
10_000       175.26 MB
100          173.53 MB - 0.99x memory usage -1.73569 MB
1000         175.26 MB - 1.00x memory usage +0 MB

This is just to highlight the possible issue, the linked PR doesn't propose any solution since the system would degrade anyway when going up with the epoch number :/

A couple of thoughts:

Then, the final_metrics_map is a map where the key is the epoch number, and the value is the metrics map. It would be interesting to try it out with a different data structure, maybe a list, prepending the latest metrics and reverse it at the very end. But that's a just a naive idea that should be validated :)
Nevertherless, do you think would it make sense to not initialize the final_metrics_map at the beginning, but to populate it step-by-step and then eventually populate the missing epochs at the very end? This would mitigate the problem a bit I think.

Happy to help :)

Cheers ✌️

josevalim · 2022-12-19T20:58:49Z

Awesome find @nickgnd! I assume this PR also addresses the issue?

#428

If it does not, we definitely need to go with the second suggestion. :)

nickgnd · 2022-12-19T23:36:43Z

@josevalim nice! 🚀

I tried the same benchmark I posted above in my first message on top of your PR #428 and indeed it is way better

Name             ips        average  deviation         median         99th %
100             0.43         2.30 s     ±0.00%         2.30 s         2.30 s
1000            0.41         2.41 s     ±0.00%         2.41 s         2.41 s
10_000          0.28         3.51 s     ±0.00%         3.51 s         3.51 s

Comparison:
100             0.43
1000            0.41 - 1.05x slower +0.114 s
10_000          0.28 - 1.53x slower +1.21 s

Memory usage statistics:

Name      Memory usage
100          174.68 MB
1000         176.51 MB - 1.01x memory usage +1.83 MB
10_000       177.26 MB - 1.01x memory usage +2.59 MB

In that benchmark the training exits always after the 100th epochs.
Therefore, I wanted to see how it behaves in the full run.

Here a recap:

| Max epochs | # Epoch | Time (us) | x times slower than 1st epoch run |
|------------|---------|-----------|-----------------------------------|
| 100        | 1       | 20161     | -                                 |
| 100        | 99      | 20722     | 1.02                              |
| 1000       | 1       | 22891     | -                                 |
| 1000       | 99      | 25058     | 1.09                              |
| 1000       | 999     | 25874     | 1.13                              |
| 10000      | 1       | 29597     | -                                 |
| 10000      | 99      | 32655     | 1.10                              |
| 10000      | 999     | 38673     | 1.30                              |
| 10000      | 9999    | 103420    | 3.49                              |

I didn't consider the 1st epoch run (index 0) because it takes always longer than the 2nd one consistently in all the scenarios (probably due to some initialization).

Here the link to the google sheet with all the data.

I assume this PR also addresses the issue?

Your changes definitely improve the performances and mitigates the issue. Then, given the same loop/model:

the time spent to complete an epoch, even for the firsts, depends on the number of epochs
and when the training goes over many epochs, the performances degrades and the time to complete an epoch grows up while progressing

Probably, not initializing the whole final_metrics_map at the beginning would partially address the 1st point, but then I don't think would address the 2nd point.

But I also don't have enough experience to judge if this ("many" epochs) is a common case and I don't know how other ML libraries behaves in similar scenarios (just to have a comparison), maybe is even expected 🙈 So... your call ☎️

But I would love to hear your opinion about :)

For completeness, I share below how I changed the snippet of code that I initially shared in my first message. Now the custom handler writes to a CSV file the time spent on completing the last epoch.

# Change the filename based on the scenario
{:ok, file} = File.open("./100_epochs_times.csv", [:append])

defmodule CustomEventHandler do
  def write(%Axon.Loop.State{epoch: epoch, times: times} = state, file) do
    epoch_n = Nx.to_number(epoch)
    IO.binwrite(file, "#{epoch_n}, #{Map.get(times, epoch_n)}\n")

    {:continue, state}
  end
end

loop =
  Axon.input("data")
  |> Axon.dense(100, activation: :sigmoid)
  |> Axon.dense(2, activation: :softmax)
  |> Axon.Loop.trainer(:categorical_cross_entropy, :adam, log: -1)
  |> Axon.Loop.handle(:epoch_completed, &CustomEventHandler.write(&1, file))

# Change the epochs based on the scenario
Axon.Loop.run(loop, train_batches, %{},
  epochs: 100,
  compiler: EXLA,
  event: :iteration_completed,
  filter: :always
)

:ok = File.close(file)

My current setup:

macOS Monterey 12.6
erlang 25.1.2
elixir 1.14.2-otp-25
processor: 2.3 GHz Quad-Core Intel Core i7
memory: 32 GB
graphic card: Intel Iris Plus Graphics 1536 MB

seanmor5 · 2022-12-21T15:56:10Z

@nickgnd I think this was resolved in #427 :)

josevalim · 2023-01-04T22:30:18Z

@nickgnd, can you please try this branch? #440

It should show no variation between epochs now.

nickgnd · 2023-01-05T08:03:24Z

Hey @josevalim
i quickly tried the same benchmark, you can find all the different results in this google sheet.

| Max epochs | # Epoch | Time to complete epoch (us) |
|------------|---------|-----------|
| 100        | 1       | 23023     | 
| 1000       | 1       | 21908     |
| 10000      | 1       | 23961     |
|------------|---------|-----------|
| 100        | 99      | 19839     | 
| 1000       | 99      | 19269     |
| 10000      | 99      | 19609     |
|------------|---------|-----------|
| 1000       | 999     | 25703     |
| 10000      | 999     | 25432     |
|------------|---------|-----------|
| 10000      | 9999    | 102406    |
|------------|---------|-----------|

It should show no variation between epochs now.

Yup, the time to complete an epoch does not depend anymore on the max epochs, yay! 🎉

Thanks for looking into that 🙇
Have a great day!

nickgnd · 2023-03-03T08:41:35Z

Hey @josevalim and @seanmor5 👋

Out of curiosity, I re-run the same benchmark with Nx v0.5, Axon v0.5, Exla v0.5 and I can tell you that there were a huge performance boost 🚀 .
In particular, the time to complete an epoch is now more "stable" in the sense that the time needed to complete 1st epoch is comparable to the one needed to complete the 10000th epoch. Impressive job! 👏

Here the updated recap:

| Max epochs | # Epoch | Time to complete epoch (us) |
|------------|---------|-----------|
| 100        | 1       | 15264     | 
| 1000       | 1       | 13700     |
| 10000      | 1       | 14823     |
|------------|---------|-----------|
| 100        | 99      | 16949     | 
| 1000       | 99      | 13437     |
| 10000      | 99      | 13548     |
|------------|---------|-----------|
| 1000       | 999     | 11482     |
| 10000      | 999     | 11608     |
|------------|---------|-----------|
| 10000      | 9999    | 14482     |
|------------|---------|-----------|

Cheers 👋

Expand to see the snipped I used for the benchmark

Mix.install(
  [
    {:exla, "~> 0.5"},
    {:nx, "~> 0.5"},
    {:axon, "~> 0.5"},
    {:benchee, "~> 1.1.0"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

# Generate the data (labels one-hot encoded)
inputs =
  Nx.iota({9000, 2}, type: :f32)
  |> Nx.divide(9000)
  |> Nx.subtract(0.5)
  |> Nx.shuffle()

labels =
  Enum.map(0..8999, fn _ -> Enum.random([0, 1]) end)
  |> Nx.tensor()
  |> Nx.reshape({:auto, 1})
  |> Nx.equal(Nx.tensor([0, 1]))

batch_size = 250
inputs_batches = Nx.to_batched(inputs, batch_size)
labels_batches = Nx.to_batched(labels, batch_size)
train_batches = Stream.zip(inputs_batches, labels_batches)

# Change the filename based on the scenario
{:ok, file} = File.open("./10000_epochs_axon_0_5.csv", [:append])

defmodule CustomEventHandler do
  def write(%Axon.Loop.State{epoch: epoch, times: times} = state, file) do
    epoch_n = Nx.to_number(epoch)
    IO.binwrite(file, "#{epoch_n}, #{Map.get(times, epoch_n)}\n")

    {:continue, state}
  end
end

loop =
  Axon.input("data")
  |> Axon.dense(100, activation: :sigmoid)
  |> Axon.dense(2, activation: :softmax)
  |> Axon.Loop.trainer(:categorical_cross_entropy, :adam, log: -1)
  |> Axon.Loop.handle(:epoch_completed, &CustomEventHandler.write(&1, file))

# Change the epochs based on the scenario
Axon.Loop.run(loop, train_batches, %{}, epochs: 10000, compiler: EXLA)

:ok = File.close(file)

nickgnd mentioned this issue Dec 19, 2022

spike: don't initialize final_metrics_map in Axon.Loop nickgnd/axon#1

Open

seanmor5 closed this as completed Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time to complete an epoch depends on the number of epochs #430

Time to complete an epoch depends on the number of epochs #430

nickgnd commented Dec 18, 2022

seanmor5 commented Dec 18, 2022

nickgnd commented Dec 19, 2022

nickgnd commented Dec 19, 2022 •

edited

Loading

nickgnd commented Dec 19, 2022 •

edited

Loading

josevalim commented Dec 19, 2022

nickgnd commented Dec 19, 2022 •

edited

Loading

seanmor5 commented Dec 21, 2022

josevalim commented Jan 4, 2023

nickgnd commented Jan 5, 2023

nickgnd commented Mar 3, 2023

Time to complete an epoch depends on the number of epochs #430

Time to complete an epoch depends on the number of epochs #430

Comments

nickgnd commented Dec 18, 2022

The issue

seanmor5 commented Dec 18, 2022

nickgnd commented Dec 19, 2022

nickgnd commented Dec 19, 2022 • edited Loading

nickgnd commented Dec 19, 2022 • edited Loading

josevalim commented Dec 19, 2022

nickgnd commented Dec 19, 2022 • edited Loading

seanmor5 commented Dec 21, 2022

josevalim commented Jan 4, 2023

nickgnd commented Jan 5, 2023

nickgnd commented Mar 3, 2023

nickgnd commented Dec 19, 2022 •

edited

Loading

nickgnd commented Dec 19, 2022 •

edited

Loading

nickgnd commented Dec 19, 2022 •

edited

Loading