-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Time to complete an epoch depends on the number of epochs #430
Comments
Hi @nickgnd thanks for the report! This is not expected, any chance you can run a profile on the training loop at different epoch #s with eprof? |
@seanmor5 yup, sure, happy to help! I'll look into it and get back to you :) |
Hey @seanmor5 👋 Benchee.run(
%{
"100" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 100, compiler: EXLA) end,
"1000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 1000, compiler: EXLA) end,
"10_000" => fn -> Axon.Loop.run(loop, train_batches, %{}, epochs: 10_000, compiler: EXLA) end
},
time: 1,
memory_time: 2,
profile_after: true
) So I profiled the training loop for each different You can find the results in this google sheets And you can find the raw data returned from eprof in this folder From what I can see the time spent on garbage collecting is way higher with 10_000 epochs compared to the other cases, while the number of calls is more or less the same 🤔
Hope this can help, let me know if there is anything else i can do 😉 Side note: To convert the eprof results to csv I used this small script: results
|> String.split("\n", trim: true)
|> Enum.map(fn
"anonymous fn" <> _rest = row ->
[anonymous_fn, rest] =
String.split(row, ~r{anonymous fn/\d+ in .*/\d+}, trim: true, include_captures: true)
[anonymous_fn] ++ String.split(rest, ~r/\s/, trim: true)
row ->
String.split(row, ~r/\s/, trim: true)
end)
|> Enum.map(& Enum.join(&1, ", "))
|> Enum.join("\n")
|> then(& File.write!("results.csv", &1)) |
Hey @seanmor5 👋 I took a look at I noticed that the Lines 1643 to 1648 in f67d45e
I did a quick spike and update the code to don't initialize the
This is just to highlight the possible issue, the linked PR doesn't propose any solution since the system would degrade anyway when going up with the epoch number :/ A couple of thoughts:
Happy to help :) Cheers ✌️ |
@josevalim nice! 🚀 I tried the same benchmark I posted above in my first message on top of your PR #428 and indeed it is way better
In that benchmark the training exits always after the 100th epochs. Here a recap:
I didn't consider the 1st epoch run (index 0) because it takes always longer than the 2nd one consistently in all the scenarios (probably due to some initialization). Here the link to the google sheet with all the data.
Your changes definitely improve the performances and mitigates the issue. Then, given the same loop/model:
Probably, not initializing the whole But I also don't have enough experience to judge if this ("many" epochs) is a common case and I don't know how other ML libraries behaves in similar scenarios (just to have a comparison), maybe is even expected 🙈 So... your call ☎️ But I would love to hear your opinion about :) For completeness, I share below how I changed the snippet of code that I initially shared in my first message. Now the custom handler writes to a CSV file the time spent on completing the last epoch. # Change the filename based on the scenario
{:ok, file} = File.open("./100_epochs_times.csv", [:append])
defmodule CustomEventHandler do
def write(%Axon.Loop.State{epoch: epoch, times: times} = state, file) do
epoch_n = Nx.to_number(epoch)
IO.binwrite(file, "#{epoch_n}, #{Map.get(times, epoch_n)}\n")
{:continue, state}
end
end
loop =
Axon.input("data")
|> Axon.dense(100, activation: :sigmoid)
|> Axon.dense(2, activation: :softmax)
|> Axon.Loop.trainer(:categorical_cross_entropy, :adam, log: -1)
|> Axon.Loop.handle(:epoch_completed, &CustomEventHandler.write(&1, file))
# Change the epochs based on the scenario
Axon.Loop.run(loop, train_batches, %{},
epochs: 100,
compiler: EXLA,
event: :iteration_completed,
filter: :always
)
:ok = File.close(file) My current setup:
|
Hey @josevalim
Yup, the time to complete an epoch does not depend anymore on the max epochs, yay! 🎉 Thanks for looking into that 🙇 |
Hey @josevalim and @seanmor5 👋 Out of curiosity, I re-run the same benchmark with Nx v0.5, Axon v0.5, Exla v0.5 and I can tell you that there were a huge performance boost 🚀 . Here the updated recap:
Cheers 👋 Expand to see the snipped I used for the benchmarkMix.install(
[
{:exla, "~> 0.5"},
{:nx, "~> 0.5"},
{:axon, "~> 0.5"},
{:benchee, "~> 1.1.0"}
],
config: [nx: [default_backend: EXLA.Backend]]
)
# Generate the data (labels one-hot encoded)
inputs =
Nx.iota({9000, 2}, type: :f32)
|> Nx.divide(9000)
|> Nx.subtract(0.5)
|> Nx.shuffle()
labels =
Enum.map(0..8999, fn _ -> Enum.random([0, 1]) end)
|> Nx.tensor()
|> Nx.reshape({:auto, 1})
|> Nx.equal(Nx.tensor([0, 1]))
batch_size = 250
inputs_batches = Nx.to_batched(inputs, batch_size)
labels_batches = Nx.to_batched(labels, batch_size)
train_batches = Stream.zip(inputs_batches, labels_batches)
# Change the filename based on the scenario
{:ok, file} = File.open("./10000_epochs_axon_0_5.csv", [:append])
defmodule CustomEventHandler do
def write(%Axon.Loop.State{epoch: epoch, times: times} = state, file) do
epoch_n = Nx.to_number(epoch)
IO.binwrite(file, "#{epoch_n}, #{Map.get(times, epoch_n)}\n")
{:continue, state}
end
end
loop =
Axon.input("data")
|> Axon.dense(100, activation: :sigmoid)
|> Axon.dense(2, activation: :softmax)
|> Axon.Loop.trainer(:categorical_cross_entropy, :adam, log: -1)
|> Axon.Loop.handle(:epoch_completed, &CustomEventHandler.write(&1, file))
# Change the epochs based on the scenario
Axon.Loop.run(loop, train_batches, %{}, epochs: 10000, compiler: EXLA)
:ok = File.close(file) |
Hi 👋
this is the 2nd issue I'm opening in Axon in a short period of time. I hope I'm not misusing the medium, I'm learning ML and therefore there is high chances that I'm doing something funky. Please bear with me.
Again, thanks for all the work you are doing, that's great and I'm enjoy learning ML using the Nx* libraries so far.
The issue
While playing with Axon I noticed that the training slows down when increasing the number of epochs. I'm not referring to the overall time, which is of course expected, but the time to complete a single epoch.
Here a quick benchmark that I assembled.
And here the results:
Given that the training is always exiting after the 100th epoch, I'd expect comparable results, instead there is a remarkable bump-up when the max epoch is set to 10000. Is it something expected?
Thanks in advance, and please let me know if there is anything else I can do for you 🙇♂️
Best,
Nicolò
The text was updated successfully, but these errors were encountered: