-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: EXCEPTION_ACCESS_VIOLATION during garbage collection in PySR #661
Comments
Can you try with |
Regrettably. I tried |
Could automatically setting |
Hm, Can you show the rest of your code? |
from pysr import PySRRegressor
# data load code
X_123e = data_X_123e.to_numpy()
y_123e = data_y_123e.to_numpy()
sr_model = PySRRegressor(
binary_operators=[
"*",
"+",
"-",
"/",
],
unary_operators=["square", "cube", "exp", "log", "sqrt"],
maxsize=80,
maxdepth=10,
niterations=100,
populations=32,
population_size=100,
ncycles_per_iteration=550,
constraints={
"/": (-1, 9),
"^": (-1, 5),
"exp": 6,
"square": 6,
"cube": 6,
"log": 6,
"sqrt": 6,
"abs": 9,
},
nested_constraints={
"square": {"square": 0, "cube": 0, "exp": 1},
"cube": {"square": 0, "cube": 0, "exp": 1},
"exp": {"square": 0, "cube": 0, "exp": 0},
"sqrt": {"sqrt": 0, "log": 0},
"log": {"log": 0},
},
complexity_of_operators={
"square": 2,
"cube": 3,
"exp": 3,
"log": 3,
"sqrt": 2,
},
complexity_of_constants=4,
adaptive_parsimony_scaling=150.0,
weight_add_node=0.79,
weight_insert_node=5.1,
weight_delete_node=1.7,
weight_do_nothing=0.21,
weight_mutate_constant=0.048,
weight_mutate_operator=0.47,
weight_swap_operands=0.1,
weight_randomize=0.23,
weight_simplify=0.5,
weight_optimize=0.5,
crossover_probability=0.066,
perturbation_factor=0.076,
cluster_manager=None,
precision=32,
turbo=True,
bumper=True,
progress=True,
elementwise_loss="""
function loss_fnc(prediction, target)
percentage_error = abs((prediction - target) / target) * 100
return percentage_error
end
""",
multithreading=False,
equation_file=symbol_regression_csv_path,
)
complexity_of_variables = [] # list of complexity
sr_model.fit(
X_123e, y_123e, complexity_of_variables=complexity_of_variables
) here is the main code of the workflow. |
At the same time, I will put the above code in a multi-layer loop to test different feature data sets and the stability of the symbolic regression results. A single loop takes about 2.2 minutes. The program crashes after running for 3-4 hours, running about 80-110 rounds. |
That looks good. Great to see all those options being used! 🙂 (Random comment: your element wise loss divides by the target, so make sure the target > 0, otherwise one target will dominate. But I’m assuming you’re aware of that!) Other comment: can you try with You can also set But it’s curious that it crashes. Since it runs for a few hours, did you notice anything else happening, like the memory usage gradually increasing over that time and not going down? |
If I use multithreading instead of multiprocessing, the calculation speed will drop from 30it/s to 7it/s on my device, which is a bit unacceptable to me. In addition, I have made sure that my y_true values are all greater than 0. And the memory usage does not fluctuate when the program crashes, occupying only 30% of the total memory. |
Maybe try import os
os.environ["PYTHON_JULIACALL_THREADS"] = (num_cores) * 2 Where The default behavior of PySR is to start Julia with The full list of available juliacall environment variables is here: https://juliapy.github.io/PythonCall.jl/stable/juliacall/#julia-config |
I tried import os
os.environ["PYTHON_JULIACALL_THREADS"] = "64"
# or
os.environ["PYTHON_JULIACALL_THREADS"] = "64"
os.environ["PYTHON_JULIACALL_PROCS"] = "64" But it did not improve the calculation speed, the processor usage was only 20-30%, I am using a 24c32t 14900k processor. |
To confirm, this was before importing PySR right? As a test, if you set it to 1, the CPU usage should only be 1 core. Also note that the |
I had a similar problem when I gave up Windows and moved to Ubuntu 24.04 lts. I also used a tool (tm5) to test the memory. After testing for 1 hour, there was no error and the temperature was stable at 45℃. It doesn't seem to be a hardware problem. This problem is so strange. Traceback (most recent call last):
File "/home/zc/Documents/GitHub/MLPIP/notebooks/TC/S2_symbol_regression/S202_sr_123e.py", line 192, in <module>
sr_model.fit(
File "/home/zc/miniconda3/envs/MLPIP_ENV_PIP/lib/python3.11/site-packages/pysr/sr.py", line 2088, in fit
self._run(X, y, runtime_params, weights=weights, seed=seed)
File "/home/zc/miniconda3/envs/MLPIP_ENV_PIP/lib/python3.11/site-packages/pysr/sr.py", line 1890, in _run
out = SymbolicRegression.equation_search(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zc/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl", line 223, in __call__
return self._jl_callmethod($(pyjl_methodnum(pyjlany_call)), args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
[1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
@ Base ./stream.jl:410
[2] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
@ Base ./stream.jl:949
[3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
@ Base ./stream.jl:955
[4] unsafe_read
@ ./io.jl:774 [inlined]
[5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
@ Base ./io.jl:773
[6] read!
@ ./io.jl:775 [inlined]
[7] deserialize_hdr_raw
@ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
[8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
[9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
[10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
juliacall.JuliaError: TaskFailedException
Stacktrace:
[1] wait
@ ./task.jl:352 [inlined]
[2] fetch
@ ./task.jl:372 [inlined]
[3] _main_search_loop!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}})
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:882
[4] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, saved_state::Nothing)
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:599
[5] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; niterations::Int64, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Nothing, runtests::Bool, saved_state::Nothing, return_state::Bool, verbosity::Int64, progress::Bool, v_dim_out::Val{1})
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:571
[6] equation_search
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:449 [inlined]
[7] #equation_search#26
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:412 [inlined]
[8] equation_search
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:360 [inlined]
[9] #equation_search#28
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:442 [inlined]
[10] pyjlany_call(self::typeof(equation_search), args_::Py, kwargs_::Py)
@ PythonCall.JlWrap ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:36
[11] _pyjl_callmethod(f::Any, self_::Ptr{PythonCall.C.PyObject}, args_::Ptr{PythonCall.C.PyObject}, nargs::Int64)
@ PythonCall.JlWrap ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/base.jl:72
[12] _pyjl_callmethod(o::Ptr{PythonCall.C.PyObject}, args::Ptr{PythonCall.C.PyObject})
@ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/C.jl:63
nested task error: Distributed.ProcessExitedException(423)
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:931
[2] wait()
@ Base ./task.jl:995
[3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
@ Base ./condition.jl:130
[4] wait
@ ./condition.jl:125 [inlined]
[5] take_buffered(c::Channel{Any})
@ Base ./channels.jl:477
[6] take!(c::Channel{Any})
@ Base ./channels.jl:471
[7] take!(::Distributed.RemoteValue)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:726
[8] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID; kwargs::@Kwargs{})
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:461
[9] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
[10] remotecall_fetch
@ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
[11] call_on_owner
@ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:565 [inlined]
[12] fetch(r::Distributed.Future)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:619
[13] (::SymbolicRegression.var"#67#72"{SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, Int64, Int64})()
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:984 |
Just to confirm, there is no crash now? Just that this message is printed? I see this message sometimes during testing. So far, it has seemed to be harmless, and has never caused a crash – it simply indicates that one of the worker processes has exited, due to the search returning, and the However, if this is what is calling the error, perhaps it is not harmless, and we should close the asynchronous |
I do think it would be better if there was a way to get multithreading to be faster, by increasing |
This message appears when the search process reaches about 30%, and then the search process stops. I can try to reproduce it again to see if it crashes. Also, does using the slurm backend help avoid this problem? |
Thanks. So if this reproduces on ubuntu, it seems like a deeper issue. Can you share your data so that I can reproduce it on my machine? If there is some script I can run which reproduces the error exactly on my computer it will be easier to help debug it. Also, the more minimal the code, the easier it will be for me to debug it. So perhaps try (1) reducing the dataset size, (2) creating conditions that cause the error to occur earlier during training, (3) using fewer parameters of PySR. I guess this might be hard to make a smaller MWE but (2) would be most useful. The Slurm backend is only if you’re using a Slurm computing cluster, but won’t be available otherwise. |
I have confirmed this point. If I use os.environ["PYTHON_JULIACALL_THREADS"] = "1", it will warn Warning: You are using multithreading mode, but only one thread is available. Try starting julia with |
Thank you very much. I need to apply for the relevant code and data to be provided. In addition, I have an Ubuntu 20 server running a single-node slurm. In the preliminary test, the calculation speed is consistent with multi-process. I can test on that device to confirm whether it is a device problem. |
I have confirmed that this prompt will cause the search process to be interrupted. I temporarily bypassed the crash by using |
I think I have found a temporary solution for the time being, which is to manually end the julia process after each search. import time, os
time.sleep(10)
os.system("killall julia") |
Thanks. That is good to know. I do think the way SymbolicRegression.jl launches processes is a bit problematic for large-scale use-cases at the moment. The way it works is that it calls What would be better is if PySR did one of the following alternative strategies:
I'm not sure how much work each of these options would be. They might be fairly easy to get working though. But it would definitely require some Julia coding (if you are up for it). |
Just going to keep this open until there's a better solution than a manual workaround. Ideally the workaround shouldn't be needed |
What happened?
The program crashed while using PySR, with an error message indicating a memory access violation (EXCEPTION_ACCESS_VIOLATION). This error occurred during the garbage collection process.
Version
v0.19.0
Operating System
Windows
Package Manager
pip
Interface
Script (i.e.,
python my_script.py
)Relevant log output
Extra Info
turbo=True, bumper=True
The text was updated successfully, but these errors were encountered: