-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing .arrow files without closing Julia seems impossible in Windows #492
Comments
Arrow uses memory map which means as long as the memory map is not garbage collected, Windows can't delete, because windows filesystem is kinda annoying. What you should do is to make sure reference is cleared and forced GC by running GC.gc() a few times |
thanks @Moelf . I tried this code, but removal still gives error adata_arrow = Arrow.Table("adata.arrow")
adata_saved = DataFrame(adata_arrow)
@test size(adata_saved) == (11, 2)
@test propertynames(adata_saved) == [:step, :mean_weight]
mdata_arrow = Arrow.Table("mdata.arrow")
mdata_saved = DataFrame(mdata_arrow)
@test size(mdata_saved) == (6, 3)
@test propertynames(mdata_saved) == [:step, :flag, :year]
adata_stream_arrow = Arrow.Stream("adata.arrow")
mdata_stream_arrow = Arrow.Stream("mdata.arrow")
@test size(vcat(DataFrame.(adata_stream_arrow)...)) == (11, 2)
@test size(vcat(DataFrame.(mdata_stream_arrow)...)) == (6, 3)
if Sys.iswindows()
adata_arrow = nothing
mdata_arrow = nothing
mdata_stream_arrow = nothing
adata_stream_arrow = nothing
for _ in 1:5
GC.gc()
end
end
rm("adata.arrow")
rm("mdata.arrow")
@test !isfile("adata.arrow")
@test !isfile("mdata.arrow") edit: yes actually there was a typo, will rerun tests |
this is also a reference holding |
I fixes the typo in the above code and rerun test, still getting |
I'm out of ideas here. |
thanks anyway for the help @Moelf . To be able to bisect the issue more easily I created this MWE: using Arrow, DataFrames
function writer_arrow(filename, data, append)
if append
Arrow.append(filename, data)
else
Arrow.write(filename, data; file = false)
end
end
filename = "MWE.arrow"
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
writer_arrow(filename, data, false)
for i in 1:10
writer_arrow(filename, data, true)
end
table = Arrow.Table("MWE.arrow")
data_saved = DataFrame(table)
table = nothing
GC.gc(); GC.gc()
rm("MWE.arrow") which gives when runned in REPL: julia> rm("MWE.arrow")
ERROR: IOError: unlink("MWE.arrow"): permission denied (EACCES)
Stacktrace:
[1] uv_error
@ .\libuv.jl:100 [inlined]
[2] unlink(p::String)
@ Base.Filesystem .\file.jl:972
[3] rm(path::String; force::Bool, recursive::Bool)
@ Base.Filesystem .\file.jl:283
[4] rm(path::String)
@ Base.Filesystem .\file.jl:273
[5] top-level scope
@ REPL[11]:1 Version info of my settings: julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 12 × AMD Ryzen 5 5600H with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 1 on 12 virtual cores Hopefully this will help someone who can solve the issue :-) |
ok strangely enough to me if you add e.g. this works fine using Arrow, DataFrames
function writer_arrow(filename, data, append)
if append
Arrow.append(filename, data)
else
Arrow.write(filename, data; file = false)
end
end
filename = "MWE.arrow"
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
writer_arrow(filename, data, false)
for i in 1:10
writer_arrow(filename, data, true)
end
table = Arrow.Table("MWE.arrow")
data_saved = DataFrame(table)
table = nothing
data_saved = nothing
GC.gc(); GC.gc()
rm("MWE.arrow") |
ohh ok got it, the dataframe doesn't copy the content of the arrow table, just steal the reference, if one tries: julia> deleteat!(data_saved, 2)
ERROR: ArgumentError: `Arrow.Primitive{Int64, Vector{Int64}}` does not support `deleteat!`; arrow data is by nature immutable
Stacktrace:
[1] deleteat!(x::Arrow.Primitive{Int64, Vector{Int64}}, inds::Vector{Int64})
@ Arrow C:\Users\bob\.julia\packages\Arrow\R2Rvz\src\arraytypes\arraytypes.jl:32
[2] deleteat!(A::SentinelArrays.ChainedVector{Int64, Arrow.Primitive{Int64, Vector{Int64}}}, inds::Vector{Int64})
@ SentinelArrays C:\Users\bob\.julia\packages\SentinelArrays\cav7N\src\chainedvector.jl:560
[3] _deleteat!_helper(df::DataFrame, drop::Vector{Int64})
@ DataFrames C:\Users\bob\.julia\packages\DataFrames\58MUJ\src\dataframe\dataframe.jl:919
[4] deleteat!(df::DataFrame, inds::Int64)
@ DataFrames C:\Users\bob\.julia\packages\DataFrames\58MUJ\src\dataframe\dataframe.jl:861
[5] top-level scope
@ REPL[33]:1 is this correct behaviour though? It isn't even possible to add new rows to this dataframe since the underlying arrow table is immutable: julia> append!(data_saved, data)
ERROR: MethodError: no method matching resize!(::Arrow.Primitive{Int64, Vector{Int64}}, ::Int64)
Closest candidates are:
resize!(::Vector, ::Integer)
@ Base array.jl:1246
resize!(::BitVector, ::Integer)
@ Base bitarray.jl:814
resize!(::SparseArrays.ReadOnly, ::Any)
@ SparseArrays C:\Users\bob\AppData\Local\Programs\Julia-1.9.3\share\julia\stdlib\v1.9\SparseArrays\src\readonly.jl:33
...
.... |
All in all, I think this is probably a problem with DataFrames.jl. edit: okay, actually it is specified in the DataFrames.jl docs so problem solved:
|
Reopening since even with this new information I couldn't get the code to work on Windows in the tests of Agents.jl and so...I got a new updated MWE: using Arrow, DataFrames, Test
@testset "reproduce" begin
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
Arrow.write("MWE.arrow", data; file = false)
DataFrame.(Arrow.Stream("MWE.arrow"), copycols=true)
nothing
for _ in 1:10 GC.gc() end
rm("MWE.arrow")
end this again throws "ERROR: IOError: unlink("MWE.arrow"): permission denied (EACCES)", notice that strangely if I remove the if statement wrapping all the code everything works fine. edit: using a testset to make it more realistic |
try to add a few more |
updated the code with 10 GC runs, same error |
maybe try isolate the workload into a function, call that function, and run GC outside? idk if you're testing this in REPL or not, but in REPL you would have |
I'm testing in REPL but using Arrow, DataFrames, Test
function create_file()
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
Arrow.write("MWE.arrow", data; file = false)
DataFrame.(Arrow.Stream("MWE.arrow"), copycols=true)
end
@testset "reproduce" begin
create_file()
GC.gc()
rm("MWE.arrow")
end removes the file fine. But still I have problems understanding why that other version without the function call instead gives error, could this be a GC issue? |
well, I think global scope and running GC.gc() in a |
I'm feeling that this problem is similar to https://discourse.julialang.org/t/how-to-force-an-object-to-be-freed-by-the-garbage-collector/77985. (also, even if I change |
I think the general problem is GC interaction happens with compiled code, and the code we human write are not what actually happens line by line at runtime, so you shouldn't rely on compiler does something specific. I agree compiler should account for GC effects but that's probably hard, meanwhile just remember to isolate by unit of individual function gets you very far already |
thanks for all the help @Moelf, do you think we can close this one then since the problem seems unrelated to Arrow in particular? |
Hi,
I'm a bit in a desperate situation since I'd like to resolve an issue in Agents.jl involving the Arrow.jl package on Windows but even if I tried hard I couldn't: it seems to me that in reality it could be a problem with Arrow.jl since all my (maybe poor) attempts to solve the problem failed. This is the PR trying to solve it: JuliaDynamics/Agents.jl#909.
In practice we have currently in Agents.jl here a test which performs:
On Linux and MacOS it works fine, but on Windows the operating system refuses to remove the .arrow files giving:
As you can see for Windows it is disables since we didn't manage to make it work.
It's probably necessary to describe what is the relationship between the Agents.jl function
offline_run!
and Arrow.jl, it is rather simple,offline_run!
calls multiple times this other function:I'm trying multiple ways to solve the issue, but none worked. I will try to provide an MWE if necessary.
The text was updated successfully, but these errors were encountered: