-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove global metadata cache (OBJ_METADATA) #90
Comments
Yeah, we could do something like that; we could even do a |
Could this just be a |
Hmmmm, a |
Hmm, that is annoying. The other thing I've realised is that I've been hitting this case a lot lately - I have programs that read or write lots of large Arrow files, and keep on getting OOMs. It always takes me a while to remember that this is the culprit! |
I'm always forgetting to lock Arrow reads / writes, and as a result I keep on getting metadata corruption issues when I'm multithreading reads or writes. I'm wondering - is the global metadata Dict necessary? Could a simpler API be something like:
I suspect that there are some good reasons you wrote it the way you did, though. I do feel strongly that, at least, concurrent Arrow reads should be thread-safe. |
Fixes #90. There's no need to store table metadata globally when we can just store it in the Table type itself and overload the `getmetadata`. This should avoid metadata bloat in the global store.
Yeah, very good points all around. I think we can allow providing metadata at write-time as well, but note that types can already store their own metadata and overload |
* Don't store table metadata globally Fixes #90. There's no need to store table metadata globally when we can just store it in the Table type itself and overload the `getmetadata`. This should avoid metadata bloat in the global store.
This solves this issue on the "read side", but not the "write side", correct? Naive usage of Arrow.jl's exposed IIUC, after #165, metadata for a newly constructed Unless I'm mistaken, should we reopen this issue until we address the write-side portion? Furthermore, can we explicitly document that |
@jrevels I agree that writing is still dangerous. A very quick fix - not ideal, but at least makes things threadsafe - would just be to have a global lock that locks the |
store Follow up to #90, based on discussions in that issue.
Alright, here's the global lock on object metadata: #183 |
There’s still a memory leak when writing many tables with metadata, right? I think @femtomc is running into that. I think it would be great to be able to pass table-wide metadata at Example of the issue: julia> using Arrow
julia> obj = rand(1000, 1000);
julia> metadata = Dict("hello" => "goodbye");
julia> function short_varinfo(mod=Main; n=10)
md = varinfo(mod, sortby=:size, imported=true)
rows = md.content[].rows
length(rows) > n && resize!(rows, n)
return md
end
short_varinfo (generic function with 2 methods)
julia> short_varinfo()
name size summary
–––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––––
Base Module
Core Module
Main Module
obj 7.629 MiB 1000×1000 Matrix{Float64}
Pkg 4.911 MiB Module
Revise 1.284 MiB Module
Arrow 938.192 KiB Module
InteractiveUtils 251.212 KiB Module
metadata 484 bytes Dict{String, String} with 1 entry
julia> foreach(i -> Arrow.setmetadata!(copy(obj), copy(metadata)), 1:100)
julia> short_varinfo()
name size summary
–––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––––
Base Module
Core Module
Main Module
Arrow 763.926 MiB Module
obj 7.629 MiB 1000×1000 Matrix{Float64}
Pkg 4.919 MiB Module
Revise 1.285 MiB Module
InteractiveUtils 253.235 KiB Module
metadata 484 bytes Dict{String, String} with 1 entry
(where here |
Yes. From my comment above:
Nobody ever replied in the affirmative, but I'll take that as lack of opposition to me reopening 😁 |
Just to say I had been plagued by memory problems and OOMs on big computational jobs for weeks and I finally narrowed it down to this. Somehow I had convinced myself the problem was something else, but once I had a solid reproduction, just adding So anyway, not sure what that says about my diagnosing skills given that I was very aware of this issue, but it makes me feel very strongly that we should rip out the global cache ASAP. (And if it's not clear, the real problem is not that the cache gets big, it's that it holds references to every table you ever add metadata to and they never get GC'd). |
Thinking a bit about what the "right" API should be for handling table-level metadata... IIUC, Arrow has three notions of "custom metadata":
The metadata that we're discussing in this issue as "table-level metadata" is actually a With that in mind, here's my take on what we should do:
This path should resolve our problems and make the API more consistent (with both itself, and the underlying Arrow structures) while still preserving some of the useful convenience functionality exposed by the current API. Thoughts? |
That all sounds like a good plan to me. I finally got a new CSV.jl release out, so I'm going to try and start picking back up some of the issues in Arrow.jl; happy to work on this if you want, or I can tackle other open issues since you have a pretty good plan here already. Oh, one note on the plan above: yeah, we don't want to define |
I'll see if I can pick this up this weekend :)
nice, not needing to support a |
EDIT: I rescoped this issue, see #90 (comment)
I noticed that metadata is stored in a global
IdDict
- would it make sense to provide anunsetmetadata!(x)
(or use a sentinel e.g.setmetadata!(x, nothing)
) that callsdelete!(OBJ_METADATA, x)
?I could see a memory-leaky scenario where a e.g. a long running service writes a bunch of Arrow objects and attaches a small amount of metadata to each one and eventually OOMs or something.
The text was updated successfully, but these errors were encountered: