Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove global metadata cache (OBJ_METADATA) #90

Closed
jrevels opened this issue Dec 27, 2020 · 16 comments · Fixed by #165 or #238
Closed

remove global metadata cache (OBJ_METADATA) #90

jrevels opened this issue Dec 27, 2020 · 16 comments · Fixed by #165 or #238

Comments

@jrevels
Copy link
Contributor

jrevels commented Dec 27, 2020

EDIT: I rescoped this issue, see #90 (comment)

I noticed that metadata is stored in a global IdDict - would it make sense to provide an unsetmetadata!(x) (or use a sentinel e.g. setmetadata!(x, nothing)) that calls delete!(OBJ_METADATA, x)?

I could see a memory-leaky scenario where a e.g. a long running service writes a bunch of Arrow objects and attaches a small amount of metadata to each one and eventually OOMs or something.

@quinnj
Copy link
Member

quinnj commented Dec 29, 2020

Yeah, we could do something like that; we could even do a finalizer on Arrow.Table/Arrow.Stream that removes their metadata, which would probably clean up most cases. I basically followed how docs are stored/setup in Base, which hasn't seemed to bring up many issues in practice.

@msavael
Copy link

msavael commented Jan 25, 2021

Could this just be a WeakKeyDict? If OBJ_METADATA is the only thing referencing the user object, it should be GCable. (I think I just encountered this exact scenario - a big job that writes 100s of GB of Arrow files with metadata, and memory wasn't being freed. For now, I am just calling Base.empty!(Arrow.OBJ_METADATA) periodically).

@quinnj
Copy link
Member

quinnj commented Jan 26, 2021

Hmmmm, a WeakKeyDict would indeed make sense, except for one case: immutable tables. WeakKeyDicts require the key type to be a mutable struct, which makes sense (immutables can't/aren't "garbage collected"). This would probably work for most cases, except if you have an immutable table, like (col1=arr1, col2=arr2). NamedTuples are immutable, so you wouldn't be able to attach metadata to them.

@msavael
Copy link

msavael commented Feb 5, 2021

Hmm, that is annoying.

The other thing I've realised is that setmetadata! doesn't seem to be threadsafe. I also hadn't realised that setmetdata! is called when a Table is created / read, so references are stored to all read Arrow tables - not just written.

I've been hitting this case a lot lately - I have programs that read or write lots of large Arrow files, and keep on getting OOMs. It always takes me a while to remember that this is the culprit!

@msavael
Copy link

msavael commented Mar 12, 2021

I'm always forgetting to lock Arrow reads / writes, and as a result I keep on getting metadata corruption issues when I'm multithreading reads or writes.

I'm wondering - is the global metadata Dict necessary? Could a simpler API be something like:

  • when writing, you can only specify the metadata at write-time, i.e. as an argument to Arrow.write. Or perhaps you can also use the current method that adds it to the global Dict, but it's explicitly only for writing and once the object is written, the corresponding Dict entry is removed (the Dict would also have a global lock).
  • when reading, the metadata is added as a property of the Arrow.Table object, and there is no global reference to the table.

I suspect that there are some good reasons you wrote it the way you did, though.

I do feel strongly that, at least, concurrent Arrow reads should be thread-safe.

quinnj added a commit that referenced this issue Apr 3, 2021
Fixes #90. There's no need to store table metadata globally when we can
just store it in the Table type itself and overload the `getmetadata`.
This should avoid metadata bloat in the global store.
@quinnj
Copy link
Member

quinnj commented Apr 3, 2021

Yeah, very good points all around. I think we can allow providing metadata at write-time as well, but note that types can already store their own metadata and overload Arrow.getmetadata and it should "just work". But yeah, here's the PR to not store Table metadata globally at least, which solves the global metadata bloat and thread-safety concerns: #165

quinnj added a commit that referenced this issue Apr 4, 2021
* Don't store table metadata globally

Fixes #90. There's no need to store table metadata globally when we can
just store it in the Table type itself and overload the `getmetadata`.
This should avoid metadata bloat in the global store.
@jrevels
Copy link
Contributor Author

jrevels commented Apr 4, 2021

But yeah, here's the PR to not store Table metadata globally at least, which solves the global metadata bloat and thread-safety concerns

This solves this issue on the "read side", but not the "write side", correct? Naive usage of Arrow.jl's exposed setmetadata! API function could still result in the problematic behavior mentioned in the OP?

IIUC, after #165, metadata for a newly constructed Arrow.Table will be stored in that table's local metadata field. But the global metadata store still exists as the sole mechanism by which callers can attach table metadata to non-Arrow.Table tables.

Unless I'm mistaken, should we reopen this issue until we address the write-side portion?

Furthermore, can we explicitly document that Arrow.getmetadata should always return an alias to the metadata (wherever it's stored) rather than a copy? That way callers can write generic code for updating any table's metadata, regardless of where it's stored (e.g. can call getmetadata and then mutate the returned value).

@msavael
Copy link

msavael commented Apr 21, 2021

@jrevels I agree that writing is still dangerous. A very quick fix - not ideal, but at least makes things threadsafe - would just be to have a global lock that locks the OBJ_METADATA Dict used in the generic get/setmetadata functions. Using setmetadata! on a Table object during writing is a bit awkward since an Arrow.Table object is not created explicitly (usually one is writing a DataFrame or a NamedTuple or similar).

quinnj added a commit that referenced this issue Apr 23, 2021
store

Follow up to #90, based on discussions in that issue.
@quinnj
Copy link
Member

quinnj commented Apr 23, 2021

Alright, here's the global lock on object metadata: #183

quinnj added a commit that referenced this issue Apr 24, 2021
…183)

* Add global metadata lock to ensure thread safety of global metadata
store

Follow up to #90, based on discussions in that issue.

* fix
@ericphanson
Copy link
Member

ericphanson commented Jun 26, 2021

There’s still a memory leak when writing many tables with metadata, right?

I think @femtomc is running into that. I think it would be great to be able to pass table-wide metadata at Arrow.write time to avoid the global store. Then there would a way to avoid the global store on both reads and writes for table-wide metadata, which would be enough for e.g. Legolas.jl’s usage of metadata (storing and validating schemas and read/write time).

Example of the issue:

julia> using Arrow

julia> obj = rand(1000, 1000);

julia> metadata = Dict("hello" => "goodbye");

julia> function short_varinfo(mod=Main; n=10)
        md = varinfo(mod, sortby=:size, imported=true)
       rows = md.content[].rows
       length(rows) > n && resize!(rows, n)
       return md
       end
short_varinfo (generic function with 2 methods)

julia> short_varinfo()
  name                    size summary
  –––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––––
  Base                         Module
  Core                         Module
  Main                         Module
  obj                7.629 MiB 1000×1000 Matrix{Float64}
  Pkg                4.911 MiB Module
  Revise             1.284 MiB Module
  Arrow            938.192 KiB Module
  InteractiveUtils 251.212 KiB Module
  metadata           484 bytes Dict{String, String} with 1 entry

julia> foreach(i -> Arrow.setmetadata!(copy(obj), copy(metadata)), 1:100)

julia> short_varinfo()
  name                    size summary
  –––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––––
  Base                         Module
  Core                         Module
  Main                         Module
  Arrow            763.926 MiB Module
  obj                7.629 MiB 1000×1000 Matrix{Float64}
  Pkg                4.919 MiB Module
  Revise             1.285 MiB Module
  InteractiveUtils 253.235 KiB Module
  metadata           484 bytes Dict{String, String} with 1 entry

(where here copy(obj) represents some big table, and copy(metadata) represents some small metadata, and the foreach loop represents some reading, processing, and writing of a table).

@jrevels
Copy link
Contributor Author

jrevels commented Jun 26, 2021

There’s still a memory leak when writing many tables with metadata, right?

Yes. From my comment above:

This solves this issue on the "read side", but not the "write side", correct? Naive usage of Arrow.jl's exposed setmetadata! API function could still result in the problematic behavior mentioned in the OP? IIUC, after #165, metadata for a newly constructed Arrow.Table will be stored in that table's local metadata field. But the global metadata store still exists as the sole mechanism by which callers can attach table metadata to non-Arrow.Table tables. Unless I'm mistaken, should we reopen this issue until we address the write-side portion?

Nobody ever replied in the affirmative, but I'll take that as lack of opposition to me reopening 😁

@jrevels jrevels reopened this Jun 26, 2021
@jrevels
Copy link
Contributor Author

jrevels commented Jul 2, 2021

My vote is to rescope this issue to remove the global cache entirely as discussed here:

Screen Shot 2021-06-30 at 11 46 35 AM

@jrevels jrevels changed the title provide mechanism to free metadata stored in OBJ_METADATA? remove global metadata cache (OBJ_METADATA) Jul 2, 2021
@ericphanson
Copy link
Member

Just to say I had been plagued by memory problems and OOMs on big computational jobs for weeks and I finally narrowed it down to this. Somehow I had convinced myself the problem was something else, but once I had a solid reproduction, just adding empty!(Arrow.OBJ_METADATA) fixed it.

So anyway, not sure what that says about my diagnosing skills given that I was very aware of this issue, but it makes me feel very strongly that we should rip out the global cache ASAP.

(And if it's not clear, the real problem is not that the cache gets big, it's that it holds references to every table you ever add metadata to and they never get GC'd).

@jrevels
Copy link
Contributor Author

jrevels commented Sep 8, 2021

Thinking a bit about what the "right" API should be for handling table-level metadata...

IIUC, Arrow has three notions of "custom metadata":

  1. custom metadata attached to a Schema
  2. custom metadata attached to a Field
  3. custom metadata attached to a Message

The metadata that we're discussing in this issue as "table-level metadata" is actually a Schema's custom_metadata field. It seems to me that if Arrow.Table is supposed to essentially serve as a "Julia view" of Arrow-formatted memory, it should surface a "Julia view" of the related Schema's custom_metadata as well. In other words, we should treat Arrow-formatted metadata in the same manner that we treat Arrow-formatted columns from an API perspective.

With that in mind, here's my take on what we should do:

  1. Delete OBJ_METADATA and related code entirely.
  2. Internally surface any available Schema-level custom_metadata via a custom_metadata::FlatBuffers.Array{KeyValue} property on Arrow.Meta.Schema. This is already done.
  3. Externally surface custom_metadata attached to an Arrow.Table's underlying schema::Arrow.Meta.Schema via the Arrow.getmetadata(::Table) API function. We should not use the current implementation of Arrow.getmetadata(::Table), though. Instead, we should decide whether the function should return an immutable Julia-ized representation of the custom_metadata, a mutable Julia-ized view of the custom_metadata, or a mutable Julia-ized copy of custom_metadata. If we go with the latter, we should ensure we explicitly document that the return value is not an alias. My preference is the "immutable Julia-ized representation" approach, though, since it seems like the safest/sanest option. Also, we should not define Arrow.setmetadata!(t::Table, m) unless we implement it to edit the underlying Arrow-formatted buffer. Arrow.getmetadata(::Any) should be defined to return nothing.
  4. Add a metadata=Arrow.getmetadata(t) kwarg to Arrow.write(_, t) that accepts any generic iterable of pairs of AbstractStrings. Note that with this approach, Arrow.write(io, t::Table; metadata=m) should write out m (i.e. not Arrow.getmetadata(t)) as the table's metadata.
  5. Tag a major release, since this is a breaking change. We could keep OBJ_METADATA around and just deprecate it for a minor release or two, but given how fragile the OBJ_METADATA-based implementation is in the first place, I'm not sure it's worth the churn.

This path should resolve our problems and make the API more consistent (with both itself, and the underlying Arrow structures) while still preserving some of the useful convenience functionality exposed by the current API.

Thoughts?

@quinnj
Copy link
Member

quinnj commented Sep 8, 2021

That all sounds like a good plan to me. I finally got a new CSV.jl release out, so I'm going to try and start picking back up some of the issues in Arrow.jl; happy to work on this if you want, or I can tackle other open issues since you have a pretty good plan here already.

Oh, one note on the plan above: yeah, we don't want to define setmetadata! on a Arrow.Table; "editing a flatbuffer" isn't really a thing, you have to rebuild the entire message flatbuffer. I like the idea of returning an immutable view into the metadata and make sure you can "copy" to get something mutable.

@jrevels
Copy link
Contributor Author

jrevels commented Sep 9, 2021

I'll see if I can pick this up this weekend :)

Oh, one note on the plan above: yeah, we don't want to define setmetadata! on a Arrow.Table; "editing a flatbuffer" isn't really a thing, you have to rebuild the entire message flatbuffer. I like the idea of returning an immutable view into the metadata and make sure you can "copy" to get something mutable.

nice, not needing to support a setmetadata! should make it way easier 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants