remove global metadata cache (OBJ_METADATA) #90

jrevels · 2020-12-27T23:48:02Z

EDIT: I rescoped this issue, see #90 (comment)

I noticed that metadata is stored in a global IdDict - would it make sense to provide an unsetmetadata!(x) (or use a sentinel e.g. setmetadata!(x, nothing)) that calls delete!(OBJ_METADATA, x)?

I could see a memory-leaky scenario where a e.g. a long running service writes a bunch of Arrow objects and attaches a small amount of metadata to each one and eventually OOMs or something.

The text was updated successfully, but these errors were encountered:

quinnj · 2020-12-29T15:39:50Z

Yeah, we could do something like that; we could even do a finalizer on Arrow.Table/Arrow.Stream that removes their metadata, which would probably clean up most cases. I basically followed how docs are stored/setup in Base, which hasn't seemed to bring up many issues in practice.

msavael · 2021-01-25T20:30:54Z

Could this just be a WeakKeyDict? If OBJ_METADATA is the only thing referencing the user object, it should be GCable. (I think I just encountered this exact scenario - a big job that writes 100s of GB of Arrow files with metadata, and memory wasn't being freed. For now, I am just calling Base.empty!(Arrow.OBJ_METADATA) periodically).

quinnj · 2021-01-26T05:34:29Z

Hmmmm, a WeakKeyDict would indeed make sense, except for one case: immutable tables. WeakKeyDicts require the key type to be a mutable struct, which makes sense (immutables can't/aren't "garbage collected"). This would probably work for most cases, except if you have an immutable table, like (col1=arr1, col2=arr2). NamedTuples are immutable, so you wouldn't be able to attach metadata to them.

msavael · 2021-02-05T15:22:54Z

Hmm, that is annoying.

The other thing I've realised is that setmetadata! doesn't seem to be threadsafe. I also hadn't realised that setmetdata! is called when a Table is created / read, so references are stored to all read Arrow tables - not just written.

I've been hitting this case a lot lately - I have programs that read or write lots of large Arrow files, and keep on getting OOMs. It always takes me a while to remember that this is the culprit!

msavael · 2021-03-12T14:14:42Z

I'm always forgetting to lock Arrow reads / writes, and as a result I keep on getting metadata corruption issues when I'm multithreading reads or writes.

I'm wondering - is the global metadata Dict necessary? Could a simpler API be something like:

when writing, you can only specify the metadata at write-time, i.e. as an argument to Arrow.write. Or perhaps you can also use the current method that adds it to the global Dict, but it's explicitly only for writing and once the object is written, the corresponding Dict entry is removed (the Dict would also have a global lock).
when reading, the metadata is added as a property of the Arrow.Table object, and there is no global reference to the table.

I suspect that there are some good reasons you wrote it the way you did, though.

I do feel strongly that, at least, concurrent Arrow reads should be thread-safe.

Fixes #90. There's no need to store table metadata globally when we can just store it in the Table type itself and overload the `getmetadata`. This should avoid metadata bloat in the global store.

quinnj · 2021-04-03T23:30:18Z

Yeah, very good points all around. I think we can allow providing metadata at write-time as well, but note that types can already store their own metadata and overload Arrow.getmetadata and it should "just work". But yeah, here's the PR to not store Table metadata globally at least, which solves the global metadata bloat and thread-safety concerns: #165

* Don't store table metadata globally Fixes #90. There's no need to store table metadata globally when we can just store it in the Table type itself and overload the `getmetadata`. This should avoid metadata bloat in the global store.

jrevels · 2021-04-04T14:11:57Z

But yeah, here's the PR to not store Table metadata globally at least, which solves the global metadata bloat and thread-safety concerns

This solves this issue on the "read side", but not the "write side", correct? Naive usage of Arrow.jl's exposed setmetadata! API function could still result in the problematic behavior mentioned in the OP?

IIUC, after #165, metadata for a newly constructed Arrow.Table will be stored in that table's local metadata field. But the global metadata store still exists as the sole mechanism by which callers can attach table metadata to non-Arrow.Table tables.

Unless I'm mistaken, should we reopen this issue until we address the write-side portion?

Furthermore, can we explicitly document that Arrow.getmetadata should always return an alias to the metadata (wherever it's stored) rather than a copy? That way callers can write generic code for updating any table's metadata, regardless of where it's stored (e.g. can call getmetadata and then mutate the returned value).

msavael · 2021-04-21T21:29:32Z

@jrevels I agree that writing is still dangerous. A very quick fix - not ideal, but at least makes things threadsafe - would just be to have a global lock that locks the OBJ_METADATA Dict used in the generic get/setmetadata functions. Using setmetadata! on a Table object during writing is a bit awkward since an Arrow.Table object is not created explicitly (usually one is writing a DataFrame or a NamedTuple or similar).

store Follow up to #90, based on discussions in that issue.

quinnj · 2021-04-23T00:44:06Z

Alright, here's the global lock on object metadata: #183

…183) * Add global metadata lock to ensure thread safety of global metadata store Follow up to #90, based on discussions in that issue. * fix

ericphanson · 2021-06-26T12:31:20Z

There’s still a memory leak when writing many tables with metadata, right?

I think @femtomc is running into that. I think it would be great to be able to pass table-wide metadata at Arrow.write time to avoid the global store. Then there would a way to avoid the global store on both reads and writes for table-wide metadata, which would be enough for e.g. Legolas.jl’s usage of metadata (storing and validating schemas and read/write time).

Example of the issue:

julia> using Arrow

julia> obj = rand(1000, 1000);

julia> metadata = Dict("hello" => "goodbye");

julia> function short_varinfo(mod=Main; n=10)
        md = varinfo(mod, sortby=:size, imported=true)
       rows = md.content[].rows
       length(rows) > n && resize!(rows, n)
       return md
       end
short_varinfo (generic function with 2 methods)

julia> short_varinfo()
  name                    size summary
  –––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––––
  Base                         Module
  Core                         Module
  Main                         Module
  obj                7.629 MiB 1000×1000 Matrix{Float64}
  Pkg                4.911 MiB Module
  Revise             1.284 MiB Module
  Arrow            938.192 KiB Module
  InteractiveUtils 251.212 KiB Module
  metadata           484 bytes Dict{String, String} with 1 entry

julia> foreach(i -> Arrow.setmetadata!(copy(obj), copy(metadata)), 1:100)

julia> short_varinfo()
  name                    size summary
  –––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––––
  Base                         Module
  Core                         Module
  Main                         Module
  Arrow            763.926 MiB Module
  obj                7.629 MiB 1000×1000 Matrix{Float64}
  Pkg                4.919 MiB Module
  Revise             1.285 MiB Module
  InteractiveUtils 253.235 KiB Module
  metadata           484 bytes Dict{String, String} with 1 entry

(where here copy(obj) represents some big table, and copy(metadata) represents some small metadata, and the foreach loop represents some reading, processing, and writing of a table).

jrevels · 2021-06-26T17:06:52Z

There’s still a memory leak when writing many tables with metadata, right?

Yes. From my comment above:

This solves this issue on the "read side", but not the "write side", correct? Naive usage of Arrow.jl's exposed setmetadata! API function could still result in the problematic behavior mentioned in the OP? IIUC, after #165, metadata for a newly constructed Arrow.Table will be stored in that table's local metadata field. But the global metadata store still exists as the sole mechanism by which callers can attach table metadata to non-Arrow.Table tables. Unless I'm mistaken, should we reopen this issue until we address the write-side portion?

Nobody ever replied in the affirmative, but I'll take that as lack of opposition to me reopening 😁

jrevels · 2021-07-02T15:48:28Z

My vote is to rescope this issue to remove the global cache entirely as discussed here:

ericphanson · 2021-09-07T16:55:06Z

Just to say I had been plagued by memory problems and OOMs on big computational jobs for weeks and I finally narrowed it down to this. Somehow I had convinced myself the problem was something else, but once I had a solid reproduction, just adding empty!(Arrow.OBJ_METADATA) fixed it.

So anyway, not sure what that says about my diagnosing skills given that I was very aware of this issue, but it makes me feel very strongly that we should rip out the global cache ASAP.

(And if it's not clear, the real problem is not that the cache gets big, it's that it holds references to every table you ever add metadata to and they never get GC'd).

jrevels · 2021-09-08T03:06:31Z

Thinking a bit about what the "right" API should be for handling table-level metadata...

IIUC, Arrow has three notions of "custom metadata":

custom metadata attached to a Schema
custom metadata attached to a Field
custom metadata attached to a Message

The metadata that we're discussing in this issue as "table-level metadata" is actually a Schema's custom_metadata field. It seems to me that if Arrow.Table is supposed to essentially serve as a "Julia view" of Arrow-formatted memory, it should surface a "Julia view" of the related Schema's custom_metadata as well. In other words, we should treat Arrow-formatted metadata in the same manner that we treat Arrow-formatted columns from an API perspective.

With that in mind, here's my take on what we should do:

Delete OBJ_METADATA and related code entirely.
Internally surface any available Schema-level custom_metadata via a custom_metadata::FlatBuffers.Array{KeyValue} property on Arrow.Meta.Schema. This is already done.
Externally surface custom_metadata attached to an Arrow.Table's underlying schema::Arrow.Meta.Schema via the Arrow.getmetadata(::Table) API function. We should not use the current implementation of Arrow.getmetadata(::Table), though. Instead, we should decide whether the function should return an immutable Julia-ized representation of the custom_metadata, a mutable Julia-ized view of the custom_metadata, or a mutable Julia-ized copy of custom_metadata. If we go with the latter, we should ensure we explicitly document that the return value is not an alias. My preference is the "immutable Julia-ized representation" approach, though, since it seems like the safest/sanest option. Also, we should not define Arrow.setmetadata!(t::Table, m) unless we implement it to edit the underlying Arrow-formatted buffer. Arrow.getmetadata(::Any) should be defined to return nothing.
Add a metadata=Arrow.getmetadata(t) kwarg to Arrow.write(_, t) that accepts any generic iterable of pairs of AbstractStrings. Note that with this approach, Arrow.write(io, t::Table; metadata=m) should write out m (i.e. not Arrow.getmetadata(t)) as the table's metadata.
Tag a major release, since this is a breaking change. We could keep OBJ_METADATA around and just deprecate it for a minor release or two, but given how fragile the OBJ_METADATA-based implementation is in the first place, I'm not sure it's worth the churn.

This path should resolve our problems and make the API more consistent (with both itself, and the underlying Arrow structures) while still preserving some of the useful convenience functionality exposed by the current API.

Thoughts?

quinnj · 2021-09-08T06:08:36Z

That all sounds like a good plan to me. I finally got a new CSV.jl release out, so I'm going to try and start picking back up some of the issues in Arrow.jl; happy to work on this if you want, or I can tackle other open issues since you have a pretty good plan here already.

Oh, one note on the plan above: yeah, we don't want to define setmetadata! on a Arrow.Table; "editing a flatbuffer" isn't really a thing, you have to rebuild the entire message flatbuffer. I like the idea of returning an immutable view into the metadata and make sure you can "copy" to get something mutable.

jrevels · 2021-09-09T14:13:23Z

I'll see if I can pick this up this weekend :)

Oh, one note on the plan above: yeah, we don't want to define setmetadata! on a Arrow.Table; "editing a flatbuffer" isn't really a thing, you have to rebuild the entire message flatbuffer. I like the idea of returning an immutable view into the metadata and make sure you can "copy" to get something mutable.

nice, not needing to support a setmetadata! should make it way easier 😁

ericphanson mentioned this issue Mar 14, 2021

reconsidering the current type registration/serialization mechanism (and its internal usage) #88

Closed

quinnj mentioned this issue Apr 3, 2021

Don't store table metadata globally #165

Merged

quinnj closed this as completed in #165 Apr 4, 2021

jrevels mentioned this issue Apr 6, 2021

document guarantee that getmetadata returns alias not copy #169

Merged

quinnj added a commit that referenced this issue Apr 23, 2021

Add global metadata lock to ensure thread safety of global metadata

cfc217d

store Follow up to #90, based on discussions in that issue.

quinnj mentioned this issue Apr 23, 2021

Add global metadata lock to ensure thread safety of global metadata #183

Merged

quinnj added a commit that referenced this issue Apr 24, 2021

Add global metadata lock to ensure thread safety of global metadata (#…

8de2d7d

…183) * Add global metadata lock to ensure thread safety of global metadata store Follow up to #90, based on discussions in that issue. * fix

jrevels reopened this Jun 26, 2021

jrevels changed the title ~~provide mechanism to free metadata stored in OBJ_METADATA?~~ remove global metadata cache (OBJ_METADATA) Jul 2, 2021

ericphanson mentioned this issue Aug 31, 2021

use of setmetadata! on generic tables basically requires calling Tables.columns on the input #211

Closed

ericphanson mentioned this issue Sep 11, 2021

Unable to write or read metadata from a Table after attaching metadata to the table #236

Open

jrevels mentioned this issue Sep 12, 2021

remove global metadata cache, refactor custom_metadata API #238

Merged

jrevels closed this as completed in #238 Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove global metadata cache (OBJ_METADATA) #90

remove global metadata cache (OBJ_METADATA) #90

jrevels commented Dec 27, 2020 •

edited

Loading

quinnj commented Dec 29, 2020

msavael commented Jan 25, 2021 •

edited

Loading

quinnj commented Jan 26, 2021

msavael commented Feb 5, 2021 •

edited

Loading

msavael commented Mar 12, 2021

quinnj commented Apr 3, 2021

jrevels commented Apr 4, 2021

msavael commented Apr 21, 2021

quinnj commented Apr 23, 2021

ericphanson commented Jun 26, 2021 •

edited

Loading

jrevels commented Jun 26, 2021

jrevels commented Jul 2, 2021

ericphanson commented Sep 7, 2021

jrevels commented Sep 8, 2021 •

edited

Loading

quinnj commented Sep 8, 2021

jrevels commented Sep 9, 2021

remove global metadata cache (OBJ_METADATA) #90

remove global metadata cache (OBJ_METADATA) #90

Comments

jrevels commented Dec 27, 2020 • edited Loading

quinnj commented Dec 29, 2020

msavael commented Jan 25, 2021 • edited Loading

quinnj commented Jan 26, 2021

msavael commented Feb 5, 2021 • edited Loading

msavael commented Mar 12, 2021

quinnj commented Apr 3, 2021

jrevels commented Apr 4, 2021

msavael commented Apr 21, 2021

quinnj commented Apr 23, 2021

ericphanson commented Jun 26, 2021 • edited Loading

jrevels commented Jun 26, 2021

jrevels commented Jul 2, 2021

ericphanson commented Sep 7, 2021

jrevels commented Sep 8, 2021 • edited Loading

quinnj commented Sep 8, 2021

jrevels commented Sep 9, 2021

jrevels commented Dec 27, 2020 •

edited

Loading

msavael commented Jan 25, 2021 •

edited

Loading

msavael commented Feb 5, 2021 •

edited

Loading

ericphanson commented Jun 26, 2021 •

edited

Loading

jrevels commented Sep 8, 2021 •

edited

Loading