Skip to content

All node metadata re-design #1105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
dehann opened this issue Nov 5, 2024 · 21 comments
Open
1 task

All node metadata re-design #1105

dehann opened this issue Nov 5, 2024 · 21 comments
Assignees
Milestone

Comments

@dehann
Copy link
Member

dehann commented Nov 5, 2024

To decide

  • Does metadata have crud?
  • What is metadata structure?
  • How is metadata serialized?

Metadata Structure
Look into:

  • JSON3.Object ❌ (JSON3.object is not mutable)
    • JSON3.Object fast enough for Compute_Node_ level?
    • Delete smalldata types union
    • Allows better nesting for JSON types
  • key-value Dict{Symbol, SmallDataTypes} (existing)
    Reminder SmallDataTypes are basically JSON values excluding null and objects (nesting):
    const SmallDataTypes = Union{
        Int,
        Float64,
        String,
        Bool,
        Vector{Int},
        Vector{Float64},
        Vector{String},
        Vector{Bool},
    }
  • Dict{Symbol, String}
  • Custom Struct/Dict

Serialization

  • JSON string
  • Base64 encoded JSON string (current)
  • Modelled Structure

TAC:

  • The Metadata design is documented in this issue and approved for DFG v1
@dehann dehann added this to the v1.0.0 milestone Nov 5, 2024
@dehann dehann added the data: entry=>blob Previously bigdata, suggested over `smalldata` label Nov 5, 2024
@Affie
Copy link
Member

Affie commented Nov 12, 2024

My vote is to keep the design as is with SmallDataTypes. Its been this way for 4 years without any issues. see #576

Edit: might change my mind pending the CRUD decision.

@Affie Affie modified the milestones: v1.0.0, v0.25.2 Nov 12, 2024
@Affie Affie self-assigned this Nov 12, 2024
@Affie Affie modified the milestones: v0.25.2, v0.25.3 Dec 2, 2024
@Affie
Copy link
Member

Affie commented Dec 2, 2024

Partially addressed in #1107

@Affie Affie modified the milestones: v0.25.3, v0.25.4 Jan 30, 2025
@Affie Affie modified the milestones: v0.25.4, v0.25.5 Feb 14, 2025
@Affie
Copy link
Member

Affie commented Feb 17, 2025

Also considered just using Dict{Symbol, String} or creating a new struct such as

struct MetadataT
    int::Dict{Symbol, Vector{Int}}
    float::Dict{Symbol, Vector{Float64}}
    string::Dict{Symbol, Vector{String}}
    bool::Dict{Symbol, Vector{Bool}}
end

@dehann
Copy link
Member Author

dehann commented Mar 21, 2025

SmallDataTypes good, that is equivalent to just using existing JSONable content. That also allows nesting of dicts (which we want).

I'm against hard top level definition such as Dict{String, String/Any}. Way to restrictive or loose.

Just stay with JSON, the whole GQL universe is simply JSON.

@Affie
Copy link
Member

Affie commented Mar 21, 2025

SmallDataTypes good, that is equivalent to just using existing JSONable content.

Yes, idea of SmallDataTypes is to force JSONable content.

That also allows nesting of dicts (which we want).

Can you elaborate on this please? We cannot currently nest dicts, metadata in essence is just key value pairs. The value itself can be a json serialized dict as a string.

Do you agree on metadata as just key value pairs to help keep it small as anything bigger should be stored in a blob?

@Affie
Copy link
Member

Affie commented Mar 26, 2025

Does metadata have CRUD in the SDKs?

Currently add/update/delete would have to: get, add/update/delete, set. That makes Isolation not possible, so we would have to change the structure and implement it in the backend for proper CRUD.

Using Dict{Symbol, String} might be the easiest way to implement crud. Metadata is not meant to be used in the solver directly and can be copied into the cache if needed so performance should not be an issue. Only the UX of having to convert strings to variables.

@dehann
Copy link
Member Author

dehann commented Mar 29, 2025

The value itself can be a json serialized dict as a string. [Can you elaborate on this please?]
[NOPE] Only the UX of having to convert strings to variables.
[NOPE] Using Dict{Symbol, String} might be the easiest way ...

No this this the same thing that keeps coming back. NEVER put a JSON.write inside a value of a key=>value pair. That is the whole point. Pure JSON includes the idea of nested dicts.

Do you agree on metadata as just key value pairs to help keep it small as anything bigger should be stored in a blob?

No, that is up to the user. Keeping size of content in metadata small is not forced, that is simply convention.

Does metadata have CRUD in the SDKs?

Yes

Currently add/update/delete would have to: get, add/update/delete, set. That makes Isolation not possible, so we would have to change the structure and implement it in the backend for proper CRUD.

What do you mean by isolation. Do you mean cannot guarantee atomic updates? metadata in SDKs need CRUD, no way around that I think.

Metadata is not meant to be used in the solver directly and can be copied into the cache if needed so performance should not be an issue.

yes, agree

@Affie
Copy link
Member

Affie commented Mar 31, 2025

What do you mean by isolation. Do you mean cannot guarantee atomic updates?

"Isolation" in ACID

metadata in SDKs need CRUD, no way around that I think.

I've been using metadata without CRUD as "get" -> modify -> "set", so not needed.

@Affie
Copy link
Member

Affie commented Mar 31, 2025

No this this the same thing that keeps coming back. NEVER put a JSON.write inside a value of a key=>value pair. That is the whole point. Pure JSON includes the idea of nested dicts.

The current design is that smallData/metadata can't be nested, they are just key-value pairs.

@Affie
Copy link
Member

Affie commented Mar 31, 2025

Packing metadata as a vector of key-value pairs can avoid the whole nested JSON string issue. Dict{Symbol, String} would be the best option then and CRUD can then be supported. Example Json would be: [{"key": "a", "value": "1"}, {"key": "b", "value": "one"}]

@Affie Affie changed the title All node metadata design All node metadata re-design Apr 1, 2025
@Affie Affie added metadata and removed data: entry=>blob Previously bigdata, suggested over `smalldata` labels Apr 1, 2025
@dehann
Copy link
Member Author

dehann commented Apr 2, 2025

Yes, idea of SmallDataTypes is to force JSONable content.
The current design is that smallData/metadata can't be nested, they are just key-value pairs.

Okay, then I have been referencing SmallDataTypes incorrectly in previous comments. Ignore reference to SmallDataTypes and lets simplify the discussion to "Nested JSON" vs "key=>value JSON only".

"Isolation" in ACID

Okay, new name for atomic+globalized operation order. I have been saying atomic instead of isolation. I'll restrict my use of atomic to compute in a hard-struct only (i.e. lower level random access threading reads or writes). I will use isolate in the higher level query/mutate context via some API.

From #1118

The current design is that smallData/metadata can't be nested, they are just key-value pairs.
The options there are currently "base64 encoded JSON string" or a "JSON string"

Two risks when using "key-value JSON pattern" with random data (incl already serialized JSON/JSONstring buried in the data):

  • False curly/square termination or opens { or } where data itself is interfering with JSON frames.
  • Introduction of escape chars makes JSON unusable in multiple languages. So write this today and tomorrow in a different language the JSON serde fails.

Please elaborate why it is so bad in metadata

.metadata lives at the top level of the Json object. So it is tempting to use a flat key-value layout. While this works for small cases, it places a serious limitation on use to the point that "metadata" becomes the wrong noun. That is why in the old days we used smalldata to emphasize that this is not a general purpose store. For me, it really really sucks if every aspect of a config or parameter/config/calibration system needs to be reduced to a single flat kay-value map. For me nesting of dicts is a vital requirement on AnythingDFG.metadata.

I think it is fine to have one condition on .metadata, and this is jstr = json.serialize(anydfg.metadata) should run through in any language, and that result be read in any other programming language with standard json.deser(jstr). This way we can avoid bespoke pack/unpack functions while giving user maximum freedom -- basically adopt a pass through posture on JSON. The JSON folks figured this out, so we can just do what they do.

One layer of nesting is already included upon transmission or persistence: json.serialize(anydfg.metadata), what difference does it make that metadata field itself has more nested JSON inside of it? You mentioned database limitations, so I went reading...

Think at this point we should talk more on the phone about this, but if Neo4j does not support nested JSON then it is starting to become a pretty poor choice. If that forces us to into short-term-long-term tandem plans then we should consider it.

Packing metadata as a vector of key-value pairs can avoid the whole nested JSON string issue. Dict{Symbol, String} would be the best option then and CRUD can then be supported. Example Json would be: [{"key": "a", "value": "1"}, {"key": "b", "value": "one"}]

maybe as a workaround only, but feels like that would result in lots of wasted effort overall.


Side question: this sounds like Update from CRUD to me:

I've been using metadata without CRUD as "get" -> modify -> "set", so not needed.

do you mean CRUD at driver/DB level vs CRUD at SDK level? I've been talking at SDK level.

Sorry, something went wrong.

@dehann
Copy link
Member Author

dehann commented Apr 2, 2025

Just copying link from duplicate 1118 - This should be just a JSON dict, not a JSON3.write ::String:

@Affie
Copy link
Member

Affie commented Apr 2, 2025

... While this works for small cases, it places a serious limitation on use to the point that "metadata" becomes the wrong noun. That is why in the old days we used smalldata to emphasize that this is not a general purpose store.

This is currently still the case and smalldata == metadata as we should absolutely not be storing blobs in metadata for performance reasons. The definition from the dictionary is still valid on "store a couple of bytes".

Smalldata: Pollutes the graph, but a convenient way to store a couple of bytes directly in the graph. Use this with caution and keep it small and simple

@Affie
Copy link
Member

Affie commented Apr 2, 2025

Side question: this sounds like Update from CRUD to me:
> I've been using metadata without CRUD as "get" -> modify -> "set", so not needed.
do you mean CRUD at driver/DB level vs CRUD at SDK level? I've been talking at SDK level.

I don't make a distinction between SDK and db, as a user I expect update to be ACID everywhere. DFG currently has update, but it is only valid in GraphsDFG as a reference to a variable is returned on getVariable.

In case it's not clear:

  • CRUD on metadata refers to CRUD on the individual keys, eg. deleteVariableMetadata!(dfg, variableLabel, key) deletes only the key-value pair and not the entire metadata object.
  • The problem with using "get" -> modify -> "set" for updates in the SDK is that, in concurrent scenarios, the first transaction's changes can be overwritten by the last and it will be hidden from the user.

@Affie Affie modified the milestones: v0.25.5, v0.x.0 Apr 3, 2025
@Affie
Copy link
Member

Affie commented Apr 8, 2025

Decisions from call on Apr 8 2025:

  • There are no custom pack unpack functions, therefore only JSON'able types are supported (Currently SmallDataTypes).
  • We support CRUD on metadata. Consequence:
    • key-value pairs (We want to stay away from complex custom resolvers for CRUD (and ACID))
    • A JSONObject as a value can be implemented as a string (escaped quotes "),

Note: we can look into expanding support in SmallDataTypes in the future. One level is a small change in DFG by adding a Dict as in {Union{DFG.SmallDataTypes, Dict{Symbol, DFG.SmallDataTypes}}

@Affie
Copy link
Member

Affie commented Apr 11, 2025

Graphql has 2 options:

  • treat the key-value as a scalar - requires custom resolvers
  • create a key-value type - this looks like the preferred solution

With the decision to support CRUD, the julia options giving the current limitations:

  • Keep SmallDataTypes
    • In julia
        :a => 1
        :b => "two"
    • json (no escaped quotes)
        [
            {
                "key": "a",
                "value": 1
            },
            {
                "key": "b",
                "value": "two"
            }
        ]
    • nesting manually as json string
          {
              "key": "g",
              "value": "{\"two\":2,\"one\":\"one\"}"
          }
  • New wrapper type over Any (or complex nested)
    • Serialize wrapper as JSON string (note value is always a string):
         [
             {
                 "key": "a",
                 "value": "1"
             },
             {
                 "key": "b",
                 "value": "\"two\""
             },
             {
                 "key": "h",
                 "value": "[\"one\",2,{\"two\":2,\"one\":\"one\"},{\"two\":{\"two\":{\"two\":2,\"one\":\"one\"},\"one\":\"one\"},\"one\":1}]"
             }    
         ]
    • Fully flexible nesting in DFG, but complex beyond key-value in SDK.

To support CRUD, key-value is the best option with current db. With value either:

  • a JSON string or
  • SmallDataTypes

These 2 choices basically comes down to: Dict{Symbol, String} vs Dict{Symbol, SmallDataTypes}

@dehann
Copy link
Member Author

dehann commented Apr 16, 2025

Hi @Affie, which option are we going with?

  • Dict{Symbol, SmallDataTypes} which is JSONed once as part of the node serialization, or
  • Dict{Symbol,String} where each key-value pair is JSONed twice (once to insert the values in dict, and once at node serialization).

Sorry, something went wrong.

@Affie
Copy link
Member

Affie commented Apr 16, 2025

Hi @dehann, I wanted to get your input on which you prefer as well before we proceed, I'm still leaning towards Dict{Symbol, SmallDataTypes} as it is now because it's not JSONed twice in most cases. It will still currently need to be JSONed twice for nested structures (manually written as a String in SmallDataTypes).

@dehann
Copy link
Member Author

dehann commented Apr 17, 2025

Okay so decision is we work towards node.metadata::Dict{Symbol, SmallDataTypes} as fast we can.


Shall we separate the time pressures by using blobless entries for CRUD in the meantime? (see other discussion E&T 187)

@Affie
Copy link
Member

Affie commented Apr 17, 2025

Shall we separate the time pressures by using blobless entries for CRUD in the meantime? (see other discussion E&T 187)

Crud on variable metadata is already supported in dfg, the rest should follow over the next few minor releases to standardize to only one design. In the mean time we can use blobless entries as you suggest or the get-modify-set pattern

@dehann
Copy link
Member Author

dehann commented Apr 25, 2025

right, think we can move this out of the Active Discussions column?

@dehann dehann removed their assignment Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants