-
Notifications
You must be signed in to change notification settings - Fork 108
Schema <> Block interaction + "advanced layouts logic" #118
Comments
I don’t see why we would not want to contain all the schema information in the schema instead of as metadata in the reference to the schema, so definitely:
I’d also like to avoid describing the block boundaries and instead just state that the I don’t see why we would embed the DSL text for the schema rather than the parsed schema in JSON types. Then we could inline it or link to it encoded into any codec that supports the data model. This would also allow the tools to more easily embed versioning information or other parser requirements. |
Well, that's mainly because many schemas will have to be constructed of multiple, inter-dependent types, so a loader will have to know which type it's loading in the current block.
We'd need a rule like: the first type in a schema describes the current block, the rest are dependents. Unfortunately that would mean you can't use different parts of a schema for different purposes. You couldn't have a block that's a
OK, I think I just had a
Fair. Although that means we have to make sure we're paying as much attention to the parsed form (maybe we can call that the AST) as the DLS syntax. I'm trying to do that here, but it should also make it into this repo's schema spec too. |
Re: schema <> block interaction: There's a lot of different ways we can do this. I'm not sure any way is clearly more correct than any other, and it might often be correct to leave this choice to the user. I did some writing on schema versioning and migration pathways which might be relevant: https://github.com/ipld/go-ipld-prime/blob/cd9283ddd86af15b9bc4a1f5b71f00fbfc2f8b94/doc/schema.md#schemas-and-migration In particular there's even a section on "Strongly linked schemas". Having explicit pointers to schemas is something we can do. I dunno about "should", though. It smells to me a lot like nominative typing, and I tend to feel we get better migrateability over time by aiming more towards the vicinity of structural typing. (Not that nominative typing isn't great in programming languages; but data description is a different ballgame...) (EDIT: the next section may have taken "native type/butterfly" slightly too literally; if you meant "schema type", your text makes more sense to me now, and this next paragraph is still true but slightly off-topic.) The big issue with explicitly linked schemas in the data itself is that it doesn't generally give you a way to translate into native types. Codegen (or our reflecty 'bind' stuff in go-ipld-prime) is something you get to do once at compile time. (Okay, I'm wearing very not-javascript^Wlisp-tinted glasses here -- suppose we have languages where Parameters to advanced layout logic being directly linked from the data is a different matter than the above rant about explicit schema links being problematic, though. (Concretely, supposing the HAMT code is already written, plugging params into that at runtime is fine, contrasted with plugging other schema content into codegen in at runtime, which is... not.) (No comment yet on the rest of advanced layout logic part yet, need more time to parse that, so I'll make it a separate comment. I think I'm mostly nodding, though.) |
+1 to always linking the IPLD form of schemas rather than the DSL. I guess this isn't committed to any documentation yet, but I also had some comments about the longevity of one vs the other over in ipld/go-ipld-prime#10 (comment) .
FWIW, I tend to imagine using a bunch of small schemas in many programs, yeah. I think this probably also works better and is easier to imagine scaling up ecosystemically when going with the the structural-typing instead of nominative-typing take. One doesn't just get most of the same issues when composing things with the structural-typing point of view -- the schema versions just aren't there to have the potential to conflict. Lots of nodding about the general concept of registering logic for handling some An interesting devil in the details is... should there be different treatment of the algorithm (e.g. "HAMT" -- it's one thing; that's easy enough to register) vs more parameters the algorithm might have (e.g. bucket width -- an int, thus countably infinite range, and not so clear how to register)? There's a brief mention of potential relationship to selectors in the first post, and... I'd say, yeah, nah, I wouldn't try to worry about this. We can use selectors in various forms of interaction with advanced data layouts, sure: either by using them with a schema that points out the advLayout, thus allowing a single path step to translate into the entire internal operation of the advLayout; or by using them with a schema that doesn't point out the advLayout, and points out all the internal types of the advLayout's guts instead, in which case it can get some fraction of the internal nodes, etc. That toggle is kinda cool. But things like HAMT sharding instantly get into more ✨ logic ✨ than we'll probably ever be capable of expressing in selectors (short of going full wasm), so... IMO, we should lean into that freely, and ignore selectors entirely while developing advanced layouts. If it so happens that there's an advanced layout which is amenable to certain kinds of selector traversal mapping onto things like "the first half of this tree", that's great; if not, well, fine. |
I wrote up another issue with some broad thoughts about high level design rules I've had cooking in the back of my head for a while, particularly with an eye towards convergence properties in the hashes in all the stuff we generate, and that also gets into a bunch of this territory: #119 |
Ahhh, gotcha, ya I hadn’t realized yet that the schema definitions are quite literally just definitions and don’t present a clear definition that would be applied. In that case, I’d just change |
Right, but even more than that, you could imagine a large schema definition that had links throughout it (another reason I want to make sure we work with the parsed schema), linking to previously encoded definitions throughout. There are performance tradeoffs on either side of this, on one side you get better deduplication and caching and on the other side you get better loading time when there is no cache state if you inline everything. Given that we don’t know all the performance tradeoffs an application might want to make it’s best for us to just leave these open ended and think in terms of the data and its shape rather than where we expect things to link to together. |
I think this is getting to the divergence point between a JS implementation and a Go one (and friends of either down the road). In JS it's easy to imagine being able to switch out versions of things, essentially providing future-proofing by allowing an endless set of versions of the same thing. Version 1 of a HAMT schema takes one path, encounter the version 2 of that schema, divert and take another, iterate for the next 25 years till we have 10 slightly versions of a HAMT, each "better" than the previous, but all within reach if our loader wants to instantiate an IPLD form of them. In fact we could ship them all in the same codebase. With Go + codegen you're going to get into quite a mess if you need to cope with that kind of flexibility. I'm instinctively not keen on ever being able to say that this version of a collection we're shipping today is going to be the same type that you'll be using for the next 25 years!. "Oh, did you show up with the codegen form of HAMT v2? Sorry, we're looking at a v1 now". How do you seeing addressing that kind of inflexibility into the future? |
re that last comment about codegen, some relevant pieces from https://github.com/ipld/go-ipld-prime/blob/cd9283ddd86af15b9bc4a1f5b71f00fbfc2f8b94/doc/schema.md (which I hadn't seen till now, ooops!)
I'll have to ponder this more, maybe that's good enough of an answer. Still up for discussion is how to link to a schema, is it enough to just link to a CID or do you need to provide metadata to say what the entry point to the schema is? Is it enough to have the first type in the schema represent the parent entry point? My proposal for a Also the interaction between schema and the advanced layout logic, but it sounds like we all just have vague ideas about how that will work. So without further suggestions I'll just work through it with code, see what happens and come back with more concrete proposals. |
I'm hoping in those cases, we'll have the codegen for golang emit a short snippet that's more or less a nicely-typed/nicely-autocompleting wrapper, and the wrapper will call out to library code which implements doing the HAMT-v$n logic in a very generic/weakly-typed way. In go-ipld-prime we already have very "generic"
Yeah, I like your schemaCID+entrypointType pairing idea. 👍 |
Do you mean the
If the schema itself is just IPLD, we could also use CID + Path to point to a specific type of the schema. |
👍
We could, but A) |
Given that we do that "CID + Path" thing: Uniformity ("you want something out of IPLD, use a path") and less concepts to learn. |
Based on (#3 (comment) #83) we’ll eventually want a “Link Type” that is extensible the same way we are doing for collections. I can see us wanting to add some features to the language for expressing simple paths in order to make building those types easier. We could piggy-back on that expressive syntax in order to define the We keep throwing around this term
|
Here’s my (likely faulty and in need of intervention) mental model for the primary way of using schemas: I’ve been imagining that an IPLD block could optionally have a predictable property at its root,
_schema
or$schema
, that has association information in it. Maybe the schema is in there or there’s a CID pointing to a schema. The latter is nice because you can use it to associate with known types.Where:
CID
-> the schema text itselfFancyCollectionRootBlock
-> theType
in the schema that this block will map toor embed the schema text in the block itself with:
A loader encounters that and can load the schema from
CID
or just read it from the inline definition and create an appropriately instantiated form of the block according to the schema.So
myNicelyShapedObject = Loader.Load(someCID)
would: read the block, find a_schema
and know how to turn the raw block layout intoNicelyShapedObjectFromUglyBlockLayout
form. Maybe it’s a struct from a tuple representation or one of the ugly string concatenation representations but now it’s a beautiful native object/butterfly.Or, for the case of objects which have associated logic, like a multi-block collection that is useless to the user in its decoded+instantiated form. We have a way of “registering” logic with the loader. Maybe the loaders we ship (whatever that is, js-ipld-stack, go-ipld-prime, or some layer above), have some of our standard types already registered. A certain CID might point to a schema of our standard HAMT, so we register the logic for that HAMT into the loader so when it encounters that CID it knows it can not only deode+instantiate a nicely shaped block but it can hand it off to that logic for further processing or pass that logic back to the user for their own handling.
So
multiblockMap = Loader.Init(someCID)
would: read the block, find a_schema
, see that it has a CID that’s been pre-registered with some bit of code—let’s call it “advanced layout logic”. Decode and instantiate the block in its schema-defined form, pass it off to the “advanced layout logic” and give that back to the user. Maybe they get an object that they can domultiblockMap.Get(Key)
on and it’ll do some cascading interaction between the loader and the “advanced layout logic” to traverse the tree in an appropriate manner, decoding+instantiating child nodes and figuring out how to traverse further down toGet
what the user wants.Maybe the “advanced layout logic” has to have some additional functionality associated with it to allow for interaction with selectors/paths? So you can traverse right into the block-spanning-data-structure with a selector, relying on the “advanced layout logic” that the loader knows about.
This CID-registration thing would allow for the loading of custom things that span one or more blocks, a user might make their own
FancyCollection
that we don’t need to care about, they register the CID for its shema with the loader,Init()
a root block and get back their collection. It would also allow versioning of the data types. Maybe we build a HAMT this year with a singleelementMap
and decide next year that we really should have gone with separate CHAMPdataMap
andnodeMap
elements, so the root and child node blocks will be different and the logic to traverse will be different. So we just register those two different schema CIDs with two different sets of logic and users will get back aMap
with the same API but it loads and traverses blocks differently under the hood.I know I at least have holes in this thinking relating to how IPLD is currently used today, I’m focused on the “get me a collection that I can interact with programatically” case but maybe there’s a lot more (selector traversal?) and maybe that’s not even the right case to be thinking about.
The text was updated successfully, but these errors were encountered: