Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP 1009 - Native Schema Registry Service #1009

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

justin-atwell
Copy link
Contributor

Description:
Base PR for Schema Registry HIP

Related issue(s):

Fixes #

Notes for reviewer:

Checklist

  • Documented (Code comments, README, etc.)
  • Tested (unit, integration, etc.)

@justin-atwell justin-atwell requested a review from mgarbs as a code owner July 13, 2024 21:53
Copy link

netlify bot commented Jul 13, 2024

Deploy Preview for hedera-hips ready!

Name Link
🔨 Latest commit 75a819b
🔍 Latest deploy log https://app.netlify.com/sites/hedera-hips/deploys/66d9bdf741f5c50008199c5e
😎 Deploy Preview https://deploy-preview-1009--hedera-hips.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kantorcodes
Copy link
Contributor

Just wondering - is the idea that a schema is essentially a blueprint for how NFT metadata or Topic Messages should be structured? Given a schema, will services enforce the schema when metadata or messages are submitted?

@justin-atwell
Copy link
Contributor Author

justin-atwell commented Jul 14, 2024

Just wondering - is the idea that a schema is essentially a blueprint for how NFT metadata or Topic Messages should be structured? Given a schema, will services enforce the schema when metadata or messages are submitted?

Long term, yes that's the direction. However, there are uncertainties with this HIP that should be mentioned: AVRO supports much more than just the "structure". It supports nullable fields, data validation, etc. This HIP is to provide Schema Definition now, and we will work with engineering to determine what we'll need to add on later.

@kantorcodes
Copy link
Contributor

Just wondering - is the idea that a schema is essentially a blueprint for how NFT metadata or Topic Messages should be structured? Given a schema, will services enforce the schema when metadata or messages are submitted?

Long term, yes that's the direction. However, there are uncertainties with this HIP that should be mentioned: AVRO supports much more than just the "structure". It supports nullable fields, data validation, etc. This HIP is to provide Schema Definition now, and we will work with engineering to determine what we'll need to add on later.

Thanks! Overall like where this is headed a lot. Keen to see more as it progresses :)

@mattsmithies
Copy link

mattsmithies commented Jul 15, 2024

There will certainly be a significant benefit in binding schemas to the assets we generate. The challenge here is presenting the story of the system in a way that everyone understands the benefit and can communicate it effectively to others.

One of the underlying problems in many ecosystem verticals within the space is that data structure is not considered a first-class entity. Before HIP412 was created, everyone decided their own standards for how NFTs should be visible and interacted with. This siloed approach significantly reduced interoperability, meaning that systems could not communicate with each other in any meaningful way.

The main benefit here is to ensure the correctness of assets to schema and to have a shared resource defining what is considered valid data structure from an authoritative source.

Sustainability use cases are currently suffering greatly from this lack of standardization of schemas. This HIP will go a long way to bridging the gap for different Guardian instances, allowing them to access the same data structure to create digital environmental assets with more assurance.

HIP412 addressed the structure outline issue for NFTs, but it is still up to the individual developer/project team to ensure their code and assets actually conform to a standard, which still creates friction for adoption.

So, the question would be: is the eventual aim of this HIP to validate data based on a provided schema in real-time, whether for an NFT or a topic message, as part of the SDK?

In addition, this seems like a decentralized alternative to services like AWS Glue Schema Registry. Being able to not only have a standard resource for a particular schema but also track the actual version of a particular schema would be highly beneficial.

@Neurone
Copy link
Contributor

Neurone commented Jul 16, 2024

Very cool idea!

I have the following suggestions/questions:

  • We can insert an option while sending an HCS via TopicMessageSubmitTransaction or minting a token via TokenMintTransaction to require validation of the data over the schema on consensus level. We can set a new flag on the protobuffer definition (default false) and on the SDKs (default false). Those transactions would cost more than normal HCS messages / mint operations, and those fees will cover the increased costs of the operations.
  • Decentralized schema management: this method would change the approach to schema management, which would become more decentralized/permanent web-oriented and less in control of the "schema administrator".
    • The schemaID should be unique, so I suggest it be computed at consensus level using a multihash. This way, we can be compatible with several P2P and Content Addressable Networks (i.e., IPFS CID v0). Side note: while I consider multihash and IPFS CIDv0 fairly agnostic, I cannot say the same for IPFS CIDv1, so I don't suggest considering that CID format)
    • In case the users try to create another schema with the same schemaID (so the same content), the transaction will be successful but no changes will be applied to the network.
  • While create and read are pretty clear functions, I think we need to expand the update and delete functions.
    • Update: If we use the idea that the schemaID is based on the content, anyone can create an updated version of the schema. It's up to the users to switch their NFT/HCS topic to follow that schema setting the new schemaID using the admin key NFT/HCS.
    • Delete: what is the effect of deleting a schema? Can we do it even if there are still on-chain entities referencing it?

Answering @mattsmithies, but please @justin-atwell confirm, I would say full support in the SDKs is definitely part of the proposal.

@AdrianKBL
Copy link

I’m really excited about this HIP and its potential to close the gap we currently have with Arweave and IPFS. Many users who initially used IPFS have transitioned or are considering transitioning to Arweave due to various reasons. Presently, in marketplaces, it is quite challenging to display filters for users regarding a specific NFT collection. We have to make a call to the Arweave or IPFS CDN for each NFT within that collection. Imagine doing this for a collection of 100,000 NFTs!

With this new approach, if I understand correctly, we could maintain the same structure of metadata pointers to Arweave/IPFS within the metadata field but leverage it through the Schema field via the Mirror Node. This would be a significant improvement.

As discussed above, we would need a mechanism to update the schema. Currently, Tokens can update their metadata via an Admin Key and NFTs via Metadata Key. These keys should also be responsible for executing schema updates, depending if is a Fungible / Non Fungible Token or NFT.

I see there’s discussion about schema validation versus the metadata field for tokens or NFTs. Will Hashgraph start using CDNs for Arweave, IPFS, etc.? How scalable is this approach? If not, how will the validation process work? Apologies if this is a basic question—I might have misunderstood this part.

Overall, I love this HIP, but more clarification on its impact on end-users would be great. Additionally, examples of creating an NFT using the basic standards of HIP-412 would be very helpful.

@mb-swirlds
Copy link

So, the question would be: is the eventual aim of this HIP to validate data based on a provided schema in real-time, whether for an NFT or a topic message, as part of the SDK?

@mattsmithies - It's part of a wider plan as @justin-atwell had replied to another message above but the first stages are to create the registry, have an ability to tie that to on-chain assets in some way that shows the relationship but in this first stage would not enforce or validate on-chain. If we found this to be a needed functionality, which I think eventually it could be very useful, we would need to design and implement this alongside the engineering teams.

@Neurone - Your questions around validated inserts are similar to the above and the same answer applies. Not in a first release implementation.

in terms of schema ID's we haven't really fleshed that part out, initially if we were to use HCS for the storage we had considered using them initially in some format like x.x.x:r where x is the HCS ID and the r is the revision.

We also need to consider multi schema and how that would be implemented, most likely as seperate schema registry instances for each type (avro, protobuf, json et al).

In terms of lifecycle, it's more akin to CRVD (Create, Read, Version (forward only updates) and Deprecate). We don't have or probably wouldn't want delete but the ability on the schema instance for that schema to signify what records are active/current and what ones are deprecated/unused.

@mb-swirlds
Copy link

Hi @AdrianKBL

I’m really excited about this HIP and its potential to close the gap we currently have with Arweave and IPFS. Many users who initially used IPFS have transitioned or are considering transitioning to Arweave due to various reasons. Presently, in marketplaces, it is quite challenging to display filters for users regarding a specific NFT collection. We have to make a call to the Arweave or IPFS CDN for each NFT within that collection. Imagine doing this for a collection of 100,000 NFTs!

With this new approach, if I understand correctly, we could maintain the same structure of metadata pointers to Arweave/IPFS within the metadata field but leverage it through the Schema field via the Mirror Node. This would be a significant improvement.

So our initial thoughts around this are you can attach a schema (which is really independent of the token and could be a singular instance used by all NFTs that would want to adhere to that standard) and when you come to read the metadata that is referenced in IPFS/Arweave you can then use that schema to then validate/read the metadata in a programmatic way.

As discussed above, we would need a mechanism to update the schema. Currently, Tokens can update their metadata via an Admin Key and NFTs via Metadata Key. These keys should also be responsible for executing schema updates, depending if is a Fungible / Non Fungible Token or NFT.

Schemas are orthogonal to the lifecycle of an NFT and would be managed separately. The admin key might be the same for both but you would manage your schemas independently of the Token. As we said above in the 1st iteration the parts will be loosely coupled in the sense we don't want to make widespread changes across a lot of services that would have a large impact. If we see adoption and usage increase then we can figure out the next stages of integration.

I see there’s discussion about schema validation versus the metadata field for tokens or NFTs. Will Hashgraph start using CDNs for Arweave, IPFS, etc.? How scalable is this approach? If not, how will the validation process work? Apologies if this is a basic question—I might have misunderstood this part.

Not entirely sure about this question but any validation in this current implementation would be done by the user on their own hardware. We have no plan for any kind of CDN use/access either in this HIP.

Overall, I love this HIP, but more clarification on its impact on end-users would be great. Additionally, examples of creating an NFT using the basic standards of HIP-412 would be very helpful.

In its most simple sense you would do this:

  1. encode the schema for your NFT metadata using JSON schema. Save and upload to schema regsitry and save the identifier (TBC exact format but current thinking is something like x.y.z@r which relates to the HCS ID and revision.
  2. When you create the token via the SDK there will be new options when creating using the builder API to then add this schemaRegistryID to the token and this will then add this/amend this value (you may want to change the revision number).
  3. When it comes time to read the metadata and check for a schema you would then request the schemaRegistryID if any that is associated with the token.
  4. You then retrieve the schema instance from the SDK/API. I guess at this point on your own infra you could cache this for later use.
  5. You then retrieve your metadata files from Arweave/IPFS
  6. You can then validate the metadata or read into an object in your language of choice or store where appropriate.

I hope that makes sense, can expand on this further or answer any other questions you may have.

@AdrianKBL
Copy link

AdrianKBL commented Jul 22, 2024

Hello @mb-swirlds, thanks for the answers, although I still have some questions.

So our initial thoughts around this are you can attach a schema (which is really independent of the token and could be a singular instance used by all NFTs that would want to adhere to that standard) and when you come to read the metadata that is referenced in IPFS/Arweave you can then use that schema to then validate/read the metadata in a programmatic way.

In this case, what would be the benefit of using schemas? Imagine I create a schema for all the NFTs under TokenID 0.0.1, but then decide to use a different structure for each NFT in the metadata field. What happens if the schema and metadata don't match? What is the expected behavior for DApps and wallets in this scenario?

What is the difference between reading directly from the metadata field and getting the information from Arweave/IPFS, compared to reading from the schema first and then from Arweave/IPFS? Isn't this redundant?

Additionally, what happens if each NFT has a different structure within the same TokenID?

How will we differentiate schemas related to tokens (fungible or non-fungible) and schemas related to NFTs? How will we attach different schemas to different NFTs within the same TokenID?


Let me know if there are any other questions or clarifications you need!

Example:

Fungible Token Metadata / HIP-405

{
  "creator": "Rafa",
  "description": "KBL Tests",
  "lightLogo": "ar://7tnJ1AxxrVQ0IjP0RnfVtPFOlTHGxEJs-u2XsFPQiWw",
  "lightLogoType": "image/jpeg",
  "darkLogo": "ar://7tnJ1AxxrVQ0IjP0RnfVtPFOlTHGxEJs-u2XsFPQiWw",
  "darkLogoType": "image/jpeg"
}

Non Fungible Token Metadata / HIP-766

{
    description: "description of NFT Collection - max. of 500 characters - RECOMMENDED",
    creator: "creator(s) - RECOMMENDED",
    "website": "link to website -  OPTIONAL", 
    "discussion": "link to discussion/discord -  OPTIONAL", 
    "whitepaper": "link to whitepaper -  OPTIONAL",
    "properties": {
        // arbitrary additional JSON data relevant to the token - OPTIONAL
    },
    "socials": [ // Array acting as a container for social links
        {
            "url": "link to social - REQUIRED",
            "label": "textual identifier for social url - REQUIRED",
            "info": "additional information about the social URL - OPTIONAL"
        }
    ],
    "lightLogo": "IPFS CID or path to the token's light background logo file - RECOMMENDED",
    "lightLogoType": "mime type - i.e. image/jpeg - CONDITIONALLY OPTIONAL",
    "lightBanner": "IPFS CID or path to the token's light banner file - RECOMMENDED",
    "lightBannerType": "mime type - i.e. image/jpeg - CONDITIONALLY OPTIONAL",
    "lightFeaturedImage": "IPFS CID or path to the token's light featured image file - RECOMMENDED",
    "lightFeaturedImageType": "mime type - i.e. image/jpeg - CONDITIONALLY OPTIONAL",
    "darkLogo": "IPFS CID or path to the token's dark background logo file - RECOMMENDED",
    "darkLogoType": "mime type - i.e. image/jpeg - CONDITIONALLY OPTIONAL ",
    "darkBanner": "IPFS CID or path to the token's dark banner file - RECOMMENDED",
    "darkBannerType": "mime type - i.e. image/jpeg - CONDITIONALLY OPTIONAL",
    "darkFeaturedImage": "IPFS CID or path to the token's dark featured image file - RECOMMENDED",
    "darkFeaturedImageType": "mime type - i.e. image/jpeg - CONDITIONALLY OPTIONAL"
}

NFT Metadata / HIP-412:

{
  "name": "SIWA",
  "creator": "Kabila",
  "description": "SIWAS are Kabila's main PFP-type NFT Collection. They are little creative beigns of the desert, digital nomads and noble defenders of Web3. The SIWA Oasis is their main meeting point, where they use Plazas to organize themselves, finance their ideas, make decisions and thrive in community. Loyal followers of Dr. Leemon Baird, they travel the world gossiping the word of Hashgraph.",
  "type": "image/png",
  "image": "ipfs://bafybeif6uznua7a2myuwlhfalvo65ozd2jfkfwh5zvj5buzs2ohaazu42y/SIWA11.png",
  "attributes": [
    {
      "trait_type": "Batch",
      "value": "#3 HANGRY Edition"
    },
    {
      "trait_type": "Background",
      "value": "Djanet Desert"
    },
    {
      "trait_type": "Body",
      "value": "Golden"
    },
    {
      "trait_type": "Feet",
      "value": "Barefoot"
    },
    {
      "trait_type": "Accesory",
      "value": "GM Cup"
    },
    {
      "trait_type": "Clothing",
      "value": "Naked"
    },
    {
      "trait_type": "Mouth",
      "value": "Yellow Bandana"
    },
    {
      "trait_type": "Lenses",
      "value": "Gold Wide"
    },
    {
      "trait_type": "Head",
      "value": "Bald"
    }
  ]
}

So, how would the schema proposed by this HIP look for a Non-Fungible Token (NFT) + Metadata and its associated NFTs + metadata?

Thanks again for the answers... I'm still trying to understand well the benefits of this.

@mattsmithies
Copy link

mattsmithies commented Jul 22, 2024

@mb-swirlds thank you for the response. Returning to the validation discussion, I want to provide further context to help shape this HIP. 😇

In retrospect, I completely understand that onchain validation at this point it's completely in the air, and there are a huge number of dependencies outside of core hedera tech that would need to be considered.

As all clients, using the registry or other HAPI services will always have complete flexibility in the storage of data referring to schemas then it doesn't make sense, imagine the nightmare where Hedera would have to pull from a IPFS/Arweave node, this would naturally yield unknown resource allocation.

Naturally, as the creation of metadata for NFTs is normally a separate process outside of the creation of a given token. It doesn't necessarily make sense that coupling of data and token should take place at this point - as it is a distinct design change to what the ecosystem have been working with.

However, I highly recommend there should be validation within the schema registry SDK itself as part of v1. This rationale isn't necessarily related to the needs of myself or my team, but for use with the greater ecosystem - especially those that expect additional support, or mainly accessibility within Hedera technology.

Allow me to spin some yarn, a tale of of sorts.

The current state of HIP412 structure and expectations

If we consider the structure of HIP412, developers are expected to conform to this given structure to produce an asset that is visible within a wallet in the ecosystem. However, wallets themselves often add failsafes to display assets that are not strictly to spec. For instance, Hashpack can detect whether a piece of media is a video or 3D object (and render accordingly) without necessarily describing the correct MIME type.

While, in Hashpack's case they are UX focused on end users this within itself, in my opinion, is indicative of far-reaching implications around the non-essential "busy work" of the expectations of developers, resulting in ecosystem-wide resource waste.

Personally, I've encountered this issue and addressed it in HIP412 with my open-source libraries by creating validation at the API level (createMetadataRequest.js). This mechanism provides a way for anyone to validate HIP412 metadata to ensure that a produced asset will render in a wallet, at least structurally.

Adding validation isn't necessarily difficult, but it becomes more complex with every new standard or structure. Developers are expected to:

  • To understand a given problem domain and search for a related HIP.
  • To find the related structure of a particular standard.
  • To have enough knowledge/training to create a validation for a particular standard in their language of choice.

Optionally: share this work with others to use as a resource, which rarely happens, as caffeinated developers high on hubris just want to build and tend to repeat the efforts of those before them.

This effort may have to be duplicated for any given standard, even for the schema registry-based standards.

Because of this, this is one reason why I believe having general validation inside of V1 with a corresponding demo would go a long way and would potentially trigger an update of HIP412 itself to be stricter for child structures.

Understanding Sustainability with the Guardian

We all know that the schema registry within this context would be vital to enable an authority to produce and publish a schema that can be consumed with assured quality for asset creation. Within the context of Guardian, there are deeper functionalities such as the concept of a "schema tree," which provides the ability to display all nested schemas visually.

Below is an image for the expectations of data requirements for a monitoring report for ACM0001 - flaring or use of landfill gas. This work, digitized through Guardian/Envision and validated by UNFCCC/CDM authorities, underscores the need for general validation methods. Creating and validating such detailed schemas would otherwise be highly resource-intensive. @anvabr @dubgeis et al

Monitoring Report

Viewing this image highlights the necessity of a general validation method. In order to validate data for such a schema, a general function would be required. Coupled with the rapid growth of the Guardian ecosystem, there will be dozens, if not hundreds, of policies on the sustainability side that will be digitized. This will become a larger issue 12 to 18 months down the line, especially as the schema registry becomes a core part of this functionality.

At @dovuofficial, we foresee this coming and are preparing to create general validation methods for sustainability. However, this expectation will likely be pushed onto the schema registry and will have significant implications. Preparing for a thin layer of validation, potentially opinionated at this point, would yield more benefits downstream.

Alternatively, each product team working with the schema registry would need to develop their own schema validation methods on a use case or general basis. As noted with reference to NFTs/HIP412, this would result in an invisible ecosystem-wide wastage of resources.

An approach to v1 validation

If we take the stance that the creation of metadata to be persisted on any data persistence provider and the creation of a particular asset will always be decoupled, we can add optional validation to the creation of the metadata itself before it is sent to a data provider to generate a CID.

Using an example of createMetadataHandler.js

const validationErrors = createMetadataRequest(req.body)

if (validationErrors) {
  return Response.unprocessibleEntity(
	  res,
	  validationErrors,
	  Language.ensureNftStorageAvailable.meta
  )
}

// Example of a third party storage provider.
const cid = await FilebaseStorage.storeData(req.body)

And expanding the createMetadataRequest as "validate" to be the validation of data/payload against a schema registry element.

{
  "Schemas": [
    {
      "UltraSchema": {
        "type": "record",
        "name": "UltraSchema",
        "namespace": "com.UltraSchema.avro",
        "fields": [
          {
            "name": "firstName",
            "type": "string"
          },
          {
            "name": "lastName",
            "type": "string"
          }
        ]
      }
    }
  ]
}

Using the mirrornode, one could simply recurse down the fields to be n-levels deep. Here is some pseudocode for validation:

function validate(data, schema) {
  const keys = Object.keys(data)
  
  // Check all keys exist in schema (there might need to be a requirement to define "required" keys in schema)
  checkKeys(data, schema)

  // Produce a structure like { firstname: { type: string } }, to reduce iteration loops
  const mapped = mapSchemaFields(schema)

  // Iterate through keys
  keys.map(key => {
    const child = mapped[key]
    const element = data[key]

    if (mapped[key].type == "record") {
      return validate(element, child)
    }

    // Basic validation 
    return !!(element instanceof mapped[key].type)
  })
}

This example outlines a possible process. The function itself could either throw an error or return a boolean value, depending on the developer's preference. Ultimately, it would be up to the developer to attach compliant data to the provided asset.

In summary, implementing a thin layer of validation in the schema registry SDK v1, even if opinionated, could yield significant downstream benefits. It would reduce the need for individual product teams to develop their own validation methods, preventing resource wastage and ensuring ecosystem-wide consistency.

@AlexIvanHoward
Copy link

This will be very powerful, especially if combined with JSON-LD.

@mgarbs mgarbs changed the title Hip 994 - Native Schema Registry Service HIP 1009 - Native Schema Registry Service Jul 29, 2024
@mb-swirlds
Copy link

Sorry for the radio silence on this one.

We are definitely considering an approach to allow you to have some kind of standard way to validate against the schema(s) and how this will work in reality is still up for discussion.

We considering a number of options both utilising schemas off chain in an SDK you could use to validate your NFT metadata in preflight check but also as an SDK you could wire into your existing code base for reading and validating.

As for onchain capabilities I don't think we would want to do that yet until we worked with the community to see if the Schema registry would be adopted and used widely. There are many use cases where the ability to have a public repository like this would add value to the ecosystem or vertical in terms of making the data available for wider use.

In terms of the validation itself, there are a lot of Avro and other toolchains already out there in terms of prior art and we really want to tap into that whole ecosystem v's trying to reinvent the wheel with some bespoke schema validation framework. Once you had the ability to read the Schema of your choice, you would then pick the standard validation library to then utilise the schema with your data.

@AdrianKBL - as for your questions around a NFT creators choice to change or make metadata that doesn't adhere to a standard then I don't know how you could ever solve that problem. If they are using a storage system of their own choosing and make these files of their own accord you would never really stop them putting whatever they want in them and this is the same problem on many other chains as well. With the Schema registry you could set a schema/standard you wished to use for your metadata and validate the files preflight of them being committed to IPFS or similar, but again this is on them do to. I'd be open to hear how you think this could be possibly solved otherwise.

Thank you everyone for the thoughtful answers/ideas/suggestions as well.

I'm sure there will be a few more questions post this so we will do our best to answer them. 🤞

@mb-swirlds
Copy link

We are awaiting on some internal review on the approach and @ty-swirldslabs will hopefully have an update soon on agreed approach for the first iteration/version of this HIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants