Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing and querying CUE in database #1435

Open
codewithcheese opened this issue Dec 16, 2021 · 10 comments
Open

Storing and querying CUE in database #1435

codewithcheese opened this issue Dec 16, 2021 · 10 comments
Labels
FeatureRequest New feature or request

Comments

@codewithcheese
Copy link

codewithcheese commented Dec 16, 2021

At balena.io we make extensive use of manifest type files we call contracts, they are a mix of data and schema, and we have an internal database for storing and querying contracts. Contracts are stored in JSONB fields and schema is defined using JSONSchema. I opened a discussion about this see #1250.

We are keen to adopt CUE for contracts for a few reasons. Mixing data and schema with CUE is more natural and much less verbose than JSONSchema. CUE has some nice ergonomic improvements over plain JSON (see #130), like no requirement to quote property names and inline objects with a single property. Also, contracts can be evaluated together to validate compatibility, solve constraints and generate new output formats and CUE has some nice properties for evaluation vs JavaScript with JSON/JSONSchema.

As far as I know there is no way to encode and store complex CUE files include expressions and literals as data, so that concrete values can be queried and resulting files can be decoded back to CUE and expressions evaluated.

CUE to JSONSchema is an option, but it would limit encoding and decoding to what can be expressed as JSONSchema.

I propose an (en|de)coder that can convert any arbitrarily complex CUE file to a plain object format, with concrete data as literals and CUE expressions serialized to string and appended as a special properties for later decoding. Internally I am calling this format CUEdata but I am totally open to any name, such as those mentioned in #130, QSON or CUEL.

We do still have a need for JSONSchema, contracts can also include queries that are defined using a schema, we have a JSONSchema to SQL compiler. I propose an attribute could be used to control how a struct is encoded. When a format attribute is specified; CUE expressions are encoded to that format instead of being serialized to a syntax string.

cc @myitcv @mpvl

@codewithcheese codewithcheese added FeatureRequest New feature or request Triage Requires triage/attention labels Dec 16, 2021
@mpvl
Copy link
Member

mpvl commented Dec 16, 2021

Internally there is a definition of data-only CUE either fully expanded (data) and allowing references only (graph).

There is no file extension for these, but you can invoke them by specifying output or input types. For instance,

cue def foo.cue --out cue+data

is really like cue export.

Similarly

cue eval data: foo.cue

checks that the input has no references or expressions, while

cue eval graph: foo.cue

checks that the input has no expressions other than literals and references.

How is this different from what you want? Can you give examples of how this is different?

@codewithcheese
Copy link
Author

codewithcheese commented Dec 16, 2021

@mpvl I see I was not clear enough. I have updated the problem statement:

As far as I know there is no way to encode and store complex CUE files include expressions and literals as data, so that concrete values can be queried and resulting files can be decoded back to CUE and expressions evaluated.

We want to store CUE expressions alongside literals, so when the data is retrieved it can be decoded back to CUE and evaluated. A JSONSchema encoder world allow for the encoding of a subset of expressions, the format I am proposing would allow for encoding all CUE expressions.

For example:

name: "example"
author: "codewithcheese"
uri: "\(envVars.HOSTNAME):\(envVars.PORT)"
envVars: {
    HOSTNAME: string
    PORT: number
    CONSTANT: 1
}

Would be encoded as follows, which can be stored in JSON field and queried for any literal like name, author, envVars.CONSTANT

name:   "example"
author: "codewithcheese"
envVars: {
	CONSTANT: 1
	$$cue:    "HOSTNAME: string\nPORT: number"
}
$$cue: "uri: \"\\(envVars.HOSTNAME):\\(envVars.PORT)\""

The result could be decoded and unified with other results that specify values for HOSTNAME and PORT and the uri expression evaluated.

JSONSchema encoder could encode envVars, but it cannot handle more complex CUE expressions like uri (if there is a way to do string interpolation that I am not aware of then imagine the expression is more complex).

See testdata in the PR for more examples.

@verdverm
Copy link

@codewithcheese what about storing the concrete values which can be queried for in a jsonb column and storing the original CUE source in a second, text column?

@codewithcheese
Copy link
Author

@verdverm that's an option. I think that's a more complex solution than using an encoding. Both options (encoding or separate field) require a change to our data handling, however a separate field also requires a migration.

It would also mean the concrete values could get out of sync with the original source if an update was applied to one of the concrete values. Might require some extra logic. We're not likely to apply updates to the concrete values alone but its a potential pitfall down the track.

@mpvl
Copy link
Member

mpvl commented Dec 17, 2021

The use case is still mostly unclear to me, but I see what your are proposing.

The semantics of this seems somewhat peculiar and specific to a certain job. There are also many aspects to consider, which would make this a massive design undertaking, as far as I can tell.
For instance, why $$cue and why not attributes? Why not store it as JSON? If this is because of easy of parsing, why not have an interpreter that simply treats any non-number or non-string as an opaque expression, or drops it from the input? That would allow a relatively simple parser for regular CUE without any rewriting (note that the CUE lexer+parser is already considerably simpler than a YAML lexer+parser) . There seem to be too many specific design choices related to a custom case to warrant making this a standard thing.

Instead, it seems that users can fairly easily writing something themselves that converts an evaluated CUE value to such a format, tailored to their purpose. Alternatively, one could just store CUE as is, and have a small tool to strip any expressions from the AST before evaluation (see ast/astutil). This would just be a few lines of code and be a cleaner alternative, IMO.

So unless someone comes up with a good argument that this is a generally useful, with good justifications why a design should be exactly as proposed, and come up with an easy set of principles from which a design flows naturally, this does not seem to be something that belongs in core.

@codewithcheese
Copy link
Author

codewithcheese commented Dec 17, 2021

@mpvl I think there is still a misunderstanding, or more likley there is something about CUE I am totally misunderstanding. I am keen to continue the discussion to get on the same page if that's ok.

I am fine for this not to be included in core, we can use it as a separate tool (if it still makes sense). My intention with the PR is to see if it something that is generally useful.

For instance, why $$cue and why not attributes? Why not store it as JSON?

$$cue is only mean to function as a field label, for a field with a syntax string value. It's not a new special identifier and only has meaning to the encoder. "CUEdata" is not really meant to be used as CUE but encoded as JSON or YAML. The goal is to make an arbitrary CUE file concrete by rewriting expressions as syntax strings while maintaining structs and their literal values, that way it can be exported as a data format like JSON and stored as structured data in a DB.

As far as I can see attributes are ignored when exporting to JSON, the purpose is to do the very opposite, to maintain the expressions, as syntax strings, in JSON. That way can they be decoded and evaluated later.

Alternatively, one could just store CUE as is, and have a small tool to strip any expressions from the AST before evaluation (see ast/astutil). This would just be a few lines of code and be a cleaner alternative, IMO.

As far as I understand CUE is not "code as data", unlike a Lisp for example. How would you store CUE file containing literals and expressions "as is" in a database so that the literal values may be queried?

@verdverm
Copy link

verdverm commented Dec 17, 2021

How would you store CUE file containing literals and expressions "as is" in a database so that the literal values may be queried?

The answer for CUE is probably similar to other unsupported formats like Yaml or Dhall.

  • convert to queryable / concrete fields to JSON, as this is the native and supported query format for the database, akin to my two column suggestion, an application layer solution
  • write custom DB functions or plugins (for example, https://www.postgresql.org/docs/12/server-programming.html)

@codewithcheese
Copy link
Author

@verdverm seems you understand what I am getting at, what do you consider to be the pros and cons of encoding to JSON with embedded syntax strings vs 2 column solution?

@verdverm
Copy link

verdverm commented Dec 17, 2021

@codewithcheese the embedded idea seems arbitrary and hacky imho. I generally agree with @mpvl that something like this would require a lot more flushing out.

  • the two column solution is more maintainable to me, no need for an extra codec, just write a lib function to avoid the issues you were concerned with
  • the semantics is weird ($$cue: string nested everywhere?), it should at a minimum have a different file extension
  • You've expressed resistance to refactoring your application in other issues / threads. I would personally start with rethinking its design before trying a custom codec (embedding), internally or through CUE. Personally, I wouldn't be putting CUE in a database in the first place. I'd be working towards the Proposal: package management #851 mindset, i.e. using modules & imports, though I'm not familiar with the application

I would think there might be something around doing this more generally within a CUE value and then being able to support this in both Yaml and JSON. Still, this would seem better as a library than in core.

@codewithcheese
Copy link
Author

@mpvl it seemed there may of been some misunderstanding of the design I proposed, I tried to clarify a few messages above. Is there anything else you would like to add before we close this issue?

@rogpeppe rogpeppe removed the Triage Requires triage/attention label Apr 14, 2022
@myitcv myitcv added the zGarden label Jun 15, 2023
@mvdan mvdan removed the zGarden label Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FeatureRequest New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants