Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Access RFC #161

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 160 additions & 0 deletions rfcs/002 - Database Access.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# 002 - Database Access RFC

## Current Status

### Proposed

2023-08-02

### Accepted

YYYY-MM-DD

#### Approvers

- Luis de Pombo <luis@alantechnologies.com>
- Alejandro Guillen <alejandro@alantechnologies.com>

### Implementation

- [ ] Implemented: [One or more PRs](https://github.com/alantech/marsha/some-pr-link-here) YYYY-MM-DD
- [ ] Revoked/Superceded by: [RFC ###](./000 - RFC Template.md) YYYY-MM-DD

## Author(s)

- David Ellis <david@alantechnologies.com>

## Summary

LLMs are good at foundational things. Loops, branches, recursion, etc. These things haven't really changed since almost the beginning of computing, and it can use them well. Widely-used, stable libraries it also does well on. But specifics to the exact problem at hand need to be fully described to it or it has no hope of working. Making queries to a database and then using the query result to generate some output *can* work with Marsha as is -- if you're willing to write the SQL query for it. The moment you simply describe the query you want, it will fail more often than not because it doesn't know what your database schema is, and will just guess a schema that seems reasonable based on the requested query, and then fail.

We have already demonstrated a way to handle this correctly with [SQL Pal](https://github.com/alantech/sqlpal), so we could integrate much of the work there into the Marsha compiler to handle the schema problem better. But we likely need new syntax to support this. Querying existing databases with externally-defined schemas might be doable without a new syntax, but it would be reasonable for Marsha applications to define their own database for their needs, especially Marsha applications meant to be run as a web server with different REST endpoints to accomplish different tasks.

## Proposal

Just alternatives right now as we dig into the possibilities here. The solution ideally should:

1. Declare what database will be used by the application, whether a local sqlite file, a remote Postgres/MySQL/etc server, or perhaps something more esoteric like MongoDB, Redis, etc.
2. Declare the connection is read-only vs read-write, so it *never* tries to mutate a read-only database connection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is a complexity we do not need to address now? we might rely on the db user and roles management and the credentials given to connect with marsha

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I agree that it would be a later feature that may not be implemented for a while, but I don't think relying on credential permissions is the right approach here: how often were you given a separate set of read-only credentials for a database in a company when you were allowed write access to said database?

Basically there can be times when you want to reduce the "blast radius" but had to do it by being careful instead of relying on the DBA to be responsive :) This is basically telling Marsha to be careful. ;)

3. Allow the user to define the schema within Marsha if the database is *owned* by the Marsha application (one level above read-write, as it should also automatically migrate the database on startup ideally)
4. When run as a web server, it should use a singleton connection instead of creating a new connection per request, since many databases handle large connection volumes poorly. (Likely the CLI approach could use the same mechanism, though it wouldn't matter there)
5. Multiple databases should be allowed in the same marsha application (the syntax shouldn't be a singleton type) so there needs to be a clear way to associate schemas and queries to a particular database
6. When possible, get the answer from the database during compilation, instead of making the user explicitly spell it out in Marsha text.

### Alternatives Considered

#### Fully separate syntax

Making a wholly-separate syntax added into Marsha: `database`, `query`, and `schema` blocks. The example syntaxes below are vaguely based on the syntax explanations [Postgres uses](https://www.postgresql.org/docs/current/sql-select.html). `{something}` is a value the user needs to provide and `[ something ]` is an optional part of the syntax. Unbracketed/unbraced words are keywords. But outside of the more rigid syntactic areas the example data is just that -- example data.

```md
# database {name} {url or path} [ro | rw]
```

Declares the database to be used, names it, and specifies the URL or path to the sqlite file. Optionally specifies read-only or read-write, defaulting to read-only.

```md
# query {name}(args) using {dbname}

Query description

---
return column 1, return column 2, return column 3
example 1 value 1, example 1 value 2, example 1 value 3
example 2 value 1, example 2 value 2, example 2 value 3
example 3 value 1, example 3 value 2, example 3 value 3
```

Similar to a function, but specifying which database it is connected to with `using {dbname}` and the return type is a custom type defined at the end after a horizontal rule `---`.

```md
# schema {name} for {dbname} {example.csv}
## index [{name} columns] {column name 1, column name 2}
## join {column name} = {schema name}.{column name}
```

```md
# schema {name} for {dbname}
column 1, column 2, column 3
example 1, example 2, example 3
example 4, example 5, example 6
## index [{name} columns] {column name 1, column name 2}
## join {column name} = {schema name}.{column name}
```
Comment on lines +76 to +83
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an usage example? I think I do not fully follow the ## index and ## join syntax

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's sort of an unholy mixture of SQL and Markdown. Declare an index, optionally name it, if you name it include the columns keyword and then put the comma separated column names after that.

Similarly for declaring a join column. It's not as flexible as real SQL and this is a negative of this approach because any feature in any database we support may be requested by a user to be added and further bulk up the complexity of this syntax.


```md
# schema {name} for {dbname} using type {type name}
## index [{name} columns] {column name 1, column name 2}
## join {column name} = {schema name}.{column name}
```

Schemas are like `type`s, but with index, join, and dbname directives included. There are three variants for the schema, the first two matching the variants for declaring a type and the third one for just deriving a schema from a type. Similarly there should be a way to derive a type from a schema:

```md
# type {name} from schema {dbname}.{type name}
```

Queries would be used by Marsha functions by the user referencing it in the description (and then the query definition would be included with the function generation, so it would be folded into the function. So different functions don't get slightly different versions of the query, there should be a pre-pass to generate the query that would be embedded into the functions so it is a SQL template given to the LLM)

#### Embedded syntax

Databases would still be declared top-level, and the `# schema` type would also exist, but queries would be in a new `## query` subsection for a function, instead. This would have to be after the description section but before the function examples, so the query would be defined *after* it's "use" in the description, which might be a bit weird, but if defined before the description we would need a `## description` sub-heading to make it clear again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, not sure if adding complexity to the function definitions makes sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a path to consider, though, since it reduces the number of top-level things to learn.


#### References syntax

Rather than coming up with a special syntax for databases or websites to mock (the [other RFC being worked on in parallel](https://github.com/alantech/marsha/pull/159)), having a references block that has a listing of references [in the Markdown style](https://www.markdownguide.org/basic-syntax/#reference-style-links) that can be referred to by number within the function blocks may provide a similar benefit, though much more reliant on the LLM to make sense of the provided data.

It would still require us to connect to the URL/database/etc, to gather information to provide to the LLM, but it would be a more modular "document generator" mechanism that could be plugin-based where parsing of the markdown in the description section of the function blocks would figure out which document(s) to generate and provide to the LLM during function generation.

This reduces the amount of new syntax dramatically, but eschews any database schema management within Marsha for at least the time being.

Some of the desired functionality, like a shared connection object to reduce connection pressure on the database, could be done with a Marsha function told to memoize generated connection objects and other Marsha functions told to acquire the connection from said function. This places the burden on the user to "do the right thing" but keeps language simplicity and flexibility high. Potentially the user burden can be reduced by developing a standard library for Marsha and/or a module system.

```md
# func get_users(): list of user objects

This function gets all user records from [the database][1] and returns them, or fails if unable to connect to the database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit tricky what substitution would need to be done here depending on each case for the final prompt for the LLM to know what to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so I was thinking it would lightly transform the [the database][1] into [the database](#the-database) where #the-database is whatever the title is of the reference after it is turned into a markdown document itself following this convention

The exact transform would depend on the extension and how it deals with the URL (https should be queried and maybe it drops part of the <body> into the document if it's an HTML document, while a file path that ends in .sqlite reads that file with the sqlite3 standard library and provides a summary of the schema).


* get_users() = [{"name": "John Doe", "age": 27, "enjoys_udon": true}, {"name": "Jane Doe", "age": 28, "enjoys_udon": false}]
* get_users() raises database access exception

# references

[1]: ./db.sqlite
```

In this approach, *all* of the logic SQL Pal has to guess what tables are relevant will be necessary (or we make it very expensive with large-context GPT4 calls).

#### Extensions Syntax

LLM abilities probably haven't plateaued yet, and I see value in both the Fully-separated syntax and the References syntax -- which is desired may depend on personal preference more than anything, but needing to choose one over the other may limit the future of Marsha in undesirable ways. At least while we're still figuring out what to leave to the LLM, what to guide the LLM with, and what to explicitly require the user to provide, it could be useful to have users specify which manipulations of the Markdown syntax they even want to have with special interpreter directives defined at the beginning of the file, kinda like [Raku](https://www.raku.org/) or [Racket](https://racket-lang.org/).

We could define a `# using {extension 1} [, {extension 2} ...]` header that *must* be the first if present, specifying what parsing extensions are to be used on the text after it. We could even potentially put *all* functionality behind different extensions to give us the flexibility to make breaking changes to things like functions without breaking any code still using the older function behavior, but that could be a wonky barrier to entry for people who are not programmers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, when I started reading it my thought was that las sentence, but thinking about it, while we get a "final" solution users that are not programmers either are not going to use these "advance modes" so often or they have not discovered marsha yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're actively developing on it during this alpha state I think that's fine, but I do think a default "base" extension that's automatically used and extensions long-term are just development artifacts and/or for esoteric purposes in the future. I think I mention something like that in the following paragraph.


These extensions ought to register sets of `# something` blocks they manage, and last one wins, then there can be a set of "base" blocks but an extension can replace it. Breaking changes could be tested with such an extension and then if accepted, promoted to base while the current base behavior is demoted into a back-compat extension you can switch back to if desired.

But for something like the `# references` block to make any sense, it also needs to *modify* the behavior of the `# func` block to update it with the documents the function description references, if any. This implies that it may be a better idea for the extensions to decide to either replace or extend the existing behavior of `# something` blocks they have referenced. The current behavior that parses the `# func` blocks into more verbose markdown would then have that extended markdown intercepted by the references extension and modified with the references used by the function.

Similarly, the current type-insertion behavior for `# func` blocks that specify an explicit type could be turned into an extension interception behavior in that way, simplifying and segregating those different concerns, likely with near-zero latency impact (the function could not be parsed by the LLM until the dependent type is turned into a Python `class`, but that is true today. It would just insert an extra function call in the Markdown parse and re-generation, afaict.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not sure if I follow this, but the classes and the functions are generated in the same call, there's no previous call to generate classes and then another one for functions, is all in one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was reading through it more closely this evening and noticed that, however that doesn't mean it can't be broken into separate paths to do just a text transform, then merged together and then fed to the LLM as separate event handlers so we can add new # blocks independently.


This also opens up a couple of interesting features:

1. With a standardized interface like this, user-written extensions, including extensions written in Marsha to modify Marsha, become possible since Python is an interpreted language so dynamic loading of these extensions should be doable.
2. The "base" and set of extensions could be derived from the target language, with Python just being the default. We could have different bases for different target languages using the same general "framework".
3. Embedded code for the target language could just be another extension, for instance a `# def my_python_function(argname):` could be the beginning of manually-written Python to include in the output automatically.
4. Modules/standard lib/etc could just be another extension that defines a block that then introduces the block(s) actually desired.
5. The five "layers" of Marsha compilation: `.mrsh` parse, code and test generation, syntactic validation, test-and-iterate, and final formatting + helper code writing, could themselves be turned into extensions. This would imply an event-based structure where extensions listen to some events and emit to others, extension selection determines which event listeners are registered, and compiler flags determine which event gets initial data pushed and which event is treated as the "done" event.

That last concept has an interesting implication for parallelization and true multicore usage, and where the LLM calls themselves could just be another event consumer and emitter and switching from, say, OpenAI to llama.cpp + LLaMa 2, could be made trivial. It also would make `# using` extensions and compiler configuration equivalent: any compiler flag ought to have an equivalent `# using` directive. So you could decide to make using compiler flags instead of `# using` directives a lint error, or you could decide to have a Marsha-based environment that uses extensions you provide that the user does not need to be aware of.

## Expected Semver Impact

Minor update if post-1.0.0

## Affected Components

TODO

## Expected Timeline

TODO