Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrate cr-sqlite #147

Closed
tantaman opened this issue Mar 18, 2023 · 10 comments
Closed

integrate cr-sqlite #147

tantaman opened this issue Mar 18, 2023 · 10 comments

Comments

@tantaman
Copy link
Contributor

tantaman commented Mar 18, 2023

integrate cr-sqlite

Lmk if you'd like to track this some other way but --

Opening a high level task to track and discuss the integration.

I'm assuming we'd want to drop the extension in under the ext folder. If so I can start moving in that direction.

A few issues were raised on the doc, however:

  1. Coming up with a custom syntax to define CRRs / CRDTs
  2. Using the internal commit and update hooks

Should we work those out first?

@psarna
Copy link
Collaborator

psarna commented Mar 19, 2023

Hi! Yeah, ext/ is the right place, although feel free to also hook it into the autotools build system, e.g. so that you can ./configure --enable-crdt && make.

Ad 1, let's create a separate issue for the custom syntax, and we can bikeshed the exact syntax in there. /cc @penberg

Ad 2, I think we can figure out the internal hooks later. It's ok for the first iteration to work with the original SELECT magic_function_with_side_effects() syntax, and then we'll spray enough syntactic sugar over it to make the experience smooth. Again, /cc @penberg

@penberg
Copy link
Collaborator

penberg commented Mar 19, 2023

@tantaman I am with @psarna on this: let's just integrate it and work towards a syntax and internal hooks incrementally.

tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Mar 27, 2023
tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Mar 27, 2023
tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Mar 27, 2023
@tantaman
Copy link
Contributor Author

tantaman commented Apr 4, 2023

@psarna - any docs (or just a code pointer) on how I might use the internal commit and update hooks?

Also, here are some ideas on the custom syntax for defining CRDTs / CRRs.

Table Syntax

Each table can be modeled as some sort of set CRDT. Grow only set, observe remove set, causal length set, etc.

Given that, we could add a modified CREATE TABLE that is CREATE [set_type]

Example:

CREATE [SetType] foo ( ... );

Other alternatives would be to keep CREATE TABLE and append some modifier to the end:

CREATE TABLE foo ( ... ) AS [SetType]; -- option 1
CREATE TABLE foo ( ... ) CRDT = [SetType]; -- option 2

Where SetType = CausalLengthSet | GrowOnlySet | DeleteWinsSet | AddWinsSet

Note that, in practice, we may end up only supporting CausalLengthSet given that type can be used to model all of the other sets in user space.

Column Syntax

Within a row, each column can be a specific type of CRDT. The type of CRDT chosen controls the semantics of how that given cell is merged.

Some options for a column:

  1. Last Write Wins Register
  2. Counter
  3. Distributed Fractional Index*
  4. Multi-Value Register
  5. RGA
  6. Peritext

One idea is to add an extra type (the crdt type) to the column type.

Example:

CREATE TABLE foo (
  id PRIMARY KEY,
  views COUNTER,
  content PERITEXT,
  owner_id LWW INTEGER
) AS CausalLengthSet;

Or generally:

CREATE TABLE [table_name] (
  [col_name] [CRDTType] [Type],
  ...
) AS [SetType];

where CRDTType = LWW | Counter | DistFractIndex | MVRegister | RGA | Peritext

* Distributed Fractional Index would need to be parameterized. I can go into more detail on this one after we talk through what has been proposed so far.

@asg017
Copy link

asg017 commented Apr 6, 2023

I'm curious: would it be possible to use virtual tables instead of implementing custom syntax? For example, your foo example could be represented as a virtual table like so:

create virtual table foo using CausalLengthSet(
  id PRIMARY KEY,
  views COUNTER,
  content PERITEXT,
  owner_id LWW INTEGER
);

The SQLite virtual table syntax is very relaxed, and you can enforce any syntax rules as you like. So your virtual table logic for CausalLengthSet (or any other "virtual table modules" that you define like GrowOnlySet or DeleteWinsSet) can validate and enforce any CRDTType in column definitions.

Personally, I would think that an extension with virtual tables would be preferable to new syntax, but I might be missing something!

@tantaman
Copy link
Contributor Author

tantaman commented Apr 6, 2023 via email

@tantaman
Copy link
Contributor Author

tantaman commented Apr 6, 2023

I did some experimentation and it looks mostly possible. A virtual table table is still created but maybe I can do something like drop the vtab in the commit hook such that the crr vtab just becomes a new syntax for creating real tables rather than creating vtabs.

Another problem is that the virtual table gets the desired table name and the real table has to get some other name :(

So the vtab route might look like:

CREATE VIRTUAL TABLE foo_blah USING CausalLengthSet(
  id PRIMARY KEY,
  views COUNTER,
  content PERITEXT,
  owner_id LWW INTEGER
);

where the user must suffix their table name with _something. That suffix is stripped to create foo. The virtual table foo_blah will just exist but not actually do anything.

It does seem better than select crsql_as_crr('foo') -- especially as as_crr accumulates more options.

tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Apr 6, 2023
Alex made a really good point here: tursodatabase/libsql#147 (comment)

this'll make `crr` creation a bit more ergonomic for the base extension
@asg017
Copy link

asg017 commented Apr 6, 2023

creating a virtual table that just proxied a real table had a large perf overhead on read.

This seems odd to me - do you mind sharing the setup of this virtual table when you tried it out? If your virtual table is backed by a regular SQLite table, queries should be as fast as querying the underlying table. You have to jump through a few hoops, but any WHERE clause or only SELECTing a subset of column can be made quick by overwriting constraints and using colsUsed in the XBestIndex method. You can also override any ORDER BY clauses in virtual tables.

That being said, if you're doing analytical queries on a virtual table, then the current implementation isn't great at those. Any COUNT(*) or other aggregate functions will be slow, although there are a few tricks to make it it faster.

In the end we just want a real table + some triggers and metadata tables.

Shadow tables could be a solution to this. You'd have a virtual table named foo, which under-the-hood creates shadow tables like foo_data or foo_metadata that stores whatever you need. They can also be renamed during ALTER TABLE foo RENAME TO bar statements with xRename.

Same with triggers, you could create those in the xCreate() method, and rename during xRename().

Another problem is that the virtual table gets the desired table name and the real table has to get some other name :(

Is there a reason why you would want the desired table name to be a real table, and not a virtual one? Even if you have a virtual table named foo and the "real" table is stored in "foo_data", user's dont have to worry about that - they can just query and INSERT into the foo table, and never even care about "foo_data".

  • you can’t alter a virtual table.

Yea that's the big issue here, you can't use ALTER TABLE to rename/add/drop columns. For current virtual tables like fts5 and rtree the convention is to just create a new table and copy all the rows over, but that isn't great.

Do you expect that people would alter columns in their crdt-backed tables a lot? One trick you might be able to use is similar to SQLite's FTS5 special insert commands, where they reserve a special hidden column for higher-level manipulation of the virtual table, like so:

INSERT INTO ft(ft, rank) VALUES('automerge', 8);
INSERT INTO ft(ft, rank) VALUES('crisismerge', 16);
INSERT INTO ft(ft) VALUES('integrity-check');

Where ft is the special hidden column (same name as the ft virtual table that it's a part of), and 'automerge' / 'crisismerge' / 'integrity-check' are special values that perform specific tasks to the ft virtual table.

So you could possibly do something like:

insert into foo(foo, a, b) values ('drop-column', 1, 1);

Would "drop" the a and b columns. You wouldn't be retroactively change the schema of the already-existing foo table in the same connection, but you can possible just make future queries that reference a and b illegal?

select a, b, c from foo; -- works fine
insert into foo(foo, a, b) values ('drop-column', 1, 1); -- drops columns a and b
select a, b, c from foo; -- fails with custom error "a and b were dropped"
select c from foo; -- works fine

And for future connections, you can possibly store the "new' schema in a virtual table, and on xConnect() read in that new schema into sqlite3_declare_vtab (with a and b omitted).

This also has the added benefit of not altering tables with a scalar SELECT function like crsql_begin_alter(). I get a little scared when I use a SQL library that has SELECT function that change state of the underlying table, but if it's in a write-only query like INSERT INTO, it feels a bit better. It also plays better with tools that allow arbritary read-only queries like Datasette.

Happy to provide more examples if needed!

@tantaman
Copy link
Contributor Author

tantaman commented Apr 6, 2023

"creating a virtual table that just proxied a real table had a large perf overhead on read." - This seems odd to me

It was a surprise to myself and @ivertom as well. We discovered that ~75% of query time for some queries is spent in the parsing and planning phase. Issuing a query against the virtual table then re-constructing a query for the base table was a significant hit since it involved running that phase twice. Once at the virtual layer, again when querying the real table from the vtab implementation. >4x slowdown in some cases.

Unfortunately I don't think we ever bothered merging the vtab changes from Iver's fork to the main repo once the perf problems surfaced. @ivertom, any chance you still have the virtual table approach you took to creating CRRs?

Is there a reason why you would want the desired table name to be a real table

mainly for the perf reasons. I don't want reads slowed down at all so querying the real table, rather than going through a virtual layer first, is the optimal situation.

Even if you have a virtual table named foo and the "real" table is stored in "foo_data", user's dont have to worry about that - they can just query and INSERT into the foo table, and never even care about "foo_data".

Yeah, I'm aware. crsql_changes is currently a virtual table that is used to read and write patch sets out of and into the database. I'm also using virtual tables to provide a fractional indexing API so users can INSERT x AFTER Y and get a correctly generated fractional index for the insertion.

Where ft is the special hidden column (same name as the ft virtual table that it's a part of), and 'automerge' / 'crisismerge' / 'integrity-check' are special values that perform specific tasks to the ft virtual table.

This is interesting. I'll need to mull it over a bit.

It also plays better with tools that allow arbritary read-only queries like Datasette.

I hadn't thought about this angle. Another good argument for moving all mutations to be insert driven.

tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Apr 6, 2023
Alex made a really good point here: tursodatabase/libsql#147 (comment)

this'll make `crr` creation a bit more ergonomic for the base extension
@ivertom
Copy link

ivertom commented Apr 6, 2023 via email

tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Aug 11, 2023
Alex made a really good point here: tursodatabase/libsql#147 (comment)

this'll make `crr` creation a bit more ergonomic for the base extension
tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Aug 12, 2023
Alex made a really good point here: tursodatabase/libsql#147 (comment)

this'll make `crr` creation a bit more ergonomic for the base extension
tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Aug 15, 2023
Alex made a really good point here: tursodatabase/libsql#147 (comment)

this'll make `crr` creation a bit more ergonomic for the base extension
tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Sep 5, 2023
tantaman added a commit to vlcn-io/cr-sqlite that referenced this issue Sep 5, 2023
Alex made a really good point here: tursodatabase/libsql#147 (comment)

this'll make `crr` creation a bit more ergonomic for the base extension
MarinPostma added a commit that referenced this issue Oct 17, 2023
147: End to end testing r=MarinPostma a=MarinPostma

This PR boostraps end-to-end testing for sqld.

I haven't yet setup CI, and plan to do it in a subsequent PR. enough suffering for now, and I need to shift focus to other higher priority tasks.

Co-authored-by: ad hoc <postma.marin@protonmail.com>
@tantaman
Copy link
Contributor Author

tantaman commented Nov 10, 2023

The process of syncing to libsql is fully automated via sync-libsql.sh as of vlcn-io/cr-sqlite#399

Well it still requires submitting PR so not quite fully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants