Discussion: consider backward compatibility #41

xxchan · 2023-01-27T13:15:12Z

I think we should start thinking backward compatibility during development given that we have users.
https://risingwave-labs.slack.com/archives/C03L8DNMDDW/p1674026516400229?thread_ts=1674020631.181329&cid=C03L8DNMDDW

previous breaking-change https://github.com/risingwavelabs/risingwave/issues?q=label%3Abreaking-change

There are many aspects to consider for compatibility

SQL syntax
data format in storage/meta
protobuf
- RFC: Backward Compatibility of Stream Plan #43
CLI args/config file (find a reference from databend)
...

However

Maintaining compatibility is hard and may not worth it. If there is a need to upgrade version for the already deployed cases, we can write a simple tool to migrate old metadata to the new one on demand.
risingwavelabs/risingwave#7281 (comment)

So I just create an issue first to collect ideas :)

The text was updated successfully, but these errors were encountered:

TennyZhuang · 2023-01-28T06:21:49Z

That's a good question, and I've thought a lot about compatibility, so let's take it case by case.

SQL syntax

After our first stable release, it's better to always keep SQL compatibility. Generally speaking, this won't be too hard, since most of our SQL is just following how Postgres behaves.

We can mark a stabilized syntax as deprecated, but never have to remove it. I guess it doesn't take much effort.

For very rare design flaws, used by few users, break changes are allowed. We should discuss them case by case.

Example: risingwavelabs/risingwave#4503 doesn't break anything, and we may still find the better property naming convention in the future.

Data format in storage

SST format

All objects stored in our storage backend are in our private SST format. There is an SstableMeta in the header of each SST file, with 4 bytes of version stored in it. I think it's flexible enough for further break changes.

Row Encoding

The SST data blocks are organized as rows of key/value pairs.

We use memcomparable encoding for the key and value encoding for the value. There's a lot of design space, and we've even made several break changes now. After we stabilize, we need to ensure that any old-version key/value pairs can be decoded correctly.

Fortunately, we can ensure that all key/value pairs in the same SST file have the same encoding version, so we don't need to store the version per row.

IMHO, a good practice to maintain encoding compatibility is to use separate crates and versions for the row encodings. Based on the power of cargo, we can use different versions of the same crate in one project.

For example:

memcomparable_0_1 = { package = "memcomparable", version = "0.1.2" }
memcomparable_0_2 = { package = "memcomparable", version = "0.2.3" }
memcomparable_1 = { package = "memcomparable", version = "1.4" }
value_encoding_0_1 = { package = "value_encoding", version = "0.1" }
value_encoding_1 = { package = "value_encoding", version = "1.2" }
value_encoding_2 = { package = "value_encoding", version = "2.1" }

And dispatch the implementation statically while decoding the SSTs.

Pseudocode:

fn decode<const SST_VERSION: usize>(buf: &Bytes) -> (Key, Value) {
  match SST_VERSION {
    1 => (memcomparable_0_1.decode(buf), value_encoding_0_1.decode(buf))
    2 => (memcomparable_0_2.decode(buf), value_encoding_1.decode(buf))
    3 => (memcomparable_0_2.decode(buf), value_encoding_2.decode(buf))
    4 => (memcomparable_1.decode(buf), value_encoding_2.decode(buf)),
    _ => unreachable!("corrupted data")
  }
}

The encoding crate version should follow the semver rules, so that we can release some patches (e.g. bump deps) to the older versions.

Upgrade During Compacting

When a new storage version (both SST and row encoding) is released, no data is changed until the SST is compacted. Compacting will rebuild the SST file so that we can upgrade the version in the meantime.

Compaction always reconstructs an entire SST file, so we can ensure that all key/value pairs in the same SST file have the same encoding version.

Data Format in Meta

Data in the meta is organized as a tree structure, and currently the major meta backend is etcd. The meta values are encoded by protobuf encoding.

If a change only adds/reduces fields in the meta, then the protobuf can ensure that nothing is broken.

If a change must reorganize the tree structure itself, then a break change must be made.

Since the meta-store data is never too large, a migration script is preferred for such a break change.

If we implement a SQL-based metastore later, we can use some SQL like migrations/5_to_6.sql to do the migration, but currently a Rust snippet is recommended.

Protocols between nodes

In general, the protobuf can handle the situation well enough.

To make things easier, we can add a version field to each request, so that an older compute node can always reject some DDL operations from a newer frontend.

For CN-CN communication (mostly exchange), protobuf compatibility should be handled carefully by developers.

CLI args/config file

No ideas, we should create a good enough configuration framework before the first release.

arkbriar · 2023-01-28T07:06:56Z

In production, there will always be cases in which we upgrade a RisingWave cluster. The problem is that upgrading isn't an atomic process, meaning multiple node versions can exist. One simple solution is to fully stop the cluster and start the newer version afterward, but the connections from our user side will be disconnected. Apparently, it hurts user experience and thus shouldn't be the first choice in most cases.

Compatibility of the communication protocols should be carefully kept and tested for those promised compatible versions, i.e., versions with the same major (and minor) numbers in semantic versions.

For the CLI args and config parameters, I'd like to propose that the unrecognized parameters should trigger warnings instead of panics. This makes the cloud easier to manage the config templates. Also, critical args should be carefully handled because the operator can not distinguish from versions. If the operator also applies the arg changes (e.g., supports a new argument or deprecates an old one), the older RisingWave versions won't be deployable anymore. That has already happened several times.

wangrunji0408 · 2023-01-28T07:10:45Z

What about testing. Is it possible to setup some kind of compatibility test in CI? 👀

TennyZhuang · 2023-01-28T07:23:39Z

In production, there will always be cases in which we upgrade a RisingWave cluster. The problem is that upgrading isn't an atomic process, meaning multiple node versions can exist. One simple solution is to fully stop the cluster and start the newer version afterward, but the connections from our user side will be disconnected. Apparently, it hurts user experience and thus shouldn't be the first choice in most cases.

I'd prefer always rolling-update the whole cluster, with specified orders. Two versions of nodes should only co-exist in several seconds to minutes, so we can reject some older requests directly.

TennyZhuang · 2023-01-28T07:24:14Z

What about testing. Is it possible to setup some kind of compatibility test in CI? 👀

Not ready, we haven't implemented all mechanisms, just some thoughts.

arkbriar · 2023-01-28T09:11:59Z

I'd prefer always rolling-update the whole cluster, with specified orders. Two versions of nodes should only co-exist in several seconds to minutes, so we can reject some older requests directly.

I agree, but it could take a long time to upgrade from a really old version.

neverchanje · 2023-01-30T08:55:53Z

#41 (comment)

What @TennyZhuang mentioned is absolutely essential that we should ensure the extensibility of our major components. But the reality is, even some trivial configuration changes could break the compatibility: risingwavelabs/risingwave#7527. Without actually testing out an upgrade, we'll never know what could break. @arkbriar @sumittal Let's include upgrade testing in our release process.

arkbriar · 2023-01-30T09:27:52Z

@neverchanje I've commented under risingwavelabs/risingwave#7530 (comment) and proposed to keep the backward compatibility for at least one patch version so that the change won't block any others at the exact time it is made to make others life easier.

huangjw806 · 2023-01-30T09:42:10Z

Let's include upgrade testing in our release process.

Of course we can add an upgrade test during the release process, but what if the test fails? Whether to roll back the kernel or upgrade the cloud, such break changes will cause a lot of trouble, and even affect customers, because the cloud operator is public in eks, and not all problems can be solved by upgrading the operator. We should formulate more detailed compatibility rules.

neverchanje · 2023-01-30T10:10:50Z

but what if the test fails?

@huangjw806 We shouldn't release any stable versions with upgradability issues, whether or not it's a patch release or a major release. If the migration tools are not ready, we should revert the changes that cause the issues.

st1page · 2023-01-31T10:50:57Z

draft a RFC to discuss how to maintain and manage the Backward Compatibility of the stream plan and SQL features. need more discussion #43

xxchan · 2023-02-10T09:53:28Z

#43 also contains valuable discussions about the requirements for rolling upgrade.

This comment was marked as resolved.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: consider backward compatibility #41

Discussion: consider backward compatibility #41

xxchan commented Jan 27, 2023 •

edited

Loading

TennyZhuang commented Jan 28, 2023 •

edited

Loading

arkbriar commented Jan 28, 2023

wangrunji0408 commented Jan 28, 2023

TennyZhuang commented Jan 28, 2023 •

edited by xxchan

Loading

TennyZhuang commented Jan 28, 2023

arkbriar commented Jan 28, 2023

This comment was marked as resolved.

neverchanje commented Jan 30, 2023 •

edited

Loading

arkbriar commented Jan 30, 2023

huangjw806 commented Jan 30, 2023

This comment was marked as resolved.

neverchanje commented Jan 30, 2023

This comment was marked as resolved.

st1page commented Jan 31, 2023

xxchan commented Feb 10, 2023

Discussion: consider backward compatibility #41

Discussion: consider backward compatibility #41

Comments

xxchan commented Jan 27, 2023 • edited Loading

TennyZhuang commented Jan 28, 2023 • edited Loading

SQL syntax

Data format in storage

SST format

Row Encoding

Upgrade During Compacting

Data Format in Meta

Protocols between nodes

CLI args/config file

arkbriar commented Jan 28, 2023

wangrunji0408 commented Jan 28, 2023

TennyZhuang commented Jan 28, 2023 • edited by xxchan Loading

TennyZhuang commented Jan 28, 2023

arkbriar commented Jan 28, 2023

This comment was marked as resolved.

neverchanje commented Jan 30, 2023 • edited Loading

arkbriar commented Jan 30, 2023

huangjw806 commented Jan 30, 2023

This comment was marked as resolved.

neverchanje commented Jan 30, 2023

This comment was marked as resolved.

st1page commented Jan 31, 2023

xxchan commented Feb 10, 2023

xxchan commented Jan 27, 2023 •

edited

Loading

TennyZhuang commented Jan 28, 2023 •

edited

Loading

TennyZhuang commented Jan 28, 2023 •

edited by xxchan

Loading

neverchanje commented Jan 30, 2023 •

edited

Loading