Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(hydro_lang): editing pass #1693

Merged
merged 3 commits into from
Jan 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
<a href="https://docs.rs/hydro_lang/"><img src="https://img.shields.io/badge/docs.rs-Hydro-blue?style=flat-square&logo=read-the-docs&logoColor=white" alt="Docs.rs"></a>
</p>

Hydro is a novel distributed programming library for standard Rust. Hydro allows developers to build distributed systems that are efficient, scalable, and correct.
Hydro is a high-level distributed programming framework for Rust. Hydro can help you quickly write scalable distributed services that are correct by construction. Much like Rust helps with memory safety, Hydro helps with [**distributed safety**](https://hydro.run/docs/hydro/correctness.md).

Hydro integrates naturally into standard Rust constructs and IDEs, providing types and programming constructs for ensuring distributed correctness. Under the covers it provides a metaprogrammed compiler that optimizes for cross-node issues of scaling and data movement while leveraging Rust and LLVM for per-node performance.
Hydro integrates naturally into standard Rust constructs and IDEs, providing types and programming constructs for ensuring distributed safety. Under the covers it provides a metaprogrammed compiler that optimizes for cross-node issues of scaling and data movement while leveraging Rust and LLVM for per-node performance.

We often describe Hydro via a metaphor: *LLVM for the cloud*. Like LLVM, Hydro is a layered compilation framework with a low-level Internal Representation language. In contrast to LLVM, Hydro focuses on distributed aspects of modern software.

Expand All @@ -17,10 +17,10 @@ We often describe Hydro via a metaphor: *LLVM for the cloud*. Like LLVM, Hydro i
</div>


## The Language (and the Low-Level IR)
Hydro provides a [high-level language](https://hydro.run/docs/hydro) that allows you to program an entire fleet of processes from a single program, and then launch your fleet locally or in the cloud via [Hydro Deploy](https://hydro.run/docs/deploy). Get started with Hydro via the language [documentation](https://hydro.run/docs/hydro) and [examples](https://github.com/hydro-project/hydro/tree/main/hydro_test/examples).
## The Hydro API (and the Low-Level IR)
Hydro provides a [high-level API](https://hydro.run/docs/hydro) that allows you to program an entire fleet of processes from a single program, and then launch your fleet locally or in the cloud via [Hydro Deploy](https://hydro.run/docs/deploy). Get started with Hydro via the API [documentation](https://hydro.run/docs/hydro) and [examples](https://github.com/hydro-project/hydro/tree/main/hydro_test/examples).

> Internally, the Hydro stack compiles Hydro programs into a low-level Dataflow Internal Representation (IR) language called [DFIR](https://hydro.run/docs/dfir); each process corresponds to a separate DFIR program. In rare cases you may want to compose one or more processes in DFIR by hand; see the DFIR [documentation](https://hydro.run/docs/dfir) or [examples](https://github.com/hydro-project/hydro/tree/main/dfir_rs/examples) for details.
> Internally, the Hydro stack compiles Hydro programs into a low-level single-threaded DataFlow Internal Representation (IR) language called [DFIR](https://hydro.run/docs/dfir); each Hydro process corresponds to a separate DFIR program. In rare cases you may want to hand-author one or more processes in DFIR; see the DFIR [documentation](https://hydro.run/docs/dfir) or [examples](https://github.com/hydro-project/hydro/tree/main/dfir_rs/examples) for details.

## Development Setup

Expand All @@ -32,7 +32,7 @@ There have been many frameworks and platforms for distributed programming over t
**Higher level frameworks** have been designed to serve specialized distributed use cases. These including *Client-Server (Monolith)* frameworks (e.g. Ruby on Rails + DBMS), parallel *Bulk Dataflow* frameworks (e.g. Spark, Flink, etc.), and step-wise *Workflows / Pipelines / Serverless / μservice Orchestration* frameworks (e.g. Kafka, Airflow). All of these frameworks offer limited expressibility and are inefficient outside their sweet spot. Each one ties developers' hands in different ways.

**Lower level asynchronous APIs** provide general-purpose distributed interfaces for sequential programming, including
*RPCs*, *Async/Await* frameworks and *Actor* frameworks (e.g. Akka, Ray, Unison, Orleans, gRPC). These interfaces allow developers to build distributed systems *one async sequential process* at a time. While they offer low-level control of individual processes, they provide minimal help for global correctness of the fleet.
*RPCs*, *Async/Await* frameworks and *Actor* frameworks (e.g. Akka, Ray, Unison, Orleans, gRPC). These interfaces allow developers to build distributed systems *one async sequential process* at a time. While they offer low-level control of individual processes, they provide minimal help with ensuring the global correctness of the fleet.

## Towards a more comprehensive approach
What's wanted, we believe, is a proper language stack addressing distributed concerns:
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/hydro/correctness.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ sidebar_position: 3
---

# Safety and Correctness
Just like Rust's type system helps you avoid memory safety bugs, Hydro helps you ensure **distributed safety**. Hydro's type systems helps you avoid many kinds of distributed systems bugs, including:
- Non-determinism due to message delays (which reorder arrival) or retries (which result in duplicates)
Much like Rust's type system helps ensure memory safety, Hydro helps ensure **distributed safety**. Hydro's type system helps you avoid many kinds of distributed systems bugs, including:
- Non-determinism due to message delays (which affect arrival order), interleaving across streams (which affect order of handling) or retries (which result in duplicates)
- See [Live Collections / Eventual Determinism](./live-collections/determinism.md)
- Using mismatched serialization and deserialization formats across services
- See [Locations and Networking](./locations/index.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/hydro/dataflow-programming.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ sidebar_position: 1
---

# Dataflow Programming
Hydro uses a dataflow programming model, which will be familiar if you have used APIs like Rust iterators. Instead of using RPCs or async/await to describe distributed computation, Hydro instead uses **asynchronous streams**, which represent data arriving over time. Streams can represent a series of asynchronous events (e.g. inbound network requests) or a sequence of data items.
Hydro uses a dataflow programming model, which will be familiar if you have used APIs like Rust iterators. Instead of using RPCs, async/await or actors to describe distributed computation, Hydro instead uses **asynchronous streams**, which represent data arriving over time. Streams can represent a series of asynchronous events (e.g. inbound network requests) or a sequence of data items.

Programs in Hydro describe how to **transform** streams and other collections of data using operators such as `map` (transforming elements one by one), `fold` (aggregating elements into a single value), or `join` (combining elements from multiple streams on matching keys).

Expand Down
6 changes: 4 additions & 2 deletions docs/docs/hydro/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@ sidebar_position: 0
---

# Introduction
Hydro is a high-level distributed programming framework for Rust powered by the [DFIR runtime](../dfir/index.mdx). Unlike traditional architectures such as actors or RPCs, Hydro offers _choreographic_ APIs, where expressions and functions can describe computation that takes place across many locations. It also integrates with [Hydro Deploy](../deploy/index.md) to make it easy to deploy and run Hydro programs to the cloud.
Hydro is a high-level distributed programming framework for Rust. Hydro can help you quickly write scalable distributed services that are correct by construction. Much like Rust helps with memory safety, Hydro helps with [**distributed safety**](./correctness.md). Hydro also makes it easy to get started by running your distributed programs in either testing or deployment modes.

Hydro uses a two-stage compilation approach. Hydro programs are standard Rust programs, which first run on the developer's laptop to generate a _deployment plan_. This plan is then compiled to individual binaries for each machine in the distributed system (enabling zero-overhead abstractions), which are then deployed to the cloud using the generated plan along with specifications of cloud resources.
Hydro is a distributed dataflow language, powered by the high-performance single-threaded [DFIR runtime](../dfir/index.mdx). Unlike traditional architectures such as actors or RPCs, Hydro offers _choreographic_ APIs, where expressions and functions can describe computation that takes place across many locations. It also integrates with [Hydro Deploy](../deploy/index.md) to make it easy to deploy and run distributed Hydro programs either locally or in the cloud.

Hydro uses a two-stage compilation approach. Hydro programs are standard Rust programs, which first run on the developer's laptop to generate a _deployment plan_. This plan is then compiled to DFIR to generate individual binaries for each machine in the distributed system (enabling zero-overhead abstractions), which are then deployed to the cloud using the generated plan along with specifications of cloud resources.

Hydro has been used to write a variety of high-performance distributed systems, including implementations of classic distributed protocols such as two-phase commit and Paxos. Work is ongoing to develop a distributed systems standard library that will offer these protocols and more as reusable components.

Expand Down
4 changes: 2 additions & 2 deletions docs/docs/hydro/live-collections/bounded-unbounded.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ sidebar_position: 0
---

# Bounded and Unbounded Types
Although live collections can be continually updated, some collection types also support **termination**, after which no additional changes can be made. For example, a live collection created by reading integers from an in-memory `Vec` will become terminated once all the elements of the `Vec` have been loaded. But other live collections, such as one being updated by the network, may never become terminated.
Although live collections can be continually updated, some collection types also support **termination**, after which no additional changes can be made. For example, a live collection created by reading integers from an in-memory `Vec` will become terminated once all the elements of the `Vec` have been loaded. But other live collections, such as events streaming into a service from a network, may never become terminated.

In Hydro, certain APIs are restricted to only work on collections that are **guaranteed to terminate** (**bounded** collections). All live collections in Hydro have a type parameter (typically named `B`), which tracks whether the collection is bounded (has the type `Bounded`) or unbounded (has the type `Unbounded`). These types are used in the signature of many Hydro APIs to ensure that the API is only called on the appropriate type of collection.

Expand Down Expand Up @@ -32,7 +32,7 @@ let input: Singleton<_, _, Bounded> = tick.singleton(q!(0));
let unbounded: Singleton<_, _, Unbounded> = input.into();
```

Converting from an unbounded collection **to a bounded collection**, however is more complex. This requires cutting off the unbounded collection at a specific point in time, which may not be possible to do deterministically. For example, the most common way to convert an unbounded `Stream` to a bounded one is to batch its elements non-deterministically using `.tick_batch()`.
Converting from an unbounded collection **to a bounded collection**, however is more complex. This requires cutting off the unbounded collection at a specific point in time, which may not be possible to do deterministically. For example, the most common way to convert an unbounded `Stream` to a bounded one is to batch its elements non-deterministically using `.tick_batch()`. Because this is non-deterministic, we require that it be placed in an `unsafe` block.

```rust,no_run
# use hydro_lang::*;
Expand Down
14 changes: 7 additions & 7 deletions docs/docs/hydro/live-collections/determinism.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,26 @@ sidebar_position: 1
---

# Eventual Determinism
Most programs have strong guarantees on **determinism**, the property that when provided the same inputs, the outputs of the program are always the same. Even when the inputs and outputs are live collections, we can focus on the _eventual_ state of the collection (as if we froze the input and waited until the output stops changing).
Many programs benefit from strong guarantees on **determinism**, the property that when provided the same inputs, the outputs of the program are always the same. This is critical for consistency across *replicated* services. It is also extremely helpful for predictability, including consistency across identical runs (e.g. for testing or reproducibility.) Determinism is particularly tricky to reason about in distributed systems due to the inherently non-deterministic nature of asynchronous event arrivals. However, even when the inputs and outputs of a program are live collections, we can focus on the _eventual_ state of the collection as if we froze the input and waited until the output stopped changing.

:::info

Our consistency and safety model is based on the POPL'25 paper [Flo: A Semantic Foundation for Progressive Stream Processing](https://arxiv.org/abs/2411.08274), which covers the formal details and proofs underlying this system.

:::

Hydro thus guarantees **eventual determinism**: given a set of streaming inputs which have arrived, the outputs of the program will **eventually** have the same _final_ value. This makes it easy to build composable blocks of code without having to worry about runtime behavior such as batching or network delays.
Hydro thus guarantees **eventual determinism**: given a set of specific live collections as inputs, the outputs of the program will **eventually** have the same _final_ value. Eventual determinism makes it easy to build composable blocks of code without having to worry about non-deterministic runtime behavior such as batching or network delays.

:::note

Much existing literature in distributed systems focuses on consistency levels such as "eventual consistency" which typically correspond to guarantees when reading the state of a _replicated_ object (or set of objects) at a _specific point_ in time. Hydro does not use such a consistency model internally, instead focusing on the values local to each distributed location _over time_. Concepts such as replication, however, can be layered on top of this model.
Much existing literature in distributed systems focuses on data consistency levels such as "eventual consistency". These typically correspond to guarantees when reading the state of a _replicated_ object (or set of objects) at a _specific point_ in time. Hydro does not use such a consistency model internally, instead focusing on the values local to each distributed location _over time_. Concepts such as replication, however, can be layered on top of this model.

:::

## Unsafe Operations in Hydro
All **safe** APIs in Hydro (the ones you can call regularly in Rust) guarantee determinism. But often it is necessary to do something non-deterministic, like generate events at a fixed wall-clock-time interval, or split an input into arbitrarily sized batches.

Hydro offers APIs for such concepts behind an **`unsafe`** guard. This keyword is typically used to mark Rust functions that may not be memory-safe, but we reuse this in Hydro to mark non-deterministic APIs.
Hydro offers APIs for such concepts behind an **`unsafe`** guard. This keyword is typically used to mark Rust functions that may not be memory-safe, but we reuse this in Hydro to mark APIs with non-deterministic constructs.

To call such an API, the Rust compiler will ask you to wrap the call in an `unsafe` block. It is typically good practice to also include a `// SAFETY: ...` comment to explain why the non-determinism is there.

Expand All @@ -39,9 +39,9 @@ unsafe {
}.for_each(q!(|v| println!("Sample: {:?}", v)))
```

When writing a function with Hydro that involves `unsafe` code, it is important to be extra careful about whether the non-determinism is exposed externally. In some applications, a utility function may involve local non-determinism (such as sending retries), but not expose it outside the function (via deduplication).
When writing a function with Hydro that involves `unsafe` code, it is important to be extra careful about whether the non-determinism is exposed externally. In some applications, a utility function may involve local non-determinism (such as sending retries), but not expose it outside the function (e.g., via deduplicating received responses).

But other utilities may expose the non-determinism, in which case they should be marked `unsafe` as well. If the function is public, Rust will require you to put a `# Safety` section in its documentation to explain the non-determinism.
But other functions may expose the non-determinism, in which case they should be marked `unsafe` as well. If the function is public, Rust will require you to put a `# Safety` section in its documentation to explain the non-determinism.

```rust
# use hydro_lang::*;
Expand All @@ -65,7 +65,7 @@ unsafe fn print_samples<T: Debug, L>(
```

## User-Defined Functions
Another source of potential non-determinism is user-defined functions, such as those provided to `map` or `filter`. Hydro allows for arbitrary Rust functions to be called inside these closures, so it is possible to introduce non-determinism that will not be checked by the compiler.
Another source of potential non-determinism comes from user-defined functions or closures, such as those provided to `map` or `filter`. Hydro allows for arbitrary Rust functions to be called inside these closures, so it is possible to introduce non-determinism that will not be checked by the compiler.

In general, avoid using APIs like random number generators inside transformation functions unless that non-determinism is explicitly documented somewhere.

Expand Down
Loading
Loading