Skip to content

Conversation

@mschristensen
Copy link
Contributor

Description

Adds a "Token Streaming" section to the AIT docs with a page for token streaming with a message per token.

Covers:

  • Using a realtime client on the agent side to guarantee order
  • Publishing tokens without awaiting the acknowledgement for high throughput
  • Common patterns for token publishing and subscribing:
    • Continuous token stream
    • Token streams for distinct responses
    • Token streams with explicit start/stop events
  • Common patterns for client hydration:
    • Using rewind
    • Using persisted history with untilAttach
    • Loading complete responses from the database and hydrating tokens for live responses

Note that the 100 message rewind limit will change soon, and these docs will be updated to reflect that.

Checklist

Add intro describing the pattern, its properties, and use cases.
Includes continuous token streams, correlating tokens for distinct
responses, and explicit start/end events.
Splits each token streaming approach into distinct patterns and shows
both the publish and subscribe side behaviour alongside one another.
Includes hydration with rewind and hydration with persisted history +
untilAttach. Describes the pattern for handling in-progress live
responses with complete responses loaded from the database.
@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/AIT-51-token-streaming-granular-history

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

// ✅ Do this - publish without await for maximum throughput
for await (const event of stream) {
if (event.type === 'token') {
channel.publish('token', event.text);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any guidance on how users are meant to handle the result of the publish in this scenario? In some failure modes (e.g. a bunch of messages end up queued client-side and then get failed due to the connection becoming SUSPENDED, but the user just ploughs on publishing subsequent messages) they might end up with gaps in the published token stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or, perhaps an even more realistic scenario: some publishes are rejected due to rate limits but we plough ahead with subsequent publishes, some of which might succeed once the rate limiting subsides)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are considering a page about discontinuity handling generally, and I think we can consider how to tackle this problem as part of that, but needs some more thinking. I'll make a note. If you have any ideas on how to handle that I'm all ears :)

```javascript
const channel = realtime.channels.get('{{RANDOM_CHANNEL_NAME}}');

const responses = new Map();
Copy link
Contributor

@lawrence-forooghian lawrence-forooghian Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "Track responses by ID" comment, as above, would be useful here I think.

const channel = realtime.channels.get('{{RANDOM_CHANNEL_NAME}}');

// Track responses by ID
const responses = new Map();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it makes sense to suggest storing the partial responses in the case where we don't have explicit start and stop events given that the storage will potentially grow unboundedly. I'd suggest perhaps only showing the Map solution in the explicit start / stop events case and perhaps here just log the response ID alongside the message. Or have I missed something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included it because I wanted to illustrate that responses could be multiplexed on the channel (see "even when delivered concurrently" above, although we will likely have a specific page for this concept in more detail). I think in this case it's okay - the example is intended to be illustrative (and I wanted it to show how the client would append tokens for the same response together). In a real app, you would likely have more complex solutions if the data could genuinely grow large enough to cause memory issues (e.g. local storage and loading only the data into memory that is currently visible at your scroll position, and so on).

// Handle response stop
await channel.subscribe('stop', (message) => {
const responseId = message.extras?.headers?.responseId;
const finalText = responses.get(responseId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps (assuming that the idea of the responses map is just to accumulate response content during generation) remove from responses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do, although to the comment above, the example is intended to be illustrative, and if you want to render the messages, they need to be somewhere (and I think it's out of scope for this page to discuss strategies for managing and displaying unbounded data in web apps generally)

Copy link
Contributor

@GregHolmes GregHolmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only made a couple minor suggestions on starting the sentences earlier on. Other than that, I think you've got this spot on.


## Publishing tokens <a id="publishing"/>

You should publish tokens from a [Realtime](/docs/api/realtime-sdk) client, which maintains a persistent connection to the Ably service. This allows you to publish at very high message rates with the lowest possible latencies, while preserving guarantees around message delivery order. For more information, see [Realtime and REST](/docs/basics#realtime-and-rest).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You should publish tokens from a [Realtime](/docs/api/realtime-sdk) client, which maintains a persistent connection to the Ably service. This allows you to publish at very high message rates with the lowest possible latencies, while preserving guarantees around message delivery order. For more information, see [Realtime and REST](/docs/basics#realtime-and-rest).
Publish tokens from a [Realtime](/docs/api/realtime-sdk) client, which maintains a persistent connection to the Ably service. This allows you to publish at very high message rates with the lowest possible latencies, while preserving guarantees around message delivery order. For more information, see [Realtime and REST](/docs/basics#realtime-and-rest).


You should publish tokens from a [Realtime](/docs/api/realtime-sdk) client, which maintains a persistent connection to the Ably service. This allows you to publish at very high message rates with the lowest possible latencies, while preserving guarantees around message delivery order. For more information, see [Realtime and REST](/docs/basics#realtime-and-rest).

[Channels](/docs/channels) are used to separate message traffic into different topics. For token streaming, each conversation or session typically has its own channel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Channels](/docs/channels) are used to separate message traffic into different topics. For token streaming, each conversation or session typically has its own channel.
[Channels](/docs/channels) separate message traffic into different topics. For token streaming, each conversation or session typically has its own channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-app Create a Heroku review app

Development

Successfully merging this pull request may close these issues.

5 participants