Server-side Blazor Production and Reliability

**Edit: @rynowak hijacking top post for great justice**

## Summary

This issue tracks doing all the needed work to support server-side Blazor in production.

I plan to dig into the following areas and for each area assess the current state, make recommendations and log bugs, and write documentation and guidance. 

Working in order:

- Error handling
- Logging and Diagnostics
- Network Reliability
- Resiliency to app recycling
- Scale out

In the background the team will be fixing any high priority issues that come up.

## Error Handling

This includes the possible causes and categories of errors that can occur in a server-side Blazor application, and how application developers should be prepared to deal with them.

- [x] Make error handling explicit on boundaries between framework and user code (lifecycle methods, rendering)

We divide unhandled exceptions into two categories:
- Exception thrown with an observer (event handler)
- Exception thrown without an observer (during rendering)

Exceptions that are thrown as a result of an event handler (observed) are not always bugs. It might be a reasonable behavior for a component to throw an exception in response to invalid data for example. We think that logging is good enough for these cases.

- [x] Make sure errors thrown on event handlers are logged on the server
- [x] Make sure client-side code cannot see unsanitized exception details

For unobserved exceptions, these are generally thrown during the rendering process and can corrupt state. We should not attempt to recover or reuse the circuit if an exception is thrown while rendering. 

- [x] Tear-down/dispose crashed circuits and disconnect the client.
- ~~Notify the client that the circuit is has crashed (ideally with UI).~~

**UPDATE** Most of this is done. The part that isn't is "displaying an error UI on the client", which is not planned for 3.0

## Logging and Diagnostics (Handled as part of https://github.com/aspnet/AspNetCore/issues/11792)

This includes logging and diagnostics of communications between server-side .NET code and client-side JS code, as well as any significant events on the server side. This can also include DiagnosticSource, EventSource, and EventCounters. We will likely make a prioritized list in this area and draw a cutline. My assumption is that the priority here is around the network ingress/egress.

We need to dial up the amount of logging we can produce on both the server and client, and make it possible for developers to diagnose and report issues that we can take action upon using logs.

- [ ] We need logging for entry/exit/results of all Hub calls on both the client and server. Some of this is provided by SignalR, but we need to log *the relevant data* at an appropriate level.
- [ ] Where we're using JS interop for fundamental framework concerns, we need to make the diagnostic information first-class. One way to do this is by converting JS interop to a hub method.
- [ ] We need to add logging and diagnostics for JS Interop. 

## Network Reliability

This covers the reliability to SignalR connection, the ability to resume a circuit, and the ability for the browser the reconnect without data loss.

We have some user-reported issues that we are acting upon here, but we need to identify a strategy for testing reliability, and ideally this would dovetail with our other E2E testing strategy.

- [ ] When a client disconnects, rendering updates will be queued on the server and delivered in order once the client reconnects. https://github.com/aspnet/AspNetCore/issues/11964
- ~~[ ] When a message fails to send on the client, the message will be queued and delivered in order.~~ This is going to apply for ACK-s only and we've decided not to do anything for JS Interop.

The below three issues will be handled as part of milestone verification work after Preview 8 CC date (as part of https://github.com/aspnet/AspNetCore/issues/12196).
- [ ] Test that a client can disconnect and reconnect multiple times without loss of data. 
- [ ] Test clients with a slow connection/high 
- [ ] Tests scenarios with a high interaction-rate with the goal of providing guidance about patterns that do and do not work well for server side.

## Resource Consumption

This covers understanding and mitigating the causes of excessive resource consumption on the server. Due to Blazor's stateful and connected nature, keeping an eye on how usage patterns can leave to resource exhaustion is important.

- [ ] Proactively remove a circuit when the user closes a tab. https://github.com/aspnet/AspNetCore/issues/12197
- [ ] ~~Proactively stop/start rendering when a user isn't looking at the tab.~~ No plans to do this
- [ ] ~~Deactivate circuits due to inactivity. This could be a sample since it might not fit all use cases.~~ No plans to do this
- [ ] ~~Can we rate-limit the number of connections open (per user/total)? This could use the `CircuitHandler` if we added the ability to reject a connection.~~ No plans to do this
- [ ] ~~Can we rate-limit events per-connection?~~ No plans to do this
- [ ] Provide guidance a documentation for understanding resource consumption per-user/per-circuit. https://github.com/aspnet/AspNetCore.Docs/issues/13294


## Resiliency to App Recycling

This covers the set of infrastructure and guidance users will need to build applications that function well when the server is shut down or crashes. This is important because server-side Blazor holds the the application state in memory on the server - the default experience is that if the server goes away so does all of your state that hasn't been persisted to a data store.

- [x] Guidance and documentation for how to architect apps that don't rely on keeping all of the state for a workflow in memory (paginated form/wizard).
- [x] ~~Provide a sample of how components can be notified for a circuit shutdown/load and use that callback to save/load state.~~ No, we wouldn't recommend persisting only on circuit shutdown, as that would be highly unreliable. What if the server goes down unexpectedly? Instead we recommend and have guidance for persisting state frequently, e.g., whenever the user changes that state.
- [x] Sample of a component that persists UI state to local storage to be resiliant to catastrophic failures. 

## Scale out 

This covers the set of steps that are required to deploy server-side Blazor is a scalable way (multiple servers). Since Blazor uses server-side memory to hold state, we expect applications to commonly need multiple servers and a scale-out strategy.

- [X] Scale out strategies for server-side Blazor will rely on stickiness provided by the Azure SignalR service.
- [x] A non-Azure-based deployment of server-side Blazor will rely on stickiness/affinity being enforced by a load balancer.
- [x] Address the scale-out problems caused by data protection (we're using data protection for CircuitIds which introduces the need for external storage). 

### Known Items
- [x] https://github.com/aspnet/AspNetCore/issues/9893 Circuits are not being cleaned up without traffic 
- [ ] https://github.com/aspnet/AspNetCore/issues/5496 Investigate accumulation of Disposable transient services
- [ ] https://github.com/aspnet/AspNetCore/issues/8003 Allow robust reconnects when client do not perform graceful disconnects
- [ ] https://github.com/aspnet/AspNetCore/issues/10449 Server-side Blazor E2E performance and capacity testing
- [ ] ~~https://github.com/aspnet/AspNetCore/issues/9117 Better server-side limits~~


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Server-side Blazor Production and Reliability #10472

Summary

Error Handling

Logging and Diagnostics (Handled as part of #11792)

Network Reliability

Resource Consumption

Resiliency to App Recycling

Scale out

Known Items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server-side Blazor Production and Reliability #10472

Description

Summary

Error Handling

Logging and Diagnostics (Handled as part of #11792)

Network Reliability

Resource Consumption

Resiliency to App Recycling

Scale out

Known Items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions