Design & implement system that reduces storage pressure and provides client feedback when PUSH is throttled #6685

vladsud · 2021-07-09T19:44:43Z

This is fork from #6596. That issue will continue to track handling of the nack with retryAfter, while this issue will track the following issues:

The lack of feedback to client when PUSH is throttled due to flushing ops. Currently, 99% of that is due to new client joining, as ops flushing is coalesced with new summary commitment, and thus has direct way to provide client feedback in the form of nack.
Optional - reduction or complete elimination of a need to flush ops by PUSH to storage on client joining.
Optional - exposing a way to fetch ops from PUSH (redis).

The key goal should be to push throttling up the stack, such that if the issue is - too high ops rate, then client has enough information to learn about it and possibly adjust its behavior. While long term we may want to even disallow users to make changes, for now we will focus consuming this signal in summarization workflow, i.e. make sure client does not keep piling uncommitted summaries while PUSH has trouble and thus creates positive feedback loop.

The discussion of signals in #6596 is directly related to this problem that we fork into this issue.

vladsud · 2021-07-09T20:17:19Z

Here is the summary of discussion between @GaryWilber, @marcmasmsft & me:

We commit to a plan for SPO fetch ops API to be single source of truth for ops, and thus fetch ops (on as needed bases) from PUSH / redis
- the details needs to be designed. It's likely that client would need to provide some hint for this flow to be engaged, both as a staging mechanism, but also as a way to reduce extra work on the server (later might be indirect, for example we should disambiguate cases where client asked for specific range, vs. client proactively asks for op tail with no specific expectation on how many ops are available).
- There is some uncertainty here due to 3 entities involved (and service to service calls), so we need to test feasibility of it.
- Ops commit while submitting summary needs more discussion. It's possible that we will stay with existing flow (i.e. push model, not pull) which would require client to ask push to flush ops before client can proceed with submitting summary in single commit flow.
Marcelo to work on design doc, publish it, and figure out execution plan.
This will take substantial time to evolve, so if we need some shorter mechanism to solve immediate scalability problems, then we should address it independently.

Temp solution in this space should be staged in a way where service can retire such temp capability easily, in other words - client (and whole system) should behave correctly when such new capability is missing or become missing in the future.

vladsud · 2021-07-09T20:55:51Z

Here is a proposal from me on how to address this problem (or rather consideration points). I start with just flush() API as consideration, but look into getOps() type of API as well (it has very similar requirements RE implementation).

PUSH to implement flush() API (available via via socket messages) to flush some set (current?) of ops to SPO.
Client to provide indication in connection message if client supports flushing and will use it if server supports it (enlightened clients).
- server would not flush ops on connection for such client, otherwise it keeps old behavior.
bto communicate back to client if it supports flushing
- Client should keep existing behavior (not calling this API) if server does not support it, such that PUSH can revoke this API in the future without notice.
Client usage of flush:
- not flushing on summary (at least for now with 2-phase summary commit), with assumption that ops flushing is bundled with summary upload by PUSH and nack is used for communicating response.
  - we can revisit this in the future.
- Client will ask to flush ops on hitting gap in ops (after preforming first successful storage ops fetch call but not filling in the gap)
  - Providing feedback (success / failure) is optional from POV of client that tries to fetch missing ops (as such client can just keep asking SPO for ops for 30 seconds, though feedback would reduce frequency and make telemetry more actionable).
  - But it's not optional from POV of summarizer client who is blocked by this operation. So there is a need to provide that feedback (at least 429 / any other recoverable failures) to all clients.
  - instead of flush, we can use getOps() here with similar impact to client. In such case feedback communication is part of API. Perf characteristics are slightly different. It will result in fewer SPO calls, and improved latencies, so it definitely needs to be considered, but flush() is more versatile as it maybe used in the future as a pre-step for summary upload.
Benefits (assuming all enlightened clients):
- PUSH does not need to flush ops on connection, substantially reducing number of SPO calls (in some situations, including expiring tokens), as well as provide recovery mechanism for fetching ops that were lost through pipeline.
Gap:
- if we have a mix of clients, then enlightened clients need to know about anybody (clients, PUSH) doing flushes and hitting problems. But that's not enough - summarizer may be non-enlightened client and thus not aware why summaries are not coming through and thus continue to post summaries and make things worse (positive feedback loop).

With that, I think there are two possible practical workflows:

flush() and signaling to when PUSH has issues on any flush().
- Pros:
  - has slightly better characteristics in terms of client awareness of what's going on in a mix of enlightened & non-enlightened clients (if summarizer is enlightened).
  - possibly reusable for other workflows
- Cons: has more implementation nuances (signals need to be re-broadcast to any new clients that joins, signals can be lost, etc.)
getOps()
- Pros:
  - More localized (it's just single client activity)
  - Better perf profile / latency
- Cons: Mix of enlightened & non-enlightened clients will behave worse in terms of scaling down on throttling failures.

My choice would be to go with getOps() kind of API, which I'd guess will look like this:

"getOps" message from client
ops are returned through normal (socket) flow, maybe have some marker that they are result of prior getOps call (it may help client to assert that normal ops have to come sequentially).
some message back to client indicating completion of the call (client can use it to understand if gap was filled in or not). If failures can occur, they are communicated on same message.

anthony-murphy · 2021-07-09T21:46:59Z

I wonder if this is all necessary.

Why does push need to flush on each join?

We support IDocumentDeltaConnection.initialMessages? Does this not include all unflushed ops? Isn't that its purpose? is it not sufficient?

Even with IDocumentDeltaConnection.initialMessages there is still the problem of feedback. There is a server to client feedback that pushing ops to storage is behind so we should hold off summarizing, and maybe in the future take more drastic action like going readonly. The other feedback is client to server, basically that it cannot find some op gap. If we only have to solve this, and not the above i wonder if other solutions are possible.

vladsud · 2021-07-09T21:56:13Z

Gary would have more precise description here, there is no guarantee on ordering / lack of gaps. I do not remember precise reasoning here, but it was something to do with fetching these ops being async and potentially some ops being added to redis after this call is satisfied, and such ops being not broadcast earlier to that (connecting) client. So they exists only in redis past the point of connection (from client perspective) and are not available to client in any other form.

anthony-murphy · 2021-07-09T22:37:29Z

I'm not super familiar with this layer of the code, but i remember it working like this. The driver connects to the socket, which quickly starts sending ops, which the client should keep. The server then sends connect document success which includes outstanding ops as of that point, but only needs to contain up to the point of when the client started receiving ops. Older ops come from the server.

It's possible push started flushing on each join, as it wasn't possible to reliably do the above. @GaryWilber and @tanviraumi is this a push only behavior? AFIK r11s writes these directly to mongo, so no need to flush anything. It seems other solutions are possible here, like tracking the first seq on each conn, and ensuring server side, that messages up to that seq are either flushed, or in initial messages.

also, @vladsud do we envision this all being encapsulated in the driver? Seems like it could be hard to encapsulate with the separation of delta stream and delta storage

GaryWilber · 2021-07-10T01:08:48Z

@anthony-murphy Yes it's a push only behavior.

For r11s ops are inserted into mongodb. When clients want to get ops it ends up checking mongodb. So there is a single source of truth for that data.

For push, ops are either in SPO or Push. And technically ops could be in 2 separate places within Push.
Clients can access SPO ops via the get ops api and clients get the Push Redis ops via initialMessages. But due to how we are broadcasting messages (we broadcast the op before inserting it into Redis), there is point in time when the op is technically inaccessible. The op could be in a kafka topic, waiting to be consumed and inserted into Redis. For docs with high op rates, clients are likely to not get all the ops when they connect due to this. This is why we always post ops to SPO when clients join.

I have prototyped the solution you described, where we track the join op seq# and send the missed ops to the client. However that solution cannot work for read-only connections because there is no join op for them, so I dropped working on that further because I think we need a more general solution for this.

One other solution is to insert ops into redis before broadcasting them. This would ensure clients joining will always get all the ops. However this would increase latency because we would now be waiting for that redis insert call before broadcasting - we didn't want that to be a thing.. but maybe it would actually be worth since it simplifies the system & removes this problem. I'm conflicted about it.

GaryWilber · 2021-07-12T22:43:39Z

@vladsud Here's what the "get_ops" and "flush_ops" websocket apis look like right now.

Both apis have a request and a response and take in a client id & and object as params. Both objects contain a client provided nonce.

Get ops - only returns ops from Push's Redis

If all ops in the inclusive range based on the from & to params was found, code is 200. The ops are returned in the messages property.

If no ops were found, code is 404

If only some ops were found in the range, code is 206

Flush ops:

It returns a 200 if it flushed ops and includes the last persisted sequence number.
If no ops are flushed, it returns a 204.

In error cases the apis will return a code & retryAfter property (if available).

I could make the get_ops api broadcast the ops normally instead of making them only be returned in "get_ops_response" if you would prefer that. I would probably rename the event to "request_ops" instead of "get_ops" in that case.

anthony-murphy · 2021-07-12T23:22:04Z

@GaryWilber I don't know enough about how the server code is structured, but ideally, we can know the first op sent to any client, and then the server knows it needs to send the gap from that op down to the last flushed op in initial messages which is part of connect document success. This should be agnostic to write vs read.

GaryWilber · 2021-07-12T23:55:31Z

@anthony-murphy That is a great idea. Originally the frontends were "dumb". They did not track what it was broadcasting to clients - they simply broadcasted whatever it was told and never inspected the payloads. This is how r11s works too.

Due to some issues seen in the scale tests, I recently had to make the frontends smarter. They now parse the ops being broadcasted to clients. It tracks the sequence numbers per-document in order to detect issues (op gaps). Now that we have that capability, it should be possible to extend that logic further for per-client tracking too.

@vladsud Would a client recover correctly if this situation occurs:

Client gets initialMessages [1, 2, 3].
Server broadcasts 6.
Server broadcasts 7.
One second later, the server broadcasts 4, 5 to that client.

I think that was supported at some point but I forget if that's still okay or not. Today the per-document op gap recovery logic will not do that. But adding per-client op gap recovery logic may result in that sequence of events.

vladsud · 2021-07-13T04:55:38Z

Yes, the client will recover just fine.
It will make a call to SPO on step 2. It will cancel it (today that means - it will stop retrying for nothing infinitely) once # 4 occurs.
We can improve here and let it wait with a call a bit (in anticipation that gap would be filled in).

But I'm not sure this solves it, or rather the sequence you described is the one where we have issues. I think it's more like that:

Client initiates connection, that triggers request to redis
Client starts receiving ops 5, 6
# 1 completes and client receives connect_document_success message with initialMessages with 1, 2.
Client keeps receiving ops 7, 8, ...

I believe that's what happens (can confirm if needed RE observed behavior from client side) and it happens because of asynchrony in the system - ops 3, 4 are on the way to redis but were missed by call in # 1.

So if PUSH wants to fill in gaps, in needs to detect gap between very first op it sent and last op in initialMessages and attempt to fill it in.

Please correct me if I got it wrong.

GaryWilber · 2021-07-13T21:31:49Z

That's right. I think both scenarios can happen (the one I listed and yours). The main issue is what you described:

So if PUSH wants to fill in gaps, in needs to detect gap between very first op it sent and last op in initialMessages and attempt to fill it in.

Push should be able to detect that and fetch the gap ops and send them (per-client). I am going to try testing this in spdf this week.

anthony-murphy · 2021-07-13T21:45:30Z

awesome, it would be great it this works, and doesn't necessitate any protocol changes. This still leaves the issue of client feedback in general if push is throttled for whatever reason. I'm really curious if @tanviraumi or someone he knows had to solve a similar issue around their throttling design. Does it just used NACKs? or are there other mechanisms?

vladsud · 2021-07-14T07:05:23Z

I briefly talked to Gary and suggested authoring a document describing current behavior of PUSH (in respect to interactions to redis, kafka, etc., as well as document and client state tracking and processes involved around filling in ops gaps).

As for next steps, client would need to make a change to not immediately hit storage on op gap as PUSH will fill in a gap in next 0-20ms (with this new design), but ops coming out of order today result in immediate request to storage, which will be unnecessary with these changes. We already have enough telemetry to notice when we do unneeded storage calls, as well as cases where client can't fill in op gap in reasonable amount of time (30 seconds), so PUSH changes in this space could be validated by observing client telemetry.
The staging will look like this:

PUSH implements logic to fill in the gaps
Client implements delaying calling storage when gap is detected, in anticipation of PUSH filling in the gap in next N ms.
We validate via telemetry that client never does unneeded requests to storage (i.e. if client does request ops from storage to fill in the gap, then ops are always actually in storage). This tells us that # 1 is actually working and that threshold we chose is reasonable (it's probabilistic, so the goal of # 2 is reduce, but not completely eliminate calls to storage).
PUSH stops flushing ops to storage on client connection.
- if there is a way to flight this change, we should, and validate it in our stress tests (likely by client having local change that provides some hint on connection to enable that behavior)
- Validation here includes also ensuring that client never fails due to not being able to find ops anywhere for 30 seconds.

vladsud · 2021-07-16T03:50:17Z

Based on discussion at PUSH review meeting:
PUSH has a lot of logic to resolve various issues with reliability of op delivery and ordering guarantees. And even after set of latest fixes, ops are not ordered all the time, resulting in client needing # 2 from above.

It feels like we need to move further to one of the extremes:

PUSH continues on this path and provides strong ordering guarantees, ensuring that all ops are delivered in order at all times, including initial ops on connection (this is almost the case, but not guaranteed, ops can come out of order).
We remove most of the logic that helps with above from PUSH and instead force client (driver) to fetch ops from PUSH through getOps type of API (latest maybe moved to be brokered by SPO).
- Solution is described at the beginning of this issue.
- The biggest con of that approach is additional network latency. So feasibility of that path depends on how often it happens.

The main telemetry event that is telling is fluid:telemetry:DeltaManager:enqueueMessages event (reason & previousReason properties).

Query:

Office_Fluid_FluidRuntime_Performance
| where Data_eventName == "fluid:telemetry:DeltaManager:enqueueMessages"
| summarize count(), countif(Data_initialGap > 0), countif(Data_gap > 0), countif(Data_duplicate > 0) by Data_eventName, Data_reason, Data_previousReason
| order by count_

First, scale - appropriate set (limiting only to 0.4x loader versions as that's when this event was added) corresponds to

205K container sessions
563K (re)connections.

Terms mean:

ReconnectOps means processing initial ops on some reconnect (not first connection for this container).
InitialOps - same as above, but very first connection in container.
DocumentOpen_pending means we fetched ops from storage on document open, but there was some gap or duplciate
opHandler - op is coming from socket
reason (one of the above) - where op came from when we detected gap or duplicate ops
previousReason - what was the source (reason) of previously successfully processed op (before gap/duplication occured)

What this data tells us:

Ranges do not have that many gaps. I.e. there are cases where initial ops as observed by DeltaManager would have a gap due to establishing early op handler and potential for ops to trickle in while we wait for "connect_document_success" message to bring initial ops - such early ops are merged into initial ops set by driver. So number of gaps for reason == "InitialOps" or reason =="ReconnectOps" is at 300.
There are a lot of duplicates in initial ops (ReconnectOps) and ops coming from web socket (on previous connection), i.e. there are a ton of fast reconnects where initialOps bring same content that client saw before (over 30K cases)
Most interesting is initialGap property when reason == "opHandler", i.e. when we consume an op from socket and realize there is a gap. Here are the top offenders:

previousReason       count
opHandler	     10066
ReconnectOps	     6736
DocumentOpen_fetch   751

So in 10K cases we do see a gap in op stream itself (that's 2% of connections). I believe this is the issue with lost redis connection by front-end.
And in 6K cases (1.2% of connections) we see a gap between first op and initial ops on (presumably) same connection.

vladsud · 2021-07-16T03:54:06Z

And here is Gary's description of current system:

To recap the 2 things causing the gaps:

the fact that Push broadcasts to clients before inserting the op into redis. this results in ops are sometimes not accessible to newly joining clients
redis pubsub issues, so frontends sometimes lose ops

And the workarounds:

Push posting ops to spo whenver a client joins. This results in the client asking SPO for ops after joining and it gets the successfully
Added extra logic to detect these gaps, fetch the ops from redis, and broadcast them to the client.

With my change in SPDF, both are now fixed with the same logic.

The frontend detects gaps on a per document level and sends the ops to the clients
The frontend detects gaps on a per client level and sends the op to the client (the gap ops may be sent up to 5 seconds later after detection, so the client needs to handle non-sequential ops)

…6685 (cherry picked from commit 9fd2a5cb7a7ea218dd0de69e38c9a86e15f99d59)

Implementing solution described in #6685. After implementing #6947, the client hits again "too many retries" issue (critical failure due to client not being able to get ops within 30 seconds). With this PR, client always asks PUSH for any missing ops in parallel to fetching same ops from storage and/or local cache. This reduces number of cases when we get "too many retries", but does not eliminate it. I've added minimum telemetry, but most request can be tracked by tracking storage request telemetry, as every call will be duplicated to PUSH if there is active connection. Flow can be optimized further by Not asking PUSH for ops ranges that are preceding first op on socket Ask for ops in sequence (not in parallel), in order of local cache / PUSH / storage. This PR (in current form) should unblock further investigations and understanding of "too many retries" problem, but also allow PUSH to be simpler (if needed / desired) by eliminating various work arounds, if we chose to go that route. Or, if we chose for PUSH to provide stronger guarantees and ensure ops are always coming in order, than lack of hits for newly added telemetry will allow us to remove this code and have confidence it's not needed.

vladsud · 2021-08-10T17:47:25Z

@GaryWilber, some observations based on my prototyping (in priority order):

"If no ops are flushed, it returns a 204.". This is not very useful as it does not tell client the latest sequence number that SPO has. So client has no ability to make decision on what to do next (in situation where it needs to ensure that certain op made it to SPO in order to proceed with summarizaiton).
It is very useful for IConnected response to have supportedFeatures property listing all optional features like "api_flush_ops" & "api_get_ops" such that client can make decision on whether to use them, and whether to assert if it does not get proper response from server when such features are supported (i.e. defense in depth). For example, client for most part ensures that it will not have overlapping ops requests, so it could assert that responses indeed come back before next request shows up. Currently I can't do it as I do not know if particular PUSH box supports that feature.
Payloads do not have enough context on multiplexed socket for clients to understand if given response is applicable to given document. Similar, client can't tell if it's response for its request unless it tracks all nonces. In both cases it's possible to format nonce in a way where it has that data (i.e. it is formed in format of documentId-clientId-randomId, but it would be great for responses in both cases to contain documentId & clientId.

GaryWilber · 2021-08-10T19:57:37Z

@vladsud Great observations.

If no ops are flushed, it would hopefully mean that the latest op the client processed is already in SPO. However we know that ops are broadcasted before being inserted into Redis so that may not be the case.
This is a bit tricky to solve because Push does not explicitly track the last persisted op to SPO. It is tracked by the deli lambda but it's not exposed to clients and it's not exposed in a way that's easy for me to grab during the flush_ops call. Is this a blocker for the single stage post summary work?
That does sound useful. For returning supportedFeatures, would this work:
export type FluidSupportedFeatures = "api_get_ops" | "api_flush_ops"
and add supportedFeatures?: FluidSupportedFeatures[]; to IConnected.
I will add tenantId, documentId & clientId to the responses.

vladsud · 2021-08-10T20:46:47Z

For # 1, I'll go for now with assumption that "all ops are flushed", but if we see SPO rejecting summary due to lack of ops in such cases, we might need to revisit it.

I'd prefer to go with connected.supportedFeatures = { api_get_ops: true, api_flush_ops: true };
Mostly to allow more extensible format (i.e. maybe some future features would want to communicate level of support, or some other properties).

vladsud · 2021-08-11T05:38:10Z

@GaryWilber, need more insights / feedback:

I'm getting 429s in our local stress tests in flush_ops_response. I have not looked at response, but I assume we do not have retryAfter there. We need it.
I'm also getting a lot of 409s. Why? There are no other file activities from client side (flush & summary upload are done in sequence by same client, and there are no images in stress test, so there are really no other activities from client that modify file). Is this because PUSH initiated flushes are async RE client initiated flushes? Shouldn't they be a single queue with no concurrency, and thus no way to hit 409?

GaryWilber · 2021-08-11T17:08:20Z

@vladsud

retryAfter will be there. Let me know if it's not.
Yes it's because the flushes are async between the two. The flushes clients run can happen at the same time as the ones Push run. The client initiated flushes do not run through the same queue system so they can happen concurrently.
I looked at the 409s. From what I see, every single conflict was from Push posting ops due to client joins at the same time. So once we stop posting ops when clients joins, you should see a large reduction in 409s

) Please see #6685 for more details on API. Flush workflow is only enabled if full summary tree (including .protocol tree) is uploaded. And only if flush_ops feature is supported by PUSH (i.e. PUSH has kill-switch). Client attempts to ensure that required ops are flushed from PUSH's redis to SPO before summary is uploaded to SPO.

vladsud · 2021-08-13T19:26:18Z

Everything covered in this issue has been implemented, including get_ops & flush_ops flows.
Next step - we should start re-evaluating places where we added complexity on PUSH side that are no longer required with get_ops flow.
Specifically:

We can deprecate initialMessages flow - see Deprecate IConnected.initialMessages #7132
PUSH can stop flushing ops to SPO on client connection
We should evaluate any other places where PUSH can be simplified.

@GaryWilber - if you have any PBIs tracking any future work in this domain, it would be great to list them.
Otherwise I'm, not aware of any work on client in this domain, so closing it as fixed.

* Add assert short codes before release (#6852) * Upgrade socket.io in R11s from v2 to v4 (#6836) * Add ability to disable summarizer heuristics (#6841) * Add ability to disable summarizer heuristics * Split summarize heuristic data from runner * end-to-end test for caching createNewSummary (#6835) * Remove readAndParseFromBlobs api as no longer needed (#6816) * Improve SummaryManager encapsulation (#6840) Reduce calls on IContainerContext, reduce duplication of connection state management. * throw error in protocol handler upon exception (#6846) * [bump] package version to 0.45.0 for development after client release (#6858) @fluidframework/build-common: 0.23.0 (unchanged) @fluidframework/eslint-config-fluid: 0.24.0 (unchanged) @fluidframework/common-definitions: 0.21.0 (unchanged) @fluidframework/common-utils: 0.32.0 (unchanged) @fluidframework/container-definitions: 0.40.0 (unchanged) @fluidframework/core-interfaces: 0.40.0 (unchanged) @fluidframework/driver-definitions: 0.40.0 (unchanged) @fluidframework/protocol-definitions: 0.1025.0 (unchanged) Server: 0.1028.0 (unchanged) Client: 0.44.0 -> 0.45.0 @fluid-tools/benchmark: 0.40.0 (unchanged) generator-fluid: 0.3.0 (unchanged) tinylicious: 0.4.0 (unchanged) dice-roller: 0.0.1 (unchanged) * Bump dependencies on container-definitions and core-interfaces to 39.7 everywhere (#6857) * Bump dependencies version @fluidframework/container-definitions -> ^0.39.7 * Bump dependencies version @fluidframework/core-interfaces -> ^0.39.7 * update driver-definitions dep. * lint * revert driver-definitions bump * revert changes in common/ * revert fix * fix lerna and other package.jsons * Add docs on auth (#6859) * Docs system: Introduce new mechanisms to update and manage docs (#6819) * Lint for banned words in docs (#6866) * New react tutorial (#6842) * @fluid-experimental/property-changeset TS conversion - Fix linting [1/3] (#6808) * add fixes to ambient type definitions in changeset and properties packages (#6741) * Accessibility fixes for website (#6786) * Handling document lambda factory errors in document partition (#6865) * Website reorg and cleanup (#6875) - Remove advanced and concepts section. Everything is in /deep now. - Add support for "outdated" and "discussion" article statuses. - Add "placeholder" shortcode to mark unwritten sections in docs. - Mark tutorial outdated. * Do automatic hash algorithm fallback in common-utils (#6877) P1 of fixing #6757 History here is we had a function to override the hashing function to enable scenarios where people wanted to do stuff locally involving insecure contexts (because crypto.subtle isn't available there). We decided against doing automatic fallback because it added the sha.js library to the webpack bundle for analyses and muddied those numbers (production impact is non-existent because it's a dynamic import). Turns out irl usage quickly goes beyond "call this function here" and it's not at all clear what to do, so change this to automatic fallback. The time saved from not having to talk to people about this will far outweigh the risk of someone accidentally introducing a production dependency on the sha.js library and having to revert it. (also cleanup some other build warning because it's annoying) * Add removed client to IAudience's removeMember event (#6825) One of the problems with using IAudience is that only getting the clientId in the removeMember event can make it cumbersome to make incremental changes to data on audience members, because once the event fires the associated IClient is gone and you're stuck either keying your data on the clientId (where use cases tend to favor keying on a user) or iterating to match the clientId. Add the IClient to the args for the event to make this easier. Also handle the case where an audience join signal gets lost by only emitting an event when a member is actually removed, and log when we try to remove a non-existent member. There should be very little functional change for consumers by doing this because in these situations the consumer couldn't find the missing member in the audience anyway, * Create an API to change the start/end of an interval (#6800) Allow changes to start and end independently of each other. A value of "undefined" passed as start or end indicates no change to that endpoint. * update routerlicious deps on common-utils (#6884) Part ??? of fixing #6757 Going through bumping all the layers to use prerelease versions of common-utils following a change there. This change bumps usages in serve/routerlicious. Next steps: Create prerelease for server 0.1028 Update client and tinylicious to use common-utils 0.32.0-0 and server 0.1028.0-0. Tinylicious must be updated here because the common-utils change is hashing changes to facilitate local testing in insecure environments, and that entire dependency chain needs to be updated. * Remove IOdspSnapshot, IOdspSnapshotCommit, IOdspSnapshotBlob from odsp driver. (#6765) * Remove blob contents from snapshot in rehydrate container, loader changes (#6822) * Replace checkNotTimeout with raceTimer helper (#6809) Replace checkNotTimeout with raceTimer helper * Remove fluid-object-interfaces example package (#6881) * Track serverMetadata per client in deli (#6882) * Stricter typing for producers (#6883) * Update base64-js version in common-utils (#6892) As part of bumping the common-utils version. Client build is complaining there are multiple versions of base64-js, so bump the version in common-utils to latest to match the duplicate. * Clarify audience docs (#6895) * Container.close() to be more robust (#6894) Closes #6690 Keeping #4244 opened as ideally we make callback non-optional and remove old behavior or raising "error" event. * bump container-definitions dependencies to 0.39.8 (#6874) * Add afterSequenceNumber option to on-demand summarize (#6860) Add enqueueSummarize * Client may get stuck fetching ops from server not realizing it has all the ops (#6898) Address https://onedrive.visualstudio.com/SPIN/_workitems/edit/1166727 * Add snapshot conversion logic for new binary odsp snapshot format (#6560) * Add ADO pipeline for building the docs site (#6899) We currently build TSDocs in CI, but not the complete docs site. This PR will enable building the whole site for validation as part of PR checks. As a welcome side effect, this will get us linting coverage on our docs markdown. * React tutorial: Fix bug in object destructuring (#6906) * Fix typos in dds interceptions documentation (#6905) * Pending ops telemetry (#6838) * Add threshold telemetry sender * Add (c) header * Fix export * Fix tests * Fix tests * Rename * Fix event name property in the tests * Add some documentation * strictEqual -> deepStrictEqual * Fix event names in test * Simplify event * Restructure loggers * Add undefined as data point in the tests * rename parameter * Rename const * Some PR feedback * SharedMatrix: Fixed not serializing handles in toString (#6904) * Deli metrics (#6868) * Refactor the emums * Adding a config to enable new telemetry framework * Adding an enum * Noop sooner if lumber is disabled * Use the lambda serviceConfiguration * Remove always-null parentBranch reference (#6912) * Update client/tinylicious to newest common-utils/server prerelease versions (#6890) Fixes #6757 Update client and tinylicious packages to use latest common-utils/server prerelease version. Tinylicious needs to be updated here as well because the common-utils change targets local testing with tinylicious. * chore: convert @fluid-experimental/property-properties to TS [ 2 / 3 ] (#6891) * add initial ts config * move tests into src folder * fix source compiling * make tests run * fix policy * rename files to camelCase * Extend CombinedProducer to support sequential sends (#6923) * Fluid Debugger: Only Sanitize if anonymize is true (#6922) fixes #6921 * Introduce normalizeError for getting a valid IFluidErrorBase from an arbitrary caught error (#6764) This change introduces the `IFluidErrorBase` interface described in #6676, and adds a function `normalizeError`. `normalizeError` will take any input at all and return a valid `IFluidErrorBase`, whether the original object or a new one if necessary. It also applies annotations to the resulting error, either telemetry props or an error code to use if there isn't one on the given error. * LoggingError support for keys to omit from logging (#6900) All the own properties on a LoggingError are logged by default, and we need a way to specify that a particular member should not be logged (e.g. AuthorizationError.claims). Someone can still do bad things by doing `(error as any).foo = somethingPrivate;` but at least mainline cases can be handled this way. See #6485 for more context. * Add logic to track speed of receiving join ops and reset connection on timeout (#6931) Add logic to track speed of receiving join ops on "write" connection and force reconnect if not received in 30 seconds. This is related to https://onedrive.visualstudio.com/SPIN/_queries/edit/1156694/?triage=true - an issue where clients are not receiving any ops on "write" connection for 1 hour while there are a lot of changed in the document. Related PR: #6928 * Fixing potential issues around client having connection that does not broadcast ops (#6928) We observed that some clients in ODSP are not observing any ops on a "write" connection for an hour, until token expired. On reconnect, client finds itself being behind with many thousands of ops. We do not have good theory, but I went through client code looking for cases where we might either not have proper "op" handler installed, or maybe connection is completely broken. Based on code inspection: Due to asynchrony, DeltaManager might get closed connection as a new connection and not realize that. Currently there is no way to check for that state, so adding IDisposable to connection to be able to detect it and recover. event registration logic is a bit complicated and while code inspection did not find a bug, I want to be able to assert that event registration / propagation is done properly, so refactoring this code and adding more asserts. More details are can be found in https://onedrive.visualstudio.com/SPIN/_queries/edit/1156694/?triage=true Also related PR: https://github.com/microsoft/FluidFramework/pull/6931 * Too many fluid:telemetry:RemoteChannelContext:StorePendingOps events in stress tests (#6933) * Missing audience member: Get rid of early signals (#6935) Closes #6910 * Try to read from new format while summarizing in detached container (#6903) * Disable recovery for when client does not observe its own join op for 30 seconds (#6937) Newly added telemetry / recovery happens too often in our own stress tests. Based on telemetry, event and recovery happens correctly - only when only when client is in "connecting" state for 30 seconds with "write" connection mode and there is no transition to "connected state". I.e. indeed we have not observed our own join op. And based on auxiliary telemetry (or rather lack of it) we are not processing ops, though we need to get better telemetry here to confirm that. So all in all, the problem is rather easy to hit in stress tests, but I have no good theory. So disabling recovery path to keep code on par with previous behavior, but keep logging and increasing it to 60 seconds to get new data. This should allow ODSP tests to run tonight undisturbed, while I use local tests to get better sense of what's going on. * Remove existing property - Part 6, IFluidDataStoreRuntime (#6869) * Make ds runtime back compat * Runtime.existing * Fix datastoreruntime * Fix datastoreruntime * Add existing to factory method * more experimenting * Some more refactoring * Change lazyloaded data object factory to not use runtime.existing * Fix datastorehelpers, add existing to testfluid object and scheduler * Simplify puredataobjectfactory * back compat data object * Extract back-compat * Remove usage of runtime.existing * Remove usage of runtime.existing take 2 * Some formatting fixes * Fix scheduler * Add note in BREAKING.md * Fix BREAKING.md * PR feedback - rename instantiateExisting func * PR feedback - get rid of document.existing * Some more notes about breaking changes, some PR feedback about simplfying initialize internal * PR feedback: leave getDataObject alone * Small correction to BREAKING.md * QuorumProxy: move to client, adjust event lister warning limit (#6949) QuorumProxy is heavily used object (as its exposed to all data stores / DDSs) and thus we do see Node warnings in various stress tests about exceeding event listener limit (of 10). Raise this limit to 50 and move implementation to client as it does not belong on the server side. * Fix deli server metadata issue (#6939) * Fix ordered client election loggers (#6955) * r11s-driver: expose minBlobSize config via driver policies (#6618) * Fetching ops from PUSH (#6954) Implementing solution described in #6685. After implementing #6947, the client hits again "too many retries" issue (critical failure due to client not being able to get ops within 30 seconds). With this PR, client always asks PUSH for any missing ops in parallel to fetching same ops from storage and/or local cache. This reduces number of cases when we get "too many retries", but does not eliminate it. I've added minimum telemetry, but most request can be tracked by tracking storage request telemetry, as every call will be duplicated to PUSH if there is active connection. Flow can be optimized further by Not asking PUSH for ops ranges that are preceding first op on socket Ask for ops in sequence (not in parallel), in order of local cache / PUSH / storage. This PR (in current form) should unblock further investigations and understanding of "too many retries" problem, but also allow PUSH to be simpler (if needed / desired) by eliminating various work arounds, if we chose to go that route. Or, if we chose for PUSH to provide stronger guarantees and ensure ops are always coming in order, than lack of hits for newly added telemetry will allow us to remove this code and have confidence it's not needed. * Small optimiaaiton - push() always went through processDeltas(), even on paused connection (#6932) * Move webpack-fluid-loader to @fluid-tools scope (#6956) * Better telemetry for fetching ops (#6947) Problem statement: Newly added NoJoinOp telemetry event points out to a condition where ops are not processed for very long time. Examining telemetry shows that all such cases have one thing in common - there is outstanding ops request to service that takes a long time. And in pretty much all the cases actual network request (as indicated by OpsFetch event) takes relatively short time, but overall process (GetDeltas_end) takes long time, occasionally minutes. I believe in all these cases ops never get to storage (in reasonable time), but in majority cases client actually receives missing ops through websocket (though in all cases, read on). DeltaManager does cancel request in such case (see ExtraStorageCall event), but request is not immediately cancelled, blocking future requests (see fetchMissingDeltasCore - it allows only one outstanding call). As result, whole process does not more forward for the long time. I do not have in-depth understanding where we get stuck in the process, but one such case is obvious waitForConnectedState() - it's possible that browser lies to us or does not quickly reacts to online/offline, which may cause process to get stuck for up to 30 seconds. The other one more likely reason - 429s returned from SPO for fetching ops. We do not have logging for individual retryable attempts, so this goes unnoticed today. Fix: 1. Make op fetching process return on cancellation immediately by listening for cancelation event. 2. Add telemetry for some sub-processes, like fetching ops from cache, if it takes longer than 1 second. 3. Remove ExtraStorageCall event as it fires on all successful fetches, and instead make core op fetching logic raise GetDeltas_cancel event instead if cancel was processed before all ops were fetched. 4. Add telemetry (logNetworkFailure in getSingleOpBatch) for individual failed fetched, such that we get insights for things like 429 that may block fetching process (but currently not visible in telemetry). Outcome: This does address many, but not all NoJoinOp issues (remaining needs to be looked deeper). But this in turn brings back "too many retries" errors, indicating that one of the reasons we run into initial problem is due to client not being able to find relevant ops (and on top of it - not failing sooner, but hanging). These errors needs to also be looked deeper to understand if bugs are on client or server side. * Throttler unit tests (#6909) Fixes #6472. Adds comments and tests to Throttler. Changes SummaryManager to depend on IThrottler to be passed in rather than creating locally. "Fixes" and enables a test to handle the case when subsequent getDelay() is called before the previous delay elapsed. Before the virtual times would keep increasing further into the future, but now they are capped at the real current time by subtracting them back down. * E2E Pipeline: Run Local to get baseline logs (#6934) Run the local server tests as a baseline in the e2e pipeline. The primary benefit here is to get logs for those tests, so when analyzing we can easily tell if an error is unique to a sever, or not. related to #6910 * Fewer events in stress tests (#6959) * Use PropertiesManager to handle property merges on Interval (#6824) Add a PropertiesManager to Interval and SequenceInterval and let it handle changes to Interval properties. Add a changeProperties API. Support cross-client property changes via the change op. * Update test-real-service.yml for Azure Pipelines * Update test-real-service.yml for Azure Pipelines * Add if-match header while summary upload (#6963) * Test summarizer node (#6885) Fixes #4459 by adding some tests for SummarizerNode, setting up the infrastructure for more. * Adding tracking of get_ops PUSH requests (#6966) Tracking from, to, duration. * Extract request summarizer from SummaryManager (#6908) Extract request summarizer function from SummaryManager. Switch to use requestFluidObject. * Add rushstack-based eslint config (#6920) * Add some more throttler tests (#6967) Consolidate and add more Throttler tests * Pull out opsUntilFirstConnect check and use opsSinceLastAck instead (#6907) Instead of doing "join" op sequence number minus DeltaManager.initialSequenceNumber, we use SummaryCollection.opsSinceLastAck. The former was a count of ops we "caught up" with, but the latter is ops since the last summary ack. Both should be equivalent when loading from latest snapshot, but in cases of loading from cached snapshot, the latter is more accurate. Also removes use of PromiseTimer for initial delay, and instead uses the new delay common-utils function. Also pulls out the handler to further reduce number of ContainerContext references within SummaryManager. Changed the logic to just check opsSinceLastAck at the point of deciding whether to initial delay or not. Now it uses a new checkBypassInitialDelay() function in conjunction with a deferred to simplify this logic. We also now check more frequently if we need to bypass the delay- any time refreshSummarizer() is called in either Off or Starting state (Starting State is new here). * Docs: Add package placeholder (#6970) * Snapshot tests: Added option to generate new reference snapshot files (#6925) * Clarify intent of new hash fallback chunk (#6964) Recent change to do automatic hash fallback when running in insecure contexts uses a dynamic import which webpack will create under all circumstances even if it is not expected to ever get served. As is it receives anonymous naming ("1.js" in our local outputs), which is non-descriptive and can be confusing for those updating their fluid version and seeing a large bundle size increase. This change adds additional notes around that increase and gives the chunk a more descriptive name matching its functionality. * Removed client-api dependency from replay tool (#6941) - Removed client-api dependency from replay tool. - Added simple code loade, data object factory and runtime factory to replay tool. * Bump tar from 4.4.13 to 4.4.15 in /server/routerlicious (#6975) Bumps [tar](https://github.com/npm/node-tar) from 4.4.13 to 4.4.15. - [Release notes](https://github.com/npm/node-tar/releases) - [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md) - [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.15) --- updated-dependencies: - dependency-name: tar dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump tar from 4.4.13 to 4.4.15 in /server/historian (#6974) Bumps [tar](https://github.com/npm/node-tar) from 4.4.13 to 4.4.15. - [Release notes](https://github.com/npm/node-tar/releases) - [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md) - [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.15) --- updated-dependencies: - dependency-name: tar dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump tar from 4.4.13 to 4.4.15 in /server/gitrest (#6973) Bumps [tar](https://github.com/npm/node-tar) from 4.4.13 to 4.4.15. - [Release notes](https://github.com/npm/node-tar/releases) - [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md) - [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.15) --- updated-dependencies: - dependency-name: tar dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add more logging and recovery attempts for stalled connection (#6978) 1. Added NoJoinOpRecovered to see if connection ever recovers 2. Converted NoJoinOp to heartbeat instead of single event. Likely will undo that in future when we have more data here, but currently it will help understand how long connection existed in cases where currently NoJoinOp is the last event in whole trace (not clear if that's because process dies, or some other reasons). 3. Added noop pings and reconnects as recovery options taking on broken connection. Telemetry around if it helps will point us to next steps to take in this area. 4. Validate that we somehow did not miss "disconnect" event - all key methods on connection to assert it's not used in disconnected state. * Docs: Broken link (#6981) The Docs on the Fluid Service https://fluidframework.com/docs/deep/containers-runtime/ attempts to reference the Summarizer topic, but the outgoing link is incorrect. * Add explicit types for some public APIs (#6983) * hot fix: change compressSmallBlobs level to 2 (#6918) * Scribe metrics (#6944) Scribe metrics * Snapshot tests: Enabled GC and added ability to test snapshots created via detached container (#6942) * Test summary manager (#6916) * Add FrsClient integration test infra (#6761) * added tinyliciousclient test cases + exisiting flag fix + allow DiceRoller in test import * removed extra new line * added first test case for FrsClient * added local tinylicious test case + scripts * added frs test infra * removed tinylicious test case since it's not required anymore with the toggle * added CreateClient + run frs test script * rebuilt project * reverted back lock packages * removed js files * reverted back routes.ts * frsclient package.json script ordering * added copyright to CreateClient * reordered package.json * added cross-env to define env variable * converted CreateClient.ts to a single function + changed file name to FrsClientFactory * converted CreateClient.ts to a single function + changed file name to FrsClientFactory * removed assert + type/assert and renamed scripts for consistency * Shared object cleanup (#6990) null check in constructor, toJSON, setOwner, debugAssert * Publish initial sequence docs on website (#6265) * Fix shared Internal e2e tests (#6993) The conflicting op test in e2e's SharedInterval.spec.ts expects certain operation to be sequenced in a certain order. Even if we create the op in some order in the tests, if they are from different client session, the network doesn't guarantee the order of arrival. So we need to make sure to call processOutgoing in between to make sure the generated op has round trip the server and get sequence (but not processed yet), before generating the second conflicting op. Also reduce the test timeout targeting r11s from 30s to 5s (in the pass, all test finish < 2.2s) * Improved console output for getkeys (#6988) * Adjustments based on latest scalability run (#6994) 1. We incorrectly quantify fetch timeouts as non-recoverable errors - that's not the case. 2. sending noops on broken connection does not really make a difference (though these ops make it to server), so remove this code and switch to forcing reconnect on error 3. Non-recoverable errors do not register as errors. Fix it. 4. Assert is hit due to timer callback firing after timer was cancelled - looks like a race condition. * Canvas view/model separation (#6897) This change does the following: -Exposes a public interface on the Canvas data object to be used by a separate view (really just the Ink DDS -- a better public interface is probably possible, potentially by more deeply incorporating elements of the InkCanvas class). -Splits the view out from the data object, and rewrites the view in React -Removes usage if IFluidHTMLView -Uses the example-utils ContainerViewRuntimeFactory to combine the view/model. -Enables strict mode for the package * React view/model separation (#6913) This change updates SyncedDataObject and all of its customers to no longer implement IFluidHTMLView but instead use a split view/model approach. For the clicker-react examples which use webpack-fluid-loader and a container-views approach it does this with a ContainerViewRuntimeFactory. For likes-and-comments which does not use webpack-fluid-loader this change converts it to an external-views approach, where the app pairs the view with the data object itself. * Adding Fetch timeouts (#7001) Closes #6997 We will monitor telemetry to asses if approach of timeouts is valid, and if 30 second timeout is the good point to be * Moving off method from LocalDocumentDeltaConnection.create() and adding connected property expected on socket (#7006) Required for PR #6986. It needs connected property. I can't patch getter property similar to how we patch "off" method, so I have to change it, and given I'm changing it, remove patching part of "off" method as well * Don't try to insert blob if already present (#7008) * Update reference to test snapshots in (#7011) * [Property Proxy Typescript Migration 3/3] migrating property proxy to TS (#6742) * rename files to *.ts * migrate utilities to TS * remove tsdoc incopatible type description in comment * migrate proxyhandler to TS * migrate propertyProxy to TS * cast property to its correct type * migrate componentSet to TS * remove unused ambient types * add type guards in utilities.ts * fixes for propertyProxy.ts * use proxy type in proxyHandler.ts * remove unneded import in utility.ts * update interfaces * add import to propertyProxy.ts * migrate componentMap to TS * migrate componentArray to TS * add missing include to propertyProxy.ts * componentArray.ts fix missing changes * add comment to lastIndexOf issue in componentArray.ts * update comment componentArray.ts * adjust eslit, tsconfig, build & test scripts * update configs * add exports to index.ts * move ReferenceType to interface.ts * update imports * cleanup tsconfig * cleanup package.json * cleanup arrayProxyHandler.ts * cleanup * lint fixes * fix some linting errors * fix linting errors * remove unused index.d.ts * cleanup * fix @typescript-eslint/ban-types linter errors * fix @typescript-eslint/consistent-type-assertions linter errors * covnert utulity class to namespace * refactor set to be more type friendly * cleanup * adjust jest version * remove tsdoc-metadata.json * add handling of NaN * cleanup componentSet.ts * further cleanup * fix comment * add todo * Snapshot tests: Added information to README on adding and updating and submitting changes to snapshots (#6984) Added the following information to the snapshot tests README: - How to submit changes to the test snapshot content. - How to add new test snapshots to the repo. - How to update existing test snapshots in the repo. * Guarantee expected op ordering in conflicting changeProperties test (#7009) Use processOutgoing to make sure ops that originate on different clients are processed in the order that the test expects. * Copyprops precedence (#6969) Fixes #6758. As an error flows through FF, we may add telemetry properties at various points. The closer to the source that a prop is added, the more authoritative it is, so when adding subsequent properties, do not overwrite. * Primitives example view/model separation (#6879) This change does the following: -Exposes a public interface on the DdsCollection data object to be used by a separate view -Modifies the view to take that object and use its public interface -Removes usage of IFluidHTMLView from the model -Uses the example-utils ContainerViewRuntimeFactory to combine the now-separate view and model -Minor renames and cleanup (further improvements are certainly possible but the purpose/scope of this change is view/model separation). * Image gallery example view/model separation (#6965) Also rewrite of the view and updating libraries used * Deprecate unused DriverErrorType.genericError (#6489) Since it's unused and indistinguishable from ContainerErrorType.genericError. * Remove stop() from IRuntime (#6998) Also throw in implementation in ContainerRuntime. * CreateProcessingError now annotates all errors as dataProcessingErrors (#7012) * Introduce UsageError to replace some asserts (#6961) Fixes #6315 Asserts should be reserved for low-level internal invariants that indicate core flaws when hit. These two asserts are about proper use of the API, so we're converting them to a new error type. These also include the telemetry prop usageError which would indicate that an error isn't a fault of the framework, but how the API is being invoked by the consumer. * Expose R11s WholeSummaryUpload functionality in client (#7020) * Add hooks for taggedLogger in the runtime. (#6926) Last step of #5560. Creates a tagged-logger adapter class (that can wrap vanilla loggers that do not handle tags) and adds runtime handling for taggedLogger property of IContainerContext. * Removing legacy container creation path from `container.load()` (#7005) See #3429 and #6033 The snapshot tests were the only consumers (there was an assert which could only be bypassed with a 'magic string' used solely by these tests) of this code path. Considering that snapshot tests are now always starting with an empty snapshot, this is no longer needed. * Enable API report for all client packages (#6888) There are several reasons for this change. First, changes to the API report don't always represent breaking changes, but they do always represent changes to the public API - even if the change is just "this method wasn't documented but now it is." Changes to BREAKING.md are intentional so auto-labeling them remains. The other reason is that we don't have good visibility into the changes that are being made that affect the public API. It's difficult to tell from code changes alone if a change will affect a public API. With this change, all the packages' public API changes will be tracked, and we can use this history over the coming months to help inform how we manage changes and breakages. * Optionally have the Alfred generate a container id on creation (#7022) * Add utility api to convert uint8Array to Array buffer (#7013) * Fixing a bug: always fetching ops from PUSH (#7017) Fetching ops from push was under "if (from < this.firstCacheMiss)" check that resulted in not requesting ops from PUSH all the time. Fixing it by making sure fetching from PUSH is unique callback that is always involved. How found: In case of customDimensions.containerId == "65f88376-9665-465b-bc79-ae18d0ea0647", forcing reconnect actually resolved an issue of stalled client, because initial ops resolved the op gap. Client was stalled prior to that for two reasons: fetching ops was first not bringing any ops, and then it started timing out. The fact that it was resolved on reconnect forced me to inspect code, and I see that we do not always request ops from PUSH when we are looking for ops. That's the bug * Verify runtime sequence number matches protocol sequence number (#7015) Fixes #7002 by storing the sequenceNumber in .metadata blob within the runtime-generated summary. Then comparing this to the one that the server generated in the .protocol tree. If they don't match, close the container with a critical error by default. Allow this behavior to be overridden with a new runtimeOption: "loadSequenceNumberVerification", which defaults to "close". If set to "log", instead just log an error to telemetry on mismatch, and if set to "bypass", do not even perform the check. Split IContainerRuntimeMetadata into a separate ReadContainerRuntimeMetadata (which unions with undefined and has sequenceNumber as required) type and WriteContainerRuntimeMetadata (which has sequenceNumber required). This PR also removes the getBackCompatRuntimeOptions function, which was used to handle compatibility when the format of the runtimeOptions changed from a flat list of properties. * fix blobManager assert tripped after serialization/rehydration (#7003) Include detached blob IDs in snapshot returned by serialize() and load them upon rehydration. * policy check (#7024) * Add documentation to FF.com for SharedMap (#6867) Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com> Co-authored-by: Tyler Butler <tyler@tylerbutler.com> Co-authored-by: Skyler Jokiel <skjokiel@microsoft.com> * Integrate new binary snapshot format in odsp driver (#6962) * policy (#7029) * Markdown-lint fixes (#7026) * version (#7031) * Add SHA256 and base64 encoding to common-utils hashFile fn (#7007) P1 #6999 odsp-driver is using sha.js for some hashing stuff which is pulling in a bunch of extra packages. This didn't show when I was doing the buffer removal before because we were just looking at base-host (which doesn't use odsp-driver). We can move odsp-driver to use common-utils hash instead by adding support for SHA256 algorithm and base64 output format. This removes the sha.js and buffer packages from the odsp-driver bundle, which is roughly 37KB parsed size. * fix returned objects in docs (#7027) Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com> * Revert "Add SHA256 and base64 encoding to common-utils hashFile fn (#7007)" (#7032) This reverts commit 9f8c9dd21c44785e6003621644ed15d8300bf87d. * [bump] package version to 0.46.0 for development after client release (#7033) @fluidframework/build-common: 0.23.0 (unchanged) @fluidframework/eslint-config-fluid: 0.24.0 (unchanged) @fluidframework/common-definitions: 0.21.0 (unchanged) @fluidframework/common-utils: 0.32.0 -> 0.33.0 @fluidframework/container-definitions: 0.40.0 (unchanged) @fluidframework/core-interfaces: 0.40.0 (unchanged) @fluidframework/driver-definitions: 0.40.0 (unchanged) @fluidframework/protocol-definitions: 0.1025.0 (unchanged) Server: 0.1028.0 -> 0.1029.0 Client: 0.45.0 -> 0.46.0 @fluid-tools/benchmark: 0.40.0 (unchanged) generator-fluid: 0.3.0 (unchanged) tinylicious: 0.4.0 (unchanged) dice-roller: 0.0.1 (unchanged) * Improved logging of ODSP error responses (#6958) Add types modeling the shape of response from ODSP, based on looking at telemetry, and only log response if it matches. * Append dev-defined user properties to the JWT token (#6982) * custom user properties changes * generic typing and rendering additional details * custom interface in view * remove urls * iteration * replace with memberStrings * frs-client.api.md * Ensure `flush` cannot be called from `orderSequentially`'s callback in `ContainerRuntime` (#6991) Allowing flush would silently break orderSequentially's guarantees. Part of #4048 * Remove audience error logging in container (#7014) Fixes #6910 We're frequently hitting the race condition on initial connection (such as with transition from read to write client) where a client disconnects very close to when another client connects, and the disconnect audience signal is sent to the connecting client that never knew about the disconnecting client. This is obfuscating what we really want to check (mismatched audience join/leaves e.g. legitimately lost signals), so just remove it because it's nbd(TM) (jk see the attached bug for more info) * Restore #7007 to add SHA256 hash and fix tests (#7041) This change restores #7007 which was reverted due to broken coverage tests and not to block the release. It also fixes the tests, which were broken because of an incompatibility between nyc and jest code coverage. I opened #7039 to track fixing this nyc/jest issue more broadly across the code base. * Update tenant manager to include documentId (#7043) * @fluid-experiemetal/property-changeset to TS [2/3] (#6893) * Rename files to use camelCase partial porting to TS * Fix linting * Fix imports + auto fixable issues * Fix build + tests * Fix build * Fix policy check + file export * Fix exports/imports * Try to fix build * Add missing scripts * Fix poilicy check agian * Fix package scripts * require() must match filename casing * Remove lib for now * Address spaces issue * Address comments on lodash imports * Remove added async declarations * Fix policy check Co-authored-by: Daniel Lehenbauer <DLehenbauer@users.noreply.github.com> * r11s: Explicitly allow blank document ids on create doc only (#7038) * Remove extraneous devDependency from frs-client (#7051) A dependency is duplicated in devDependencies and isn't getting bumped by the bump tool * Bump common-utils prerelease version dep to release (#7055) * remove prerelease version * missing lock files * Use path-browserify instead of node path in server-services-telemetry (#7057) We should not be using node libraries in code that is not explicitly node-only. Webpack no longer implicitly does a polyfill fallback, so our downstream consumers are stuck handling this when we do. path-browserify is used elsewhere (dds/map) over node path, so use that there as well. * Use common-utils instead of shajs in odsp-driver (#7010) Fixes #6999, next part of PR #7007 Use the updated hash functions from common-utils (with new SHA256/base64 support) instead of sha.js, which removes the sha.js and downstream dependencies and cuts ~37KB from the odsp-driver package. Requires making getHashedDocumentId async. * Update server dependencies in Historian (#7056) * r11s-driver: Use documentId from server as source of truth (#7037) * changed comment for disableIsolatedChannels (#7058) * Bump dependencies version (#7068) Server -> ^0.1028.1 * Add preliminary doc for testing/automation (#6854) Add a doc for getting started with writing automation against tinylicious or frs, and also re-order the docs in the testing group to better reflect how they should be read when not jumping around * move cra-demo to FluidExamples, remove from monorepo (#7059) Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com> * Remove Legacy Debug Logging (#7082) Before we had logging infra we used the debug library. This was never removed as we moved to logging infra. These old logs are generally not useful, and are only available on the client. This change removes the dependency, and many of the spurious log statements, which will save some bits over the wire as well. fixes #6253 * Small telemetry adjustments based on analyzing stress tests. (#7087) Changes are mostly around uniformly representing data, i.e. using same property names and event names that better reflect what it tracks. * Bump path-parse from 1.0.6 to 1.0.7 in /server/routerlicious (#7079) Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7. - [Release notes](https://github.com/jbgutierrez/path-parse/releases) - [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7) --- updated-dependencies: - dependency-name: path-parse dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump jszip from 3.6.0 to 3.7.1 in /server/routerlicious (#7078) Bumps [jszip](https://github.com/Stuk/jszip) from 3.6.0 to 3.7.1. - [Release notes](https://github.com/Stuk/jszip/releases) - [Changelog](https://github.com/Stuk/jszip/blob/master/CHANGES.md) - [Commits](https://github.com/Stuk/jszip/compare/v3.6.0...v3.7.1) --- updated-dependencies: - dependency-name: jszip dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump path-parse from 1.0.6 to 1.0.7 in /server/gateway (#7077) Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7. - [Release notes](https://github.com/jbgutierrez/path-parse/releases) - [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7) --- updated-dependencies: - dependency-name: path-parse dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump jszip from 3.6.0 to 3.7.1 in /server/historian (#7076) Bumps [jszip](https://github.com/Stuk/jszip) from 3.6.0 to 3.7.1. - [Release notes](https://github.com/Stuk/jszip/releases) - [Changelog](https://github.com/Stuk/jszip/blob/master/CHANGES.md) - [Commits](https://github.com/Stuk/jszip/compare/v3.6.0...v3.7.1) --- updated-dependencies: - dependency-name: jszip dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump path-parse from 1.0.6 to 1.0.7 in /server/historian (#7075) Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7. - [Release notes](https://github.com/jbgutierrez/path-parse/releases) - [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7) --- updated-dependencies: - dependency-name: path-parse dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump url-parse from 1.5.1 to 1.5.3 in /server/gitrest (#7074) Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.1 to 1.5.3. - [Release notes](https://github.com/unshiftio/url-parse/releases) - [Commits](https://github.com/unshiftio/url-parse/compare/1.5.1...1.5.3) --- updated-dependencies: - dependency-name: url-parse dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump path-parse from 1.0.6 to 1.0.7 in /server/gitrest (#7073) Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7. - [Release notes](https://github.com/jbgutierrez/path-parse/releases) - [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7) --- updated-dependencies: - dependency-name: path-parse dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Rework non-immediate noop sending logic (send them more often) (#7085) Please see issues #5629 & #6364 for context, as well as big comment in the code on new characteristics * Disable running it in strees test for now (#7093) * ODSP driver: Tell PUSH if client supports get_ops flow (#7025) * FRS documentation with TokenProvider and Azure function (#7052) * custom user properties changes * generic typing and rendering additional details * custom interface in view * remove urls * iteration * replace with memberStrings * frs-client.api.md * token provider docs draft * link * optional * return token * changes * update sample FrsMember * space * Replace CreateContainerError usages with either normalizeError or new GenericError (#6940) CreateContainerError as-is is basically equivalent to normalizeError*. So wherever CreateContainerError was given a thrown object, just use normalizeError. We also called CreateContainerError to raise new cases, those should directly create GenericError with the string used as the error code as well as message. I also reimplemented CreateProcessingError with normalizeError in here but could split into a different PR. * Here are the differences between what CreateContainerError returned, and what normalizeError now returns where it's used instead: The returned error is no longer an instance of LoggingError. So arbitrary props set on there won't be logged, you must use addTelemetryProperties "Partially" valid errors - e.g. with an errorType but no telemetry prop functions or vice-versa - will not have those properties brought over. These are hypothetical cases that don't happen in practice so we decided to stop supporting them for simplicity. Tangentially related to @vladsud 's PR #6936 * Bump socket.io-client dep from 2.1.1 to 2.4.0 to resolve security issue in xmlhttprequest-ssl (#7099) There is a security issue in xmlhttprequest-ssl 1.5.5, which we are getting from our socket.io-client version. It is resolved in 1.6.1, which we can get by bumping our dep for socket.io-client to 2.4.0. We already resolve socket.io-client to 2.4.0, so this should functionally be a no-op for us. * Revert 1c141b - Early signal processing, PR #6935 (Main) (#7098) Per feedback, that exposes issues in Audience consumers as signals are dropped. We will do more thorough investigation if this is a race condition that was always the case but more exposed by this change or this code fundamentally needs to stay as is * Enable Alfred and Scribe to upload using one single call to storage (#7088) * Uploading the initial summary in one single call * Added support for scribe * Add server-side doc id generation to Tinylicious (#7104) * Put arraybuffer contents in blobs while rehydrating container (#7030) * Session metrics (#7047) Introducing session and startsession metrics * Fix contents of blobs when binary contents are specified for r11s driver (#7103) * [0.45] Add assert tags (#7107) (#7109) * add tags * fix line lengths * ODSP driver: flush_ops() implementation for single-commit summary (#7086) Please see #6685 for more details on API. Flush workflow is only enabled if full summary tree (including .protocol tree) is uploaded. And only if flush_ops feature is supported by PUSH (i.e. PUSH has kill-switch). Client attempts to ensure that required ops are flushed from PUSH's redis to SPO before summary is uploaded to SPO. * We may miss "disconnected" event on socket (#6986) When connection is established, connection object is returned back to DeltaManager. During this transition time, nobody listens on "disconnected" event, make it possible to miss it. Add proper handling for such cases, as well as validation that if connection object it not disposed, socket itself should be connected. We had someone similar behavior earlier for "error" handler, but it was not fully correct. Extending it to "disconnect" event and fixing issues. * Bugfix and added whole summary option as config. (#7114) * Bump color-string from 1.5.4 to 1.6.0 (#7113) Bumps [color-string](https://github.com/Qix-/color-string) from 1.5.4 to 1.6.0. - [Release notes](https://github.com/Qix-/color-string/releases) - [Changelog](https://github.com/Qix-/color-string/blob/master/CHANGELOG.md) - [Commits](https://github.com/Qix-/color-string/compare/1.5.4...1.6.0) --- updated-dependencies: - dependency-name: color-string dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump handlebars from 4.7.6 to 4.7.7 (#7112) Bumps [handlebars](https://github.com/wycats/handlebars.js) from 4.7.6 to 4.7.7. - [Release notes](https://github.com/wycats/handlebars.js/releases) - [Changelog](https://github.com/handlebars-lang/handlebars.js/blob/master/release-notes.md) - [Commits](https://github.com/wycats/handlebars.js/compare/v4.7.6...v4.7.7) --- updated-dependencies: - dependency-name: handlebars dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump tar from 4.4.13 to 4.4.17 (#7111) Bumps [tar](https://github.com/npm/node-tar) from 4.4.13 to 4.4.17. - [Release notes](https://github.com/npm/node-tar/releases) - [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md) - [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.17) --- updated-dependencies: - dependency-name: tar dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump jszip from 3.6.0 to 3.7.1 (#7091) Bumps [jszip](https://github.com/Stuk/jszip) from 3.6.0 to 3.7.1. - [Release notes](https://github.com/Stuk/jszip/releases) - [Changelog](https://github.com/Stuk/jszip/blob/master/CHANGES.md) - [Commits](https://github.com/Stuk/jszip/compare/v3.6.0...v3.7.1) --- updated-dependencies: - dependency-name: jszip dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove document id from IFluidDataStoreContext and IFluidDataStoreRuntime (#7064) * Fix tinylicious (#7115) * Fix tinylicious * sort dependencies * fixed tinylicious server readme extra word (#7117) * Strip .app in buildHeirarchy from protocol-base (#7116) * Added test for validating audience correctness (#7097) * upgrade server deps in client packages (#7118) * Removed gcData from summarize (#7090) * Disable TLS 1.0 and 1.1 in Nginx Ingress Controller for security reasons (#7105) * Disable TLS 1.0 and 1.1 in Nginx Ingress Controller for security reasons * Add description to README * Also add comments in tmpl file Co-authored-by: yhou46 <yunpengdevelop@gmail.com> * get_ops flow is broken due to raising events on wrong object. (#7123) The bug became obvious due to latest PUSH rollout to SPDF that removes initial Ops on connection, making this part of code critical in loading flow. Unfortunately we did not see this issue despite a bunch of testing, and it was missed on PUSH rollout as PUSH is using 0.44.x bits. * Introduce errorInstanceId for errors telemetry (#7045) This is a scoped version of #6968. Add errorInstanceId to IFluidErrorBase (which none of our error classes implement yet) In wrapError, add the inner error's errorInstanceId to the wrapping error if present Add wrapErrorandLog which wraps then logs the inner error. No longer copy telemetry props from inner error in wrapError, since we can now tie the inner/outer errors together in telemetry via ErrorInstanceId. Co-authored-by: Tony Murphy <anthonm@microsoft.com> * Start running e2e test targeting tinylicious in realsvc pipeline (#7122) * Move tinylicious-client from experimental/framework -> packages/framework (#7101) Move tinylicious-client from experimental/framework -> packages/framework * Change references to "frs" -> "azure" in service-specific client packages (#7084) Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com> * add exponential backoff retry mechanism for writing ops to mongodb * fix blob test failing on r11s (#7126) This code was left out when replacing the error string with the shortcodes. The behavior being tested is not fully implemented, and is only partially implemented on ODSP. The test checks that attach() throws an error when attachment blobs are present, or that an error is thrown by the r11s or local document service factory beforehand. The error message is the same in r11s and local, but was replaced by different shortcodes. * Rename uber package to drop "@fluid-experimental" scope (#7108) Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com> * Move fluid-static from experimental/framework -> packages/framework (#7133) Move fluid-static from experimental/framework -> packages/framework * update packages to reflect latest layers (#7140) Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com> * Add a hook for an event that will fire on changes to Interval properties (#7019) * Add a hook for an event that will fire on changes to Interval properties * Update API signature * Remove redundant init of range labels * Respond to review feedback * Fixed not waiting for clients to get to connected state in Audience tests (#7142) * use "container" type payload when uploading summary in attach() with blobs (#7110) Use "container" type summary payload instead of "channel" type when uploading initial summary in attach() when attachment blobs are present. * Adjust latency telemetry based on recent regression in scalability runs experience (#7143) One of the recent client changes overwhelmed PUSH but it was not very obvious from client logs that regression occurred. Make sure such issues are more visible through usage of error events. We will likely need to adjust threshold in the future, https://onedrive.visualstudio.com/SPIN/_queries/edit/1083644/?triage=true is tracking work on PUSH side to better understand latencies, and what to do next. Also correctly format duration for another event by using performance event API (it converts floats to ints, making it easier to consume in telemetry and reducing size a bit). * Follow ups to fluid:telemetry:OdspDriver:GetDeltas_cancel events #7040 (#7144) Please see issue #7040 for more details - this event is duplicating another event that has exactly same name which makes data analyzes very confusing. Closes #7040 * remove value from DB close log (#7150) * Bump path-parse from 1.0.6 to 1.0.7 (#7081) Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7. - [Release notes](https://github.com/jbgutierrez/path-parse/releases) - [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7) --- updated-dependencies: - dependency-name: path-parse dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Rename FrsAzTokenProvider -> AzureFunctionTokenProvider o FF website * Adjust nop frequency up from 250ms to 2s based on feedback from ODSP TAP 40 scalability run. (#7127) With current (prior to fix) numbers, we were spamming PUSH with too many ops, causing it not handling workload. 2s adjustment will have (using TAP 40 payload) 8x difference in outbound traffic, and should have zero impact on final file size (due to 5 ops / sec of real traffic that results in all noops not being sequenced). That said, it will regress certain workloads (search quality, size of snapshots), so we will need to re-evaluate impact and next steps. Long term, we should strongly consider "alternative solution" proposed in #5629, as well as make our system less susceptible to collab window size (i.e. how summaries are working for Sequence and how search works for Sequence). * Enable using multiple account for odsp e2e tests (#7148) ODSP throttles and causes the test to fail if there are too many concurrent real service e2e job running in the pipeline. To avoid it, use multiple account to spread the load. - Add a new specification for multiple account that can be used in login__odsp__test__tenants - ODSP test driver will generate a set of accounts that it will pick randomly when the test driver is created. - For e2e, that means every test files. - If there are sufficient account, the work load should be spread enough to avoid throttling. To support multiple users in a single session: - Change the token cache to cached based on full user id when username auth method is used. - ODSP driver multiplex the socket, but it is associate to a single user. Add options to the ODSP driver to specify whether we want to isolate the socket cache by factory. Also updated the pipeline to use the format. * reuse retry logic * Enable multiuser account for stress test (#7161) - Increase the time to wait for the token cache file lock (as more account needs to be auth and cached. - Switch the pipeline to use the tenant accounts format * Enforce single-use tokens in R11s createDocument API (#7141) * Bump path-parse from 1.0.6 to 1.0.7 in /docs (#7072) Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7. - [Release notes](https://github.com/jbgutierrez/path-parse/releases) - [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7) --- updated-dependencies: - dependency-name: path-parse dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Allow changing local lambda controller setup type (#7165) * Update CI pipeline to support non-scoped packages (#7159) Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com> * Move summary nack messages logic to deli (#7166) * set perMessageDeflate (#7167) * Add RestLessServer and RestLessClient (#7168) * Fixing parameters when creating Routerlicious driver's DocumentStorageService (#7174) * Add option to always record telemetry during token fetch (#7172) * WebPack: Remove '--package' arg (#7182) * Update Tinylicious to use socket.io v4 (#7176) Updates Tinylicious to use socket.io 4.1.2, same as Routerlicious. Removes @types/socket.io (no longer required, the more recent versions of socket.io include typing). * Fix published path for server docker builds, optional NOTICE, and test docker image tag (#7181) - PR #7159 subdivided the publish packages into scoped and non-scoped to be published in separate steps. The server packages are all scoped and need to put into the right path for publishing - NOTICE generation fail randomly, make it not required if we don't publish the docker image - If we are doing manual test build, don't push it to the same container repository as the main official CI build (use `test/` instead of `build/`. * Adding TLS support for rdkafka and other Kafka changes (#7171) * Making sure node-rdkafka is compiled with SSL by adding SSL deps in out Dockerfile and avoiding volume mapping for it in docker-compose.dev.yaml. Also starting the experiment with some variables * Adding support for SSL configs in RdKafka consuemer and producer classes * Also enabling TLS for Kafka Admin Client * Addomg Readme and updating config name * Removing debug logs and updating Readme * Throwing error if rdkafka tries to setup SSL but SSL is not enabled. Also adding more comments * move logic to services-core, add logger * artificially throw error Co-authored-by: Jatin Garg <48029724+jatgarg@users.noreply.github.com> Co-authored-by: Zach Newton <znewton@microsoft.com> Co-authored-by: Arin Taylor <artaylor@microsoft.com> Co-authored-by: chensixx <34214774+chensixx@users.noreply.github.com> Co-authored-by: Matt Rakow <ChumpChief@users.noreply.github.com> Co-authored-by: Kabir Brar <kabir@brar.xyz> Co-authored-by: Tyler Butler <tylerbu@microsoft.com> Co-authored-by: Rick Kirkham <Rick-Kirkham@users.noreply.github.com> Co-authored-by: Nedal Horany <nedalhy@gmail.com> Co-authored-by: Marcus Karlbowski <43415869+karlbom@users.noreply.github.com> Co-authored-by: Henrique Da Silveira <41453887+hedasilv@users.noreply.github.com> Co-authored-by: Helio Liu <59622401+heliocliu@users.noreply.github.com> Co-authored-by: Paul Leathers <pleath@users.noreply.github.com> Co-authored-by: Gary Wilber <41303831+GaryWilber@users.noreply.github.com> Co-authored-by: Vlad Sudzilouski <vlad@sudzilouski.com> Co-authored-by: Wes Carlson <49205066+wes-carlson@users.noreply.github.com> Co-authored-by: Pragya Garg <praggarg@microsoft.com> Co-authored-by: Andrei Iacob <84357545+andre4i@users.noreply.github.com> Co-authored-by: Navin Agarwal <45832642+agarwal-navin@users.noreply.github.com> Co-authored-by: Pradeep Vairamani <pradeeprv123@gmail.com> Co-authored-by: Elchin Valiyev <elchin.valiyev@autodesk.com> Co-authored-by: Tony Murphy <anthony.murphy@microsoft.com> Co-authored-by: Mark Fields <markfields@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Donovan Lange <dolange@microsoft.com> Co-authored-by: Tim Wang <82841707+timtwang@users.noreply.github.com> Co-authored-by: Curtis Man <curtism@microsoft.com> Co-authored-by: sumedhb1995 <sumedhb1995@gmail.com> Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com> Co-authored-by: Tyler Butler <tyler@tylerbutler.com> Co-authored-by: Skyler Jokiel <skjokiel@microsoft.com> Co-authored-by: sdeshpande3 <46719950+sdeshpande3@users.noreply.github.com> Co-authored-by: Tanvir Aumi <mdaumi@microsoft.com> Co-authored-by: Daniel Lehenbauer <DLehenbauer@users.noreply.github.com> Co-authored-by: yunho-microsoft <75456899+yunho-microsoft@users.noreply.github.com> Co-authored-by: yhou46 <yunpengdevelop@gmail.com> Co-authored-by: Tony Murphy <anthonm@microsoft.com>

vladsud added bug Something isn't working design-required This issue requires design thought labels Jul 9, 2021

vladsud added this to the July 2021 milestone Jul 9, 2021

vladsud self-assigned this Jul 9, 2021

This was referenced Jul 9, 2021

Add handling of summary nacks error codes and retryAfter #6596

Closed

[Epic] ODSP Scalability #6686

Closed

vladsud removed the design-required This issue requires design thought label Jul 14, 2021

vladsud added a commit to vladsud/FluidFramework that referenced this issue Aug 2, 2021

Fetching ops from PUSH: Implementing solution described in microsoft#…

825492c

…6685 (cherry picked from commit 9fd2a5cb7a7ea218dd0de69e38c9a86e15f99d59)

vladsud mentioned this issue Aug 2, 2021

Fetching ops from PUSH #6954

Merged

vladsud modified the milestones: July 2021, August 2021 Aug 2, 2021

vladsud mentioned this issue Aug 10, 2021

PUSH's flush_ops() implementation #7086

Merged

vladsud mentioned this issue Aug 13, 2021

Deprecate IConnected.initialMessages #7132

Closed

vladsud closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design & implement system that reduces storage pressure and provides client feedback when PUSH is throttled #6685

Design & implement system that reduces storage pressure and provides client feedback when PUSH is throttled #6685

vladsud commented Jul 9, 2021

vladsud commented Jul 9, 2021

vladsud commented Jul 9, 2021

anthony-murphy commented Jul 9, 2021

vladsud commented Jul 9, 2021

anthony-murphy commented Jul 9, 2021

GaryWilber commented Jul 10, 2021 •

edited

Loading

GaryWilber commented Jul 12, 2021

anthony-murphy commented Jul 12, 2021

GaryWilber commented Jul 12, 2021

vladsud commented Jul 13, 2021

GaryWilber commented Jul 13, 2021

anthony-murphy commented Jul 13, 2021

vladsud commented Jul 14, 2021

vladsud commented Jul 16, 2021

vladsud commented Jul 16, 2021

vladsud commented Aug 10, 2021

GaryWilber commented Aug 10, 2021 •

edited

Loading

vladsud commented Aug 10, 2021

vladsud commented Aug 11, 2021

GaryWilber commented Aug 11, 2021

vladsud commented Aug 13, 2021

Design & implement system that reduces storage pressure and provides client feedback when PUSH is throttled #6685

Design & implement system that reduces storage pressure and provides client feedback when PUSH is throttled #6685

Comments

vladsud commented Jul 9, 2021

vladsud commented Jul 9, 2021

vladsud commented Jul 9, 2021

anthony-murphy commented Jul 9, 2021

vladsud commented Jul 9, 2021

anthony-murphy commented Jul 9, 2021

GaryWilber commented Jul 10, 2021 • edited Loading

GaryWilber commented Jul 12, 2021

anthony-murphy commented Jul 12, 2021

GaryWilber commented Jul 12, 2021

vladsud commented Jul 13, 2021

GaryWilber commented Jul 13, 2021

anthony-murphy commented Jul 13, 2021

vladsud commented Jul 14, 2021

vladsud commented Jul 16, 2021

vladsud commented Jul 16, 2021

vladsud commented Aug 10, 2021

GaryWilber commented Aug 10, 2021 • edited Loading

vladsud commented Aug 10, 2021

vladsud commented Aug 11, 2021

GaryWilber commented Aug 11, 2021

vladsud commented Aug 13, 2021

GaryWilber commented Jul 10, 2021 •

edited

Loading

GaryWilber commented Aug 10, 2021 •

edited

Loading