Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pending ops telemetry #6838

Merged
merged 17 commits into from
Jul 29, 2021
Merged

Pending ops telemetry #6838

merged 17 commits into from
Jul 29, 2021

Conversation

andre4i
Copy link
Contributor

@andre4i andre4i commented Jul 21, 2021

See #4616

Added telemetry for pending ops in FluidDataStoreContext and RemoteChannelContext. The threshold is 1000 in both cases, as per the initial issue, please let me know if I need to adjust that. As events across classes have the same name, each class will have a child logger in order to better differentiate the events.

When realizing, it will send the counter if it's above the threshold. When adding ops, it will only send when it surpasses a multiple of the threshold, to minimize noise.

@github-actions github-actions bot requested review from vladsud and arinwt July 21, 2021 21:26
@github-actions github-actions bot added the area: runtime Runtime related issues label Jul 21, 2021
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Jul 21, 2021

@fluidframework/base-host: +552 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
main.js 178.51 KB 179.05 KB +552 Bytes
Total Size 178.51 KB 179.05 KB +552 Bytes
@fluid-example/bundle-size-tests: +727 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
container.js 159.56 KB 160.27 KB +727 Bytes
map.js 48.96 KB 48.96 KB No change
matrix.js 147.64 KB 147.64 KB No change
odspDriver.js 200.21 KB 200.21 KB No change
odspPrefetchSnapshot.js 29.5 KB 29.5 KB No change
sharedString.js 168.1 KB 168.1 KB No change
Total Size 753.97 KB 754.68 KB +727 Bytes

Baseline commit: 9a8041e

Generated by 🚫 dangerJS against 08ef584

@github-actions github-actions bot requested a review from vladsud July 26, 2021 20:56
@andre4i andre4i merged commit bd8c094 into microsoft:main Jul 29, 2021
appym31 added a commit that referenced this pull request Aug 20, 2021
* Add assert short codes before release (#6852)

* Upgrade socket.io in R11s from v2 to v4 (#6836)

* Add ability to disable summarizer heuristics (#6841)

* Add ability to disable summarizer heuristics

* Split summarize heuristic data from runner

* end-to-end test for caching createNewSummary (#6835)

* Remove readAndParseFromBlobs api as no longer needed (#6816)

* Improve SummaryManager encapsulation (#6840)

Reduce calls on IContainerContext, reduce duplication of connection state management.

* throw error in protocol handler upon exception (#6846)

* [bump] package version to 0.45.0 for development after client release (#6858)

@fluidframework/build-common:     0.23.0 (unchanged)
     @fluidframework/eslint-config-fluid:     0.24.0 (unchanged)
      @fluidframework/common-definitions:     0.21.0 (unchanged)
            @fluidframework/common-utils:     0.32.0 (unchanged)
   @fluidframework/container-definitions:     0.40.0 (unchanged)
         @fluidframework/core-interfaces:     0.40.0 (unchanged)
      @fluidframework/driver-definitions:     0.40.0 (unchanged)
    @fluidframework/protocol-definitions:   0.1025.0 (unchanged)
                                  Server:   0.1028.0 (unchanged)
                                  Client:     0.44.0 -> 0.45.0
                  @fluid-tools/benchmark:     0.40.0 (unchanged)
                         generator-fluid:      0.3.0 (unchanged)
                             tinylicious:      0.4.0 (unchanged)
                             dice-roller:      0.0.1 (unchanged)

* Bump dependencies on container-definitions and core-interfaces to 39.7 everywhere (#6857)

* Bump dependencies version

   @fluidframework/container-definitions -> ^0.39.7

* Bump dependencies version

         @fluidframework/core-interfaces -> ^0.39.7

* update driver-definitions dep.

* lint

* revert driver-definitions bump

* revert changes in common/

* revert fix

* fix lerna and other package.jsons

* Add docs on auth (#6859)

* Docs system: Introduce new mechanisms to update and manage docs (#6819)

* Lint for banned words in docs (#6866)

* New react tutorial (#6842)

* @fluid-experimental/property-changeset TS conversion - Fix linting [1/3] (#6808)

* add fixes to ambient type definitions in changeset and properties packages (#6741)

* Accessibility fixes for website (#6786)

* Handling document lambda factory errors in document partition (#6865)

* Website reorg and cleanup (#6875)

- Remove advanced and concepts section. Everything is in /deep now.
- Add support for "outdated" and "discussion" article statuses.
- Add "placeholder" shortcode to mark unwritten sections in docs.
- Mark tutorial outdated.

* Do automatic hash algorithm fallback in common-utils (#6877)

P1 of fixing #6757

History here is we had a function to override the hashing function to enable scenarios where people wanted to do stuff locally involving insecure contexts (because crypto.subtle isn't available there). We decided against doing automatic fallback because it added the sha.js library to the webpack bundle for analyses and muddied those numbers (production impact is non-existent because it's a dynamic import).

Turns out irl usage quickly goes beyond "call this function here" and it's not at all clear what to do, so change this to automatic fallback. The time saved from not having to talk to people about this will far outweigh the risk of someone accidentally introducing a production dependency on the sha.js library and having to revert it.

(also cleanup some other build warning because it's annoying)

* Add removed client to IAudience's removeMember event (#6825)

One of the problems with using IAudience is that only getting the clientId in the removeMember event can make it cumbersome to make incremental changes to data on audience members, because once the event fires the associated IClient is gone and you're stuck either keying your data on the clientId (where use cases tend to favor keying on a user) or iterating to match the clientId. Add the IClient to the args for the event to make this easier. Also handle the case where an audience join signal gets lost by only emitting an event when a member is actually removed, and log when we try to remove a non-existent member. There should be very little functional change for consumers by doing this because in these situations the consumer couldn't find the missing member in the audience anyway,

* Create an API to change the start/end of an interval (#6800)

Allow changes to start and end independently of each other. A value of "undefined" passed as start or end indicates no change to that endpoint.

* update routerlicious deps on common-utils (#6884)

Part ??? of fixing #6757

Going through bumping all the layers to use prerelease versions of common-utils following a change there. This change bumps usages in serve/routerlicious.

Next steps:

Create prerelease for server 0.1028
Update client and tinylicious to use common-utils 0.32.0-0 and server 0.1028.0-0. Tinylicious must be updated here because the common-utils change is hashing changes to facilitate local testing in insecure environments, and that entire dependency chain needs to be updated.

* Remove IOdspSnapshot, IOdspSnapshotCommit, IOdspSnapshotBlob from odsp driver. (#6765)

* Remove blob contents from snapshot in rehydrate container, loader changes (#6822)

* Replace checkNotTimeout with raceTimer helper (#6809)

Replace checkNotTimeout with raceTimer helper

* Remove fluid-object-interfaces example package (#6881)

* Track serverMetadata per client in deli (#6882)

* Stricter typing for producers (#6883)

* Update base64-js version in common-utils (#6892)

As part of bumping the common-utils version. Client build is complaining there are multiple versions of base64-js, so bump the version in common-utils to latest to match the duplicate.

* Clarify audience docs (#6895)

* Container.close() to be more robust (#6894)

Closes #6690
Keeping #4244 opened as ideally we make callback non-optional and remove old behavior or raising "error" event.

* bump container-definitions dependencies to 0.39.8 (#6874)

* Add afterSequenceNumber option to on-demand summarize (#6860)

Add enqueueSummarize

* Client may get stuck fetching ops from server not realizing it has all the ops (#6898)

Address https://onedrive.visualstudio.com/SPIN/_workitems/edit/1166727

* Add snapshot conversion logic for new binary odsp snapshot format (#6560)

* Add ADO pipeline for building the docs site (#6899)

We currently build TSDocs in CI, but not the complete docs site. This PR will enable building the whole site for validation as part of PR checks. As a welcome side effect, this will get us linting coverage on our docs markdown.

* React tutorial: Fix bug in object destructuring (#6906)

* Fix typos in dds interceptions documentation (#6905)

* Pending ops telemetry (#6838)

* Add threshold telemetry sender

* Add (c) header

* Fix export

* Fix tests

* Fix tests

* Rename

* Fix event name property in the tests

* Add some documentation

* strictEqual -> deepStrictEqual

* Fix event names in test

* Simplify event

* Restructure loggers

* Add undefined as data point in the tests

* rename parameter

* Rename const

* Some PR feedback

* SharedMatrix: Fixed not serializing handles in toString (#6904)

* Deli metrics (#6868)

* Refactor the emums

* Adding a config to enable new telemetry framework

* Adding an enum

* Noop sooner if lumber is disabled

* Use the lambda serviceConfiguration

* Remove always-null parentBranch reference (#6912)

* Update client/tinylicious to newest common-utils/server prerelease versions (#6890)

Fixes #6757

Update client and tinylicious packages to use latest common-utils/server prerelease version. Tinylicious needs to be updated here as well because the common-utils change targets local testing with tinylicious.

* chore: convert @fluid-experimental/property-properties to TS [ 2 / 3 ] (#6891)

* add initial ts config
* move tests into src folder
* fix source compiling
* make tests run
* fix policy
* rename files to camelCase

* Extend CombinedProducer to support sequential sends (#6923)

* Fluid Debugger: Only Sanitize if anonymize is true (#6922)

fixes #6921

* Introduce normalizeError for getting a valid IFluidErrorBase from an arbitrary caught error (#6764)

This change introduces the `IFluidErrorBase` interface described in #6676, and adds a function `normalizeError`.

`normalizeError` will take any input at all and return a valid `IFluidErrorBase`, whether the original object or a new one if necessary.

It also applies annotations to the resulting error, either telemetry props or an error code to use if there isn't one on the given error.

* LoggingError support for keys to omit from logging (#6900)

All the own properties on a LoggingError are logged by default, and we need a way to specify that a particular member should not be logged (e.g. AuthorizationError.claims).  Someone can still do bad things by doing `(error as any).foo = somethingPrivate;` but at least mainline cases can be handled this way.  See #6485 for more context.

* Add logic to track speed of receiving join ops and reset connection on timeout (#6931)

Add logic to track speed of receiving join ops on "write" connection and force reconnect if not received in 30 seconds.
This is related to https://onedrive.visualstudio.com/SPIN/_queries/edit/1156694/?triage=true - an issue where clients are not receiving any ops on "write" connection for 1 hour while there are a lot of changed in the document.
Related PR: #6928

* Fixing potential issues around client having connection that does not broadcast ops (#6928)

We observed that some clients in ODSP are not observing any ops on a "write" connection for an hour, until token expired. On reconnect, client finds itself being behind with many thousands of ops.

We do not have good theory, but I went through client code looking for cases where we might either not have proper "op" handler installed, or maybe connection is completely broken. Based on code inspection:

Due to asynchrony, DeltaManager might get closed connection as a new connection and not realize that. Currently there is no way to check for that state, so adding IDisposable to connection to be able to detect it and recover.
event registration logic is a bit complicated and while code inspection did not find a bug, I want to be able to assert that event registration / propagation is done properly, so refactoring this code and adding more asserts.

More details are can be found in https://onedrive.visualstudio.com/SPIN/_queries/edit/1156694/?triage=true
Also related PR: https://github.com/microsoft/FluidFramework/pull/6931

* Too many fluid:telemetry:RemoteChannelContext:StorePendingOps events in stress tests (#6933)

* Missing audience member: Get rid of early signals (#6935)

Closes #6910

* Try to read from new format while summarizing in detached container (#6903)

* Disable recovery for when client does not observe its own join op for 30 seconds (#6937)

Newly added telemetry / recovery happens too often in our own stress tests.
Based on telemetry, event and recovery happens correctly - only when only when client is in "connecting" state for 30 seconds with "write" connection mode and there is no transition to "connected state". I.e. indeed we have not observed our own join op.
And based on auxiliary telemetry (or rather lack of it) we are not processing ops, though we need to get better telemetry here to confirm that.

So all in all, the problem is rather easy to hit in stress tests, but I have no good theory. So disabling recovery path to keep code on par with previous behavior, but keep logging and increasing it to 60 seconds to get new data. This should allow ODSP tests to run tonight undisturbed, while I use local tests to get better sense of what's going on.

* Remove existing property - Part 6, IFluidDataStoreRuntime (#6869)

* Make ds runtime back compat

* Runtime.existing

* Fix datastoreruntime

* Fix datastoreruntime

* Add existing to factory method

* more experimenting

* Some more refactoring

* Change lazyloaded data object factory to not use runtime.existing

* Fix datastorehelpers, add existing to testfluid object and scheduler

* Simplify puredataobjectfactory

* back compat data object

* Extract back-compat

* Remove usage of runtime.existing

* Remove usage of runtime.existing take 2

* Some formatting fixes

* Fix scheduler

* Add note in BREAKING.md

* Fix BREAKING.md

* PR feedback - rename instantiateExisting func

* PR feedback - get rid of document.existing

* Some more notes about breaking changes, some PR feedback about simplfying initialize internal

* PR feedback: leave getDataObject alone

* Small correction to BREAKING.md

* QuorumProxy: move to client, adjust event lister warning limit (#6949)

QuorumProxy is heavily used object (as its exposed to all data stores / DDSs) and thus we do see Node warnings in various stress tests about exceeding event listener limit (of 10).

Raise this limit to 50 and move implementation to client as it does not belong on the server side.

* Fix deli server metadata issue (#6939)

* Fix ordered client election loggers (#6955)

* r11s-driver: expose minBlobSize config via driver policies (#6618)

* Fetching ops from PUSH (#6954)

Implementing solution described in #6685.

After implementing #6947, the client hits again "too many retries" issue (critical failure due to client not being able to get ops within 30 seconds).
With this PR, client always asks PUSH for any missing ops in parallel to fetching same ops from storage and/or local cache.
This reduces number of cases when we get "too many retries", but does not eliminate it.

I've added minimum telemetry, but most request can be tracked by tracking storage request telemetry, as every call will be duplicated to PUSH if there is active connection.

Flow can be optimized further by

Not asking PUSH for ops ranges that are preceding first op on socket
Ask for ops in sequence (not in parallel), in order of local cache / PUSH / storage.
This PR (in current form) should unblock further investigations and understanding of "too many retries" problem, but also allow PUSH to be simpler (if needed / desired) by eliminating various work arounds, if we chose to go that route.
Or, if we chose for PUSH to provide stronger guarantees and ensure ops are always coming in order, than lack of hits for newly added telemetry will allow us to remove this code and have confidence it's not needed.

* Small optimiaaiton - push() always went through processDeltas(), even on paused connection (#6932)

* Move webpack-fluid-loader to @fluid-tools scope (#6956)

* Better telemetry for fetching ops (#6947)

Problem statement:

Newly added NoJoinOp telemetry event points out to a condition where ops are not processed for very long time.
Examining telemetry shows that all such cases have one thing in common - there is outstanding ops request to service that takes a long time. And in pretty much all the cases actual network request (as indicated by OpsFetch event) takes relatively short time, but overall process (GetDeltas_end) takes long time, occasionally minutes.

I believe in all these cases ops never get to storage (in reasonable time), but in majority cases client actually receives missing ops through websocket (though in all cases, read on). DeltaManager does cancel request in such case (see ExtraStorageCall event), but request is not immediately cancelled, blocking future requests (see fetchMissingDeltasCore - it allows only one outstanding call). As result, whole process does not more forward for the long time.

I do not have in-depth understanding where we get stuck in the process, but one such case is obvious waitForConnectedState() - it's possible that browser lies to us or does not quickly reacts to online/offline, which may cause process to get stuck for up to 30 seconds.

The other one more likely reason - 429s returned from SPO for fetching ops. We do not have logging for individual retryable attempts, so this goes unnoticed today.

Fix:

1. Make op fetching process return on cancellation immediately by listening for cancelation event.
2. Add telemetry for some sub-processes, like fetching ops from cache, if it takes longer than 1 second.
3. Remove ExtraStorageCall event as it fires on all successful fetches, and instead make core op fetching logic raise GetDeltas_cancel event instead if cancel was processed before all ops were fetched.
4. Add telemetry (logNetworkFailure in getSingleOpBatch) for individual failed fetched, such that we get insights for things like 429 that may block fetching process (but currently not visible in telemetry).

Outcome:

This does address many, but not all NoJoinOp issues (remaining needs to be looked deeper).
But this in turn brings back "too many retries" errors, indicating that one of the reasons we run into initial problem is due to client not being able to find relevant ops (and on top of it - not failing sooner, but hanging). These errors needs to also be looked deeper to understand if bugs are on client or server side.

* Throttler unit tests (#6909)

Fixes #6472.

Adds comments and tests to Throttler.
Changes SummaryManager to depend on IThrottler to be passed in rather than creating locally.
"Fixes" and enables a test to handle the case when subsequent getDelay() is called before the previous delay elapsed. Before the virtual times would keep increasing further into the future, but now they are capped at the real current time by subtracting them back down.

* E2E Pipeline: Run Local to get baseline logs (#6934)

Run the local server tests as a baseline in the e2e pipeline. The primary benefit here is to get logs for those tests, so when analyzing we can easily tell if an error is unique to a sever, or not.

related to #6910

* Fewer events in stress tests (#6959)

* Use PropertiesManager to handle property merges on Interval (#6824)

Add a PropertiesManager to Interval and SequenceInterval and let it handle changes to Interval properties. Add a changeProperties API. Support cross-client property changes via the change op.

* Update test-real-service.yml for Azure Pipelines

* Update test-real-service.yml for Azure Pipelines

* Add if-match header while summary upload (#6963)

* Test summarizer node (#6885)

Fixes #4459 by adding some tests for SummarizerNode, setting up the infrastructure for more.

* Adding tracking of get_ops PUSH requests (#6966)

Tracking from, to, duration.

* Extract request summarizer from SummaryManager (#6908)

Extract request summarizer function from SummaryManager.
Switch to use requestFluidObject.

* Add rushstack-based eslint config (#6920)

* Add some more throttler tests (#6967)

Consolidate and add more Throttler tests

* Pull out opsUntilFirstConnect check and use opsSinceLastAck instead (#6907)

Instead of doing "join" op sequence number minus DeltaManager.initialSequenceNumber, we use SummaryCollection.opsSinceLastAck. The former was a count of ops we "caught up" with, but the latter is ops since the last summary ack. Both should be equivalent when loading from latest snapshot, but in cases of loading from cached snapshot, the latter is more accurate.

Also removes use of PromiseTimer for initial delay, and instead uses the new delay common-utils function.
Also pulls out the handler to further reduce number of ContainerContext references within SummaryManager.

Changed the logic to just check opsSinceLastAck at the point of deciding whether to initial delay or not. Now it uses
a new checkBypassInitialDelay() function in conjunction with a deferred to simplify this logic. We also now check more frequently if we need to bypass the delay- any time refreshSummarizer() is called in either Off or Starting state (Starting State is new here).

* Docs: Add package placeholder (#6970)

* Snapshot tests: Added option to generate new reference snapshot files (#6925)

* Clarify intent of new hash fallback chunk (#6964)

Recent change to do automatic hash fallback when running in insecure contexts uses a dynamic import which webpack will create under all circumstances even if it is not expected to ever get served. As is it receives anonymous naming ("1.js" in our local outputs), which is non-descriptive and can be confusing for those updating their fluid version and seeing a large bundle size increase. This change adds additional notes around that increase and gives the chunk a more descriptive name matching its functionality.

* Removed client-api dependency from replay tool (#6941)

- Removed client-api dependency from replay tool.
- Added simple code loade, data object factory and runtime factory to replay tool.

* Bump tar from 4.4.13 to 4.4.15 in /server/routerlicious (#6975)

Bumps [tar](https://github.com/npm/node-tar) from 4.4.13 to 4.4.15.
- [Release notes](https://github.com/npm/node-tar/releases)
- [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md)
- [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.15)

---
updated-dependencies:
- dependency-name: tar
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump tar from 4.4.13 to 4.4.15 in /server/historian (#6974)

Bumps [tar](https://github.com/npm/node-tar) from 4.4.13 to 4.4.15.
- [Release notes](https://github.com/npm/node-tar/releases)
- [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md)
- [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.15)

---
updated-dependencies:
- dependency-name: tar
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump tar from 4.4.13 to 4.4.15 in /server/gitrest (#6973)

Bumps [tar](https://github.com/npm/node-tar) from 4.4.13 to 4.4.15.
- [Release notes](https://github.com/npm/node-tar/releases)
- [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md)
- [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.15)

---
updated-dependencies:
- dependency-name: tar
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add more logging and recovery attempts for stalled connection (#6978)

1. Added NoJoinOpRecovered to see if connection ever recovers
2. Converted NoJoinOp to heartbeat instead of single event. Likely will undo that in future when we have more data here, but currently it will help understand how long connection existed in cases where currently NoJoinOp is the last event in whole trace (not clear if that's because process dies, or some other reasons).
3. Added noop pings and reconnects as recovery options taking on broken connection. Telemetry around if it helps will point us to next steps to take in this area.
4. Validate that we somehow did not miss "disconnect" event - all key methods on connection to assert it's not used in disconnected state.

* Docs: Broken link (#6981)

The Docs on the Fluid Service https://fluidframework.com/docs/deep/containers-runtime/ attempts to reference the Summarizer topic, but the outgoing link is incorrect.

* Add explicit types for some public APIs (#6983)

* hot fix: change compressSmallBlobs level to 2 (#6918)

* Scribe metrics (#6944)

Scribe metrics

* Snapshot tests: Enabled GC and added ability to test snapshots created via detached container (#6942)

* Test summary manager (#6916)

* Add FrsClient integration test infra (#6761)

* added tinyliciousclient test cases + exisiting flag fix + allow DiceRoller in test import

* removed extra new line

* added first test case for FrsClient

* added local tinylicious test case + scripts

* added frs test infra

* removed tinylicious test case since it's not required anymore with the toggle

* added CreateClient + run frs test script

* rebuilt project

* reverted back lock packages

* removed js files

* reverted back routes.ts

* frsclient package.json script ordering

* added copyright to CreateClient

* reordered package.json

* added cross-env to define env variable

* converted CreateClient.ts to a single function + changed file name to FrsClientFactory

* converted CreateClient.ts to a single function + changed file name to FrsClientFactory

* removed assert + type/assert and renamed scripts for consistency

* Shared object cleanup (#6990)

null check in constructor, toJSON, setOwner, debugAssert

* Publish initial sequence docs on website (#6265)

* Fix shared Internal e2e tests (#6993)

The conflicting op test in e2e's SharedInterval.spec.ts expects certain operation to be sequenced in a certain order.
Even if we create the op in some order in the tests, if they are from different client session, the network doesn't guarantee the order of arrival. So we need to make sure to call processOutgoing in between to make sure the generated op has round trip the server and get sequence (but not processed yet), before generating the second conflicting op.

Also reduce the test timeout targeting r11s from 30s to 5s (in the pass, all test finish < 2.2s)

* Improved console output for getkeys (#6988)

* Adjustments based on latest scalability run (#6994)

1. We incorrectly quantify fetch timeouts as non-recoverable errors - that's not the case.
2. sending noops on broken connection does not really make a difference (though these ops make it to server), so remove this code and switch to forcing reconnect on error
3. Non-recoverable errors do not register as errors. Fix it.
4. Assert is hit due to timer callback firing after timer was cancelled - looks like a race condition.

* Canvas view/model separation (#6897)

This change does the following:

-Exposes a public interface on the Canvas data object to be used by a separate view (really just the Ink DDS -- a better public interface is probably possible, potentially by more deeply incorporating elements of the InkCanvas class).
-Splits the view out from the data object, and rewrites the view in React
-Removes usage if IFluidHTMLView
-Uses the example-utils ContainerViewRuntimeFactory to combine the view/model.
-Enables strict mode for the package

* React view/model separation (#6913)

This change updates SyncedDataObject and all of its customers to no longer implement IFluidHTMLView but instead use a split view/model approach.

For the clicker-react examples which use webpack-fluid-loader and a container-views approach it does this with a ContainerViewRuntimeFactory.

For likes-and-comments which does not use webpack-fluid-loader this change converts it to an external-views approach, where the app pairs the view with the data object itself.

* Adding Fetch timeouts (#7001)

Closes #6997
We will monitor telemetry to asses if approach of timeouts is valid, and if 30 second timeout is the good point to be

* Moving off method from LocalDocumentDeltaConnection.create() and adding connected property expected on socket (#7006)

Required for PR #6986.
It needs connected property. I can't patch getter property similar to how we patch "off" method, so I have to change it, and given I'm changing it, remove patching part of "off" method as well

* Don't try to insert blob if already present (#7008)

* Update reference to test snapshots in (#7011)

* [Property Proxy Typescript Migration 3/3] migrating property proxy to TS (#6742)

* rename files to *.ts
* migrate utilities to TS
* remove tsdoc incopatible type description in comment
* migrate proxyhandler to TS
* migrate propertyProxy to TS
* cast property to its correct type
* migrate componentSet to TS
* remove unused ambient types
* add type guards in utilities.ts
* fixes for propertyProxy.ts
* use proxy type in proxyHandler.ts
* remove unneded import in utility.ts
* update interfaces
* add import to propertyProxy.ts
* migrate componentMap to TS
* migrate componentArray to TS
* add missing include to propertyProxy.ts
* componentArray.ts fix missing changes
* add comment to lastIndexOf issue in componentArray.ts
* update comment componentArray.ts
* adjust eslit, tsconfig, build & test scripts
* update configs
* add exports to index.ts
* move ReferenceType to interface.ts
* update imports
* cleanup tsconfig
* cleanup package.json
* cleanup arrayProxyHandler.ts
* cleanup
* lint fixes
* fix some linting errors
* fix linting errors
* remove unused index.d.ts
* cleanup
* fix @typescript-eslint/ban-types linter errors
* fix @typescript-eslint/consistent-type-assertions linter errors
* covnert utulity class to namespace
* refactor set to be more type friendly
* cleanup
* adjust jest version
* remove tsdoc-metadata.json
* add handling of NaN
* cleanup componentSet.ts
* further cleanup
* fix comment
* add todo

* Snapshot tests: Added information to README on adding and updating and submitting changes to snapshots (#6984)

Added the following information to the snapshot tests README:
- How to submit changes to the test snapshot content.
- How to add new test snapshots to the repo.
- How to update existing test snapshots in the repo.

* Guarantee expected op ordering in conflicting changeProperties test (#7009)

Use processOutgoing to make sure ops that originate on different clients are processed in the order that the test expects.

* Copyprops precedence (#6969)

Fixes #6758.

As an error flows through FF, we may add telemetry properties at various points. The closer to the source that a prop is added, the more authoritative it is, so when adding subsequent properties, do not overwrite.

* Primitives example view/model separation (#6879)

This change does the following:

-Exposes a public interface on the DdsCollection data object to be used by a separate view
-Modifies the view to take that object and use its public interface
-Removes usage of IFluidHTMLView from the model
-Uses the example-utils ContainerViewRuntimeFactory to combine the now-separate view and model
-Minor renames and cleanup (further improvements are certainly possible but the purpose/scope of this change is view/model separation).

* Image gallery example view/model separation (#6965)

Also rewrite of the view and updating libraries used

* Deprecate unused DriverErrorType.genericError (#6489)

Since it's unused and indistinguishable from ContainerErrorType.genericError.

* Remove stop() from IRuntime (#6998)

Also throw in implementation in ContainerRuntime.

* CreateProcessingError now annotates all errors as dataProcessingErrors (#7012)

* Introduce UsageError to replace some asserts (#6961)

Fixes #6315

Asserts should be reserved for low-level internal invariants that indicate core flaws when hit. These two asserts are about proper use of the API, so we're converting them to a new error type.

These also include the telemetry prop usageError which would indicate that an error isn't a fault of the framework, but how the API is being invoked by the consumer.

* Expose R11s WholeSummaryUpload functionality in client (#7020)

* Add hooks for taggedLogger in the runtime. (#6926)

Last step of #5560. Creates a tagged-logger adapter class (that can wrap vanilla loggers that do not handle tags) and adds runtime handling for taggedLogger property of IContainerContext.

* Removing legacy container creation path from `container.load()` (#7005)

See #3429 and #6033

The snapshot tests were the only consumers (there was an assert which could only be bypassed with a 'magic string' used solely by these tests) of this code path. Considering that snapshot tests are now always starting with an empty snapshot, this is no longer needed.

* Enable API report for all client packages (#6888)

There are several reasons for this change. First, changes to the API report don't always represent breaking changes, but they do always represent changes to the public API - even if the change is just "this method wasn't documented but now it is." Changes to BREAKING.md are intentional so auto-labeling them remains.

The other reason is that we don't have good visibility into the changes that are being made that affect the public API. It's difficult to tell from code changes alone if a change will affect a public API. With this change, all the packages' public API changes will be tracked, and we can use this history over the coming months to help inform how we manage changes and breakages.

* Optionally have the Alfred generate a container id on creation (#7022)

* Add utility api to convert uint8Array to Array buffer (#7013)

* Fixing a bug: always fetching ops from PUSH (#7017)

Fetching ops from push was under "if (from < this.firstCacheMiss)" check that resulted in not requesting ops from PUSH all the time.
Fixing it by making sure fetching from PUSH is unique callback that is always involved.

How found:

In case of customDimensions.containerId == "65f88376-9665-465b-bc79-ae18d0ea0647", forcing reconnect actually resolved an issue of stalled client, because initial ops resolved the op gap. Client was stalled prior to that for two reasons:

fetching ops was first not bringing any ops, and then it started timing out.
The fact that it was resolved on reconnect forced me to inspect code, and I see that we do not always request ops from PUSH when we are looking for ops. That's the bug

* Verify runtime sequence number matches protocol sequence number (#7015)

Fixes #7002 by storing the sequenceNumber in .metadata blob within the runtime-generated summary. Then comparing this to the one that the server generated in the .protocol tree.

If they don't match, close the container with a critical error by default. Allow this behavior to be overridden with a new runtimeOption: "loadSequenceNumberVerification", which defaults to "close". If set to "log", instead just log an error to telemetry on mismatch, and if set to "bypass", do not even perform the check.

Split IContainerRuntimeMetadata into a separate ReadContainerRuntimeMetadata (which unions with undefined and has sequenceNumber as required) type and WriteContainerRuntimeMetadata (which has sequenceNumber required).

This PR also removes the getBackCompatRuntimeOptions function, which was used to handle compatibility when the format of the runtimeOptions changed from a flat list of properties.

* fix blobManager assert tripped after serialization/rehydration (#7003)

Include detached blob IDs in snapshot returned by serialize() and load them upon rehydration.

* policy check (#7024)

* Add documentation to FF.com for SharedMap (#6867)

Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com>
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
Co-authored-by: Skyler Jokiel <skjokiel@microsoft.com>

* Integrate new binary snapshot format in odsp driver (#6962)

* policy (#7029)

* Markdown-lint fixes (#7026)

* version (#7031)

* Add SHA256 and base64 encoding to common-utils hashFile fn (#7007)

P1 #6999

odsp-driver is using sha.js for some hashing stuff which is pulling in a bunch of extra packages. This didn't show when I was doing the buffer removal before because we were just looking at base-host (which doesn't use odsp-driver). We can move odsp-driver to use common-utils hash instead by adding support for SHA256 algorithm and base64 output format. This removes the sha.js and buffer packages from the odsp-driver bundle, which is roughly 37KB parsed size.

* fix returned objects in docs (#7027)

Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com>

* Revert "Add SHA256 and base64 encoding to common-utils hashFile fn (#7007)" (#7032)

This reverts commit 9f8c9dd21c44785e6003621644ed15d8300bf87d.

* [bump] package version to 0.46.0 for development after client release (#7033)

@fluidframework/build-common:     0.23.0 (unchanged)
     @fluidframework/eslint-config-fluid:     0.24.0 (unchanged)
      @fluidframework/common-definitions:     0.21.0 (unchanged)
            @fluidframework/common-utils:     0.32.0 -> 0.33.0
   @fluidframework/container-definitions:     0.40.0 (unchanged)
         @fluidframework/core-interfaces:     0.40.0 (unchanged)
      @fluidframework/driver-definitions:     0.40.0 (unchanged)
    @fluidframework/protocol-definitions:   0.1025.0 (unchanged)
                                  Server:   0.1028.0 -> 0.1029.0
                                  Client:     0.45.0 -> 0.46.0
                  @fluid-tools/benchmark:     0.40.0 (unchanged)
                         generator-fluid:      0.3.0 (unchanged)
                             tinylicious:      0.4.0 (unchanged)
                             dice-roller:      0.0.1 (unchanged)

* Improved logging of ODSP error responses (#6958)

Add types modeling the shape of response from ODSP, based on looking at telemetry, and only log response if it matches.

* Append dev-defined user properties to the JWT token (#6982)

* custom user properties changes

* generic typing and rendering additional details

* custom interface in view

* remove urls

* iteration

* replace with memberStrings

* frs-client.api.md

* Ensure `flush` cannot be called from `orderSequentially`'s callback in `ContainerRuntime` (#6991)

Allowing flush would silently break orderSequentially's guarantees.

Part of #4048

* Remove audience error logging in container (#7014)

Fixes #6910
We're frequently hitting the race condition on initial connection (such as with transition from read to write client) where a client disconnects very close to when another client connects, and the disconnect audience signal is sent to the connecting client that never knew about the disconnecting client. This is obfuscating what we really want to check (mismatched audience join/leaves e.g. legitimately lost signals), so just remove it because it's nbd(TM) (jk see the attached bug for more info)

* Restore #7007 to add SHA256 hash and fix tests (#7041)

This change restores #7007 which was reverted due to broken coverage tests and not to block the release. It also fixes the tests, which were broken because of an incompatibility between nyc and jest code coverage. I opened #7039 to track fixing this nyc/jest issue more broadly across the code base.

* Update tenant manager to include documentId (#7043)

* @fluid-experiemetal/property-changeset to TS [2/3]  (#6893)

* Rename files to use camelCase partial porting to TS
* Fix linting
* Fix imports + auto fixable issues
* Fix build + tests
* Fix build
* Fix policy check + file export
* Fix exports/imports
* Try to fix build
* Add missing scripts
* Fix poilicy check agian
* Fix package scripts
* require() must match filename casing
* Remove lib for now
* Address spaces issue
* Address comments on lodash imports
* Remove added async declarations
* Fix policy check

Co-authored-by: Daniel Lehenbauer <DLehenbauer@users.noreply.github.com>

* r11s: Explicitly allow blank document ids on create doc only (#7038)

* Remove extraneous devDependency from frs-client (#7051)

A dependency is duplicated in devDependencies and isn't getting bumped by the bump tool

* Bump common-utils prerelease version dep to release (#7055)

* remove prerelease version

* missing lock files

* Use path-browserify instead of node path in server-services-telemetry (#7057)

We should not be using node libraries in code that is not explicitly node-only. Webpack no longer implicitly does a polyfill fallback, so our downstream consumers are stuck handling this when we do. path-browserify is used elsewhere (dds/map) over node path, so use that there as well.

* Use common-utils instead of shajs in odsp-driver (#7010)

Fixes #6999, next part of PR #7007

Use the updated hash functions from common-utils (with new SHA256/base64 support) instead of sha.js, which removes the sha.js and downstream dependencies and cuts ~37KB from the odsp-driver package. Requires making getHashedDocumentId async.

* Update server dependencies in Historian (#7056)

* r11s-driver: Use documentId from server as source of truth (#7037)

* changed comment for disableIsolatedChannels (#7058)

* Bump dependencies version (#7068)

Server -> ^0.1028.1

* Add preliminary doc for testing/automation (#6854)

Add a doc for getting started with writing automation against tinylicious or frs, and also re-order the docs in the testing group to better reflect how they should be read when not jumping around

* move cra-demo to FluidExamples, remove from monorepo (#7059)

Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com>

* Remove Legacy Debug Logging (#7082)

Before we had logging infra we used the debug library. This was never removed as we moved to logging infra. These old logs are generally not useful, and are only available on the client. This change removes the dependency, and many of the spurious log statements, which will save some bits over the wire as well.

fixes #6253

* Small telemetry adjustments based on analyzing stress tests. (#7087)

Changes are mostly around uniformly representing data, i.e. using same property names and event names that better reflect what it tracks.

* Bump path-parse from 1.0.6 to 1.0.7 in /server/routerlicious (#7079)

Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7.
- [Release notes](https://github.com/jbgutierrez/path-parse/releases)
- [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7)

---
updated-dependencies:
- dependency-name: path-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump jszip from 3.6.0 to 3.7.1 in /server/routerlicious (#7078)

Bumps [jszip](https://github.com/Stuk/jszip) from 3.6.0 to 3.7.1.
- [Release notes](https://github.com/Stuk/jszip/releases)
- [Changelog](https://github.com/Stuk/jszip/blob/master/CHANGES.md)
- [Commits](https://github.com/Stuk/jszip/compare/v3.6.0...v3.7.1)

---
updated-dependencies:
- dependency-name: jszip
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump path-parse from 1.0.6 to 1.0.7 in /server/gateway (#7077)

Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7.
- [Release notes](https://github.com/jbgutierrez/path-parse/releases)
- [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7)

---
updated-dependencies:
- dependency-name: path-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump jszip from 3.6.0 to 3.7.1 in /server/historian (#7076)

Bumps [jszip](https://github.com/Stuk/jszip) from 3.6.0 to 3.7.1.
- [Release notes](https://github.com/Stuk/jszip/releases)
- [Changelog](https://github.com/Stuk/jszip/blob/master/CHANGES.md)
- [Commits](https://github.com/Stuk/jszip/compare/v3.6.0...v3.7.1)

---
updated-dependencies:
- dependency-name: jszip
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump path-parse from 1.0.6 to 1.0.7 in /server/historian (#7075)

Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7.
- [Release notes](https://github.com/jbgutierrez/path-parse/releases)
- [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7)

---
updated-dependencies:
- dependency-name: path-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump url-parse from 1.5.1 to 1.5.3 in /server/gitrest (#7074)

Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.1 to 1.5.3.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](https://github.com/unshiftio/url-parse/compare/1.5.1...1.5.3)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump path-parse from 1.0.6 to 1.0.7 in /server/gitrest (#7073)

Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7.
- [Release notes](https://github.com/jbgutierrez/path-parse/releases)
- [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7)

---
updated-dependencies:
- dependency-name: path-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Rework non-immediate noop sending logic (send them more often) (#7085)

Please see issues #5629 & #6364 for context, as well as big comment in the code on new characteristics

* Disable running it in strees test for now (#7093)

* ODSP driver: Tell PUSH if client supports get_ops flow (#7025)

* FRS documentation with TokenProvider and Azure function (#7052)

* custom user properties changes

* generic typing and rendering additional details

* custom interface in view

* remove urls

* iteration

* replace with memberStrings

* frs-client.api.md

* token provider docs draft

* link

* optional

* return token

* changes

* update sample FrsMember

* space

* Replace CreateContainerError usages with either normalizeError or new GenericError (#6940)

CreateContainerError as-is is basically equivalent to normalizeError*. So wherever CreateContainerError was given a thrown object, just use normalizeError.

We also called CreateContainerError to raise new cases, those should directly create GenericError with the string used as the error code as well as message.

I also reimplemented CreateProcessingError with normalizeError in here but could split into a different PR.

* Here are the differences between what CreateContainerError returned, and what normalizeError now returns where it's used instead:

The returned error is no longer an instance of LoggingError. So arbitrary props set on there won't be logged, you must use addTelemetryProperties
"Partially" valid errors - e.g. with an errorType but no telemetry prop functions or vice-versa - will not have those properties brought over. These are hypothetical cases that don't happen in practice so we decided to stop supporting them for simplicity.
Tangentially related to @vladsud 's PR #6936

* Bump socket.io-client dep from 2.1.1 to 2.4.0 to resolve security issue in xmlhttprequest-ssl (#7099)

There is a security issue in xmlhttprequest-ssl 1.5.5, which we are getting from our socket.io-client version. It is resolved in 1.6.1, which we can get by bumping our dep for socket.io-client to 2.4.0. We already resolve socket.io-client to 2.4.0, so this should functionally be a no-op for us.

* Revert 1c141b - Early signal processing, PR #6935 (Main) (#7098)

Per feedback, that exposes issues in Audience consumers as signals are dropped.
We will do more thorough investigation if this is a race condition that was always the case but more exposed by this change or this code fundamentally needs to stay as is

* Enable Alfred and Scribe to upload using one single call to storage (#7088)

* Uploading the initial summary in one single call

* Added support for scribe

* Add server-side doc id generation to Tinylicious (#7104)

* Put arraybuffer contents in blobs while rehydrating container (#7030)

* Session metrics (#7047)

Introducing session and startsession metrics

* Fix contents of blobs when binary contents are specified for r11s driver (#7103)

* [0.45] Add assert tags (#7107) (#7109)

* add tags

* fix line lengths

* ODSP driver: flush_ops() implementation for single-commit summary  (#7086)

Please see #6685 for more details on API.
Flush workflow is only enabled if full summary tree (including .protocol tree) is uploaded.
And only if flush_ops feature is supported by PUSH (i.e. PUSH has kill-switch).
Client attempts to ensure that required ops are flushed from PUSH's redis to SPO before summary is uploaded to SPO.

* We may miss "disconnected" event on socket (#6986)

When connection is established, connection object is returned back to DeltaManager. During this transition time, nobody listens on "disconnected" event, make it possible to miss it.

Add proper handling for such cases, as well as validation that if connection object it not disposed, socket itself should be connected.

We had someone similar behavior earlier for "error" handler, but it was not fully correct. Extending it to "disconnect" event and fixing issues.

* Bugfix and added whole summary option as config. (#7114)

* Bump color-string from 1.5.4 to 1.6.0 (#7113)

Bumps [color-string](https://github.com/Qix-/color-string) from 1.5.4 to 1.6.0.
- [Release notes](https://github.com/Qix-/color-string/releases)
- [Changelog](https://github.com/Qix-/color-string/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Qix-/color-string/compare/1.5.4...1.6.0)

---
updated-dependencies:
- dependency-name: color-string
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump handlebars from 4.7.6 to 4.7.7 (#7112)

Bumps [handlebars](https://github.com/wycats/handlebars.js) from 4.7.6 to 4.7.7.
- [Release notes](https://github.com/wycats/handlebars.js/releases)
- [Changelog](https://github.com/handlebars-lang/handlebars.js/blob/master/release-notes.md)
- [Commits](https://github.com/wycats/handlebars.js/compare/v4.7.6...v4.7.7)

---
updated-dependencies:
- dependency-name: handlebars
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump tar from 4.4.13 to 4.4.17 (#7111)

Bumps [tar](https://github.com/npm/node-tar) from 4.4.13 to 4.4.17.
- [Release notes](https://github.com/npm/node-tar/releases)
- [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md)
- [Commits](https://github.com/npm/node-tar/compare/v4.4.13...v4.4.17)

---
updated-dependencies:
- dependency-name: tar
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump jszip from 3.6.0 to 3.7.1 (#7091)

Bumps [jszip](https://github.com/Stuk/jszip) from 3.6.0 to 3.7.1.
- [Release notes](https://github.com/Stuk/jszip/releases)
- [Changelog](https://github.com/Stuk/jszip/blob/master/CHANGES.md)
- [Commits](https://github.com/Stuk/jszip/compare/v3.6.0...v3.7.1)

---
updated-dependencies:
- dependency-name: jszip
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* remove document id from IFluidDataStoreContext and IFluidDataStoreRuntime (#7064)

* Fix tinylicious (#7115)

* Fix tinylicious

* sort dependencies

* fixed tinylicious server readme extra word (#7117)

* Strip .app in buildHeirarchy from protocol-base (#7116)

* Added test for validating audience correctness (#7097)

* upgrade server deps in client packages (#7118)

* Removed gcData from summarize (#7090)

* Disable TLS 1.0 and 1.1 in Nginx Ingress Controller for security reasons (#7105)

* Disable TLS 1.0 and 1.1 in Nginx Ingress Controller for security reasons

* Add description to README

* Also add comments in tmpl file

Co-authored-by: yhou46 <yunpengdevelop@gmail.com>

* get_ops flow is broken due to raising events on wrong object. (#7123)

The bug became obvious due to latest PUSH rollout to SPDF that removes initial Ops on connection, making this part of code critical in loading flow.
Unfortunately we did not see this issue despite a bunch of testing, and it was missed on PUSH rollout as PUSH is using 0.44.x bits.

* Introduce errorInstanceId for errors telemetry (#7045)

This is a scoped version of #6968.

Add errorInstanceId to IFluidErrorBase (which none of our error classes implement yet)
In wrapError, add the inner error's errorInstanceId to the wrapping error if present
Add wrapErrorandLog which wraps then logs the inner error.
No longer copy telemetry props from inner error in wrapError, since we can now tie the inner/outer errors together in telemetry via ErrorInstanceId.

Co-authored-by: Tony Murphy <anthonm@microsoft.com>

* Start running e2e test targeting tinylicious in realsvc pipeline (#7122)

* Move tinylicious-client from experimental/framework -> packages/framework (#7101)

Move tinylicious-client from experimental/framework -> packages/framework

* Change references to "frs" -> "azure" in service-specific client packages (#7084)


Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com>

* add exponential backoff retry mechanism for writing ops to mongodb

* fix blob test failing on r11s (#7126)

This code was left out when replacing the error string with the shortcodes.

The behavior being tested is not fully implemented, and is only partially implemented on ODSP. The test checks that attach() throws an error when attachment blobs are present, or that an error is thrown by the r11s or local document service factory beforehand. The error message is the same in r11s and local, but was replaced by different shortcodes.

* Rename uber package to drop "@fluid-experimental" scope (#7108)

Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com>

* Move fluid-static from experimental/framework -> packages/framework (#7133)

Move fluid-static from experimental/framework -> packages/framework

* update packages to reflect latest layers (#7140)

Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com>

* Add a hook for an event that will fire on changes to Interval properties (#7019)

* Add a hook for an event that will fire on changes to Interval properties

* Update API signature

* Remove redundant init of range labels

* Respond to review feedback

* Fixed not waiting for clients to get to connected state in Audience tests (#7142)

* use "container" type payload when uploading summary in attach() with blobs (#7110)

Use "container" type summary payload instead of "channel" type when uploading initial summary in attach() when attachment blobs are present.

* Adjust latency telemetry based on recent regression in scalability runs experience (#7143)

One of the recent client changes overwhelmed PUSH but it was not very obvious from client logs that regression occurred.
Make sure such issues are more visible through usage of error events.
We will likely need to adjust threshold in the future, https://onedrive.visualstudio.com/SPIN/_queries/edit/1083644/?triage=true is tracking work on PUSH side to better understand latencies, and what to do next.

Also correctly format duration for another event by using performance event API (it converts floats to ints, making it easier to consume in telemetry and reducing size a bit).

* Follow ups to fluid:telemetry:OdspDriver:GetDeltas_cancel events #7040 (#7144)

Please see issue #7040 for more details - this event is duplicating another event that has exactly same name which makes data analyzes very confusing.
Closes #7040

* remove value from DB close log (#7150)

* Bump path-parse from 1.0.6 to 1.0.7 (#7081)

Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7.
- [Release notes](https://github.com/jbgutierrez/path-parse/releases)
- [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7)

---
updated-dependencies:
- dependency-name: path-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Rename FrsAzTokenProvider -> AzureFunctionTokenProvider o FF website

* Adjust nop frequency up from 250ms to 2s based on feedback from ODSP TAP 40 scalability run. (#7127)

With current (prior to fix) numbers, we were spamming PUSH with too many ops, causing it not handling workload.
2s adjustment will have (using TAP 40 payload) 8x difference in outbound traffic, and should have zero impact on final file size (due to 5 ops / sec of real traffic that results in all noops not being sequenced).
That said, it will regress certain workloads (search quality, size of snapshots), so we will need to re-evaluate impact and next steps.

Long term, we should strongly consider "alternative solution" proposed in #5629, as well as make our system less susceptible to collab window size (i.e. how summaries are working for Sequence and how search works for Sequence).

* Enable using multiple account for odsp e2e tests (#7148)

ODSP throttles and causes the test to fail if there are too many concurrent real service e2e job running in the pipeline.
To avoid it, use multiple account to spread the load.
- Add a new specification for multiple account that can be used in login__odsp__test__tenants
- ODSP test driver will generate a set of accounts that it will pick randomly when the test driver is created.
  - For e2e, that means every test files.
  - If there are sufficient account, the work load should be spread enough to avoid throttling.
 
To support multiple users in a single session:
- Change the token cache to cached based on full user id when username auth method is used.
- ODSP driver multiplex the socket, but it is associate to a single user.  Add options to the ODSP driver to specify whether we want to isolate the socket cache by factory.

Also updated the pipeline to use the format.

* reuse retry logic

* Enable multiuser account for stress test (#7161)

- Increase the time to wait for the token cache file lock (as more account needs to be auth and cached.
- Switch the pipeline to use the tenant accounts format

* Enforce single-use tokens in R11s createDocument API (#7141)

* Bump path-parse from 1.0.6 to 1.0.7 in /docs (#7072)

Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.6 to 1.0.7.
- [Release notes](https://github.com/jbgutierrez/path-parse/releases)
- [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7)

---
updated-dependencies:
- dependency-name: path-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Allow changing local lambda controller setup type (#7165)

* Update CI pipeline to support non-scoped packages (#7159)

Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com>

* Move summary nack messages logic to deli (#7166)

* set perMessageDeflate (#7167)

* Add RestLessServer and RestLessClient (#7168)

* Fixing parameters when creating Routerlicious driver's DocumentStorageService (#7174)

* Add option to always record telemetry during token fetch (#7172)

* WebPack: Remove '--package' arg (#7182)

* Update Tinylicious to use socket.io v4 (#7176)

Updates Tinylicious to use socket.io 4.1.2, same as Routerlicious.  Removes @types/socket.io (no longer required, the more recent versions of socket.io include typing).

* Fix published path for server docker builds, optional NOTICE, and test docker image tag (#7181)

- PR #7159 subdivided the publish packages into scoped and non-scoped to be published in separate steps.  The server packages are all scoped and need to put into the right path for publishing
- NOTICE generation fail randomly, make it not required if we don't publish the docker image
- If we are doing manual test build, don't push it to the same container repository as the main official CI build (use `test/` instead of `build/`.

* Adding TLS support for rdkafka and other Kafka changes (#7171)

* Making sure node-rdkafka is compiled with SSL by adding SSL deps in out Dockerfile and avoiding volume mapping for it in docker-compose.dev.yaml. Also starting the experiment with some variables

* Adding support for SSL configs in RdKafka consuemer and producer classes

* Also enabling TLS for Kafka Admin Client

* Addomg Readme and updating config name

* Removing debug logs and updating Readme

* Throwing error if rdkafka tries to setup SSL but SSL is not enabled. Also adding more comments

* move logic to services-core, add logger

* artificially throw error

Co-authored-by: Jatin Garg <48029724+jatgarg@users.noreply.github.com>
Co-authored-by: Zach Newton <znewton@microsoft.com>
Co-authored-by: Arin Taylor <artaylor@microsoft.com>
Co-authored-by: chensixx <34214774+chensixx@users.noreply.github.com>
Co-authored-by: Matt Rakow <ChumpChief@users.noreply.github.com>
Co-authored-by: Kabir Brar <kabir@brar.xyz>
Co-authored-by: Tyler Butler <tylerbu@microsoft.com>
Co-authored-by: Rick Kirkham <Rick-Kirkham@users.noreply.github.com>
Co-authored-by: Nedal Horany <nedalhy@gmail.com>
Co-authored-by: Marcus Karlbowski <43415869+karlbom@users.noreply.github.com>
Co-authored-by: Henrique Da Silveira <41453887+hedasilv@users.noreply.github.com>
Co-authored-by: Helio Liu <59622401+heliocliu@users.noreply.github.com>
Co-authored-by: Paul Leathers <pleath@users.noreply.github.com>
Co-authored-by: Gary Wilber <41303831+GaryWilber@users.noreply.github.com>
Co-authored-by: Vlad Sudzilouski <vlad@sudzilouski.com>
Co-authored-by: Wes Carlson <49205066+wes-carlson@users.noreply.github.com>
Co-authored-by: Pragya Garg <praggarg@microsoft.com>
Co-authored-by: Andrei Iacob <84357545+andre4i@users.noreply.github.com>
Co-authored-by: Navin Agarwal <45832642+agarwal-navin@users.noreply.github.com>
Co-authored-by: Pradeep Vairamani <pradeeprv123@gmail.com>
Co-authored-by: Elchin Valiyev <elchin.valiyev@autodesk.com>
Co-authored-by: Tony Murphy <anthony.murphy@microsoft.com>
Co-authored-by: Mark Fields <markfields@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Donovan Lange <dolange@microsoft.com>
Co-authored-by: Tim Wang <82841707+timtwang@users.noreply.github.com>
Co-authored-by: Curtis Man <curtism@microsoft.com>
Co-authored-by: sumedhb1995 <sumedhb1995@gmail.com>
Co-authored-by: Sumedh Bhattacharya <sumbhatt@microsoft.com>
Co-authored-by: Tyler Butler <tyler@tylerbutler.com>
Co-authored-by: Skyler Jokiel <skjokiel@microsoft.com>
Co-authored-by: sdeshpande3 <46719950+sdeshpande3@users.noreply.github.com>
Co-authored-by: Tanvir Aumi <mdaumi@microsoft.com>
Co-authored-by: Daniel Lehenbauer <DLehenbauer@users.noreply.github.com>
Co-authored-by: yunho-microsoft <75456899+yunho-microsoft@users.noreply.github.com>
Co-authored-by: yhou46 <yunpengdevelop@gmail.com>
Co-authored-by: Tony Murphy <anthonm@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add telemetry on number of ops we are hoarding for delay-loaded data stores / DDSs
4 participants