-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOLD for payment 2024-06-28] LOW: [Tracking] Metric - Send a message TTI (Performance audit by Callstack) #35225
Comments
|
messageProcessing [ms] |
---|
14588.59 |
11782.90 |
20523.72 |
15050.67 |
14571.76 |
17800.07 |
13568.82 |
16305.16 |
10060.16 |
10869.97 |
The average time of messageProcessing
is: 14512.18 ms
Auxiliary data
Additionally for broader perspective, the resources consumption during one of the test runs is included.
CPU usage |
JS Thread usage |
RAM usage |
@kacper-mikolajczak Thanks! Sorry if I am misunderstanding this but did it take 14 seconds from clicking Send button to show the message in the report in your testing? 😱 I have never experienced delay remotely close to this |
It may be the Heavy account setting we're using This was all on a physical device, production build and this exact account. @rinej will post a recording soon so we can see how the experience looked like. |
Hi @mountiny 👋 No misunderstanding whatsoever - I did not believe at first as well :D As @adhorodyski already pointed out, measurements were aligned with the guidelines but the app with this specific setup is not performing as it should. I will post the analysis soon, just to unwrap of what is going on under the hood 👍 |
@mountiny I am sending video example with around 15s for sending message. short-flow.mp4Additionally on that account sometimes sending message is bugy. It sends just part of the message and keeps the whole message in the input Example videos: bug1.mp4bug2.mp4 |
Holy Moly |
Give us some time we're already cookin' something!! 😊 |
AnalysisIn order to analyse underlying issues on Android platform, the app was ran in release build variant and profiled using both native and JS tools (Android Studio Profiler, Chrome Profiler and React Profiler respectively). As a brief overview, here is a high-level representation of analysis flow:
Finding the dark sideAndroid Studio Profiler is used to determine if performance issues stem from native, JS or both sides of execution. The results of native profiling when plugged into Perfetto visualiser look as follows:
The focus of the analysis is bound to the 4 main threads from React Native standpoint:
By looking at the CPU usage summary, there are two areas of high-density calculations. The biggest concern seems to be the overload of JS ProfilingThe detailed JS profiling trace is obtained thanks to
From the trace observed above, the key take-away were made:
The main goal of the analysis is the reduce the amount of calculations as much as possible. Each section down below contains information about specific finding and its results on overall performance. The results will be summed up between sections to eventually present the final outcome in very last part. After each section with meaningful changes in timings, the table will be presented like one below. First (baseline) and the last (final) tables consist of 10 samples, while the intermediate ones have 5 samples for brevity.
Hermes Intl
|
Stack trace after Hermes Intl changes shows significant improvement |
# | Duration [ms] |
---|---|
1 | 3572.07 |
2 | 3512.99 |
3 | 3542.68 |
4 | 3710.00 |
5 | 3672.54 |
Avg | 3602,06 |
The 3 bumps
While previous step carried out enormous gains, we still have quite a lot of work going on under the hood. Even with such reduction we still can observe the pattern when looking at the JS profiling trace - there are three significant overheads when sending the message. After investigating, those “bumps” seem to be:
- update of
hasDraft
property in report object when starting typing a message - optimistic update when sending the message
- actual Onyx update after message was successfully sent
Even though those bumps are a result of the different application flows, they have the same root which is tied to the SidebarLinksData
component and its internal calculations.
The SidebarLinksData
subscribes to collection of all chatReports
from Onyx (in testing account about 14k of reports). Collection is invalidated every time one of the report changes, which happens in aforementioned bumps (updating the hasDraft
property and sending the actual message flow).
Whenever chatReports
collection is invalidated, it breaks up the memoization of heavy optionListItems
object. Keep in mind that the memoization might also break because of other dependencies. It was not the case during the analysis of sending the message, but they appear as fragile as chatReports
.
Brief explanation of SidebarLinksData ordering mechanism and its pitfalls |
New Order
The major suspect of optionListItems
bad performance seems to be the getOrderedReportIDs
method from SidebarUtils
. It is called to calculate new order of reports for SidebarLinksData
whenever dependencies change.
Looking at the stack trace, the getOrderedReportIDs
function call is responsible for biggest time overhead of about 2.5-3s per call.
Stack trace presenting getOrderedReportIDs overhead inside SidebarLinksData |
Each call consists of number of operations performed on set of reports (and other arguments) to determine correct order of reports:
- caching reports (
JSON.stringify
) - filtering out reports (
arrayPrototypeFilter
) - modifying reports’ display names (
arrayPrototypeForEach
) - sorting reports (
arrayPrototypeSort
)
Those calculations are done for all of thousands reports every time collection is invalidated, thus resulting in high overhead. Let’s look closely at those in sections below.
The Cache
As old joke says, the three hardest things in computer science are cache invalidation, naming things and off-by-one errors. The audit, based on previous findings, will try to address them in this section.
The getOrderedReportIDs
function tries to avoid calculations by introducing a cache. While it is generally a good practice, in this specific example it might add more overhead that it tries to mitigate.
App spends about 150ms performing JSON.stringify
on heavy list of data in order to compute the cachedReportsKey
.
Then the generated key is checked against existing cache. If cache entry is hit, the cached ordered reportIDs
are returned. The issue here is that the cache entry is unlikely to be hit, especially given maximum cache size of 5 elements and quite big amount of input variables:
currentReportId
allReports
betas
policies
priorityMode
allReportActions
By removing the caching mechanism, the time required to calculate getOrderedReportIDs
gets decreased by about 0.3s (~15%).
By removing the caching mechanism getOrderedReportIDs gets 15% faster |
Additionally, the allReportActions
are solely used to compute the cache, so when removing the cache implementation, the entire allReportActions
retrieval process can be omitted as well (including request to Onyx), which is another potential gain for entire SidebarLinksData
. By comparing the stack traces there is ~40ms gain in performance:
Before removing allReportAction retrieval |
After removing allReportAction retrieval |
# | Message | Duration [ms] |
---|---|---|
1 | testing1 | 3032.67 |
2 | testing2 | 3216.57 |
3 | testing3 | 3232.38 |
4 | testing4 | 3295.99 |
5 | testing5 | 3195.70 |
Avg | - | 3194.86 |
The naming
Another heavy lifter inside getOrderedReportIDs
is a computation of report’s displayName
property via getReportName
.
Overhead of forEach that is mostly related to getReportName calculations |
In order to calculate the display name, getReportName
calls internally many different helpers, but the ones that stand out the most are getDisplayNameForParticipant
and getPolicyExpenseChatName
.
getDisplayNameForParticipant and getPolicyExpenseChatName overhead of ~200ms |
While one could try to squeeze out gains by looking for a more performant library or doing micro-optimisations, the best way to save time is to calculate static properties once (on server-side preferably), as they are not intended to change and derive dynamic values from them.
For example, in getDisplayNameForParticipant
implementation we can see that the value returned from formatPhoneNumber
call can be re-used:
The formattedLogin
field could be part of the personalDetails itself or might live in a separate entity. As an experiment, the formatted login was stored in a makeshift map and it resulted in about 500ms of reduced overhead:
Still costly forEach with a reduced runtime performance of about 500ms |
The example brought up here are meant to illustrate what can be done on broader scale in the implementation phase with app’s data flow in general. The higher in the tree of data sources we can reduce the number of calculation the better and this is something we will do in future.
Order | Message | Duration [ms] |
---|---|---|
1 | testing1 | 2482.31 |
2 | testing2 | 2512.94 |
3 | testing3 | 2474.32 |
4 | testing4 | 2461.08 |
5 | testing5 | 2493.86 |
Avg | - | 2484.12 |
Report’s hasDraft
property
An example of such high-level improvement could be process of determining whether report has a draft message. It is fired whenever user starts typing a new message and is responsible for the first of three bumps described above.
Updating draft state is mainly used to show correct order and UI in LHN
. While it is an important part of the UX, we don’t necessarily want to perform 1.5s of calculations to update one property.
Unfortunately, the hasDraft
property is strictly bound to the report object. Changing the (part of) report will result in expensive calculations of aforementioned getOrderedReportIDs
.
In order to fix this, the hasDraft
property could be extracted away from the core report structure and be served as a separate dependency.
Here are the results of mitigating the hasDraft
impact - the first bump of calculation that was present in one of the first measurments is now almost completely gone. The reduction goes from about 1.5s to under 100ms of an overhead. In comparison of other bumps, it now looks negligible.
![Results of avoiding the unnecessary getOrderedReportsIDs
updates on draft message]
Results of avoiding the unnecessary getOrderedReportsIDs updates on draft message |
This was achieved thanks to replacing the hasDraft
property with a separate Onyx key REPORT_HAS_DRAFT_
which is then used in the component of interest (e.g. SidebarLinksData
, OptionRowLHN
). The most significant improvement is made by skipping the costly call of getOrderedReportIDs
whenever the draft status changes - instead of calculating entire list from scratch, the newly created draft is injected into previous list.
Here is the partial implementation seen from SidebarLinksData
component:
SidebarLinksData improvements |
Similar pattern was applied to other draft properties, like REPORT_DRAFT_COMMENT
, so it might be a worthwhile to spark a discussion about handling such structures in the system.
# | Message | Duration [ms] |
---|---|---|
1 | testing5 | 2485.97 |
2 | testing 😄 | 2314.00 |
3 | testing6 | 2237.64 |
4 | testing7 | 2247.78 |
5 | testing8 | 2390.47 |
6 | testing9 | 2271.90 |
7 | testing1 | 2255.88 |
8 | testing3 | 2309.42 |
9 | testing5:smile: | 2319.85 |
10 | testing6 | 2280.91 |
Avg | - | 2310.77 |
Summary
While there is still a room for improvements, as the goal the we are striving for is 100ms of execution time, with the proposed changes involved, sending a message time went down from ~15s to ~2.5s.
Such changes are meant to be taken as a general direction or example of implementations which will be continued in next phase of the audit.
The main goal is to reduce costly calculations that freeze the thread making app unresponsive for the user - which is even worse UX than inaccessible inputs! Following that, as presented with smaller improvements we aim to improve both performance and development experience through writing more reliable code.
Hope you like the analysis - see you soon in the implementation phase! :)
These analysis are really impressive and great opportunity for everyone to learn from too. Its also great to see that this is in the open source channel as new engineers/ anyone can get inspired by these. I am very excited for bringing the proposals from this into the App! |
Love to hear that @mountiny! We are also eager to implement the proposal as soon as possible to get the benefits out of it. Before that, would you like to break down analysis with me into some Actionable Points we can treat as entries into Implementation Phase? For convenience, here is a quick summary of main points we can try to make into AP and prioritize:
Please keep in mind the actual time gains might roughly differ - their purpose here is to have a general overview of at what level presented change might impact the performance 👍 |
Thank you Kacper, I think my reasoning for ordering would be: make the biggest impact changes first and also prioritize the lowest hanging fruit. Seems like your list achieves that so let's do that |
Great, I find such ordering best as well. Just to sum up, here are ordered Action Items:
We are going to use those items as an input into Implementation Phase CC: @mountiny |
Thanks! |
We're working on it |
Working through it |
Most recent slack convo: https://expensify.slack.com/archives/C05LX9D6E07/p1713358946087599 TL;DR currently we're talking about standardizing on moize as our go-to solution for memoization outside of React. |
This issue was mentioned in https://expensify.slack.com/archives/C05LX9D6E07/p1711455442814489?thread_ts=1711010712.155509&cid=C05LX9D6E07 I'm going to write e2e test for that (or extend a current one) to assure the performance for such critical flows is not degrading over time 👍 |
@kacper-mikolajczak are you still on this one? |
Hi @muttmuure it was part of the last performance audit and is completed for now. |
Closing |
|
The solution for this issue has been 🚀 deployed to production 🚀 in version 1.4.85-7 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue: If no regressions arise, payment will be issued on 2024-06-28. 🎊 For reference, here are some details about the assignees on this issue:
|
Intro
As a part of the Callstack performance audit, the issue tracks and summarises phases for
Send a message
metric.The motivation behind measuring TTI of this particular behaviour were many reported issues of bad UX whenever user tried to send a message. By having an offline-first approach, we can also discard the network latency in measurements, but keep an eye on optimistic/actual updates in general during analysis.
Issue Owner
Current Issue Owner: @muttmuureThe text was updated successfully, but these errors were encountered: