Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] MVP for new service landing page experience #300

Closed
alex-fedotyev opened this issue Jul 21, 2020 · 12 comments
Closed

[APM] MVP for new service landing page experience #300

alex-fedotyev opened this issue Jul 21, 2020 · 12 comments
Assignees

Comments

@alex-fedotyev
Copy link

alex-fedotyev commented Jul 21, 2020

Summary of the problem (If there are multiple problems or use cases, prioritize them)
This MVP would be first step in improving service landing page.
Main goal is to introduce more actionable troubleshooting workflows and leverage more data points about the service performance.

User stories

  • As service owner, I would like to see key details about my service like version, framework, runtime, platform or cloud to validate current state of the service.
  • As App Ops, I need to have visibility into timeline of anomalies, alerts and change events correlated to the service KPIs to better identify and isolate issues.
  • As App Ops, I need to understand impact of downstream services and backends onto the service in question in order to troubleshoot isolate problem root cause and increase MTTR.
  • As App Ops, I need to have visibility into how service is performing across its infrastructure in order to isolate issues related to specific instances, geos, deployments, etc.

List known (technical) restrictions and requirements

  1. Able to scale to different time-frames from 15 minutes to multiple weeks.
  2. Scalable to list multiple dependencies (10+).
  3. Scalable to list multiple service instances (10 to 100s).

If in doubt, don’t hesitate to reach out to the #observability-design Slack channel.

Design issue: https://www.figma.com/proto/WkQsIVDmiYuHkvcXbzYBtg/268-%2F-Service-landing-page?node-id=513%3A2599&viewport=2040%2C-1389%2C0.5&scaling=min-zoom

@elasticmachine
Copy link
Contributor

Pinging @elastic/observability-design (design)

@alex-fedotyev
Copy link
Author

alex-fedotyev commented Jul 21, 2020

Quick mock of the service page:
Svc 1
Dependencies view:
Svc 2
Service instances view:
Test - Svc 3

@felixbarny
Copy link
Member

Related but probably out of scope:
Another nice addition could be adding dots on the transaction duration chart, representing exemplars of transactions. When clicking on such a dot, they would take the user to the transaction details for that particular instance of a transaction. This makes the process of selecting a representative trace, given a timestamp and duration much simpler compared to selecting a distribution bucket and toggling through the 10 examples of that bucket.
Another use case this would help with is a dev workflow where a dev would like the find the request they just made.

@graphaelli
Copy link
Member

@felixbarny I agree using the transaction duration chart to zoom in on the interesting transactions would be an improvement. Currently, it's simple to zoom in on a particular spike (x-axis/time filter) and drill into the interesting ones but y-axis filtering (on transaction duration) must be done manually with the kuery bar.

I'm interested to hear @formgeist's thoughts on the dots approach as part of the overall design for this issue.

@formgeist formgeist changed the title [APM] Design MVP for new service landing page experience [APM] MVP for new service landing page experience Aug 19, 2020
@formgeist
Copy link
Contributor

Design update Sep 10, 2020

These first mocks are primarily focused around delivering an MVP overview experience for the service. I've made a walk-through of the concepts and variations, so please have a watch 🍿

Loom video walk-through
Figma prototypes [click the Overview link in the tab navigation to switch between the two variations]


Overview

Overview - Pillars widgets

cc @cyrille-leclerc @nehaduggal @alex-fedotyev

@nehaduggal
Copy link

Love this iteration. The overview page now allows you to view the event timeline to identify if there's an issue and then directly allows a user to figure out the when and where the potential issue could be(with app KPIs, slowest transactions, errors and dependencies view). The only thing that I am missing from this view is infrastructure. If we could add infrastructure/instance based KPIs that would complete this story and give users a good reference point on where to look for potential issues.

@graphaelli
Copy link
Member

graphaelli commented Sep 10, 2020

I also really like the direction this is taking, particularly the time overlays for hour over hour, week over week - understanding what's normal, even if not anomalous is great.

I miss the time spent by span type. Knowing that my application as a whole is spending most of its time in the DB or Application code is extremely valuable from the first glance.

Also, Is transaction duration represented twice here, timeline and in its own chart?

@formgeist
Copy link
Contributor

Design update Sep 17, 2020

Apologies for the late response, but first of all thank you for the feedback! I've been working on some enhancements and changes based on these suggestions and other feedback that I have received from the team.

Overview

Figma prototype

There are a lot of changes, probably too many to mention since there's a lot of little tweaks, but here are some highlights;

  • Added the Time spent by span type breakdown as a chart alongside the dependencies table.
  • Updated a lot of the charts to reduce oversaturation of information and colors.
    • We have a new Traffic chart that no longer groups our transactions per min. by HTTP status code, but instead is a pure average chart.
    • The Timeline (new name pending) has been aligned with the duration (now called Latency) chart underneath for consistency.
  • We have some new names that in general should be made consistent through-out the rest of the app. Look forward to hear what you think of those.

There's plenty of remaining tasks, but eager to hear any feedback on this.

@axw
Copy link
Member

axw commented Sep 18, 2020

I think this is looking great -- I really like the timeline chart. ++ on having a time scrubber eventually.

Do I understand correctly that the shaded area in the timeline is throughput? Or is it the comparison ("A week ago") latency? This isn't intuitive to me, but maybe I'm just dense. If it's throughput, perhaps it would be helpful to use the same style in the "Traffic" chart?

(BTW, this is the kind of view I had in mind for CPU/heap profiling too. Timeline in the top with interesting events, and details below focused on a selected time range.)

A small detail I noticed in the Figma prototype:

Inside the Cloud Provider details at the top, there's Machine Type and Availability Zone. I would expect it to be very common to have multiple Availability Zones, and perhaps multiple Machine/Instance Types. In theory multiple cloud providers, but that's going to be less common. How will we capture this mix of details at the service level?

@alex-fedotyev
Copy link
Author

@axw - thanks for the feedback!

I think that almost any information displayed under the top service info icons could have multiple values, like service running on different JVM versions (canary deployment), two agent versions monitoring separate instances of the same service or multiple cloud providers, anything!

@formgeist
Copy link
Contributor

Design update Oct 19, 2020

It's been a little while since we've provided an update, but we've been iterating the layout and design of the overview page quite a bit since receiving more feedback and discussions around the components in the view.

We're planning on moving to implementation for the layout very soon, so we're focusing on the parts of the layout that we already have and should be able to build out the majority of the design in a first iteration. Then comes the new components such as the Dependencies table and the updated Span type breakdown chart along with the comparison data that is considered a feature in itself.

We've previously mentioned a History component which was a container for all the relevant service events (anomalies, annotations, and deployments) combined with a separate visualization of the latency metrics. We've decided to only visualize the existing latency metric chart and use that for hosting the events. The feedback we received on the History component was confusing to most of the users we presented it to. The concept of having two separate time ranges to control was too difficult to grok. So we've dropped it for the MVP and allowed the latency metric chart to host those events by giving it the full width at the top of the view.

Overview

As we reviewed the initial draft, we decided we needed to give the tables some more space and re-arranged the charts and tables so the related ones are displayed in the same row. We imagine that this layout is a good template for adding more in future iterations. Secondly, there many new controls that allow the user to show/hide comparison time range data and change the latency metric aggregation (which in turn changes it in the tables as well).

We're working on completing the outstanding design tasks in order to finalize it for implementation, since it should start in the coming weeks. Let me know if you have any feedback or questions.

cc @alex-fedotyev

@formgeist
Copy link
Contributor

Closing this design issue as all of the requirements have been addressed in this first iteration. I've created implementation issues for the UI dev team to proceed with the building of the view.

elastic/kibana#81147
elastic/kibana#81135
elastic/kibana#81120

If there's a need for additional design, we'll open new issues to handle those requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants