Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dashboard): automated analysis dashboard card #618

Merged
merged 53 commits into from
Dec 8, 2022

Conversation

maxcao13
Copy link
Member

@maxcao13 maxcao13 commented Nov 9, 2022

@maxcao13 maxcao13 added the feat New feature or request label Nov 9, 2022
Copy link
Member

@andrewazores andrewazores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking super cool. I have some minor feedback so far about the implementation itself.

src/app/Dashboard/AutomatedAnalysisCard.tsx Outdated Show resolved Hide resolved
src/app/Dashboard/ClickableAutomatedAnalysisLabel.tsx Outdated Show resolved Hide resolved
src/app/Dashboard/ClickableAutomatedAnalysisLabel.tsx Outdated Show resolved Hide resolved
src/app/Dashboard/ClickableAutomatedAnalysisLabel.tsx Outdated Show resolved Hide resolved
src/app/Dashboard/ClickableAutomatedAnalysisLabel.tsx Outdated Show resolved Hide resolved
@andrewazores
Copy link
Member

I wonder what to do, or what to show to the user, in the case where there are no recordings to source data from. Should the card just show an empty content? Maybe it should explain why it is empty and provide a link to the Recordings view to prompt the user to start a recording?

What if the target has no running active recordings but it does have a recent archive? What if this is a Cryostat Agent target that we eventually have implemented, and so perhaps we cannot even start active recordings or maybe even can't list active recordings, but it may have new archives that appear over time as it pushes them?

Would it make sense to have a "source" that can be selected for the card to switch between taking snapshots vs using the latest archive?

@maxcao13 maxcao13 marked this pull request as draft November 9, 2022 18:31
@maxcao13
Copy link
Member Author

maxcao13 commented Nov 9, 2022

I wonder what to do, or what to show to the user, in the case where there are no recordings to source data from. Should the card just show an empty content? Maybe it should explain why it is empty and provide a link to the Recordings view to prompt the user to start a recording?

Some good points. My first impression is that if there are no source recordings on the target, there should be an explanation why we couldn't perform automated analysis, and then maybe have a prompt or button that allows Cryostat to start a profiling recording directly from the card itself without having to move, and then automatically, perform a snapshot, parse the report etc. (also maybe having a second option in the prompt to create the recording manually if the user wants to do so). Maybe also, there can be a customizable setting in the Settings, where the use can turn on and off whether recordings should be automatically started on targets for automated analysis, (during startup or when rendering the dashboard?), so that the user doesn't have to manually create a recording every time.
EDIT: I just realized that I am describing an Automated Rule.

What if the target has no running active recordings but it does have a recent archive? What if this is a Cryostat Agent target that we eventually have implemented, and so perhaps we cannot even start active recordings or maybe even can't list active recordings, but it may have new archives that appear over time as it pushes them?

Would it make sense to have a "source" that can be selected for the card to switch between taking snapshots vs using the latest archive?

First impression is that if an active recording isn't currently started, then maybe the Card will fallback to check if there are any archived recordings listed. If there are, then having the card acknowledge that it is using an Archived Recording to generate a report on, and also having a timestamp or "Seconds since stale" increasing timer number along with the analysis may be useful for the user to check how stale the analysis is?

I'm not sure about how the source idea would go; I don't if it may be confusing to people but if you think people may want that fine-grained control over their analysis, then it can definitely be worked in.

@andrewazores
Copy link
Member

EDIT: I just realized that I am describing an Automated Rule.

:-)

I thought about that, too. It might make sense to point the user to creating an Automated Rule if there are no source recordings for the card to draw from, rather than having them create a recording manually. I think Automated Rules fit better with the story about continuous monitoring, which is also what the Dashboard should be about, whereas a custom recording created manually is more of a deepdive profiling task.

First impression is that if an active recording isn't currently started, then maybe the Card will fallback to check if there are any archived recordings listed. If there are, then having the card acknowledge that it is using an Archived Recording to generate a report on, and also having a timestamp or "Seconds since stale" increasing timer number along with the analysis may be useful for the user to check how stale the analysis is?

I'm not sure about how the source idea would go; I don't if it may be confusing to people but if you think people may want that fine-grained control over their analysis, then it can definitely be worked in.

I think your idea makes a lot of sense, actually. Given that this is the Dashboard and meant to just give a quick overview of things, the "staleness" on the card seems like a really good compromise. The user can decide what to do from there or interpret that information according to the extra context they might know about the target or that data.

@maxcao13
Copy link
Member Author

maxcao13 commented Nov 9, 2022

Adding on to that, it may make sense to cache reports within the web-client and also maybe use the most recently cached report for a target in-case maybe a report generation or snapshot fails upon re-rendering. Then in that case, the staleness should also probably be used.

@maxcao13 maxcao13 force-pushed the automated-analysis-dashboard branch from d2541e5 to 96d9e41 Compare November 10, 2022 01:38
@maxcao13 maxcao13 force-pushed the automated-analysis-dashboard branch 2 times, most recently from 4e4b903 to 6c819b0 Compare November 15, 2022 20:12
@maxcao13 maxcao13 marked this pull request as ready for review November 15, 2022 23:36
@maxcao13
Copy link
Member Author

maxcao13 commented Nov 15, 2022

Ready for review, except for tests, which I will do after the OK.

There are issues with how snapshots seem to be flakey to work sometimes, and this results in snapshot recordings that are created under the target that don't seem to be useful. Also, another problem that happens is that eventually, if a recording has gone on a while and a snapshot is taken and sent to the /reports sidecar, there will be a ReportGeneration error because the Request Entity is supposedly too large. I assume these both are backend issues.

@andrewazores
Copy link
Member

Those do sound like backend issues.

There are issues with how snapshots seem to be flakey to work sometimes, and this results in snapshot recordings that are created under the target that don't seem to be useful.

Any more details on the flakiness or how the snapshots aren't useful?

Also, another problem that happens is that eventually, if a recording has gone on a while and a snapshot is taken and sent to the /reports sidecar, there will be a ReportGeneration error because the Request Entity is supposedly too large.

This could just be a report sidecar resource constraints configuration issue, or it could be an issue with the configuration of the active recording on the target being observed. What recording settings were used for that? Maybe it needs a slimmer event template, or a maxAge/maxSize.

@github-actions
Copy link

This PR/issue depends on:

@maxcao13
Copy link
Member Author

maxcao13 commented Nov 17, 2022

I'm not sure when it happens, but here is an example I just did by just repeatedly taking snapshots until it happens:
image
image
image
As you can see the recording is called "Snapshot", and if this occurs more than once, there will be multiple recordings called "Snapshot" seemingly existing at once within the Target.

As for the request entity problem, I generate snapshots using this recording as a basis:

  name: 'automated-analysis',
  events: 'template=Continuous,type=TARGET',
  duration: undefined,
  archiveOnStop: false,
  options: {
    toDisk: true,
    maxAge: 0,
    maxSize: 0,
  },
  metadata: {
    labels: {
      createdFrom: 'automatedAnalysis',
    },
  },

A maxSize constraint makes the most sense here.

@andrewazores
Copy link
Member

As you can see the recording is called "Snapshot", and if this occurs more than once, there will be multiple recordings called "Snapshot" seemingly existing at once within the Target.

Weird.

JFR doesn't use recording names as unique identifiers, they have incrementing numeric IDs that are used instead. The unique name thing is a Cryostat application restriction and we hide the numeric IDs. When you ask the target JVM to take a snapshot it could very well create snapshots with the same repeated name but distinct IDs. I don't know why it would just be creating one called Snapshot instead of snapshot-n, though.

But, do we really need to take snapshots anymore for this usecase? If there is already a recording running, or if one is started by clicking that text/button on the card, then that recording can be used for the report generation request. Taking a snapshot in this case would only help if there was some extra work done to limit the age of data within the snapshot for example, but we don't really have a good story for support that yet anyway. For now I think it would make sense to simply check for a recording with a specific name, like automated-analysis - or even use GraphQL to query by that label you've added instead of looking for a name - and simply try to generate reports from that. If the user clicked that text/button to start that recording from the card then apply settings with a maxAge and the Continuous event template.

@maxcao13
Copy link
Member Author

maxcao13 commented Nov 17, 2022

But, do we really need to take snapshots anymore for this usecase? If there is already a recording running, or if one is started by clicking that text/button on the card, then that recording can be used for the report generation request. Taking a snapshot in this case would only help if there was some extra work done to limit the age of data within the snapshot for example, but we don't really have a good story for support that yet anyway. For now I think it would make sense to simply check for a recording with a specific name, like automated-analysis - or even use GraphQL to query by that label you've added instead of looking for a name - and simply try to generate reports from that. If the user clicked that text/button to start that recording from the card then apply settings with a maxAge and the Continuous event template.

Makes sense, probably don't need snapshots if an 'automated-analysis' recording acts as a profiling recording.

@andrewazores
Copy link
Member

Anyway, the card itself looks great and generally behaves as expected other than the snapshotting behaviour. I really like the filters, especially by score severity.

One question/note, would it make sense to "collapse" NA scores into the n more labels? Or maybe have a default filter applied to not display those?

@maxcao13
Copy link
Member Author

Anyway, the card itself looks great and generally behaves as expected other than the snapshotting behaviour. I really like the filters, especially by score severity.

One question/note, would it make sense to "collapse" NA scores into the n more labels? Or maybe have a default filter applied to not display those?

Yeah, I was thinking about that, thanks for reminding me. I think I will go with just defaultly not showing N/A scores.

@maxcao13
Copy link
Member Author

Makes sense, probably don't need snapshots if an 'automated-analysis' recording acts as a profiling recording.

Mmm... I'm wondering one thing. If we just take report generation repeatedly from a recording, let's say from a fixed recording called 'automated-analysis' with a max size of something like 1MB, is the recording report representative of the current state of the target and previous events? Because if there is a max-size, if I recall correctly, the recording will chop off the previous recording timeline data to fit the max-size constraint?

@andrewazores
Copy link
Member

That's true, the maxSize/maxAge settings will both cause older events to be dropped from the recording, and therefore they won't be used when calculating the automated analysis scores. So if something noteworthy happened to your application half an hour ago but your recording only preserves the last 5 minutes of data, you would never be able to learn about that event that happened half an hour ago.

This is why I think we need to expand the card with some configuration, eventually, that lets the user choose which event template to use and also values for maxSize/maxAge. We don't know how much traffic the target gets and how many events will be emitted, and we don't know how large or how many reports sidecar generators the user will provision, so there are some things that are unknown to us but hopefully the user can figure out and configure.

@andrewazores
Copy link
Member

andrewazores commented Nov 17, 2022

Side-benefit of using the active recording (by name or label) directly rather than taking snapshots is that it means this card will actually serve reports backed by Cryostat's active report cache. Using snapshots this way actually circumvents the cache entirely even though the data may be unlikely to have changed, or changed significantly.

@maxcao13 on your backend support PR, could you touch up the active report cache so that the expiry and refresh times are configured by environment variable? I think in theory we would want to tie the lifecycle of the cache entries to the configured maxAge for the recording, but that's a level of integration that we just don't have right now and would take a lot of refactoring to get to. Once that's done then it's another set of configuration options for the Operator to add to the Cryostat CR, and I guess something else for the Helm chart to support as well.

@maxcao13 maxcao13 force-pushed the automated-analysis-dashboard branch from 86e9d9e to a69af48 Compare December 8, 2022 19:01
Copy link
Member

@andrewazores andrewazores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing looks good to me, skimming the code looks good too. @tthvo seems to have already reviewed the implementation quite thoroughly.

@andrewazores andrewazores merged commit f8ae71e into cryostatio:main Dec 8, 2022
@maxcao13 maxcao13 deleted the automated-analysis-dashboard branch December 8, 2022 19:32
@mergify mergify bot added the safe-to-test label Jan 10, 2023
@github-actions
Copy link

Test image available:

CRYOSTAT_IMAGE=ghcr.io/cryostatio/cryostat-web:pr-618-4f650e4789bf88a5ff4d3a3808eba021322f1cc8 sh smoketest.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Task] Embed Automated Analysis report scores in Dashboard
3 participants