Launcher should not allow users to continue if environment(s) have reached resource quotas #3346

andrewazores · 2018-04-27T19:50:25Z

Parent: #3344

Task

As seen in design link in parent issue, the Launcher should display an overlay and block the user from continuing to create a new application if the user has insufficient available resources for the application to be deployed.

andrewazores · 2018-04-27T19:50:56Z

@mceledonia to clarify the intent here, should the Launcher block creation if either Stage or Run environments have reached resource quota?

mceledonia · 2018-04-27T20:16:44Z

@andrewazores I was always under the assumption that this would be for both Stage and Run being full, but I think for this stage of implementation and for the sake of simplicity we should block for either. If in the future we do have the granular control to allow deployment into one stage or the other in this wizard I think we can change that but for now this seems to be the way to go.

qodfathr · 2018-04-29T02:40:11Z

Please see my comments here: #3344 (comment)

I do not feel this should be a p0.

bmicklea · 2018-04-30T20:33:46Z

+1 to @qodfathr assessment that this is a SEV2, but not a P0. Removing label.

sunilk747 · 2018-05-04T08:47:20Z

@invincibleJai @arunkumars08 I guess this issue can be solved from client side. We just need to add a check for the available resources and used resources. We can use this API end point to get the metadata
https://api.starter-us-east-2.openshift.com/api/v1/namespaces/ssamal/resourcequotas. It will return something like this

Let me know what you guys think.
cc: @animuk

invincibleJai · 2018-05-04T08:51:27Z

@sunilk747 agree it should be handled the client end, it needs some development effort and even the flow needs to be decided i.e when to show quota screen https://redhat.invisionapp.com/share/CTGJ6F7V4K9#/screens , it's more like a task to me not exactly an issue. Which we need to have in sprint plan.

andrewazores · 2018-05-04T12:51:32Z

@sunilk747 you can use this service: https://github.com/fabric8-ui/fabric8-ui/blob/3683fb70017dd16a96951d259a08cf30698e09ea/src/app/space/create/deployments/services/deployment-api.service.ts

This provides access to the backend WIT endpoints developed for the Deployments page. This way you don't need to reimplement proxy handling etc. since it is done by the backend.

andrewazores · 2018-05-04T12:56:33Z

In particular you can get CPU and Memory usage vs quota by calling getEnvironments, which returns an Observable<EnvironmentStat[]>. You can narrow this down from EnvironmentStat -> EnvironmentAttributes -> Quota -> {MemoryStat or CpuStat}. Both MemoryStat and CpuStat extend an interface Stat, which looks like this:

export interface Stat {
  readonly used: number;
  readonly quota: number;
  readonly timestamp?: number;
}

used and quota are the exact info you need here.

jyasveer · 2018-05-07T10:30:33Z

@catrobson @bdellasc
As per my understanding we have to show this overlay while a user is trying to create a space. The steps can be something like this

User clicks create space link/button.
As soon as the user clicks on the button/link, we will check for the resource limits and display the overlay with the resource warning message.

Please correct me if i am wrong ?
cc @animuk @invincibleJai @arunkumars08 @sunilk747

bdellasc · 2018-05-07T13:01:22Z

@jyasveer I believe this would work. For this use case, instead of seeing the actual "Create Space" overlay, showing this makes sense to me.

CC @catrobson @mceledonia

qodfathr · 2018-05-07T14:20:11Z

I want to make sure we all understand and agree to the implications of this.

In the free tier, very quickly, all users will essentially be at quota. So we agree that a good experience is that a very disruptive overall is presented to them every single time they try to create a new app. Even though there is nothing wrong with creating a new app -- the user will not be stuck in any way if they create a new app. The only thing that will "go wrong" is the deployment may fail (but may not -- this is highly situationally dependent). They will be able to edit the app in Che, create work items, etc. etc. Moreover, it is entirely possible that if the user is not at quota, but is close to quota, the app deployment will fail anyway, and we are not warning them in this case. Lastly, there are several ways a deployment may fail -- CPU, mem, disk, routes, services, etc. Are we committing to (1) ensure we are checking all of those, (2) keeping pace with new artefacts with limits in the future (e.g. stateful sets), and (3) keeping pace with changes to those limits (e.g. combing Run and Stage into one pool of resources rather than two independent pools).

I remain greatly concerned that we are trying to solve the wrong problem, and do so in the wrong way. If we pursue this change as described, then I would like to see those in support of the change to actually use the product, with that change feature flag turned on, to do real work with the service for a few weeks and see if they still feel the same way about the proposal. We have to balance what is logical/analytical against the emotional response to the feel of the service. I believe this proposal is far too analytical and not considering enough the desired emotional experience.

catrobson · 2018-05-07T16:43:48Z

I agree @qodfathr - I think the blocking of creation is being put into place because we do not have the right information, actions, and capabilities built into the system today to gracefully handle how the system acts, and what the user can do about it, in the case that they go over quota. This is something that I think will be really important for us to focus on improving as soon as possible.

I'm not convinced that as of today, not blocking users will result in a better experience than blocking them, only because recovery from an over quota experience seems very difficult to handle - the user can get into too much trouble without clear recovery paths. I'd like to talk about this as a team as soon as possible so we can consider what the least amount of changes required would be in order for the information, user recovery, and resource management to be good enough that we no longer need to block users in an attempt to protect.

bmicklea · 2018-05-07T17:12:25Z

Is this a case where if the deployment fails we could warn them and suggest they try again later? If this was paired with some kind of alert on the system delivery side could we then clear the issue for them from the back?

jyasveer · 2018-05-11T06:18:08Z

@catrobson can you please help me know if UX team is still working on the UXD for this ?

bdellasc · 2018-05-11T14:07:36Z

My impression: reading last comments from @catrobson and @qodfathr , it sounds like we (UXD, Development and PM) need to look at the information, user recovery and resource management topics. Based on that conversation, we would try to figure out how to add or improve those so we don’t have to “block” them when maxed out on current resource usage. “Blocking” them feels like an interim solution, when we really need to improve those other supporting pieces.

Since Summit was this week, I don’t think this is something UXD has taken up yet, as a number of key stakeholders have been focused on Summit. If this is the path we agree needs to be taken, we’ll need to lay some groundwork to create stories/issues in order to begin fielding the wider topics mentioned above.

We might need to stick with a simple “block” solution (+ some basic mechanisms to get unblocked) for now because the issues above might take some time to work through. If we run with the interim “block” solution now, keep in mind that we need to address those other areas as a more permanent solution. I’m not sure if @catrobson and @qodfathr would be alright with this approach and I don't want to speak for them.

So, bottom line is - we do intend on fielding the wider issues with the rest of the team, but haven’t yet because of Summit this week.

ldimaggi · 2018-05-11T14:26:28Z

I think that as a general rule, if we block a user from performing a task, we have to also provide them with an easy path to get un-blocked. Blocking a user from performing a task without providing a path forward/around the block is a dead-end situation. Thx!

jyasveer · 2018-05-11T14:28:45Z

@ldimaggi i agree with you.
@qodfathr @bdellasc @catrobson
My question here is do we want to have temporary solution of blocking a user to not create a space for now OR do we want to have a permanent solution and implement the same later ?

bdellasc · 2018-05-11T14:49:15Z

If this is critical to have something implemented ASAP, it might make sense to arrive at a very basic solution where we block the user, but give them that path to get unblocked, so we don't end up the with dead end that @ldimaggi mentioned. It may not be an ideal flow, but would serve the purpose for now. Then, we can take a wider look at a better, more permanent solution for the longer term?

I think these are our options going forward:

spend more time NOW for a more integrated, longer term solution (means the current implementation is a dead end for the user)
get something quick/basic implemented NOW that prevents the dead end, then look at a better, longer term solution once the basic one is in place...

I suspect the second option is the least painful for our users...something that is potentially clunky, but doesn't dead-end them right now.

@catrobson @qodfathr

mceledonia · 2018-05-11T15:37:09Z

My understanding of this is that it was a temporary, first-step of a multi step implementation to solve the problem. I'm assuming (let me know if I'm wrong) that the dead end is remedied with a direct link to a page that allows the user to scale down, delete, or both... However there are a couple issues with this which need to be solved which is why we are implementing only the first increment (blocking the user and using language to direct them to a solution)

We don't have backend support to know which space is using the most resources
Even if we did, we can't be sure we're sending them to the right space, because that space may not be the one the user wishes to compromise

So we need to figure out what the best next step is for the user. Because we don't have the ability to upgrade the account we need to offer a solution that exists within the current resource limitations.

Do we block and send them to an account-wide overview page with quick actions to scale up/down and or delete? Do we offer this experience in an abbreviated way directly in the overlay?
Do we offer a way to quickly "disable" or "turn off" all deployments in a given space or application, which scales them all the way down with the click of a button? Essentially streamlining the scale down process in this view?
Do we allow the user to continue but offer adequate/detailed information as to why their deployments fail when they do, and defer the "clear path forward" to that point?
Do we allow the user to continue and give them a one time trial experience of what it's like having more resources (perhaps a bit extreme answer to Todd's point, emotional response) with persisting reminders they are over quota and the trial period will end after a certain period of time? Then after it ends defer to a more hard blocked solution?

Those are only a few ideas off of the top of my head, it's Friday and almost lunch time, but all that to say that I understood this solution as an incremental first-step towards a better solution with the assumption that scaling up/down or deleting was the only path forward for the user. That assumption sounds like it's not be correct, as we can defer the blocking of the user to later when a deployment fails (if it does). It sounds like there needs to be more discussion around those two options and the possible solutions that branch from them.

I agree with @qodfathr that we need to consider the balance between emotional and analytical responses, and with @catrobson that this needs more discussion with the larger team.

animuk · 2018-05-11T17:19:26Z

I would say it would be good if we give users

A warning message that "his quota limit is exceeded and the pipeline build may fail"
provide the url/link to the open shift console so that they can remove the unused resource quota in a separate window.
3.then proceed with creating new app.

That way, openshiftio.io is not blocking its users an d also warn users that his deployment may fail and also display the action points to counter that.

qodfathr · 2018-05-12T15:48:57Z

Being out of quota (which has something like 5 dimensions -- disk,cpu, ram, services, routes, others?) should be something known to users at all times, in a non-intrusive way. Perhaps a warning icon in the topnav, which, when clicked, gives details plus links to how to remediate. But even as a "short term" solution, I wouldn't want us to proceed until all quota dimensions are known, documented and we've got the APIs to properly articulate them. Moreover, we cannot hard-code the quota limits into the front-end logic -- e.g, don't query to find out there are 5 routes in use and assume that the user cannot create a 6th. The API must be make this determination in conjunction with the backend. Assume the answer is potentially different for every single user. (i.e. a short-term solution is not an excuse to add to technical debt.)

Being out of quota may impact users in many ways, but completing the Launcher is not one those ways, so solving it in the Launcher I still argue is the wrong approach. Rather, if the user knows that they are out of quota before they even start the Launcher, I feel this is a better overall experience. An unobtrusive, non-blocking topnav warning is one way to achieve that goal. (Read: the user sees the warning icon and elects to do something about it before clicking Add to Space; I am not suggesting the user clicks Add to Space and then we put a pop-up in their face. It's more like this: I get in my car and the low gas light is on. I likely choose to go to the gas station and fill-up before embarking on my journey, but maybe I'm only going on a very short trip and know I can deal with buying gas later. In any event, the light is a non-obtrusive warning. It doesn't stop me from starting my car or driving to the destination of my choice. If I go too far and run out of gas, oh well, I made a choice. Given that this rarely happens, the unobtrusive warning model is proven to work.)

The primary impact of quota maximization is that app deployment fails. So, as a longer-term goal, IMHO, this is where we should look to improve the experience. e.g., rather than failing the pipeline, put it in a warning state requiring input (much like the Promotion step we have today). The two options could be "try again" or "cancel." The idea being that if the deployment fails due to some quota limitation (or really for any reason), the user is given the option to go correct the cause of the deployment failure and then try the deployment again (without having to rerun the whole pipeline). Taken further, imagine if the user could redirect the deployment elsewhere -- e.g. "oh, yeah, my free Stage env is full or too small to ever run that app, but let's deploy it to OSO Pro instread." A quick edit to the pipeline (perhaps via a GUI) and a click of "Try Again" and the user is back on track.

Related to all of this, pipelines that are in any "input required" state probably should also have some sort of non-blocking notification to the user (e.g. an icon in the topnav). Clicking would give details and a link to the awaiting pipeline. This helps to address the primary impact of failed deployments (assuming that the pipeline does not fail but rather goes into the remediation state as described above). Moreover, this would help address a concern whereby a pipeline is not starting in Space S2 because another pipeline is awaiting input in Space S1 and I only have the authority to run one build at a time. (Note that in the current UI, you cannot see the awaiting S1 pipeline when in S2, so, today, the user is somewhat in the dark as to why the S2 pipeline is not starting.)

ldimaggi · 2018-05-14T02:06:44Z

+1 to Todd's description of handling a deployment that would fail due to a lack of resources. Instead of blocking a user, I think it's better to inform the user, and provide the user with a path forward.

Also +1 to Todd's recommendation of our having a clear definition of all the dimensions of a user being out of quota (we've been focusing on a subset of these in our discussions) AND to our providing easy to understand quota information to our users.

animuk · 2018-05-14T07:36:02Z

@pradeepto Please see Todd's comments above. As a part of long term approach instead of pipeline failure in case of resource quota limit exceed in openshift, can we give users a warning message with the option to go correct the cause of the deployment failure in the pipeline build level. Can you analyse the possibility from the build team as a future process improvement?
cc @krishnapaparaju

animuk · 2018-06-14T12:26:57Z

@krishnapaparaju As discussed, We need an endpoint in the jenkins proxy layer, which will call the Openshift API to fetch the usage quota details and show the user two options which could be "try again" or "cancel" in the pipeline screen before it starts deployment in the openshift.
cc @pradeepto

andrewazores · 2018-06-14T13:00:50Z

The API you're asking for sounds like a combination of ones that already exist and one we have in progress for the Deployments page: #3388 . We also have an OpenShift events bus API in progress which would allow you to discover when deployment failures occur: #2528 .

piyush-garg · 2018-06-20T04:29:07Z

@andrewazores @animuk @jyasveer Is there anything build-cd team is needed to do. Can we drop that label if everything is fine?

animuk · 2018-06-25T06:00:35Z

@piyush1594 As discussed, If you already have that endpoint in the jenkins proxy layer, then use that to prompt the user "try again" and/or "cancel" in the pipeline screen before it starts deployment in the openshift.

bartoszmajsak · 2018-08-11T13:46:47Z

@GeorgeActon @animuk just wanted to understand how much has been done and what is still needed to make this feature reality. /cc @slemeur

animuk · 2018-08-13T07:09:57Z

@bartoszmajsak this has been assigned to build team before start of J train. Can you ask them?

andrewazores added SEV2-high type/task team/launcher area/launcher priority/P1 Critical labels Apr 27, 2018

andrewazores mentioned this issue Apr 27, 2018

Users should be blocked from exceeding resource quotas #3344

Closed

2 tasks

joshuawilson added team/analytics team/analytics/wizard and removed SEV2-high team/launcher labels Apr 27, 2018

bmicklea removed the priority/P1 Critical label Apr 30, 2018

jyasveer self-assigned this May 7, 2018

jyasveer added this to the Sprint 149 milestone May 7, 2018

jyasveer mentioned this issue May 8, 2018

[8] Update the user about the resource quota available. #3514

Closed

3 tasks

jyasveer removed this from the Sprint 149 milestone May 8, 2018

animuk added the team/build-cd label May 14, 2018

jyasveer removed their assignment May 16, 2018

jiekang mentioned this issue May 16, 2018

Provide status information for when pod deployments fail #2225

Closed

animuk added the SEV2-high label May 31, 2018

animuk removed team/analytics team/analytics/wizard labels Jun 14, 2018

This was referenced Jun 20, 2018

Better error reporting is needed if a build fails at roll out to stage step due to the user being over quota #3024

Closed

App deployments fail silently after quota is exhaused. #2897

Closed

sthaha added the sprint/next label Jul 2, 2018

GeorgeActon removed the SEV2-high label Jul 25, 2018

nimishamukherjee added the team/app-creation Team works on Getting Started label Jul 31, 2018

piyush-garg added the type/feature-request label Jul 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Launcher should not allow users to continue if environment(s) have reached resource quotas #3346

Launcher should not allow users to continue if environment(s) have reached resource quotas #3346

andrewazores commented Apr 27, 2018

andrewazores commented Apr 27, 2018

mceledonia commented Apr 27, 2018 •

edited

Loading

qodfathr commented Apr 29, 2018

bmicklea commented Apr 30, 2018

sunilk747 commented May 4, 2018 •

edited

Loading

invincibleJai commented May 4, 2018

andrewazores commented May 4, 2018

andrewazores commented May 4, 2018

jyasveer commented May 7, 2018 •

edited

Loading

bdellasc commented May 7, 2018

qodfathr commented May 7, 2018

catrobson commented May 7, 2018

bmicklea commented May 7, 2018

jyasveer commented May 11, 2018

bdellasc commented May 11, 2018 •

edited

Loading

ldimaggi commented May 11, 2018

jyasveer commented May 11, 2018

bdellasc commented May 11, 2018 •

edited

Loading

mceledonia commented May 11, 2018 •

edited

Loading

animuk commented May 11, 2018

qodfathr commented May 12, 2018

ldimaggi commented May 14, 2018

animuk commented May 14, 2018 •

edited

Loading

animuk commented Jun 14, 2018

andrewazores commented Jun 14, 2018

piyush-garg commented Jun 20, 2018 •

edited

Loading

animuk commented Jun 25, 2018

bartoszmajsak commented Aug 11, 2018

animuk commented Aug 13, 2018

Launcher should not allow users to continue if environment(s) have reached resource quotas #3346

Launcher should not allow users to continue if environment(s) have reached resource quotas #3346

Comments

andrewazores commented Apr 27, 2018

Task

andrewazores commented Apr 27, 2018

mceledonia commented Apr 27, 2018 • edited Loading

qodfathr commented Apr 29, 2018

bmicklea commented Apr 30, 2018

sunilk747 commented May 4, 2018 • edited Loading

invincibleJai commented May 4, 2018

andrewazores commented May 4, 2018

andrewazores commented May 4, 2018

jyasveer commented May 7, 2018 • edited Loading

bdellasc commented May 7, 2018

qodfathr commented May 7, 2018

catrobson commented May 7, 2018

bmicklea commented May 7, 2018

jyasveer commented May 11, 2018

bdellasc commented May 11, 2018 • edited Loading

ldimaggi commented May 11, 2018

jyasveer commented May 11, 2018

bdellasc commented May 11, 2018 • edited Loading

mceledonia commented May 11, 2018 • edited Loading

animuk commented May 11, 2018

qodfathr commented May 12, 2018

ldimaggi commented May 14, 2018

animuk commented May 14, 2018 • edited Loading

animuk commented Jun 14, 2018

andrewazores commented Jun 14, 2018

piyush-garg commented Jun 20, 2018 • edited Loading

animuk commented Jun 25, 2018

bartoszmajsak commented Aug 11, 2018

animuk commented Aug 13, 2018

mceledonia commented Apr 27, 2018 •

edited

Loading

sunilk747 commented May 4, 2018 •

edited

Loading

jyasveer commented May 7, 2018 •

edited

Loading

bdellasc commented May 11, 2018 •

edited

Loading

bdellasc commented May 11, 2018 •

edited

Loading

mceledonia commented May 11, 2018 •

edited

Loading

animuk commented May 14, 2018 •

edited

Loading

piyush-garg commented Jun 20, 2018 •

edited

Loading